Don’t Be Like Uber — Have Your Data Wear a Mask

For this week’s blog article, I want to start with a story.

Imagine you’re the CISO of an up-and-coming ride-hailing service. The business has been so successful, that you’ve been able to scale at a record pace. And while your developers and QA teams are now knowledgeable on secure coding practices, this wasn’t always the case.

Back in the early days, your engineers were using a GitHub repo for some testing code that accidentally was made public. A little Google dorking led hackers to discover that one of your engineers had hardcoded the login credentials to your test database. Unfortunately for you, the test database was using real production data, which included thousands of records on your customers.

But all was not lost, because your company, unlike Uber, was smart, and had recently implemented a data masking policy to protect that sensitive user data in that test database.

What is Data Masking?

So what is data masking anyway? Data masking, or obfuscation refers to techniques that can be used to hide or de-identify data, usually by substituting it with some form of modified content. In its simplest form, think of the humble password prompt:

Here, the password text is ‘masked’ with bullets. The key is to keep the data format similar, without exposing the values. While the password example is a simple one, data masking is also commonly used between production and development environments, or as a way of sharing data with third parties such as call-center personnel.

For example, during the software development and testing phases, developers and QA team members may need to use real data for testing and debugging applications. Rather than copy live production data into the development environment, data masking can be used to allow for testing formats without the risk of exposing sensitive personal data. There are a number of data masking techniques, including fictive data or substitution (e.g., ‘John Doe’, ‘Anytown, USA’, ‘123-45-6789), full or partial redaction (‘XXXXXX’), encryption, or using null values.

While data masking can be applied to any field or data element, it’s most commonly used in the context of protecting personal data, such as name, address, phone number, IDs (e.g., SSN, passport number), and payment card details. No matter how it’s used, in order to be effective, your data masking technique must change the live data in such a way that it becomes impossible to reverse engineer the identity or sensitive data.

An Ounce of Prevention Avoids a Pound of Compliance Headaches

In addition to being a general best practice in software development, data masking can also be effective at limiting data breach risk, including third party breaches, as well as offering significant benefits when it comes to meeting compliance objectives under PCI-DSS, the GDPR and the HIPAA Privacy Rule.

PCI-DSS Requirement 3 explicitly instructs merchants to mask primary account number (PAN) data through truncation or masking whenever it is displayed. All but the first six, or last four digits should be masked. And while the GDPR doesn’t specifically mention the idea of ‘data masking,’ Recital 78 does outline how pseudonymization, which data masking is a subset, can be implemented as an effective technical and organizational control for meeting compliance obligations. Thus, when implemented correctly, data masking can be a means to achieve those ends.

Fortunately, there are many solutions that can be used to implement a data masking process, including within cloud environments. Regardless of what application or tool you use, it’s important to look for something that provides the following features:

  • Offers a range of masking techniques
  • Uses rules-based data masking, that can be applied to different categories/subsets of data
  • Utilizes format-preserving encryption (FPE) transformation
  • Uses realistic fictive data types
  • Has centralized policy management
  • Robust access controls
  • Audit trails and compliance tracking
  • The ability to share subsets of data
  • Scalability

Native Solutions for the Cloud

GCP: Google offers a native data masking/de-identification solution via its Cloud Data Loss Prevention(DLP). This fully-managed service identifies, classifies and performs de-identification methods such as masking and tokenization across Cloud Storage, BigQuery, and DataStore, as well as a streaming content API that enables support for other data sources. Key features include:

  • Scanning and classification of over 120 types of sensitive data, or InfoTypes. This includes name, email, cardholder data, gender, IP Address, ID numbers and other types.
  • Automatic data masking capabilities for both structured and unstructured data
  • Detailed findings can be sent to BigQuery for further analysis and auditing
  • Allows organizations to add and manage custom data types
  • Volume-based pricing.

Additional documentation and a full list of features can be found here.

AWS: While Amazon does not appear to have a single or complete solution, they do offer a native tool known as Macie which allows for discovery and protection of sensitive data at scale. According to Macie documentation, “Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS.” While Macie does not appear to actually perform data masking functions, its identification features do allow for continuous monitoring and response. Other key features include:

  • Automatic inventory of Amazon S3 buckets, including a list of unencrypted, publicly accessible, and shared buckets.
  • Identification and alerting of sensitive data, such as personally identifiable information (PII)
  • Searchable findings in the AWS Management console
  • Integration with Amazon EventBridge (formerly CloudWatch) and AWS Step Function workflows.

Additional documentation can be found here.

Azure: Azure also offers a limited set of native data masking features for SQL databases, through its Azure Portal solution. The Dynamic Data Masking recommendation engine automatically flags fields and data types for masking. Existing logic for masking credit card details, email and other custom text fields are built in.

Other key features include:

  • Implementation through the portal, Azure SQL Database cmdlets or REST API
  • Masking rules are customizable and columns can be added manually
  • Simple drop-down functionality to set masking type (e.g., truncation, custom prefixes, random number generation)

Additional documentation can be found here