Two Methods to Scan for PII in Data Warehouses

Find PII data by scanning column names and/or data in your data warehouse

Nov 29, 2021

An important requirement for data privacy and protection is to find and catalog tables and columns that contain PII or PHI data in a data warehouse. Open-source data catalogs like Datahub and Amundsen enable the cataloging of information in data warehouses. Moreover, tables and columns can be tagged including PII and type of PII tags.

The missing piece is to scan, detect and tag tables and columns with PII.

This post describes two strategies to scan and detect PII as well as introduce an open-source application PIICatcher that can be used to scan data warehouses.

What is PII data?

PII or Personally Identifiable Information is generally defined as any piece of information that can be used to identify an individual. Traditionally, information such as SSN, mailing, email or phone numbers are considered PII. As technology has evolved, the scope of PII has increased to include login IDs, IP addresses, geolocation and biometric data.

There are different types of PII data:

Sensitive: is any data that can be used to directly link to an individual such as name, phone numbers, email and mailing address.
Non-Sensitive: is any data that can be used to indirectly linked to an individual such as location and race.

Specifically, PII as defined by Compliance laws are:

GDPR: PII is any data that can be used to clearly identify an individual. This also includes IP addresses, login ID details, social media posts, digital images, geolocation and more.
CCPA: Personal information is defined as information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.
HIPAA: HIPAA also defines PII as any type of information that relates directly or indirectly to an individual.

Beyond the above definition, domains and businesses may have specific PII data collected by them. A simple example is PHI (Personal Health Information) collected by the health industry. Similarly, bank account or crypto-currency wallet IDs can also be used to identify individuals.

The following list can be considered as basic or common PII information that all industries need to manage:

Phone
Email
Credit Card
Address
Person/Name
Location
Date
Gender
Nationality
IP Address
SSN
User Name
Password

Challenges

An example record in the patients table in Synthetic Patient Records with COVID-19 is:

|Column Name|Data|
|-----------|----|
|Id|f0f3bc8d-ef38-49ce-a2bd-dfdda982b271|
|BIRTHDATE|2017-08-24|
|SSN|999-68-6630|
|FIRST|Jacinto644|
|LAST|Kris249|
|RACE|white|
|ETHNICITY|nonhispanic|
|GENDER|M|
|BIRTHPLACE|Beverly  Massachusetts  US|
|ADDRESS|888 Hickle Ferry Suite 38|
|CITY|Springfield|
|STATE|Massachusetts|
|COUNTY|Hampden County|
|ZIP|01106|
|LAT|42.151961474963535|
|LON|-72.59895940376188|
|HEALTHCARE_EXPENSES|8446.49|
|HEALTHCARE_COVERAGE|1499.08|

Note that most of the columns store PII data. However, it can be confusing to detect if a column stores PII data and the type of PII data. For example, if the scanner only scans the data in SSN then it may detect it as a phone number. Similarly, M or F in the GENDER column or white in RACE column, do not provide enough context to detect if it is PII and the type of PII data.In both these cases, it is easier to scan the column names.

Conversely, the payers table stores the name of health insurance companies in the NAME column. In this case, the scanner has to check the data to detect that the NAME column does not contain PII data.

Techniques to scan and detect PII data

Based on the previous section, the two main strategies to scan for PII data are:

Scan column and table names
Scan data stored in columns

Scan Data Warehouse Metadata

Data engineers use descriptive names for tables and columns to help users understand the data stored in them. Therefore, the names of tables and columns provide clues to the type of data stored. For example,

first_name, last_name, full_name, or name may be used to store the name of a person.
ssn or social_security may be used to store US SSN numbers.
phone or phone_number maybe used to store phone numbers.

All data warehouses provide an information schema to extract schema, table and column information. For example, the following query can be used to get metadata from Snowflake:

SELECT
    lower(c.column_name) AS col_name,
    c.comment AS col_description,
    lower(c.data_type) AS col_type,
    lower(c.ordinal_position) AS col_sort_order,
    lower(c.table_catalog) AS database,
    lower({cluster_source}) AS cluster,
    lower(c.table_schema) AS schema,
    lower(c.table_name) AS name,
    t.comment AS description,
    decode(lower(t.table_type), 'view', 'true', 'false') AS is_view
FROM
    {database}.{schema}.COLUMNS AS c
LEFT JOIN
    {database}.{schema}.TABLES t
        ON c.TABLE_NAME = t.TABLE_NAME
        AND c.TABLE_SCHEMA = t.TABLE_SCHEMA

Regular expressions can be used to match table or column names. For example, the regular expression below detects a column that stores social security numbers:

^.*(ssn|social).*$

Scan data stored in columns

The second strategy is to scan the data stored in columns. Within this strategy the two sub-strategies are:

Regular Expressions
NLP libraries such as Stanford NER Detector and Spacy

The major disadvantage of this strategy is that NLP libraries are compute intensive. It can be prohibitively expensive to run NLP scanners even on moderately sized tables let alone tables of millions or billions of rows. Therefore, a random sample of rows should be scanned. Choosing a random sample is harder than expected. Luckily, a few databases provide builtin functions to choose a random sample. For example, the Snowflake query below provides a random sample:

select {column_list} from {schema_name}.{table_name} TABLESAMPLE BERNOULLI (10 ROWS)

Once the rows have been extracted, then can be processed using regular expressions or NLP libraries to detect PII content.

Breaking Ties

As explained in Challenges section, both techniques are required to detect PII data. However both techniques are susceptible to false positives and negatives. More often than not, different techniques suggest conflicting PII types. Detecting the right type is hard and the subject of a future blog post.

PIICatcher: Scan data warehouses for PII data

PIICatcher implements both the strategies to scan and detect PII data in the data warehouses.

Features

A data warehouse can be scanned using either strategies. PIICatcher is *battery-included* with a growing set of regular expressions for scanning column names as well as data. It also include Spacy. PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources. There are ingestion functions for both Datahub and Amundsen which will tag columns and tables with PII and the type of PII tags.

Check out AWS Glue & Lake Formation Privilege Analyzer for an example of how PIIcatcher is used in production.

Conclusion

Column names and data can be scanned to detect PII in databases. Both strategies are required to reliably detect PII data. PIICatcher is an open source application that implements both these strategies. It can tag datasets with PII and the type of PII to enable data admins to take more informed decisions on data privacy and security.

Open Source SQL Parsers

Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing SQL for data observability.

Rajat Venkatesh

Oct 08, 2021

Query history of a data warehouse is a rich source of information to glean how data is used in your organization. Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can extract:

Popular tables and columns
Unused tables and columns
Column-level lineage
Freshness

These statistics also help to automate common data engineering tasks like:

Backup and Disaster Recovery
Triage Data Quality issues
Track sensitive data and how they are used.

Challenges and Approaches

SQL language is an ISO/IEC standard and the latest version is SQL2016. However, every database implements the standard differently, uses different function names for the same operation, and has extensions to access specific custom features. Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses.

Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.

An alternate approach is to implement the SQL grammar using parser generators like ANTLR. There are similar open-source parser generators in other popular languages.

There are multiple projects that maintain parsers for popular open-source databases like MySQL and Postgres. For other open-source databases, the grammar can be extracted from the open-source project. For commercial databases, the only option is to reverse engineer the complete grammar.

There are SQL parser/optimizer platforms like Apache Calcite that help to reduce the effort to implement the SQL dialect of your choice.

Open Source Parsers

Some popular open-source databases and data warehouses are:

MySQL/MariaDB

Pingcap parser is a MySQL parser in Go.
SQL Parser in phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect.

Postgres

libpg_query extracts the parser (written in C) from the postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:

Python: pglast
Golang: pg_query_go
JS: psql-parser in Node and pg-query-emscripten in the browser
Rust: pg_query.rs

Multiple Engines

queryparser implements Apache Hive, Presto/Trino and Vertica dialects.
zetasql implements BigQuery, Spanner, and Dataflow dialects.

Generic Parsers

Python: sqlparse
Rust: sqlparser-rs
Python: mo-sql-parsing

Platforms

Parser/Optimizer platforms implement the common SQL language features and allow customization as first-class feature of the platform. Two popular open-source projects are:

Apache Calcite is a popular parser/optimizer that is used in popular databases and query engines like Apache Hive, BlazingSQL and many others.
JSQLParser can parse multiple SQL dialects like MySQL, Postgres and Oracle. The grammar can be modified to support other SQL dialects.

Apache Calcite allows customizations at various points of the parsing process.

Parser rules can be changed to support custom syntax.
Conventions such as quotes vs double quotes, case sensitivity.
Add optimizer rules.

Apache Calcite also provides visitors for traversing the SQL execution plan. A Visitor pattern is an algorithm to traverse a SQL plan.

Practical tips to Getting Started

There are many abandoned open-source SQL parsers. The first filter is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.

What if there is no parser for your database?

Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries. Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.

Conclusion

There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good open-source projects. However, there is a steep learning curve to use these projects and in many cases a project may not fit your specific requirements.

Struggling with parsing query history? Get in touch

How to Get Started on Data Governance

This post describes the prerequisites as well as a methodology for a successful implementation of a data governance platform.

Rajat Venkatesh

Jan 08, 2021

In this guest post, Syed Atif Akhtar provides insights on how an organization can get started on Data Governance. These insights are based on his experience helping organizations big & small to put data governance systems in place.

Data Governance helps organizations improve developer productivity, data quality, compliance, and security. However many organizations fail to extract value from data governance. This post describes the prerequisites as well as a methodology for a successful implementation of a data governance platform.

Prerequisites for a data governance implementation

Strategy and Vision

The strategy should clearly articulate the value proposition and goals of data governance. These should be articulated across different timelines. Goals and objectives help everyone see the value and maintain motivation to stick with the process.

The strategy has to consider both the tools as well as the organization. It is not sufficient to only introduce the right tools. Most organizations that follow a top-down driven approach often end up failing due to investing too much money on tools but do not consider the organization’s dynamics. Hence the right incentives have to be created for everyone in the organization to participate and take ownership of data governance within their area.

The strategy should support decentralization and adopt extensible tools. Each team should be able to add features they see fit and evolve the governance fabric itself based on the needs of the department.

The strategy should consider the needs of data engineers, data scientists, and AI engineers along with data analysts. Hence, a data governance strategy should include relational databases and data warehouses as well as technologies for data engineering, data science, and artificial intelligence. It should also consider that new technologies will become popular and will be adopted by the engineers in the team.

The strategy should push data governance in phases. The roadmap should consist of incremental improvements rather than overnight change to allow the organization to assimilate the new technologies and processes.

Comprehensive goals

The goal of data governance is typically limited to compliance, accountability, and audits. However, the value of data governance goes beyond compliance to developer productivity and data quality. Developer productivity and data quality provide higher levels of motivation to choose the right investments. For example, the organization will then own data governance instead of outsourcing it to a vendor. Sub-organizations will take ownership of datasets useful to them.

Technology

There are many factors to choosing the right data governance technologies. Some of them are:

Build vs Buy
Open Source vs Commercial

The right decision depends on the goals, vision, and capabilities of the team. For example, the organization may not be able to install and maintain an open-source project. In other cases, the project may not work with the ecosystem adopted by the enterprise.

Steps in a successful data governance implementation

Prototype Phase

The goal of the prototype phase is to answer these questions:

What are the datasets available?
Who is using the datasets?
How are the datasets used?

In the prototype phase:

Curate Important Data
Focus on a few teams. Ideally Data Science or AI teams.
Implement basic classification and tagging of data.
The catalog pulls metadata from storage systems. This is in contrast to push-based where data owners push metadata to a catalog. At this stage, pull-based is better because organizational buy-in to data governance is not yet there.

Enabling Phase

The goal of the scaling phase is to add tooling to analyze metadata and compliance.

In the enabling phase:

The organization moves to a push-based model.
Focus on reducing the friction to onboard new teams and datasets.
A central team creates compliance and audit policies.
The technology is augmented to analyze and implement these policies.
Automate reporting of compliance reports.

Scaling Phase:

In the final phase, the goal is to align teams on data domains and federate data governance.

In the scaling phase:

Data Governance is decentralized. Data Stewards are part of sub-organizations.
Sub-organizations have the flexibility to choose their policies and managed through an Open Policy Agent.
Distributed data catalogs may be chosen to reflect the architecture of different data marts.
A business glossary is created to link code, SQL scripts, models, and transformations across different sub-organizations and data catalogs.

Conclusion

This post provides a blueprint for organizations to successfully implement data governance and extract value from it. Organizations should focus on strategy, goals, and technology. Then these should be implemented in phases by targeting simple goals first.

A Practical Data Quality Process

This provides high level, practical tips to approach data quality.

Rajat Venkatesh

Oct 18, 2020

Data quality management is necessary for data-driven decision making. However, the outcome of many data quality initiatives is not up to expectations. This post describes one framework that can help to plan and execute a successful data quality project.

What is data quality?

The first question is to understand what does high data quality mean? There are two common definitions:

Data quality is high if data is fit for the intended purpose.
Data quality is high if the data completely reflects the real-world entity.

By definition 1, the data quality is high if it is possible to send an invoice to a customer. However, the data may be missing the customer’s phone number.

By definition 2, the data quality is high if the data correctly captures all attributes of the customer. In the previous example, the data quality is low if the data contains the correct address but is missing the customer’s phone number.

Both these definitions can contradict each other. A reasonable expectation is for master data to be more applicable to multiple business decisions. However, it may be impractical or costly to ensure all-round data quality. Any data quality process has to balance both these opposing forces.

A Data Quality Flow

A data quality flow that focuses on business outcome has the advantage of showing value. Therefore there are better chances of continued funding. Such a flow is described below:

The process starts with:

Defining a business issue such as inablity to send invoices correctly.
Triage with product and engineering teams. Set expectations on outcome as well as threshold and KPIs for data quality. For example a threshold may be that only 1% of attempts to send invoices can fail.
Product and engineering teams fix and deploy.
Verify that data quality has improved based on previously agreed thresholds and KPIs.

Data Quality Tools

There are a number of commercial and open source tools such Great Expectations to help data engineering teams with data quality management. To main techniques are:

Data profiling and alerts on changes in data profiling.
Tests for data pipelines.

A typical problem is that these techniques are used to monitor data quality of all attributes. This can lead to alert fatigue and not perceptibe improvement in business outcome. Another issue is that data profiles and requirements change. Since maintenance of data quality rules is manually intensive, they are not kept up to date.

Tying data quality rules and software usage to specific business outcomes as described in the previous section helps to focus on relevant business outcomes.

Data Catalogs, Dictionaries, Taxonomies and Glossaries

This post explores the different types of metadata on data and the different types of personas who use this metadata for their work.

Rajat Venkatesh

Sep 18, 2020

Metadata in a data lake is important for the productivity of everyone in the data ecosystem. The different types of metadata, systems to store them, and their consumers can be very confusing. How is a data catalog different from a dictionary or a glossary? This post will explore all aspects of metadata for data.

Information Schema

The basic type of metadata is stored by the database itself in an information schema. The information schema is an ANSI-SQL standard and provides system information on tables, views, columns, users, and permissions among other database-specific information. Its primary users are database administrators to verify the internal state of the database.

The information schema is typically accessed through SQL statements or non-standard commands like SHOW or DESCRIBE at the database prompt or in scripts.

The example from MySQL documentation lists all tables in a schema “db5” and system or database-specific information like the engine. The Hive Metastore and AWS Glue Data Catalog are popular information schemas in data lakes.

There are multiple instances of an information schema - one per database in the organization.

Data Catalog

The Data Catalog is a system-wide inventory of all the data assets. A popular analogy is to compare the data catalog to a library catalog. A library catalog stores if a book is available, its edition, authors, description, and other metadata. Just like a library catalog can be used to discover data, a data catalog can be used to discover data assets.

Different personas require a data catalog. Examples are:

Data engineers want to know the change impact of a new feature in ETL pipelines.
Data scientists and analysts use catalogs to find the right data sets for their work.
Data stewards scan data catalogs to ensure security and governance policies are being followed.

A data catalog stores technical metadata about data. One of the major sources of the data catalog is the information schema from all the databases, data warehouses, and data lakes. It will also contain other technical information like lineage, ETL scripts, ACLs, and access history.

A data catalog is typically available through UI or web interface and also has APIs for scripting. Popular open-source data catalogs are DataHub and Amundsen.

Business Glossaries

Business glossaries define various business terms. A very simple example is the definition of a customer or a lead. Different departments must agree on a definition. It is also common that without a glossary, there can be different opinions on simple terminology such as a customer or purchase date.

Business glossaries add semantic meaning to data. While a data catalog may state that a column contains a date, a glossary provides information on how that date should be interpreted. Is a date when a widget was ordered, delivered, or fully paid?

Data Dictionary

A data dictionary is a searchable repository of all business or semantic metadata of data assets. The major difference from a data catalog is that it will also store business or semantic information about the data.

The terms “data dictionary” and “data catalog” are used interchangeably and there is a lot of confusion on when to use which terms. Also the major difference between the two - store business or semantic metadata - is not very large. Many data catalogs can store semantic information and the same systems can be called a catalog or a dictionary.

Therefore the same system can be called a data catalog if the only audience is technical users or a dictionary if it is used by a business audience.

Taxonomies

The data taxonomy is an oddball in this list. A taxonomy describes how to assign metadata to data. It provides a framework to name things and to disambiguate when there is confusion about the semantics of data. For example, is the purchase date the date when the first payment was received or the last?

Taxonomies also standardize terms within the data. An example is:

The above image shows two suppliers that both offer mechanical pencils. However, they may use different terms to describe the same features about a mechanical pencil as shown in the below image.

A taxonomy provides a framework to standardize the values in the Color and Description columns.

Conclusion

The terminology and names of systems to manage metadata can be confusing. The post categorizes metadata and systems based on who uses the metadata. If you agree, disagree, or find it helpful, please share, comment, or like this post.

Loading more posts…

Data Governance, Privacy and Security

Two Methods to Scan for PII in Data Warehouses

Find PII data by scanning column names and/or data in your data warehouse

What is PII data?

Challenges

Techniques to scan and detect PII data

Scan Data Warehouse Metadata

Scan data stored in columns

Breaking Ties

PIICatcher: Scan data warehouses for PII data

Features

Conclusion

Open Source SQL Parsers

Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing SQL for data observability.

Challenges and Approaches

Open Source Parsers

MySQL/MariaDB

Postgres

Multiple Engines

Generic Parsers

Platforms

Practical tips to Getting Started

Conclusion

How to Get Started on Data Governance

This post describes the prerequisites as well as a methodology for a successful implementation of a data governance platform.

Prerequisites for a data governance implementation

Strategy and Vision

Comprehensive goals

Technology

Steps in a successful data governance implementation

Prototype Phase

Enabling Phase

Scaling Phase:

Conclusion

A Practical Data Quality Process

This provides high level, practical tips to approach data quality.

What is data quality?

A Data Quality Flow

Data Quality Tools

Data Catalogs, Dictionaries, Taxonomies and Glossaries

This post explores the different types of metadata on data and the different types of personas who use this metadata for their work.

Information Schema

Data Catalog

Business Glossaries

Data Dictionary

Taxonomies

Conclusion