A Practical Data Quality Process

This provides high level, practical tips to approach data quality.

Data quality management is necessary for data-driven decision making. However, the outcome of many data quality initiatives is not up to expectations. This post describes one framework that can help to plan and execute a successful data quality project.

What is data quality?

The first question is to understand what does high data quality mean? There are two common definitions:

  1. Data quality is high if data is fit for the intended purpose.

  2. Data quality is high if the data completely reflects the real-world entity.

By definition 1, the data quality is high if it is possible to send an invoice to a customer. However, the data may be missing the customer’s phone number.

By definition 2, the data quality is high if the data correctly captures all attributes of the customer. In the previous example, the data quality is low if the data contains the correct address but is missing the customer’s phone number.

Both these definitions can contradict each other. A reasonable expectation is for master data to be more applicable to multiple business decisions. However, it may be impractical or costly to ensure all-round data quality. Any data quality process has to balance both these opposing forces.

A Data Quality Flow

A data quality flow that focuses on business outcome has the advantage of showing value. Therefore there are better chances of continued funding. Such a flow is described below:

The process starts with:

  1. Defining a business issue such as inablity to send invoices correctly.

  2. Triage with product and engineering teams. Set expectations on outcome as well as threshold and KPIs for data quality. For example a threshold may be that only 1% of attempts to send invoices can fail.

  3. Product and engineering teams fix and deploy.

  4. Verify that data quality has improved based on previously agreed thresholds and KPIs.

Data Quality Tools

There are a number of commercial and open source tools such Great Expectations to help data engineering teams with data quality management. To main techniques are:

  1. Data profiling and alerts on changes in data profiling.

  2. Tests for data pipelines.

A typical problem is that these techniques are used to monitor data quality of all attributes. This can lead to alert fatigue and not perceptibe improvement in business outcome. Another issue is that data profiles and requirements change. Since maintenance of data quality rules is manually intensive, they are not kept up to date.

Tying data quality rules and software usage to specific business outcomes as described in the previous section helps to focus on relevant business outcomes.

Data Catalogs, Dictionaries, Taxonomies and Glossaries

This post explores the different types of metadata on data and the different types of personas who use this metadata for their work.

Metadata in a data lake is important for the productivity of everyone in the data ecosystem. The different types of metadata, systems to store them, and their consumers can be very confusing. How is a data catalog different from a dictionary or a glossary? This post will explore all aspects of metadata for data.

Information Schema

The basic type of metadata is stored by the database itself in an information schema. The information schema is an ANSI-SQL standard and provides system information on tables, views, columns, users, and permissions among other database-specific information. Its primary users are database administrators to verify the internal state of the database.

The information schema is typically accessed through SQL statements or non-standard commands like SHOW or DESCRIBE at the database prompt or in scripts.

The example from MySQL documentation lists all tables in a schema “db5” and system or database-specific information like the engine. The Hive Metastore and AWS Glue Data Catalog are popular information schemas in data lakes.

There are multiple instances of an information schema - one per database in the organization.

Data Catalog

The Data Catalog is a system-wide inventory of all the data assets. A popular analogy is to compare the data catalog to a library catalog. A library catalog stores if a book is available, its edition, authors, description, and other metadata. Just like a library catalog can be used to discover data, a data catalog can be used to discover data assets.

Different personas require a data catalog. Examples are:

  • Data engineers want to know the change impact of a new feature in ETL pipelines.

  • Data scientists and analysts use catalogs to find the right data sets for their work.

  • Data stewards scan data catalogs to ensure security and governance policies are being followed.

A data catalog stores technical metadata about data. One of the major sources of the data catalog is the information schema from all the databases, data warehouses, and data lakes. It will also contain other technical information like lineage, ETL scripts, ACLs, and access history.

A data catalog is typically available through UI or web interface and also has APIs for scripting. Popular open-source data catalogs are DataHub and Amundsen.

Business Glossaries

Business glossaries define various business terms. A very simple example is the definition of a customer or a lead. Different departments must agree on a definition. It is also common that without a glossary, there can be different opinions on simple terminology such as a customer or purchase date.

Business glossaries add semantic meaning to data. While a data catalog may state that a column contains a date, a glossary provides information on how that date should be interpreted. Is a date when a widget was ordered, delivered, or fully paid?

Data Dictionary

A data dictionary is a searchable repository of all business or semantic metadata of data assets. The major difference from a data catalog is that it will also store business or semantic information about the data.

The terms “data dictionary” and “data catalog” are used interchangeably and there is a lot of confusion on when to use which terms. Also the major difference between the two - store business or semantic metadata - is not very large. Many data catalogs can store semantic information and the same systems can be called a catalog or a dictionary.

Therefore the same system can be called a data catalog if the only audience is technical users or a dictionary if it is used by a business audience.

Taxonomies

The data taxonomy is an oddball in this list. A taxonomy describes how to assign metadata to data. It provides a framework to name things and to disambiguate when there is confusion about the semantics of data. For example, is the purchase date the date when the first payment was received or the last?

Taxonomies also standardize terms within the data. An example is:

transactional and master data example

The above image shows two suppliers that both offer mechanical pencils. However, they may use different terms to describe the same features about a mechanical pencil as shown in the below image.

master data management taxonomy

A taxonomy provides a framework to standardize the values in the Color and Description columns.

Conclusion

The terminology and names of systems to manage metadata can be confusing. The post categorizes metadata and systems based on who uses the metadata. If you agree, disagree, or find it helpful, please share, comment, or like this post.

Data Governance 101

We organized the first Data Governance Meetup on July 4, 2020. This post is a recap of my presentation and is a semi-autobiographical journey in helping data teams setup data governance frameworks.

What is Data Governance?

The first step is to understand what is data governance. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver. In my case, Data Governance is any task required for:

  • Compliance: Data life cycle and usage is in accordance with laws and regulations.

  • Privacy: Protect data as per regulations and user expectations.

  • Security: Data & data infrastructure is adequately protected.

Why is Data Governance hard?

Compliance, Privacy, and Security are different approaches to ensure that data collectors and processors do not gain unregulated insights. It is hard to ensure that the right data governance framework is in place to meet this goal. An interesting example of an unexpected insight is the sequence of events leading to leakage of taxi cab tipping history of celebrities.

Paparazzi took photos of celebrities in New York City using taxi cabs. The photos had geolocations and timestamps along with identifying information about taxi cabs like registration and medallion numbers. Independently, the Taxi Commission released an anonymized dataset of taxi trips with time, medallion numbers, fares, and tips. It was possible to link the metadata from photographs and the taxi usage dataset to get the tips given by celebrities.

Data Governance is hard because:

  • There is too much data

  • There is too much complexity in data infrastructure.

  • There is no context for data usage.

There is too much data

The trend is towards businesses collecting more data from users and sharing more data with each other. For example, the image below lists some of the companies PayPal has data sharing agreements.

As companies share and hoard more data, it is possible that they will link these datasets to garner insights that were unexpected by the user.

There is too much complexity

The Data & AI Landscape lists approximately 1500 open-source and commercial technologies. In a small survey, I found that a simple data infrastructure uses 8-10 components. Data and security teams have to ensure similar capabilities in compliance and security across all the parts of the data infrastructure. This is very hard to accomplish.

There is no context for data usage

Analytics, Data Science and AI objectives compete with compliance, privacy, and security. A blanket “Yes” or “No” access policies do not work. More context is required to enforce access policies appropriately:

  • Who is using the data?

  • What purpose?

  • When?

How to get started on Data Governance?

I have found it helpful to help teams start on Data Governance by answering these basic questions:

  • Where is my data?

  • Who has access to data?

  • How is the data used?

Typically teams care only about sensitive data. Every company and team will have a different definition of what is sensitive. Common ones are PII, PHI, or financial data.

It is also important to ensure the process of obtaining answers is automated. Automation will ensure that the data governance framework is relevant and useful when required.

Where is my sensitive data?

A data catalog, scanner, and data lineage application are required to keep track of sensitive data.

An example of a data catalog and scanner is PIICatcher. PIICatcher can scan databases and detect PII data. It can be extended to detect other types of sensitive data. The image shows the metadata stored by PIICatcher after scanning data in AWS S3 in the AWS Glue Catalog.

Typically it is not practical to scan all datasets. Instead, it is sufficient to scan base datasets and then build a data lineage graph to track sensitive data. A library like data-lineage can build a DAG from query history using a graphing library. The DAG can be used to visualize the graph or process it programmatically.

Who has access to my sensitive data?

Most databases have an information schema that stores the privileges of users and roles. This table can be joined with a data catalog where columns with sensitive data are tagged to get a list of users and roles that have access to sensitive data.

How is sensitive data used?

The first is to log usage across all databases. Most databases store query history in the information schema. Big Data technologies like Presto provide hooks to capture usage. It is not advisable to log usage from production databases. Instead, proxies should be used for human access. Proxies can log usage and send the information to a log aggregator where it can be analyzed.

Conclusion

Data Compliance, Privacy & Security is a journey. Data governance is hard but can be tackled by starting with simple questions and using automation extensively.

Data Lake Chargebacks

Self Service Data Lakes require robust charge back systems to ensure there are no runaway costs and help guide investments in data lakes to improve ROI

Self-service data lakes give users and departments freedom to run workloads to get information from data sets. Chargebacks on usage are required to keep a check on costs and return on investment in data lakes.

Chargebacks have two major components:

  • Calculate the cost of the workload

  • Attribute the cost to user and department

Cost of Workloads

The cost can be broken down into two categories:

  • Cost of storing data in storage devices such as HDFS or AWS S3

  • Cost of running workloads on compute such as Spark, Presto or Hive.

Storage Costs

Storage cost calculation is straight-forward. Every storage device has a cost to store 1GB of data. For example, AWS S3 costs $0.023/GB (caveat emptor). Similarly, costs of HDFS or other distributed file systems can be estimated by taking into consideration hardware and running costs.

Storage costs are then calculated by multiplying data set size (in GB) with cost per GB.

Workload Costs

A workload uses 4 main resources in a data lake:

  • Storage

  • CPU

  • Memory

  • Network

All query engines provide usage metrics for all these resources. For example, Spark provides the following execution metrics:

  • executionCpuTime: CPU time the executor spent running this task

  • inputMetrics.bytesRead: Bytes read from tables

  • shuffleReadMetrics.remoteBytesRead: Bytes read over the network.

  • peakExecutionMemory: Bytes used by all internal data structures in memory.

These metrics are per task. For a query or workload, the sum of these metrics will give the complete resource usage of a query. If the cost can be associated with each unit of usage, then it is possible to calculate the cost of running the query.

However, this process is too onerous and error-prone. Instead, a simpler method is to assume the resource usage of CPU, Memory, and Network is proportional to the bytes read from the tables in HDFS or cloud storage and charge based on bytes read by the query or workload.

For example, AWS Athena charges query execution costs on the number of bytes read by the query. Even though this measure is not perfect and can be abused, in practice it provides a close approximation to workload costs with far less effort in implementing the systems to calculate exact workload costs.

Cost Attribution

Once workload costs can be calculated, the next step is to attribute it to the correct user and aggregate it to the correct department. A data catalog is required to capture metadata of users, departments, data sets, and workloads as well as the relationship between them.

A data catalog can be stored in a database. SQL queries can be run to calculate the costs of storage and workloads followed by attribution to the right users and departments.

Conclusion

Chargebacks are important to control the costs and ROI of the data lake. Even though most mature data teams have chargeback systems, there are no common best practices or open-source projects that can be used instead of building a chargeback system. One of the reasons is that the policies for chargebacks differ from one company to another.

Does your company have a chargeback system? Or are you considering building one? Please comment or get in touch if you want to discuss how to build a chargeback system.

Automatic Schema Detection And Matching

Data analysts need to know the schema of all the data sets as well as which tables and columns represent common entities. Schema detection and matching techniques automate these operations.

The ability to query and join multiple data sets is an important requirement in modern data lakes. For example, Prestodb (and its fork PrestoSQL) is a popular engine that is used to query multiple data sets in the same query. However before a data engine like Presto can query these data sets, data analysts need to know the schema of all the data sets and which tables and columns represent common entities.

Manual processes to specify the schema and standardize them is a major roadblock to the productivity of analysts. Schema detection and matching techniques automate these operations and enable analysts to query multiple data sources at scale.


Definitions

Schema Detection

Detect schema from third-party data sources. A common use case is to find the schema of CSV, JSON, AVRO or PARQUET files.

For example, bigquery_schema_generator[4] can detect the schema of CSV and JSON files as shown below:

$ generate-schema   
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" } ^D
   INFO:root:Processed 1 lines
[
     {
       "mode": "NULLABLE",
       "name": "b",
       "type": "BOOLEAN"
     },
     {
       "mode": "NULLABLE",
       "name": "i",
       "type": "INTEGER"
     },
     {
       "mode": "NULLABLE",
       "name": "s",
       "type": "STRING"
     },
     {
       "mode": "NULLABLE",
       "name": "t",
       "type": "TIMESTAMP"
     },
     {
       "mode": "NULLABLE",
       "name": "x",
       "type": "FLOAT"
     }
] 

Schema Matching

Takes two schemata as input and produces a mapping between elements of the two schemata that correspond semantically to each other.

For example, there are two tables:

  1. vehicle: with columns automobile, color, years, and owner.

  2. transport: with columns car, color, age, and owned_by

A schema matcher should pair the following columns:

  1. automobile:: car

  2. color:: color

  3. years:: age

  4. owner:: owned_by

Use Cases

  1. Import third-party data sets: Trading partners exchange data sets that describe business or user behavior transactions. Usually, each partner uses its record format and semantics. For these data sets, the schema has to be detected and then matched with other data sets in the data lake.

  2. Build Data Dictionaries: Data analysts need to understand what's in the dataset (what are the fields and what do they mean). Data providers solve this problem today with a "data dictionary" which is a spreadsheet explaining a dataset. Data dictionaries can be seeded with schema information automatically. Data dictionaries also help by providing auxiliary information for schema matching with downstream data sets at different customers.

  3. Build Data Warehouses and Lakes: A Data Warehouse is loaded with data from different sources through an ETL pipeline. The extraction process requires transforming data from the source format into the warehouse format. Schema matching capability helps to automate the generation of the transformation step.

  4. Semantic Query Processing: Data discovery and an intelligent SQL editor are important productivity tools. These require schema, table and column mapping to be able to provide intelligent suggestions when authoring SQL queries.

Classification of Techniques [2]

The techniques for detection and matching are very similar and will be treated together.

Instance vs schema: matching approaches can consider instance data (i.e., data contents) or only metadata.

Element vs structure matching: match can be performed for individual schema elements, such as attributes, or combinations of elements. For example match address fields or a combination of (email, address) fields.

Language vs constraint: a matcher can use a linguistic-based approach (e.g., based on names and textual descriptions of schema elements) or a constraint-based approach (e.g., based on keys and relationships).

Matching on data statistics: Match based on the similarity of data types and statistics such as count, histograms, no. of nulls.

Auxiliary information: Use auxiliary information, such as dictionaries, global schemas, previous matching decisions, and user input.

Techniques to compare columns

  • String Distance: Column names can be compared using edit distance. The pair with the lowest edit distance will be mapped to each other with the highest confidence. For example, birthday and birth_date

  • Semantic Distance: Instead of checking the spelling, column names are compared by checking the meaning of the parts of the name. For example, salary and compensation. Semantic distance requires a domain taxonomy to find synonyms.

  • Data Types: Columns are paired if they have the same data type such as integer, float, timestamp or string. Data types can be obtained from a database catalog or determined by casting the string to all of the data types and check if there are any errors. This method is used tableschema-py [3] project.

  • Statistics: Pair columns if the statistics of the data such as range of values, average or median, standard deviation or interquartile range and number of missing values are similar.

  • Semantics from content: Similar to semantic distance, the semantics of the data such as all possible unique values and range of values can be compared between columns.

Comparing Tables and Schemata

Tables and schemata can be mapped to one other once columns are mapped between tables in two different schemata. The confidence scores from mapping columns are aggregated using average between all pairs of tables in the two schemata. Then the correct mappings are chosen based on a simple threshold of the confidence score.

One disadvantage of this approach is that the computational complexity is the cartesian product of the number of columns in each data set. For example, if there are 5 columns in each data set and there are 1000 data sets, we will need to make at least 12,487,500 comparisons.

There are general and domain-specific techniques to reduce the number of comparisons. For example, two tables are compared if only they have a date column. Another technique is to use computationally cheap options such as calculating edit distance of column names to eliminate low confidence matches.

Hybrid Detectors and Matchers

No single technique is fool-proof. It is common to use multiple detectors and matchers for columns, tables, and schemata. The confidence scores from each technique are combined and the pairs with the best confidence scores are reported.

Conclusion

Federation databases require automated schema detection and matching. There are no popular open-source projects to detect and match schemata. Hopefully, the techniques in this newsletter help you to roll your own. Start a conversation here or on Twitter if you want to discuss schema detection and matching.

Have a specific topic in mind for the next newsletter or other comments? Send a message here or on Twitter.

References

[1]https://datamade.us/blog/schema-matching/

[2] Rahm, Erhard & Bernstein, Philip. (2001). A Survey of Approaches to Automatic Schema Matching.. VLDB J.. 10. 334-350. 10.1007/s007780100057.

[3] infer() in frictionlessdata/tableschema-py

[4] BigQuery Schema Generator

Loading more posts…