How to Get Started on Data Governance

This post describes the prerequisites as well as a methodology for a successful implementation of a data governance platform.

In this guest post, Syed Atif Akhtar provides insights on how an organization can get started on Data Governance. These insights are based on his experience helping organizations big & small to put data governance systems in place.

Data Governance helps organizations improve developer productivity, data quality, compliance, and security. However many organizations fail to extract value from data governance. This post describes the prerequisites as well as a methodology for a successful implementation of a data governance platform. 

Prerequisites for a data governance implementation

Strategy and Vision

The strategy should clearly articulate the value proposition and goals of data governance. These should be articulated across different timelines. Goals and objectives help everyone see the value and maintain motivation to stick with the process.

The strategy has to consider both the tools as well as the organization. It is not sufficient to only introduce the right tools. Most organizations that follow a top-down driven approach often end up failing due to investing too much money on tools but do not consider the organization’s dynamics. Hence the right incentives have to be created for everyone in the organization to participate and take ownership of data governance within their area.

The strategy should support decentralization and adopt extensible tools. Each team should be able to add features they see fit and evolve the governance fabric itself based on the needs of the department.

The strategy should consider the needs of data engineers, data scientists, and AI engineers along with data analysts. Hence, a data governance strategy should include relational databases and data warehouses as well as technologies for data engineering, data science, and artificial intelligence. It should also consider that new technologies will become popular and will be adopted by the engineers in the team.

The strategy should push data governance in phases. The roadmap should consist of incremental improvements rather than overnight change to allow the organization to assimilate the new technologies and processes.

Comprehensive goals

The goal of data governance is typically limited to compliance, accountability, and audits. However, the value of data governance goes beyond compliance to developer productivity and data quality. Developer productivity and data quality provide higher levels of motivation to choose the right investments. For example, the organization will then own data governance instead of outsourcing it to a vendor. Sub-organizations will take ownership of datasets useful to them.

Technology

 There are many factors to choosing the right data governance technologies. Some of them are:

  • Build vs Buy

  • Open Source vs Commercial

The right decision depends on the goals, vision, and capabilities of the team. For example, the organization may not be able to install and maintain an open-source project. In other cases, the project may not work with the ecosystem adopted by the enterprise.

Steps in a successful data governance implementation

Prototype Phase

The goal of the prototype phase is to answer these questions:

  • What are the datasets available?

  • Who is using the datasets?

  • How are the datasets used?

In the prototype phase:

  • Curate Important Data

  • Focus on a few teams. Ideally Data Science or AI teams.

  • Implement basic classification and tagging of data.

  • The catalog pulls metadata from storage systems. This is in contrast to push-based where data owners push metadata to a catalog. At this stage, pull-based is better because organizational buy-in to data governance is not yet there.

Enabling Phase

The goal of the scaling phase is to add tooling to analyze metadata and compliance.

In the enabling phase:

  • The organization moves to a push-based model.

  • Focus on reducing the friction to onboard new teams and datasets.

  • A central team creates compliance and audit policies.

  • The technology is augmented to analyze and implement these policies.

  • Automate reporting of compliance reports.

Scaling Phase:

In the final phase, the goal is to align teams on data domains and federate data governance.

In the scaling phase:

  • Data Governance is decentralized. Data Stewards are part of sub-organizations.

  • Sub-organizations have the flexibility to choose their policies and managed through an Open Policy Agent.

  • Distributed data catalogs may be chosen to reflect the architecture of different data marts.

  • A business glossary is created to link code, SQL scripts, models, and transformations across different sub-organizations and data catalogs.

Conclusion

This post provides a blueprint for organizations to successfully implement data governance and extract value from it. Organizations should focus on strategy, goals, and technology. Then these should be implemented in phases by targeting simple goals first. 

A Practical Data Quality Process

This provides high level, practical tips to approach data quality.

Data quality management is necessary for data-driven decision making. However, the outcome of many data quality initiatives is not up to expectations. This post describes one framework that can help to plan and execute a successful data quality project.

What is data quality?

The first question is to understand what does high data quality mean? There are two common definitions:

  1. Data quality is high if data is fit for the intended purpose.

  2. Data quality is high if the data completely reflects the real-world entity.

By definition 1, the data quality is high if it is possible to send an invoice to a customer. However, the data may be missing the customer’s phone number.

By definition 2, the data quality is high if the data correctly captures all attributes of the customer. In the previous example, the data quality is low if the data contains the correct address but is missing the customer’s phone number.

Both these definitions can contradict each other. A reasonable expectation is for master data to be more applicable to multiple business decisions. However, it may be impractical or costly to ensure all-round data quality. Any data quality process has to balance both these opposing forces.

A Data Quality Flow

A data quality flow that focuses on business outcome has the advantage of showing value. Therefore there are better chances of continued funding. Such a flow is described below:

The process starts with:

  1. Defining a business issue such as inablity to send invoices correctly.

  2. Triage with product and engineering teams. Set expectations on outcome as well as threshold and KPIs for data quality. For example a threshold may be that only 1% of attempts to send invoices can fail.

  3. Product and engineering teams fix and deploy.

  4. Verify that data quality has improved based on previously agreed thresholds and KPIs.

Data Quality Tools

There are a number of commercial and open source tools such Great Expectations to help data engineering teams with data quality management. To main techniques are:

  1. Data profiling and alerts on changes in data profiling.

  2. Tests for data pipelines.

A typical problem is that these techniques are used to monitor data quality of all attributes. This can lead to alert fatigue and not perceptibe improvement in business outcome. Another issue is that data profiles and requirements change. Since maintenance of data quality rules is manually intensive, they are not kept up to date.

Tying data quality rules and software usage to specific business outcomes as described in the previous section helps to focus on relevant business outcomes.

Data Catalogs, Dictionaries, Taxonomies and Glossaries

This post explores the different types of metadata on data and the different types of personas who use this metadata for their work.

Metadata in a data lake is important for the productivity of everyone in the data ecosystem. The different types of metadata, systems to store them, and their consumers can be very confusing. How is a data catalog different from a dictionary or a glossary? This post will explore all aspects of metadata for data.

Information Schema

The basic type of metadata is stored by the database itself in an information schema. The information schema is an ANSI-SQL standard and provides system information on tables, views, columns, users, and permissions among other database-specific information. Its primary users are database administrators to verify the internal state of the database.

The information schema is typically accessed through SQL statements or non-standard commands like SHOW or DESCRIBE at the database prompt or in scripts.

The example from MySQL documentation lists all tables in a schema “db5” and system or database-specific information like the engine. The Hive Metastore and AWS Glue Data Catalog are popular information schemas in data lakes.

There are multiple instances of an information schema - one per database in the organization.

Data Catalog

The Data Catalog is a system-wide inventory of all the data assets. A popular analogy is to compare the data catalog to a library catalog. A library catalog stores if a book is available, its edition, authors, description, and other metadata. Just like a library catalog can be used to discover data, a data catalog can be used to discover data assets.

Different personas require a data catalog. Examples are:

  • Data engineers want to know the change impact of a new feature in ETL pipelines.

  • Data scientists and analysts use catalogs to find the right data sets for their work.

  • Data stewards scan data catalogs to ensure security and governance policies are being followed.

A data catalog stores technical metadata about data. One of the major sources of the data catalog is the information schema from all the databases, data warehouses, and data lakes. It will also contain other technical information like lineage, ETL scripts, ACLs, and access history.

A data catalog is typically available through UI or web interface and also has APIs for scripting. Popular open-source data catalogs are DataHub and Amundsen.

Business Glossaries

Business glossaries define various business terms. A very simple example is the definition of a customer or a lead. Different departments must agree on a definition. It is also common that without a glossary, there can be different opinions on simple terminology such as a customer or purchase date.

Business glossaries add semantic meaning to data. While a data catalog may state that a column contains a date, a glossary provides information on how that date should be interpreted. Is a date when a widget was ordered, delivered, or fully paid?

Data Dictionary

A data dictionary is a searchable repository of all business or semantic metadata of data assets. The major difference from a data catalog is that it will also store business or semantic information about the data.

The terms “data dictionary” and “data catalog” are used interchangeably and there is a lot of confusion on when to use which terms. Also the major difference between the two - store business or semantic metadata - is not very large. Many data catalogs can store semantic information and the same systems can be called a catalog or a dictionary.

Therefore the same system can be called a data catalog if the only audience is technical users or a dictionary if it is used by a business audience.

Taxonomies

The data taxonomy is an oddball in this list. A taxonomy describes how to assign metadata to data. It provides a framework to name things and to disambiguate when there is confusion about the semantics of data. For example, is the purchase date the date when the first payment was received or the last?

Taxonomies also standardize terms within the data. An example is:

transactional and master data example

The above image shows two suppliers that both offer mechanical pencils. However, they may use different terms to describe the same features about a mechanical pencil as shown in the below image.

master data management taxonomy

A taxonomy provides a framework to standardize the values in the Color and Description columns.

Conclusion

The terminology and names of systems to manage metadata can be confusing. The post categorizes metadata and systems based on who uses the metadata. If you agree, disagree, or find it helpful, please share, comment, or like this post.

Data Governance 101

We organized the first Data Governance Meetup on July 4, 2020. This post is a recap of my presentation and is a semi-autobiographical journey in helping data teams setup data governance frameworks.

What is Data Governance?

The first step is to understand what is data governance. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver. In my case, Data Governance is any task required for:

  • Compliance: Data life cycle and usage is in accordance with laws and regulations.

  • Privacy: Protect data as per regulations and user expectations.

  • Security: Data & data infrastructure is adequately protected.

Why is Data Governance hard?

Compliance, Privacy, and Security are different approaches to ensure that data collectors and processors do not gain unregulated insights. It is hard to ensure that the right data governance framework is in place to meet this goal. An interesting example of an unexpected insight is the sequence of events leading to leakage of taxi cab tipping history of celebrities.

Paparazzi took photos of celebrities in New York City using taxi cabs. The photos had geolocations and timestamps along with identifying information about taxi cabs like registration and medallion numbers. Independently, the Taxi Commission released an anonymized dataset of taxi trips with time, medallion numbers, fares, and tips. It was possible to link the metadata from photographs and the taxi usage dataset to get the tips given by celebrities.

Data Governance is hard because:

  • There is too much data

  • There is too much complexity in data infrastructure.

  • There is no context for data usage.

There is too much data

The trend is towards businesses collecting more data from users and sharing more data with each other. For example, the image below lists some of the companies PayPal has data sharing agreements.

As companies share and hoard more data, it is possible that they will link these datasets to garner insights that were unexpected by the user.

There is too much complexity

The Data & AI Landscape lists approximately 1500 open-source and commercial technologies. In a small survey, I found that a simple data infrastructure uses 8-10 components. Data and security teams have to ensure similar capabilities in compliance and security across all the parts of the data infrastructure. This is very hard to accomplish.

There is no context for data usage

Analytics, Data Science and AI objectives compete with compliance, privacy, and security. A blanket “Yes” or “No” access policies do not work. More context is required to enforce access policies appropriately:

  • Who is using the data?

  • What purpose?

  • When?

How to get started on Data Governance?

I have found it helpful to help teams start on Data Governance by answering these basic questions:

  • Where is my data?

  • Who has access to data?

  • How is the data used?

Typically teams care only about sensitive data. Every company and team will have a different definition of what is sensitive. Common ones are PII, PHI, or financial data.

It is also important to ensure the process of obtaining answers is automated. Automation will ensure that the data governance framework is relevant and useful when required.

Where is my sensitive data?

A data catalog, scanner, and data lineage application are required to keep track of sensitive data.

An example of a data catalog and scanner is PIICatcher. PIICatcher can scan databases and detect PII data. It can be extended to detect other types of sensitive data. The image shows the metadata stored by PIICatcher after scanning data in AWS S3 in the AWS Glue Catalog.

Typically it is not practical to scan all datasets. Instead, it is sufficient to scan base datasets and then build a data lineage graph to track sensitive data. A library like data-lineage can build a DAG from query history using a graphing library. The DAG can be used to visualize the graph or process it programmatically.

Who has access to my sensitive data?

Most databases have an information schema that stores the privileges of users and roles. This table can be joined with a data catalog where columns with sensitive data are tagged to get a list of users and roles that have access to sensitive data.

How is sensitive data used?

The first is to log usage across all databases. Most databases store query history in the information schema. Big Data technologies like Presto provide hooks to capture usage. It is not advisable to log usage from production databases. Instead, proxies should be used for human access. Proxies can log usage and send the information to a log aggregator where it can be analyzed.

Conclusion

Data Compliance, Privacy & Security is a journey. Data governance is hard but can be tackled by starting with simple questions and using automation extensively.

Data Lake Chargebacks

Self Service Data Lakes require robust charge back systems to ensure there are no runaway costs and help guide investments in data lakes to improve ROI

Self-service data lakes give users and departments freedom to run workloads to get information from data sets. Chargebacks on usage are required to keep a check on costs and return on investment in data lakes.

Chargebacks have two major components:

  • Calculate the cost of the workload

  • Attribute the cost to user and department

Cost of Workloads

The cost can be broken down into two categories:

  • Cost of storing data in storage devices such as HDFS or AWS S3

  • Cost of running workloads on compute such as Spark, Presto or Hive.

Storage Costs

Storage cost calculation is straight-forward. Every storage device has a cost to store 1GB of data. For example, AWS S3 costs $0.023/GB (caveat emptor). Similarly, costs of HDFS or other distributed file systems can be estimated by taking into consideration hardware and running costs.

Storage costs are then calculated by multiplying data set size (in GB) with cost per GB.

Workload Costs

A workload uses 4 main resources in a data lake:

  • Storage

  • CPU

  • Memory

  • Network

All query engines provide usage metrics for all these resources. For example, Spark provides the following execution metrics:

  • executionCpuTime: CPU time the executor spent running this task

  • inputMetrics.bytesRead: Bytes read from tables

  • shuffleReadMetrics.remoteBytesRead: Bytes read over the network.

  • peakExecutionMemory: Bytes used by all internal data structures in memory.

These metrics are per task. For a query or workload, the sum of these metrics will give the complete resource usage of a query. If the cost can be associated with each unit of usage, then it is possible to calculate the cost of running the query.

However, this process is too onerous and error-prone. Instead, a simpler method is to assume the resource usage of CPU, Memory, and Network is proportional to the bytes read from the tables in HDFS or cloud storage and charge based on bytes read by the query or workload.

For example, AWS Athena charges query execution costs on the number of bytes read by the query. Even though this measure is not perfect and can be abused, in practice it provides a close approximation to workload costs with far less effort in implementing the systems to calculate exact workload costs.

Cost Attribution

Once workload costs can be calculated, the next step is to attribute it to the correct user and aggregate it to the correct department. A data catalog is required to capture metadata of users, departments, data sets, and workloads as well as the relationship between them.

A data catalog can be stored in a database. SQL queries can be run to calculate the costs of storage and workloads followed by attribution to the right users and departments.

Conclusion

Chargebacks are important to control the costs and ROI of the data lake. Even though most mature data teams have chargeback systems, there are no common best practices or open-source projects that can be used instead of building a chargeback system. One of the reasons is that the policies for chargebacks differ from one company to another.

Does your company have a chargeback system? Or are you considering building one? Please comment or get in touch if you want to discuss how to build a chargeback system.

Loading more posts…