Data Catalog

This edition is a survey of data catalogs for big data. Data Catalogs manage metadata of data and are a foundation for data governance, privacy and security

Jan 19, 2020

Data Catalog is the foundation of many capabilities such as data discovery, governance, and security. At its simplest, the data catalog manages metadata about all the data sets in your company. The rest of the newsletter will survey data catalogs at large web companies, open source data catalogs, and SAAS data catalogs.

Why is a data catalog important ?

Ground is a research project at UC Berkeley building an open-source data context service. The research paper introducing Ground as well as the review by Morning Paper talk about the advantages of investing in a data catalog.

A data catalog is important to solve two problems in modern data teams:

Avoid poor productivity of people and the ROI of data.
Governance Risk

The metadata captured by a catalog consists of Application Context (scripts, schema), Behavior (how the data is created and used over time) and Change (how the data is changing over time)

The metadata could be for data sets like data stores, dashboards/reports, events/schemas, streams, ETL jobs, ML workflows, streaming jobs, and company organizational structure.

Research Paper

Morning Paper Review

Data Catalog Implementations

Many large web companies have built a data catalog for their big data infrastructure. A list of projects are:

Amundsen by Lyft. It is an open-source project on Github.
DataPortal by Airbnb.
Databook by Uber.
Metacat by Netflix. It is an open-source project available on Github.

At a high level, all the projects satisfy the same requirements and consist of similar building blocks. The projects differ in detail in metadata modeling, ingestion, serving and indexing. The differences are because of different priorities for each of these functions as well as differences in internal processes in managing data. Check out the links for a deeper dive into use cases and architecture.

Apache Atlas and ODPI Egeria

Apache Atlas is a data governance and metadata management platform by the Apache community. It is specifically designed for the Hadoop eco-system though later versions support other data infrastructure technologies.

ODPI Egeria is an open standard to enable databases, data warehouses, and data infrastructure by different vendors to communicate with each other. The project solves the following problem in the data catalog landscape:

There are multiple systems capturing metadata.
Each system is built for a specific technology.
It is impractical to build a catalog that will support all technologies.

Egeria will help different systems to communicate metadata with each other and hopefully breakdown the silos due to proprietary protocols.

Data Catalog as a service

All three public clouds provide a data catalog as a service. These are:

Data Catalog by GCP.
Glue Catalog by AWS.
Data Catalog by Azure.

If you are running your data infrastructure in the cloud, these are a good default choice.

Applications built on a Data Catalog

A Data Catalog enables many applications to improve productivity and governance. A representative list of applications is:

Data Discovery
Data Dictionary
Data Provenance
Measure ROI
Privileged Access Management
Auditing and compliance

Conclusion

A data catalog is a necessary foundation for capabilities like data dictionary, governance and security. When you plan to add a data catalog, ensure that you follow through with implementing an application. Check out the open source data catalog projects or learn from the implementations at large web companies if you decide to build your own.

In subsequent posts, I will get in deeper into data catalog architectures as well as applications. If you found this survey useful, send a signal by liking and/or subscribing to the newsletter.

Have a specific topic in mind for the next newsletter or other comments ? Send a message here or on Twitter.

Data Governance, Privacy and Security

Discussion about this post