Data Catalogs, Dictionaries, Taxonomies and Glossaries
This post explores the different types of metadata on data and the different types of personas who use this metadata for their work.
Metadata in a data lake is important for the productivity of everyone in the data ecosystem. The different types of metadata, systems to store them, and their consumers can be very confusing. How is a data catalog different from a dictionary or a glossary? This post will explore all aspects of metadata for data.
Information Schema
The basic type of metadata is stored by the database itself in an information schema. The information schema is an ANSI-SQL standard and provides system information on tables, views, columns, users, and permissions among other database-specific information. Its primary users are database administrators to verify the internal state of the database.
The information schema is typically accessed through SQL statements or non-standard commands like SHOW or DESCRIBE at the database prompt or in scripts.
The example from MySQL documentation lists all tables in a schema “db5” and system or database-specific information like the engine. The Hive Metastore and AWS Glue Data Catalog are popular information schemas in data lakes.
There are multiple instances of an information schema - one per database in the organization.
Data Catalog
The Data Catalog is a system-wide inventory of all the data assets. A popular analogy is to compare the data catalog to a library catalog. A library catalog stores if a book is available, its edition, authors, description, and other metadata. Just like a library catalog can be used to discover data, a data catalog can be used to discover data assets.
Different personas require a data catalog. Examples are:
Data engineers want to know the change impact of a new feature in ETL pipelines.
Data scientists and analysts use catalogs to find the right data sets for their work.
Data stewards scan data catalogs to ensure security and governance policies are being followed.
A data catalog stores technical metadata about data. One of the major sources of the data catalog is the information schema from all the databases, data warehouses, and data lakes. It will also contain other technical information like lineage, ETL scripts, ACLs, and access history.
A data catalog is typically available through UI or web interface and also has APIs for scripting. Popular open-source data catalogs are DataHub and Amundsen.
Business Glossaries
Business glossaries define various business terms. A very simple example is the definition of a customer or a lead. Different departments must agree on a definition. It is also common that without a glossary, there can be different opinions on simple terminology such as a customer or purchase date.
Business glossaries add semantic meaning to data. While a data catalog may state that a column contains a date, a glossary provides information on how that date should be interpreted. Is a date when a widget was ordered, delivered, or fully paid?
Data Dictionary
A data dictionary is a searchable repository of all business or semantic metadata of data assets. The major difference from a data catalog is that it will also store business or semantic information about the data.
The terms “data dictionary” and “data catalog” are used interchangeably and there is a lot of confusion on when to use which terms. Also the major difference between the two - store business or semantic metadata - is not very large. Many data catalogs can store semantic information and the same systems can be called a catalog or a dictionary.
Therefore the same system can be called a data catalog if the only audience is technical users or a dictionary if it is used by a business audience.
Taxonomies
The data taxonomy is an oddball in this list. A taxonomy describes how to assign metadata to data. It provides a framework to name things and to disambiguate when there is confusion about the semantics of data. For example, is the purchase date the date when the first payment was received or the last?
Taxonomies also standardize terms within the data. An example is:
The above image shows two suppliers that both offer mechanical pencils. However, they may use different terms to describe the same features about a mechanical pencil as shown in the below image.
A taxonomy provides a framework to standardize the values in the Color and Description columns.
Conclusion
The terminology and names of systems to manage metadata can be confusing. The post categorizes metadata and systems based on who uses the metadata. If you agree, disagree, or find it helpful, please share, comment, or like this post.