Automatic Schema Detection And Matching
Data analysts need to know the schema of all the data sets as well as which tables and columns represent common entities. Schema detection and matching techniques automate these operations.
The ability to query and join multiple data sets is an important requirement in modern data lakes. For example, Prestodb (and its fork PrestoSQL) is a popular engine that is used to query multiple data sets in the same query. However before a data engine like Presto can query these data sets, data analysts need to know the schema of all the data sets and which tables and columns represent common entities.
Manual processes to specify the schema and standardize them is a major roadblock to the productivity of analysts. Schema detection and matching techniques automate these operations and enable analysts to query multiple data sources at scale.
Definitions
Schema Detection
Detect schema from third-party data sources. A common use case is to find the schema of CSV, JSON, AVRO or PARQUET files.
For example, bigquery_schema_generator[4] can detect the schema of CSV and JSON files as shown below:
$ generate-schema
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" } ^D
INFO:root:Processed 1 lines
[
{
"mode": "NULLABLE",
"name": "b",
"type": "BOOLEAN"
},
{
"mode": "NULLABLE",
"name": "i",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "s",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "t",
"type": "TIMESTAMP"
},
{
"mode": "NULLABLE",
"name": "x",
"type": "FLOAT"
}
]
Schema Matching
Takes two schemata as input and produces a mapping between elements of the two schemata that correspond semantically to each other.
For example, there are two tables:
vehicle: with columns automobile, color, years, and owner.
transport: with columns car, color, age, and owned_by
A schema matcher should pair the following columns:
automobile:: car
color:: color
years:: age
owner:: owned_by
Use Cases
Import third-party data sets: Trading partners exchange data sets that describe business or user behavior transactions. Usually, each partner uses its record format and semantics. For these data sets, the schema has to be detected and then matched with other data sets in the data lake.
Build Data Dictionaries: Data analysts need to understand what's in the dataset (what are the fields and what do they mean). Data providers solve this problem today with a "data dictionary" which is a spreadsheet explaining a dataset. Data dictionaries can be seeded with schema information automatically. Data dictionaries also help by providing auxiliary information for schema matching with downstream data sets at different customers.
Build Data Warehouses and Lakes: A Data Warehouse is loaded with data from different sources through an ETL pipeline. The extraction process requires transforming data from the source format into the warehouse format. Schema matching capability helps to automate the generation of the transformation step.
Semantic Query Processing: Data discovery and an intelligent SQL editor are important productivity tools. These require schema, table and column mapping to be able to provide intelligent suggestions when authoring SQL queries.
Classification of Techniques [2]
The techniques for detection and matching are very similar and will be treated together.
• Instance vs schema: matching approaches can consider instance data (i.e., data contents) or only metadata.
• Element vs structure matching: match can be performed for individual schema elements, such as attributes, or combinations of elements. For example match address fields or a combination of (email, address) fields.
• Language vs constraint: a matcher can use a linguistic-based approach (e.g., based on names and textual descriptions of schema elements) or a constraint-based approach (e.g., based on keys and relationships).
• Matching on data statistics: Match based on the similarity of data types and statistics such as count, histograms, no. of nulls.
• Auxiliary information: Use auxiliary information, such as dictionaries, global schemas, previous matching decisions, and user input.
Techniques to compare columns
String Distance: Column names can be compared using edit distance. The pair with the lowest edit distance will be mapped to each other with the highest confidence. For example, birthday and birth_date
Semantic Distance: Instead of checking the spelling, column names are compared by checking the meaning of the parts of the name. For example, salary and compensation. Semantic distance requires a domain taxonomy to find synonyms.
Data Types: Columns are paired if they have the same data type such as integer, float, timestamp or string. Data types can be obtained from a database catalog or determined by casting the string to all of the data types and check if there are any errors. This method is used tableschema-py [3] project.
Statistics: Pair columns if the statistics of the data such as range of values, average or median, standard deviation or interquartile range and number of missing values are similar.
Semantics from content: Similar to semantic distance, the semantics of the data such as all possible unique values and range of values can be compared between columns.
Comparing Tables and Schemata
Tables and schemata can be mapped to one other once columns are mapped between tables in two different schemata. The confidence scores from mapping columns are aggregated using average between all pairs of tables in the two schemata. Then the correct mappings are chosen based on a simple threshold of the confidence score.
One disadvantage of this approach is that the computational complexity is the cartesian product of the number of columns in each data set. For example, if there are 5 columns in each data set and there are 1000 data sets, we will need to make at least 12,487,500 comparisons.
There are general and domain-specific techniques to reduce the number of comparisons. For example, two tables are compared if only they have a date column. Another technique is to use computationally cheap options such as calculating edit distance of column names to eliminate low confidence matches.
Hybrid Detectors and Matchers
No single technique is fool-proof. It is common to use multiple detectors and matchers for columns, tables, and schemata. The confidence scores from each technique are combined and the pairs with the best confidence scores are reported.
Conclusion
Federation databases require automated schema detection and matching. There are no popular open-source projects to detect and match schemata. Hopefully, the techniques in this newsletter help you to roll your own. Start a conversation here or on Twitter if you want to discuss schema detection and matching.
Have a specific topic in mind for the next newsletter or other comments? Send a message here or on Twitter.
References
[1]https://datamade.us/blog/schema-matching/
[2] Rahm, Erhard & Bernstein, Philip. (2001). A Survey of Approaches to Automatic Schema Matching.. VLDB J.. 10. 334-350. 10.1007/s007780100057.
[3] infer() in frictionlessdata/tableschema-py
[4] BigQuery Schema Generator