Open Source SQL Parsers
Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing SQL for data observability.
Query history of a data warehouse is a rich source of information to glean how data is used in your organization. Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can extract:
Popular tables and columns
Unused tables and columns
Column-level lineage
Freshness
These statistics also help to automate common data engineering tasks like:
Backup and Disaster Recovery
Triage Data Quality issues
Track sensitive data and how they are used.
Challenges and Approaches
SQL language is an ISO/IEC standard and the latest version is SQL2016. However, every database implements the standard differently, uses different function names for the same operation, and has extensions to access specific custom features. Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses.
Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular Python package that uses regular expressions to parse SQL.
An alternate approach is to implement the SQL grammar using parser generators like ANTLR. There are similar open-source parser generators in other popular languages.
There are multiple projects that maintain parsers for popular open-source databases like MySQL and Postgres. For other open-source databases, the grammar can be extracted from the open-source project. For commercial databases, the only option is to reverse engineer the complete grammar.
There are SQL parser/optimizer platforms like Apache Calcite that help to reduce the effort to implement the SQL dialect of your choice.
Open Source Parsers
Some popular open-source databases and data warehouses are:
MySQL/MariaDB
Pingcap parser is a MySQL parser in Go.
SQL Parser in phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect.
Postgres
libpg_query extracts the parser (written in C) from the postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like:
Python: pglast
Golang: pg_query_go
JS: psql-parser in Node and pg-query-emscripten in the browser
Rust: pg_query.rs
Multiple Engines
queryparser implements Apache Hive, Presto/Trino and Vertica dialects.
zetasql implements BigQuery, Spanner, and Dataflow dialects.
Generic Parsers
Python: sqlparse
Rust: sqlparser-rs
Python: mo-sql-parsing
Platforms
Parser/Optimizer platforms implement the common SQL language features and allow customization as first-class feature of the platform. Two popular open-source projects are:
Apache Calcite is a popular parser/optimizer that is used in popular databases and query engines like Apache Hive, BlazingSQL and many others.
JSQLParser can parse multiple SQL dialects like MySQL, Postgres and Oracle. The grammar can be modified to support other SQL dialects.
Apache Calcite allows customizations at various points of the parsing process.
Parser rules can be changed to support custom syntax.
Conventions such as quotes vs double quotes, case sensitivity.
Add optimizer rules.
Apache Calcite also provides visitors for traversing the SQL execution plan. A Visitor pattern is an algorithm to traverse a SQL plan.
Practical tips to Getting Started
There are many abandoned open-source SQL parsers. The first filter is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages.
What if there is no parser for your database?
Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries. Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.
Conclusion
There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good open-source projects. However, there is a steep learning curve to use these projects and in many cases a project may not fit your specific requirements.
Struggling with parsing query history? Get in touch