Data Lake Chargebacks

Self Service Data Lakes require robust charge back systems to ensure there are no runaway costs and help guide investments in data lakes to improve ROI

Self-service data lakes give users and departments freedom to run workloads to get information from data sets. Chargebacks on usage are required to keep a check on costs and return on investment in data lakes.

Chargebacks have two major components:

  • Calculate the cost of the workload

  • Attribute the cost to user and department

Cost of Workloads

The cost can be broken down into two categories:

  • Cost of storing data in storage devices such as HDFS or AWS S3

  • Cost of running workloads on compute such as Spark, Presto or Hive.

Storage Costs

Storage cost calculation is straight-forward. Every storage device has a cost to store 1GB of data. For example, AWS S3 costs $0.023/GB (caveat emptor). Similarly, costs of HDFS or other distributed file systems can be estimated by taking into consideration hardware and running costs.

Storage costs are then calculated by multiplying data set size (in GB) with cost per GB.

Workload Costs

A workload uses 4 main resources in a data lake:

  • Storage

  • CPU

  • Memory

  • Network

All query engines provide usage metrics for all these resources. For example, Spark provides the following execution metrics:

  • executionCpuTime: CPU time the executor spent running this task

  • inputMetrics.bytesRead: Bytes read from tables

  • shuffleReadMetrics.remoteBytesRead: Bytes read over the network.

  • peakExecutionMemory: Bytes used by all internal data structures in memory.

These metrics are per task. For a query or workload, the sum of these metrics will give the complete resource usage of a query. If the cost can be associated with each unit of usage, then it is possible to calculate the cost of running the query.

However, this process is too onerous and error-prone. Instead, a simpler method is to assume the resource usage of CPU, Memory, and Network is proportional to the bytes read from the tables in HDFS or cloud storage and charge based on bytes read by the query or workload.

For example, AWS Athena charges query execution costs on the number of bytes read by the query. Even though this measure is not perfect and can be abused, in practice it provides a close approximation to workload costs with far less effort in implementing the systems to calculate exact workload costs.

Cost Attribution

Once workload costs can be calculated, the next step is to attribute it to the correct user and aggregate it to the correct department. A data catalog is required to capture metadata of users, departments, data sets, and workloads as well as the relationship between them.

A data catalog can be stored in a database. SQL queries can be run to calculate the costs of storage and workloads followed by attribution to the right users and departments.


Chargebacks are important to control the costs and ROI of the data lake. Even though most mature data teams have chargeback systems, there are no common best practices or open-source projects that can be used instead of building a chargeback system. One of the reasons is that the policies for chargebacks differ from one company to another.

Does your company have a chargeback system? Or are you considering building one? Please comment or get in touch if you want to discuss how to build a chargeback system.

Automatic Schema Detection And Matching

Data analysts need to know the schema of all the data sets as well as which tables and columns represent common entities. Schema detection and matching techniques automate these operations.

The ability to query and join multiple data sets is an important requirement in modern data lakes. For example, Prestodb (and its fork PrestoSQL) is a popular engine that is used to query multiple data sets in the same query. However before a data engine like Presto can query these data sets, data analysts need to know the schema of all the data sets and which tables and columns represent common entities.

Manual processes to specify the schema and standardize them is a major roadblock to the productivity of analysts. Schema detection and matching techniques automate these operations and enable analysts to query multiple data sources at scale.


Schema Detection

Detect schema from third-party data sources. A common use case is to find the schema of CSV, JSON, AVRO or PARQUET files.

For example, bigquery_schema_generator[4] can detect the schema of CSV and JSON files as shown below:

$ generate-schema   
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" } ^D
   INFO:root:Processed 1 lines
       "mode": "NULLABLE",
       "name": "b",
       "type": "BOOLEAN"
       "mode": "NULLABLE",
       "name": "i",
       "type": "INTEGER"
       "mode": "NULLABLE",
       "name": "s",
       "type": "STRING"
       "mode": "NULLABLE",
       "name": "t",
       "type": "TIMESTAMP"
       "mode": "NULLABLE",
       "name": "x",
       "type": "FLOAT"

Schema Matching

Takes two schemata as input and produces a mapping between elements of the two schemata that correspond semantically to each other.

For example, there are two tables:

  1. vehicle: with columns automobile, color, years, and owner.

  2. transport: with columns car, color, age, and owned_by

A schema matcher should pair the following columns:

  1. automobile:: car

  2. color:: color

  3. years:: age

  4. owner:: owned_by

Use Cases

  1. Import third-party data sets: Trading partners exchange data sets that describe business or user behavior transactions. Usually, each partner uses its record format and semantics. For these data sets, the schema has to be detected and then matched with other data sets in the data lake.

  2. Build Data Dictionaries: Data analysts need to understand what's in the dataset (what are the fields and what do they mean). Data providers solve this problem today with a "data dictionary" which is a spreadsheet explaining a dataset. Data dictionaries can be seeded with schema information automatically. Data dictionaries also help by providing auxiliary information for schema matching with downstream data sets at different customers.

  3. Build Data Warehouses and Lakes: A Data Warehouse is loaded with data from different sources through an ETL pipeline. The extraction process requires transforming data from the source format into the warehouse format. Schema matching capability helps to automate the generation of the transformation step.

  4. Semantic Query Processing: Data discovery and an intelligent SQL editor are important productivity tools. These require schema, table and column mapping to be able to provide intelligent suggestions when authoring SQL queries.

Classification of Techniques [2]

The techniques for detection and matching are very similar and will be treated together.

Instance vs schema: matching approaches can consider instance data (i.e., data contents) or only metadata.

Element vs structure matching: match can be performed for individual schema elements, such as attributes, or combinations of elements. For example match address fields or a combination of (email, address) fields.

Language vs constraint: a matcher can use a linguistic-based approach (e.g., based on names and textual descriptions of schema elements) or a constraint-based approach (e.g., based on keys and relationships).

Matching on data statistics: Match based on the similarity of data types and statistics such as count, histograms, no. of nulls.

Auxiliary information: Use auxiliary information, such as dictionaries, global schemas, previous matching decisions, and user input.

Techniques to compare columns

  • String Distance: Column names can be compared using edit distance. The pair with the lowest edit distance will be mapped to each other with the highest confidence. For example, birthday and birth_date

  • Semantic Distance: Instead of checking the spelling, column names are compared by checking the meaning of the parts of the name. For example, salary and compensation. Semantic distance requires a domain taxonomy to find synonyms.

  • Data Types: Columns are paired if they have the same data type such as integer, float, timestamp or string. Data types can be obtained from a database catalog or determined by casting the string to all of the data types and check if there are any errors. This method is used tableschema-py [3] project.

  • Statistics: Pair columns if the statistics of the data such as range of values, average or median, standard deviation or interquartile range and number of missing values are similar.

  • Semantics from content: Similar to semantic distance, the semantics of the data such as all possible unique values and range of values can be compared between columns.

Comparing Tables and Schemata

Tables and schemata can be mapped to one other once columns are mapped between tables in two different schemata. The confidence scores from mapping columns are aggregated using average between all pairs of tables in the two schemata. Then the correct mappings are chosen based on a simple threshold of the confidence score.

One disadvantage of this approach is that the computational complexity is the cartesian product of the number of columns in each data set. For example, if there are 5 columns in each data set and there are 1000 data sets, we will need to make at least 12,487,500 comparisons.

There are general and domain-specific techniques to reduce the number of comparisons. For example, two tables are compared if only they have a date column. Another technique is to use computationally cheap options such as calculating edit distance of column names to eliminate low confidence matches.

Hybrid Detectors and Matchers

No single technique is fool-proof. It is common to use multiple detectors and matchers for columns, tables, and schemata. The confidence scores from each technique are combined and the pairs with the best confidence scores are reported.


Federation databases require automated schema detection and matching. There are no popular open-source projects to detect and match schemata. Hopefully, the techniques in this newsletter help you to roll your own. Start a conversation here or on Twitter if you want to discuss schema detection and matching.

Have a specific topic in mind for the next newsletter or other comments? Send a message here or on Twitter.



[2] Rahm, Erhard & Bernstein, Philip. (2001). A Survey of Approaches to Automatic Schema Matching.. VLDB J.. 10. 334-350. 10.1007/s007780100057.

[3] infer() in frictionlessdata/tableschema-py

[4] BigQuery Schema Generator

Auto-tune performance and cost

This post will explore techniques to optimize the usage and cost of compute and storage in databases and data lakes.

Databases and Data Lakes collect detritus over time. Unnecessary compute (queries and jobs) and storage(tables and files) have a cost. At scale, it is common to spend $10K per month or more on unnecessary queries and data. A key takeaway is that an automated analysis on top of a data catalog is required to successfully maintain an efficient data lake.

Compute & Memory Usage

Compute performance optimization and cost reduction are two sides of the same coin.

Every data engine has a set of best practices to run queries as efficiently as possible. For example,

To successfully improve performance the following questions should be answered:

  • Which ones are the most expensive queries?

  • Which techniques should be applied to optimize each of the queries?

  • How do I apply the technique successfully and confirm the queries are optimized?

  • How do I ensure the queries continue to be optimized?

Storage Usage Analysis

The techniques to reduce storage cost fall into the following categories:

  • Reduce the size of files with encoding or compression.

  • Implement a data life cycle to move data from expensive to cheap storage.

  • Remove data sets like duplicates or derived data sets.

For example, in this blog about reducing AWS Redshift costs, compression provided $3000 of monthly savings and a data life cycle reduced costs by $17000 per month.

To apply these techniques the following questions have to be answered:

  • What is the storage format of all the tables and columns?

  • What is the ideal storage format for the analysis run on them?

  • Which tables are required for fast queries and which can be moved to slow & cheaper storage?

  • What is the lineage of data sets? Can they be derived again?

Calculation of data storage cost differs between shared-nothing (AWS Redshift) vs shared-storage (AWS Athena) technologies. In shared-nothing databases like AWS Redshift, more compute has to be allocated to store data since a single unit consists of disks, compute and memory. In shared-storage databases, more disk capacity has to be added. Therefore the cost of storage is higher in shared-nothing databases.

Automate Best Practices for high performance

At scale, it is hard to apply these techniques manually. The data team has to contend with 100s or 1000s of queries, multiple data engines and new workloads that shift the equilibrium all the time. Many applications automate the analysis and application of best practices to improve performance. An example of a performance tuning application is Dr. Elephant. It suggests optimizations for Apache Spark and Apache Hadoop jobs.

Anatomy of a performance or cost tuning application

Every application has two parts:

  1. A Data Catalog that stores metrics of queries and jobs

  2. An analysis module that uses heuristics to recommend performance optimizations

  3. An optional module to apply the optimizations.

Let us use Dr. Elephant as an example.

Dr. Elephant uses a MySQL database as a data catalog (1). It stores the following information about Hadoop and Spark jobs:

Dr. Elephant has a heuristics module (2) that checks for the following best practices:

  • Spark Executor Load Balance: Ensure work is balanced across all executors and there is no skew

  • Spark Job Runtime: Checks is a job should be split and executed as multiple jobs

  • Spark Memory Limit: Determines the appropriate memory allocation for a spark job.

An example of an application that optimizes storage is redshift_utils. redshift_utils uses AWS Redshift information schema as the data catalog and provides queries that generate recommendations for storage optimizations specific to AWS Redshift.

Open Source Performance Tuning Applications

There are performance tuning applications for most popular data engines. A list of popular open-source performance tuning applications are:

Roll your own auto-tuning application

Most probably you will have to implement your own auto-tuning application for the unique data infrastructure, data sets, and workloads at your company. A high-level list of steps to follow are:

  1. List the metrics and data of queries, jobs, users, and datasets that are required.

  2. Choose a database as the data catalog.

  3. Create schema and tables to store the metrics.

  4. Scrape metrics and store in the data catalog.

  5. Write SQL queries or build dashboards to analyze & recommend optimizations.

  6. Optionally set alerts to kick off scripts to apply the optimizations.

The blog on taming AWS Costs provides another example of using AWS Athena, S3 and CloudTrail logs to analyze costs.


Auto-tuning is required to ensure that databases and data lakes are being run efficiently. Auto-tuners are an application on top of data catalogs. If you think you need an auto-tuner, checkout the opensource projects or roll your own. Start a conversation here or on Twitter if you want to discuss auto-tuners for databases and data lakes.

Have a specific topic in mind for the next newsletter or other comments? Send a message here or on Twitter.


Data Catalog

This edition is a survey of data catalogs for big data. Data Catalogs manage metadata of data and are a foundation for data governance, privacy and security

Data Catalog is the foundation of many capabilities such as data discovery, governance, and security. At its simplest, the data catalog manages metadata about all the data sets in your company. The rest of the newsletter will survey data catalogs at large web companies, open source data catalogs, and SAAS data catalogs.

Why is a data catalog important ?

Ground is a research project at UC Berkeley building an open-source data context service. The research paper introducing Ground as well as the review by Morning Paper talk about the advantages of investing in a data catalog.

A data catalog is important to solve two problems in modern data teams:

  1. Avoid poor productivity of people and the ROI of data.

  2. Governance Risk

The metadata captured by a catalog consists of Application Context (scripts, schema), Behavior (how the data is created and used over time) and Change (how the data is changing over time)

The metadata could be for data sets like data stores, dashboards/reports, events/schemas, streams, ETL jobs, ML workflows, streaming jobs, and company organizational structure.

Research Paper

Morning Paper Review

Data Catalog Implementations

Many large web companies have built a data catalog for their big data infrastructure. A list of projects are:

At a high level, all the projects satisfy the same requirements and consist of similar building blocks. The projects differ in detail in metadata modeling, ingestion, serving and indexing. The differences are because of different priorities for each of these functions as well as differences in internal processes in managing data. Check out the links for a deeper dive into use cases and architecture.

Apache Atlas and ODPI Egeria

Apache Atlas is a data governance and metadata management platform by the Apache community. It is specifically designed for the Hadoop eco-system though later versions support other data infrastructure technologies.

ODPI Egeria is an open standard to enable databases, data warehouses, and data infrastructure by different vendors to communicate with each other. The project solves the following problem in the data catalog landscape:

  • There are multiple systems capturing metadata.

  • Each system is built for a specific technology.

  • It is impractical to build a catalog that will support all technologies.

Egeria will help different systems to communicate metadata with each other and hopefully breakdown the silos due to proprietary protocols.

Data Catalog as a service

All three public clouds provide a data catalog as a service. These are:

If you are running your data infrastructure in the cloud, these are a good default choice.

Applications built on a Data Catalog

A Data Catalog enables many applications to improve productivity and governance. A representative list of applications is:

  • Data Discovery

  • Data Dictionary

  • Data Provenance

  • Measure ROI

  • Privileged Access Management

  • Auditing and compliance


A data catalog is a necessary foundation for capabilities like data dictionary, governance and security. When you plan to add a data catalog, ensure that you follow through with implementing an application. Check out the open source data catalog projects or learn from the implementations at large web companies if you decide to build your own.

In subsequent posts, I will get in deeper into data catalog architectures as well as applications. If you found this survey useful, send a signal by liking and/or subscribing to the newsletter.

Have a specific topic in mind for the next newsletter or other comments ? Send a message here or on Twitter.

MySQL Governance

This issue will focus on governance, privacy and security of MySQL production databases.

This is the first news letter of! Thanks for taking the time to read the newsletter. The news letter will focus on technology solutions for data governance, privacy and security.

The data technology and government regulations landscape are changing independently and very fast. This is forcing data teams to come up with new solutions for data governance, privacy and security. The news letter is my attempt to keep up to date. Hopefully you’ll find it useful as well.

A common governance problem is to audit and manage human access to production databases. In this blog, Hootsuite built a MySQL Bastion that manages temporary credentials and logs activity for MySQL production databases.

Another common requirement is data masking. In this talk, ProxySQL is used to provide basic data masking using regular expression. This approach is a good starting point to check if your company is ready for data masking. However at scale a more robust and automated method built on metadata is required.

The last one is feature request. A major issue in security operations is to provide database passwords to applications and users. There are multiple approaches involving a “database of secrets” and scripts to write passwords to configuration files. However all of them are susceptible to bugs and human error as well as insider threats. A new approach is to not require passwords at all. AWS has introduced IAM authentication that allows access based on the role of an application or user.

AWS IAM support was recently merged into Mariadb allowing headless processes and developers to log into Mariadb without requiring passwords.

Thanks for reading the first issue of DbAdmin.News. If you have any feedback or interesting articles for this newsletter, get in touch by sending an email to or message on twitter at @dbad1minnews.

Register for the newsletter OR follow us on twitter @dbad1minnews

Loading more posts…