Auto-tune performance and cost

This post will explore techniques to optimize the usage and cost of compute and storage in databases and data lakes.

Feb 01, 2020

Databases and Data Lakes collect detritus over time. Unnecessary compute (queries and jobs) and storage(tables and files) have a cost. At scale, it is common to spend $10K per month or more on unnecessary queries and data. A key takeaway is that an automated analysis on top of a data catalog is required to successfully maintain an efficient data lake.

Jeff Gibson @GibbyMT

Your aws account when you finally investigate why your bill is so high. #AWS

Compute & Memory Usage

Compute performance optimization and cost reduction are two sides of the same coin.

Every data engine has a set of best practices to run queries as efficiently as possible. For example,

PostgreSQL Performance Tips
MySQL Performance Optimization Resources
AWS Redshift Performance Tuning Techniques
Presto Best Practices
Apache Spark Performance Tuning

To successfully improve performance the following questions should be answered:

Which ones are the most expensive queries?
Which techniques should be applied to optimize each of the queries?
How do I apply the technique successfully and confirm the queries are optimized?
How do I ensure the queries continue to be optimized?

Storage Usage Analysis

The techniques to reduce storage cost fall into the following categories:

Reduce the size of files with encoding or compression.
Implement a data life cycle to move data from expensive to cheap storage.
Remove data sets like duplicates or derived data sets.

For example, in this blog about reducing AWS Redshift costs, compression provided $3000 of monthly savings and a data life cycle reduced costs by $17000 per month.

To apply these techniques the following questions have to be answered:

What is the storage format of all the tables and columns?
What is the ideal storage format for the analysis run on them?
Which tables are required for fast queries and which can be moved to slow & cheaper storage?
What is the lineage of data sets? Can they be derived again?

Calculation of data storage cost differs between shared-nothing (AWS Redshift) vs shared-storage (AWS Athena) technologies. In shared-nothing databases like AWS Redshift, more compute has to be allocated to store data since a single unit consists of disks, compute and memory. In shared-storage databases, more disk capacity has to be added. Therefore the cost of storage is higher in shared-nothing databases.

Automate Best Practices for high performance

At scale, it is hard to apply these techniques manually. The data team has to contend with 100s or 1000s of queries, multiple data engines and new workloads that shift the equilibrium all the time. Many applications automate the analysis and application of best practices to improve performance. An example of a performance tuning application is Dr. Elephant. It suggests optimizations for Apache Spark and Apache Hadoop jobs.

Anatomy of a performance or cost tuning application

Every application has two parts:

A Data Catalog that stores metrics of queries and jobs
An analysis module that uses heuristics to recommend performance optimizations
An optional module to apply the optimizations.

Let us use Dr. Elephant as an example.

Dr. Elephant uses a MySQL database as a data catalog (1). It stores the following information about Hadoop and Spark jobs:

Dr. Elephant has a heuristics module (2) that checks for the following best practices:

Spark Executor Load Balance: Ensure work is balanced across all executors and there is no skew
Spark Job Runtime: Checks is a job should be split and executed as multiple jobs
Spark Memory Limit: Determines the appropriate memory allocation for a spark job.

An example of an application that optimizes storage is redshift_utils. redshift_utils uses AWS Redshift information schema as the data catalog and provides queries that generate recommendations for storage optimizations specific to AWS Redshift.

Open Source Performance Tuning Applications

There are performance tuning applications for most popular data engines. A list of popular open-source performance tuning applications are:

AWS Redshift: AWS Redshift Advisor
Apache Hive: Dr. Elephant
Apache Spark: Sparklens, Spline
AWS Athena: Wikia Athena_Alerter & Discreetly
AWS EMR: EMR Cost Calculator

Roll your own auto-tuning application

Most probably you will have to implement your own auto-tuning application for the unique data infrastructure, data sets, and workloads at your company. A high-level list of steps to follow are:

List the metrics and data of queries, jobs, users, and datasets that are required.
Choose a database as the data catalog.
Create schema and tables to store the metrics.
Scrape metrics and store in the data catalog.
Write SQL queries or build dashboards to analyze & recommend optimizations.
Optionally set alerts to kick off scripts to apply the optimizations.

The blog on taming AWS Costs provides another example of using AWS Athena, S3 and CloudTrail logs to analyze costs.

Conclusion

Auto-tuning is required to ensure that databases and data lakes are being run efficiently. Auto-tuners are an application on top of data catalogs. If you think you need an auto-tuner, checkout the opensource projects or roll your own. Start a conversation here or on Twitter if you want to discuss auto-tuners for databases and data lakes.

Have a specific topic in mind for the next newsletter or other comments? Send a message here or on Twitter.

Data Governance, Privacy and Security

Discussion about this post