Self Service Data Lakes require robust charge back systems to ensure there are no runaway costs and help guide investments in data lakes to improve ROI
Self-service data lakes give users and departments freedom to run workloads to get information from data sets. Chargebacks on usage are required to keep a check on costs and return on investment in data lakes.
Chargebacks have two major components:
Calculate the cost of the workload
Attribute the cost to user and department
Cost of Workloads
The cost can be broken down into two categories:
Cost of storing data in storage devices such as HDFS or AWS S3
Cost of running workloads on compute such as Spark, Presto or Hive.
Storage cost calculation is straight-forward. Every storage device has a cost to store 1GB of data. For example, AWS S3 costs $0.023/GB (caveat emptor). Similarly, costs of HDFS or other distributed file systems can be estimated by taking into consideration hardware and running costs.
Storage costs are then calculated by multiplying data set size (in GB) with cost per GB.
A workload uses 4 main resources in a data lake:
All query engines provide usage metrics for all these resources. For example, Spark provides the following execution metrics:
executionCpuTime: CPU time the executor spent running this task
inputMetrics.bytesRead: Bytes read from tables
shuffleReadMetrics.remoteBytesRead: Bytes read over the network.
peakExecutionMemory: Bytes used by all internal data structures in memory.
These metrics are per task. For a query or workload, the sum of these metrics will give the complete resource usage of a query. If the cost can be associated with each unit of usage, then it is possible to calculate the cost of running the query.
However, this process is too onerous and error-prone. Instead, a simpler method is to assume the resource usage of CPU, Memory, and Network is proportional to the bytes read from the tables in HDFS or cloud storage and charge based on bytes read by the query or workload.
For example, AWS Athena charges query execution costs on the number of bytes read by the query. Even though this measure is not perfect and can be abused, in practice it provides a close approximation to workload costs with far less effort in implementing the systems to calculate exact workload costs.
Once workload costs can be calculated, the next step is to attribute it to the correct user and aggregate it to the correct department. A data catalog is required to capture metadata of users, departments, data sets, and workloads as well as the relationship between them.
A data catalog can be stored in a database. SQL queries can be run to calculate the costs of storage and workloads followed by attribution to the right users and departments.
Chargebacks are important to control the costs and ROI of the data lake. Even though most mature data teams have chargeback systems, there are no common best practices or open-source projects that can be used instead of building a chargeback system. One of the reasons is that the policies for chargebacks differ from one company to another.
Does your company have a chargeback system? Or are you considering building one? Please comment or get in touch if you want to discuss how to build a chargeback system.