This provides high level, practical tips to approach data quality.
Data quality management is necessary for data-driven decision making. However, the outcome of many data quality initiatives is not up to expectations. This post describes one framework that can help to plan and execute a successful data quality project.
What is data quality?
The first question is to understand what does high data quality mean? There are two common definitions:
Data quality is high if data is fit for the intended purpose.
Data quality is high if the data completely reflects the real-world entity.
By definition 1, the data quality is high if it is possible to send an invoice to a customer. However, the data may be missing the customer’s phone number.
By definition 2, the data quality is high if the data correctly captures all attributes of the customer. In the previous example, the data quality is low if the data contains the correct address but is missing the customer’s phone number.
Both these definitions can contradict each other. A reasonable expectation is for master data to be more applicable to multiple business decisions. However, it may be impractical or costly to ensure all-round data quality. Any data quality process has to balance both these opposing forces.
A Data Quality Flow
A data quality flow that focuses on business outcome has the advantage of showing value. Therefore there are better chances of continued funding. Such a flow is described below:
The process starts with:
Defining a business issue such as inablity to send invoices correctly.
Triage with product and engineering teams. Set expectations on outcome as well as threshold and KPIs for data quality. For example a threshold may be that only 1% of attempts to send invoices can fail.
Product and engineering teams fix and deploy.
Verify that data quality has improved based on previously agreed thresholds and KPIs.
Data Quality Tools
There are a number of commercial and open source tools such Great Expectations to help data engineering teams with data quality management. To main techniques are:
Data profiling and alerts on changes in data profiling.
Tests for data pipelines.
A typical problem is that these techniques are used to monitor data quality of all attributes. This can lead to alert fatigue and not perceptibe improvement in business outcome. Another issue is that data profiles and requirements change. Since maintenance of data quality rules is manually intensive, they are not kept up to date.
Tying data quality rules and software usage to specific business outcomes as described in the previous section helps to focus on relevant business outcomes.