Sunday, January 16, 2011

Data Quality as a set of consistent measures

Introduction
Managing Data Quality (DQ) of the information system is a complex and resource demanding task. Though it is often clear why cleaner database will bring more profit, it is not easy to take required steps to get desired result. I have been working on DQ initiative for some time now and though we are nowhere near the final goal, we were doing a good progress. Here I decided to share with you how I understand basic principles of DQ Management. If you think that data quality is the problem of your organisation then you can start from here. I am not getting into the project management aspects of DQ initiative here, as this article is just about the main DQ concepts.
Data Quality is a broad discipline that includes many tasks and preventive measures:

  • Standardisation
  • Profiling
  • Data Quality Assessment (DQA)
  • Matching and Merging (M&M)
  • Correction
  • Monitoring
Though all this measures are well known each organisation has to develop a consistent approach to implement these measures. The good way to start is to draw a broad DQ Strategy document that will explain how DQ will be measured, improved and monitored.


Data Rules or Business Rules
These are rules that define or restrict your business processes. Any data quality project should start from formal definition of these rules.

So the first thing you should do is to go to your users and ask them:

  1. What business processes heavily depend on the database?
  2. What data users need for their business processes?
  3. What data will disturb or break business processes?
After you got answers to these questions you can:

  1. Write a standard for data entry into the system;
  2. Create a catalogue of rules that your data must conform to.
Catalogue of business processes and business rules is the most important part of Data Quality Management Initiative as it provides you with a clear view on what should be done.



Data Standards
While mainly working with CRM system I have created a standard for Customer Data which includes thorough definitions for all data for contact, address and telephone details:

Contacts

  • Title
  • First Name
  • Second Name
  • Sex / Gender
  • Date of birth
Addresses (depends on the country)

  • House Name
  • Address Lines (depend on countries if you have multi-national database
  • Postcode/Zip Code
  • Town
  • Country
Telephone

  • Country Code
  • Region Code
  • Number
Each data item should have a description for business users who key customer details plus technical definition i.e. regular expression.

It is worth mentioning that most contemporary CRM systems will have such standards imbedded in their functionality. However it can be the case that these rules are still ignored by users or data loads.

Data Quality Metadata

When your business rules are defined through the formal requirements gathering exercise it is time to:

  1. Assess the quality of data in your system;
  2. Correct data errors where possible to get improve results that you got through assessment stage;
  3. Set up monitoring activity that will be base on regular DQA reports.
To get all this right you will require a meta-database that will store all data rules and also statistics for DQA;

The main feature of such database is that it will generate SQL for all required steps or store pre-written SQL as part of metadata.

Let's have a look at main tables that can be found in any DQ repository.

Scores or Processes

List of business processes that were chosen for DQ

Rules

List of rules that were formulated in requirements capture stage for DQ

This table can be extended by tables that store details for rules of specific types:

  • Not Null value
  • Regular expression
  • Relational integrity – searches orphan records
  • Custom rule – stores custom SQL that returns a set of erroneous records
  • Field comparison – checks that two fields have consistent values i.e. [Start Date]>=[End Date]
  • Any other type of rule if it frequently appears in the list of business rules.
The purpose of these extensions is to make it possible for the DQ solution to generate SQL basing on minimum of information in the metadata database. So if you want to set a Not Null constraint on one of the fields you just need to specify the attribute for that rule and system will generate SQL that picks erroneous records:


  • SELECT * FROM MyTable WHERE MyField IS NULL;

All you have to store in the extended table is the attribute ID that is linked to Attributes table.

Tables or Entities

Contains the list of all entities from the database

Fields or Attributes

Contains the list of attributes in entities from the database

Data Quality Reports

There are several reports that can be very useful for monitoring of DQ:

  • Process Level – Shows how many times business rule is violated in the database.
  • Entity Level – Shows how many records in table contain errors or missing.
  • Sub Entity Level – Shows how many records in table contain records within a specified group of records (i.e. Country in Addresses table).
  • Errors report – Contains details of erroneous records that violate business rules.
Reporting is essential for DQ management as it allows prioritising data corrections activities basing on number of errors and importance of business rule. It then allows to monitor the progress of DQ initiative and provides easy means to support initiative continuation or cancellation.