Managing Data Quality (DQ) of the information system is a complex and resource demanding task. Though it is often clear why cleaner database will bring more profit, it is not easy to take required steps to get desired result. I have been working on DQ initiative for some time now and though we are nowhere near the final goal, we were doing a good progress. Here I decided to share with you how I understand basic principles of DQ Management. If you think that data quality is the problem of your organisation then you can start from here. I am not getting into the project management aspects of DQ initiative here, as this article is just about the main DQ concepts.
Data Quality is a broad discipline that includes many tasks and preventive measures:
- Standardisation
- Profiling
- Data Quality Assessment (DQA)
- Matching and Merging (M&M)
- Correction
- Monitoring
Data Rules or Business Rules
These are rules that define or restrict your business processes. Any data quality project should start from formal definition of these rules.
So the first thing you should do is to go to your users and ask them:
- What business processes heavily depend on the database?
- What data users need for their business processes?
- What data will disturb or break business processes?
- Write a standard for data entry into the system;
- Create a catalogue of rules that your data must conform to.
Data Standards
While mainly working with CRM system I have created a standard for Customer Data which includes thorough definitions for all data for contact, address and telephone details:
Contacts
- Title
- First Name
- Second Name
- Sex / Gender
- Date of birth
- House Name
- Address Lines (depend on countries if you have multi-national database
- Postcode/Zip Code
- Town
- Country
- Country Code
- Region Code
- Number
It is worth mentioning that most contemporary CRM systems will have such standards imbedded in their functionality. However it can be the case that these rules are still ignored by users or data loads.
Data Quality Metadata
When your business rules are defined through the formal requirements gathering exercise it is time to:- Assess the quality of data in your system;
- Correct data errors where possible to get improve results that you got through assessment stage;
- Set up monitoring activity that will be base on regular DQA reports.
The main feature of such database is that it will generate SQL for all required steps or store pre-written SQL as part of metadata.
Let's have a look at main tables that can be found in any DQ repository.
Scores or Processes
List of business processes that were chosen for DQRules
List of rules that were formulated in requirements capture stage for DQThis table can be extended by tables that store details for rules of specific types:
- Not Null value
- Regular expression
- Relational integrity – searches orphan records
- Custom rule – stores custom SQL that returns a set of erroneous records
- Field comparison – checks that two fields have consistent values i.e. [Start Date]>=[End Date]
- Any other type of rule if it frequently appears in the list of business rules.
- SELECT * FROM MyTable WHERE MyField IS NULL;
All you have to store in the extended table is the attribute ID that is linked to Attributes table.
Tables or Entities
Contains the list of all entities from the databaseFields or Attributes
Contains the list of attributes in entities from the database Data Quality Reports
There are several reports that can be very useful for monitoring of DQ:- Process Level – Shows how many times business rule is violated in the database.
- Entity Level – Shows how many records in table contain errors or missing.
- Sub Entity Level – Shows how many records in table contain records within a specified group of records (i.e. Country in Addresses table).
- Errors report – Contains details of erroneous records that violate business rules.