ETIS09 - Data Quality - Common Problems & Checks - Presentation
-
Upload
davidmwalker -
Category
Documents
-
view
918 -
download
3
Transcript of ETIS09 - Data Quality - Common Problems & Checks - Presentation
![Page 1: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/1.jpg)
Data Quality: Common Problems & Checks
David M. Walker [email protected]
+44 (0) 7050 028 911 - http://www.datamgmt.com
Data Management & Warehousing
Date: 24 April 2009 Location: Zagreb, Croatia
![Page 2: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/2.jpg)
24 April 2009 © 2009 Data Management & Warehousing Page 2
Agenda
• Introduction • Common Problems • Automated Checking • Profiling Checks • Conclusions
![Page 3: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/3.jpg)
Introduction
• Data Quality problems are a SOURCE SYSTEM issue and an ETL issue – They just manifest themselves in the data warehouse
• Prevention is better than cure – Fixing the source system or the ETL is ALWAYS
cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself
• Data Quality is a continuous process – It is never finished and always needs to be monitored
24 April 2009 © 2009 Data Management & Warehousing Page 3
![Page 4: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/4.jpg)
The Impact of Poor Data Quality
• Devalues the data warehouse – Discourages people from trusting or using the
system and therefore curtailing the life of the data warehouse
• Highlights failings in the source system and/or the business process – Businesses would rather fix at any cost in the
data warehouse and pretend that there isn’t a source system problem
24 April 2009 © 2009 Data Management & Warehousing Page 4
![Page 5: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/5.jpg)
Common Problems
• 11 types of problem that account for the most common problems
• They usually reflect poor design and/or implementation of systems
• Most can be fixed or monitored and managed to limit the impact
24 April 2009 © 2009 Data Management & Warehousing Page 5
![Page 6: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/6.jpg)
Referential Issues
• Keys that are not unique – Systems that do not enforce the unique
primary key or flat files or spreadsheets – Also generated by ETL that creates a
surrogate key incorrectly • Referential Integrity Failures
– Where referentially integrity is not enforced values in the child table are created that are not in the parent table
24 April 2009 © 2009 Data Management & Warehousing Page 6
![Page 7: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/7.jpg)
Data Type Issues
• Format Errors – Typically in Date/Time type fields – 02/04/2009 2nd April (UK) or 4th Feb (US)
• Inappropriate Data Types – Storing Dates in Character Strings
20090624 as YYYYMMDD format string – But what about 20090230?
24 April 2009 © 2009 Data Management & Warehousing Page 7
![Page 8: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/8.jpg)
Data Model Issues
• De-normalised tables – Commonly created for performance reasons – Inherently duplicates data – Often gets out of sync
• Data/Column Retirement – Upgrade to system retires a column – ETL continues to use the old column
• Poor Table/Column Naming – Don’t assume that a column does what it says – Don’t assume that a column is still being used for it’s original
purpose
24 April 2009 © 2009 Data Management & Warehousing Page 8
![Page 9: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/9.jpg)
Data Content Issues
• Null Values – Systems that have many optional fields will
often have missing values – Null values allow rows to be silently omitted
from queries • Inappropriate Values
– Databases allow special characters and/or leading/trailing white space
– “DataspaceQuality” != “DatatabQuality” 24 April 2009 © 2009 Data Management & Warehousing Page 9
![Page 10: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/10.jpg)
Data Feed Issues
• Missing Data – Where a stream of files are loaded by ETL if
one is dropped it can go un-noticed – Common with CDR type loads in Telcos
• Late Data – A short term data quality issue – Leaves users believing there is a problem – Produces inconsistent reporting over time
24 April 2009 © 2009 Data Management & Warehousing Page 10
![Page 11: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/11.jpg)
Automated Checking
• Regularly run checks • Broad Coverage across systems
– 100s and 1000s not 10s of queries – Run in a low priority loop in the background
• Used against: – Sources – Data Staging – Data Warehouse
• No Product Required – We often implement this as a controlling shell script and lots of
small scripts, one for each check
24 April 2009 © 2009 Data Management & Warehousing Page 11
![Page 12: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/12.jpg)
Trending
• Absolute Trending – Track an expected value over time
• e.g. Returned Mail is usually less than 500 items per day • If the value is <500 status is green, 501 to 1000 amber and >1000
red
• Statistical Process Control (SPC) Trending – Track an expected value where the value changes over time
• e.g. Telco CDRs – expect more as the company grows • Don’t want to be continuously changing the threshold • Compare current load to historical means • If current load within 2 Standard Deviations – Green,
3 Standard Deviations – Amber, 4 or more Standard Deviations - Red
24 April 2009 © 2009 Data Management & Warehousing Page 12
![Page 13: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/13.jpg)
Flow Control
• Flow Control – ETL manipulates data
• Joins, De-duplicates, Filters, Aggregates, etc
– Use the formula: Source Count - Filtered Count – DeDup Count – Target Count = 0
• Trusted Source – Compare the result with a third system – e.g. Does the Count of Switch CDRs =
Count of those processed by the billing system Count of those processed in the DWH
24 April 2009 © 2009 Data Management & Warehousing Page 13
![Page 14: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/14.jpg)
Business Rules Based
• Specific rules to match known business rules – Account holders > 18 (Sys Date – DoB) – Account holders < 115 – Credit Card numbers are 16 digits long – Number of accounts without a status
• Result should yield Zero
24 April 2009 © 2009 Data Management & Warehousing Page 14
![Page 15: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/15.jpg)
Automated Checking - Ops
• Managed by exception – Red given priority – Amber are always followed up
• Massive number of checks – 100’s are good – 1000’s are better
• Presentation – Alerts, RAG, Graphical, Numerical, etc.
24 April 2009 © 2009 Data Management & Warehousing Page 15
![Page 16: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/16.jpg)
Data Profiling Checks
• Run manually because they need to be interpreted by a human
• Leads to new business rules being added to the automated checks
• Can be done with simple reporting tools or commercial data profiling tools
24 April 2009 © 2009 Data Management & Warehousing Page 16
![Page 17: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/17.jpg)
Frequency Outliers
• Count discreet values in a table and check items with many more or less than normal – e.g. DoB 01-01-01 many times more common
than any other value indicates source default and something that needs work
– e.g. Count of SMS messages significantly lower on a given day may equate to a genuine system failure and therefore not a DQ problem
24 April 2009 © 2009 Data Management & Warehousing Page 17
![Page 18: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/18.jpg)
Maximum & Minimum
• Determine what a valid range for any value should be – e.g. age between 18 and 115 – Immediately finds individual data quality
issues that can be resolved – Allows an analyst to create new business
rules to prevent future problems
24 April 2009 © 2009 Data Management & Warehousing Page 18
![Page 19: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/19.jpg)
Sequential Keys
• If a system has a sequential key:
Max Value – Min Value – Count = 0
• If this is true – is it too perfect for an operational system and therefore test data
• If this is false – what has caused the gaps, are the deletions intentional?
24 April 2009 © 2009 Data Management & Warehousing Page 19
![Page 20: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/20.jpg)
Data Types
• Validation of mis-used data types before loading – e.g. Dates in Character fields – Format: YYYYMMDD – Check MM between 01 and 12 – Check DD between 01 and 31 – Check MMDD does not include 0230, 0231 – etc.
24 April 2009 © 2009 Data Management & Warehousing Page 20
![Page 21: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/21.jpg)
Skewed Pattern Profiling
• Looking for specific patterns in data – e.g. UK National Insurance Numbers (?) have
the format AA 99 99 99 A – Pattern match all values looking for
exceptions • Number Lengths are a special case
– e.g Credit Card Numbers are 16 digits long
24 April 2009 © 2009 Data Management & Warehousing Page 21
![Page 22: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/22.jpg)
Content Checking
• Content Checking is the manual review of character strings
• Needs a good understanding of the nature of the data
• Often determines the need to do analysis of other types
24 April 2009 © 2009 Data Management & Warehousing Page 22
![Page 23: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/23.jpg)
Nulls & White Space
• Nulls – Fields that have large proportion of nulls are usually
not useful – Also common is default status of null
(e.g. Account us either closed or null) • White Space
– Not Null fields with a single space – Tab instead of space – Leading/Trailing white space – Double White Space: “DavidSpaceSpaceWalker”
24 April 2009 © 2009 Data Management & Warehousing Page 23
![Page 24: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/24.jpg)
Punctuation & Control Chars
• Punctuation – CSV files that are not properly quoted perform
field shifts – Address lines with extra commas
• Control Characters – Data fields that contain ASCII character codes
0 to 31 and 127 to 159 are often ‘invisible’ when viewed in queries but cause failures
– Also be aware of ‘code-page’ specifics 24 April 2009 © 2009 Data Management & Warehousing Page 24
![Page 25: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/25.jpg)
Problem Management Matrix
24 April 2009 © 2009 Data Management & Warehousing Page 25
![Page 26: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/26.jpg)
Continuous DQ Process
24 April 2009 © 2009 Data Management & Warehousing Page 26
![Page 27: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/27.jpg)
Quality is FREE …
… as long as you are prepared to INVEST HEAVILY in it
Philip Crosby 1980
Especially true of Data Quality
24 April 2009 © 2009 Data Management & Warehousing Page 27
![Page 28: ETIS09 - Data Quality - Common Problems & Checks - Presentation](https://reader034.fdocuments.net/reader034/viewer/2022051609/5466a8f5b4af9fbb068b469b/html5/thumbnails/28.jpg)
Data Quality: Common Problems & Checks
David M. Walker [email protected]
+44 (0) 7050 028 911 - http://www.datamgmt.com
Data Management & Warehousing
Thank You