Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality...

41
Data Quality Class 5
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    223
  • download

    2

Transcript of Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality...

Page 1: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Quality

Class 5

Page 2: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Goals

• Project

• Data Quality Rules (Continued)

• Example

• Use of Data Quality Rules

Page 3: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Quality Rules Classes

• 1)      Null value rules• 2)      Value rules• 3)      Domain membership rules• 4)      Domain Mappings• 5)      Relation rules• 6)      Table, Cross-table, and Cross-message assertions• 7)      In-Process directives• 8)      Operational Directives• 9)      Other rules

Page 4: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Representing Data Quality Rules

• Data is divided into 2 sets:– conformers– violators

• Sets can be represented using SQL

• Create SQL statements representing violating set

Page 5: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Using SQL

• Direct queries• Embedded queries

– Using ODBC/JDBC, can create validation scripts in

• C• C++• Java• Visual Basic• Etc.

Page 6: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Null Value Representations

• Maintain a table of null representation types and names:

create table nullreps (name varchar(30),

nulltype char(1),

description varchar(1024),

source varchar(512),

nullval varchar(100),

nullrepid integer

);

Page 7: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Null Value Rules

• Allows nulls– If the rule is “allows nulls” without any

additional characterization• Nothing necessary

– If the rule is “allows nulls,” but only of a specific type

• Must check for real nulls (and possibly blanks and spaces):

• SELECT * from <table> WHERE <table>.<attribute> is NULL;

Page 8: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Null Value Rules

• Does not allow nulls– Must check for nulls(and possibly blanks and

spaces):• SELECT * from <table> WHERE

<table>.<attribute> is NULL;

Page 9: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Value Rules

• Value rule is specified as some set of constraints

• Makes use of operators and functions:– +, -, *, /, <, <=, >, >=, !=, ==, AND, OR– User defined functions

• Example:– value >= 0 AND value <= 100

Page 10: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Value Rules 2

• Validation test is opposite of constraint

• Use DeMorgan’s laws– If constraint was “value >= 0 AND value <=

100), use:

SELECT * from <table> where <table>.<attribute> < 0 OR

<table>.<attribute> > 100;

Page 11: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Membership

• Domains are stored in a database table

• Test for domain membership of an attribute is a test to make sure that all values are represented in domain table

Page 12: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Reference Tables

create table domainref (

name varchar(30),

dtype char(1),

description varchar(1024),

source varchar(512),

domainid integer

);

Page 13: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Reference Tables

create table domainvals (

domainid integer,

value varchar(128)

);

Page 14: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Membership

• Test for membership of attribute foo in the domain named bar:

SELECT * from <table> where foo not in

(SELECT value from domainvals where domainid =

(SELECT domainid from domainref

where domainref.name = “bar”));

Page 15: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Assignment

• The values in the attribute define the domain:– Find all the values not in the domain already– Update domain tables with those values

Page 16: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Assignment 2

• SELECT * from <table> where foo not in

(SELECT value from domainvals where domainid =

(SELECT domainid from domainref

where domainref.name = “bar”));

For all values in this set, create a record with (the value, the domain id for “bar”), and insert into domainvals.

Page 17: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Mapping Membership

• Similar to domain membership, except:– Must include domain membership tests for both

values– Also must be looked up in the mapping tables

Page 18: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Completeness

• Defines when a record is complete– Ex: IF (Orders.Total > 0.0), Complete With

{Orders.Billing_Street, Orders.Billing_City, Orders.Billing_State, Orders.Billing_ZIP}

• Format:– Condition– List of fields that must be complete

Page 19: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Completeness 2

• Equivalent to a set of null tests using condition

• Select * from <table> where <condition is true> and <list of not null tests>;

Page 20: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Exemption

Defines which fields may be missingIF (Orders.Item_Class != “CLOTHING”) Exempt

{Orders.Color,

Orders.Size

}

• Format:– Condition

– List of fields that must be complete

Page 21: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Exemption 2

• If condition is true, the fields may be null

• Therefore, if condition is false, fields may not be null

• Equivalent for test of opposite of condition and test for nulls

Page 22: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Consistency

• Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”)

Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)

– Format:• Condition

• Assertion

Page 23: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Consistency 2

• If condition is true, the assertion must be true

• Equivalent to test for cases where the condition is true and the assertion is false:

Select * from <table> where <condition> and not <assertion>;

Page 24: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Derivation

• Prescriptive form of consistency rule• Details how one attribute’s value is determined

based on other attributesIF (Orders.NumberOrdered > 0) Then {Orders.Total = (Orders.NumberOrdered *

Orders.Price) * 1.05}

• Format:– Condition– assignment

Page 25: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Derivation 2

• The assigned fields must be updated if condition is true

• Find all records where the condition is true

• Generate update SQL calls with updated values

• Execute updates

Page 26: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Functional Dependence

• Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,

• if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.

• In other words, attribute Y is said to be determined by attribute X.

Page 27: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Functional Dependence 2

• Rule Format:– Attribute X determines Attribute Y

• Validation test makes sure that the functional dependence criterion is met

• This means that if we extract the X value from the set of all distinct value pairs, that set should have no duplicates

Page 28: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Functional Dependence 3

• Create view FD as select distinct X, Y from <table>;

• Select count (*) from FD;

• Select count (distinct X) from <table>;

• These should be the same numbers.

Page 29: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Primary Key/Uniqueness

• A set of attributes defined as a primary key must uniquely identify a record

• Can also be viewed as a uniqueness constraint

• Format:– {attribute list} is PRIMARY– {attribute list} is UNIQUE

Page 30: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Primary

• Test to make sure that the number of distinct records with the expected key is the same as the number of records

• Select count(*) from <table>;• Select count (distinct <attribute list>) from

<table>;

• These numbers should be the same

Page 31: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Uniqueness

• Test for multiple record occurrences with the same set of values that should have been unique, if there is a separate known primary key

SELECT <table>.<attribute>, <table>.<attribute>

FROM <table> AS t1, <table> AS t2

WHERE t1.<attribute> = t2.<attribute> and t1.<primary> <> t2.<primary>;

Page 32: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Foreign Key

• When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f

• If f is a foreign key, the key must exist in table S, column g (=referential integrity)

Page 33: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Foreign Key 2

• Similar to primary key

• Test is to make sure that all values in foreign key field exist in target table

Select * from <source table> where <attribute> not in (Select distinct <attribute> from <target table>);

Page 34: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Use of Data Quality Rules

• Data Validation

• Root Cause Analysis

• Message Transformation

• Data-driven GUIs

• Metadata Collection

Page 35: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Validation

• Translate rule set into select statements

• Create a program that:– Loads select statements into an array, indexed

by a unique integer– Connects to database via ODBC– Iterates through the array of select statements

those results

Page 36: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Validation 2

– Each type of rule has an expected result; check against the expected result

– Outputs the result of each statement to output file, tagged by rule identifier

– Results can be tallied to yield an overall percentage of valid records to total records

Page 37: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Root Cause Analysis

• Root cause analysis can be started by looking at the counts of violated rules

• Use the most frequently violated rule as a starting place

Page 38: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Message Transformation

• Electronic Data Interchange

• Use DQ rules to validate incoming messages

• Use DQ rules (derivations, mappings) to transform incoming messages into an internal format

Page 39: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data-driven GUIs

• Data dependence is specified in a collection of rules

• Generate equivalence classes of data values based on dependence specification

Page 40: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data-driven GUIS

• First, look for all independent attributes – this is class 0

• For class i, collect all attributes that depend on class (i – 1)

• The GUI will be constructed to iteratively request data from class 0..n

• Based on the results from collecting data at step j, the rules associated with the actual values are applied, determining which values are requested at step j + 1

Page 41: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Metadata Collection

• Use domain and mapping derivation rules to collect metadata

• Use other rules as a documentation of business operations