INDEPTH Data Quality Workshop Program & Curriculum

63
INDEPTH Data Quality Workshop Program and Curriculum 11-13 May 2010, Accra, Ghana Course Facilitator : Dr Kobus Herbst

Transcript of INDEPTH Data Quality Workshop Program & Curriculum

Page 1: INDEPTH Data Quality Workshop Program & Curriculum

INDEPTH Data Quality Workshop

Program and Curriculum

11-13 May 2010, Accra, Ghana

Course Facilitator : Dr Kobus Herbst

Page 2: INDEPTH Data Quality Workshop Program & Curriculum

1 Workshop Objectives1. Create a common understanding of data quality in the context of health and demographic

surveillance2. Learn from the experience regarding data quality in the iShare initiative3. Gain practical experience in measuring data quality in HDSS databases4. Derive and agree on minimum data quality metrics for INDEPTH sites5. Apply a minimum set of common data quality metrics to own HDSS database6. Discuss the form and content of site data quality improvement projects and INDEPTH’s role

in promoting such

2 Outcomes1. Minimum set of INDEPTH Data Quality Metrics defined2. Site data quality baselines established3. Common outline and criteria for site data quality projects agreed to4. Recommendation made for an INDEPTH Data Quality Assurance Program.

2

Page 3: INDEPTH Data Quality Workshop Program & Curriculum

3 ProgramTime Topic Presenter

Day 1 : 11 May 20108:00-9:00 Registration INDEPTH Secretariat9:00-9:30 Welcome and Introduction to Workshop Objectives INDEPTH Executive

Course Facilitator9:30-10:30 What is Data Quality? Course Facilitator10:30-11:00 Tea Break11:00-11:30 Impact of Data Quality on Demographic Measures Ayaga Bawah11:30-12:00 Extend and Implications of Poor Quality Data – iShare Experience iShare representative12:00-12:30 Causes of Poor Quality Data Course Facilitator12:30-13:30 Lunch Break13:30-14:30 Measuring Data Quality : Theory Course Facilitator14:30-15:30 Measuring Data Quality : iShare Experience iShare representative15:30-16:00 Tea Break16:00-17:00 Measuring Data Quality : Practical - Attribute domain constraints Course FacilitatorDay 2 : 12 May 20108:30-9:30 Measuring Data Quality : Practical – Relational integrity constraints Course Facilitator9:30-10:30 Measuring Data Quality : Practical – Historical Data & State

Dependant ObjectsCourse Facilitator

10:30-11:00 Tea Break11:00-11:30 Measuring Data Quality : Practical – General Attribute Dependencies Course Facilitator11:30-13:00 Discussion : Agreeing on a minimum set of data quality metrics for

INDEPTHAll Participants

13:00-14:00 Lunch Break14:00-17:00 Applying agreed set of data quality metrics to own database All ParticipantsDay 3 : 13 May 20108:30-10:00 Comparison & Standardisation of Minimum Data Quality Metrics Course Facilitator10:00-10:30 Tea Break10:30-11:00 Total Data Quality Management : Theory Course Facilitator11:00-12:30 Discussion : Data Quality Assurance in INDEPTH : The Way Forward All Participants12:30-13:00 Publication : Workshop Proceedings Course Facilitator13:00-14:00 Lunch14:00-16:00 INDEPTH Minimum Dataset INDEPTH Secretariat

3

Page 4: INDEPTH Data Quality Workshop Program & Curriculum

4 Curriculum

4.1 What is Data Quality?

4.1.1 Learning Objectives1. Explain the different roles that can be identified in the information production system2. Understand the concept of an information product, and relate that to the HDSS research

context3. Understand and explain the different concepts of data quality4. Identify the dimensions of data quality most relevant to HDSS

4.1.2 Content1. Information System Roles2. Information Products3. Concepts & Dimensions of Data Quality

4.1.3 Pre-reading and Reference Material1. Carlo Batini, Monica Scannapieca. Data Quality. Concepts, Methodologies and Techniques.

2006. Springer Berlin. Pp 1-49.2. Jack E. Olson. Data Quality. The Accuracy Dimension. 2003. Morgan Kaufmann. San

Francisco. Pp 3-64.3. Census Bureau Methodology & Standards Council. Census Bureau Principle: Definition of

Data Quality. 2006. US Census Bureau.4. Danette McGilvray. Executing Data Quality Projects. Ten Steps to Quality Data and Trusted

Information. 2008. Morgan Kaufmann Burlington. Pp30-33.5. Tim Holt, Tim Jones. Quality work and conflicting quality objectives. 1998. 84th DGINS

conference, Stockholm 28-29 May 1998. Office for National Statistics, UK.

4

Page 5: INDEPTH Data Quality Workshop Program & Curriculum

4.2 Impact of Data Quality on Demographic Measures

4.2.1 Learning ObjectivesTo be provided

4.2.2 ContentTo be provided

4.2.3 Pre-reading and Reference MaterialTo be provided.

4.3 Extend and Implications of Poor Quality Data – iShare Experience

4.3.1 Learning ObjectivesTo be provided

4.3.2 ContentTo be provided

4.3.3 Pre-reading and Reference MaterialTo be provided.

5

Page 6: INDEPTH Data Quality Workshop Program & Curriculum

4.4 Causes of Poor Quality Data

4.4.1 Learning Objectives1. Able to classify and describe the causes of poor data quality

4.4.2 Content1. Research Design

a. Research Questionb. Research Methodologyc. Data System Design

2. Population Factorsa. Educationb. Cultural

3. Data Collectiona. Field workersb. Data collection instrumentsc. Data Entry

4. Data Analysisa. Data Conversionb. Data Extractionc. Data Cleaning

4.4.3 Pre-reading and Reference Material1. Van den Broeck, J., S.A. Cunningham, R. Eeckels, and K. Herbst, Data cleaning: detecting,

diagnosing, and editing data abnormalities. PLoS Med, 2005. 2(10): p. e267.

6

Page 7: INDEPTH Data Quality Workshop Program & Curriculum

4.5 Measuring Data Quality

4.5.1 Learning Objectives1. Classify, list and explain the different rules that can be applied to measure data quality

4.5.2 Content1. Data Quality Rules

a. Attribute domain constraints b. Relational integrity constraintsc. Rules for historical data d. Rules for state-dependent objects e. General attribute dependency rules

4.5.3 Pre-reading and Reference Material1. Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. Data Quality Assessment. Communications

of the ACM. April 2002/Vol. 45, No. 4ve. p211.2. Arkady Maydanchik. Data Quality Assessment. 2007. Technics Publications.

4.6 Measuring Data Quality : Practical

4.6.1 Learning Objectives1. Apply Data Quality Rules to DSS Reference Data Model to derive data quality indicators

4.6.2 ContentThe examples are all based on a sample database based on the INDEPTH Reference Data Model. See Appendix A. The SQL used to derive the data quality indicators are contained in Appendix B. The SQL dialect is SQL Server 2008 T-SQL.

1. Attribute domain constraints a. Optionality Constraints

These constraints prevent attributes from taking Null, or missing, values. Default values are often entered to circumvent the Not-Null constraints, i.e., the attribute is populated with a default value when actual value is not available.

Example: Cause of Death codes

Cause n

Unassigned

520

Indicator=1−Unassigned+NullTotal

Null 745

Assigned 8225

Total 9490

Indicator

86.7%

7

Page 8: INDEPTH Data Quality Workshop Program & Curriculum

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 20100.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

Data Quality Cause of Death

b. Format Constraints

These constraints define the expected form in which the attribute values are stored in the database field. Format constraints are most important when dealing with “legacy” databases. However, even modern databases are full of surprises. From time to time, numeric and date/time attributes are still stored in text fields.

Example : Surname field containing invalid characters.

Use wildcard characters or regular expressions to detect format violations. The specific function is quite specific to particular database used. In SQL 2008 T-SQL, I am using the PATINDEX function to find any LastName with a character not in the set of capital and lower case alpha characters and a space and single quote (‘) character.

SELECT COUNT(*)FROM dbo.IndividualsWHERE PATINDEX('%[^a-zA-Z '']%',LastName)>0

LastName

n

Valid12627

5

Indicator=1− InvalidTotal

Invalid 137

Total12641

2Indicator

99.9%

c. Valid Value Constraints

These constraints limit the permitted attribute values to a prescribed list or range. Unfortunately, valid value lists are often unavailable, incomplete, or incorrect. To identify valid values, we first need to collect counts of all actual values. These counts can then be analyzed, and actual values can be cross- referenced against the valid value list, if available. Values that are found in many records are probably valid, even if they are missing from the data dictionary. Such circumstances typically arise when new values are added after the original database design, but are not added to the documentation. Values that have low frequency are suspect.

8

Page 9: INDEPTH Data Quality Workshop Program & Curriculum

Example : Residency episode initiating event type.

Resident episode should only be started by DSS start, birth or in-migration.

Start Type

n

Valid 168544

Indicator=1− InvalidTotal

Invalid 2109

Total 170653

Indicator

98.8%

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201094.0%

95.0%

96.0%

97.0%

98.0%

99.0%

100.0%

Data Quality : Residency Start

Example : Birth Weights

0300

10001300

16001900

22002500

28003100

34003700

40004300

46004900

52000

50

100

150

200

250

300

350

400

450

500

Birth Weight

9

Page 10: INDEPTH Data Quality Workshop Program & Curriculum

d. Precision and Granularity Constraints

These constraints require all values of an attribute to have the same precision, granularity, and unit of measurement. Precision constraints can apply to both numeric and date/time attributes. For numeric values, they define the desired number of decimals. For date/time attributes, precision can be defined as calendar month, day, hour, minute, or second. Data profiling can be used to calculate distribution of values for each precision level.

Example : Date of Birth Precision

Date Precision

Precision n Score Formula

Day 1 15765 141885Score=(10−Precision )×Frequency Precision

ScoreMax= ∑Precision

Frequency×9

ScoreTotal= ∑Precision

Score

Indicator=1−ScoreMax−ScoreTotal

ScoreMax

Week 2 66 528

Fortnight 3 2 14

Month 4 519 3114

Quarter 5 11 55

Semester 6 67 268

Year 7 0 0

Decade 8 0 0

Unknown 9 0 0

Total 147870 16430 145864

Indicator 98.6%

Example : Migration Date Precision

Indicator Value

External In-Migration Date 77.7%

External Out-Migration Date 77.4%

Internal In-Migration Date 79.0%

Internal Out-Migration Date 78.1%

2000 2001 2002 2003 2004 2005 2006 2007 2008 20090%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Migration Date Precision

2. Relational integrity constraintsa. Identity rules

10

Page 11: INDEPTH Data Quality Workshop Program & Curriculum

An identity rule validates that every record in a database table corresponds to one and only one real world entity and that no two records reference the same entity.

Example : Potential Individual duplications

Similarity measure = Levenshtein distance1 (Firstnamea, Firstnameb) +Levenshtein distance (Lastnamea,Lastnameb) +Sexa=Sexb ? 0 : 1 +ABS(YEAR(DoBa) -YEAR(DoBb)) +ABS(MONTH(DoBa) - MONTH(DoBb)) +ABS(DAY(DoBa) - DAY(DoBb))

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 540

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

Individual Identifier Similarity Measure Distribution

Similarity

n

0 42

Indicator=1− Individuals−UniqueIndividualsIndividuals

= 99.4%

1 238

2 442

3 820

4 1699

5 3832

6 8349

7 16849

8 31836

9 59679

1011003

8

Similarity = 1IndA IndB Name A Name B Sex Sex DoB A DoB B

1 The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

11

Page 12: INDEPTH Data Quality Workshop Program & Curriculum

A B

4614316

9Nesbit, Nqobile Nesbit, Nqobile FEM FEM

1992/09/08

1993/09/08

1001

1005 Nguyen, Simbongiwe Nguyen, Sibongiwe FEM FEM1995/03/25

1995/03/25

Similarity = 2Ind A

Ind B Name A Name B Sex A

Sex B

Do BA Do BB

1388

18938

Mitchell, Hlengiwe Mitchell, Hlengiwe FEM FEM1983/05/16

1983/04/15

3378

3380 Myers, Sandile Myers, Zandile MAL FEM1987/11/25

1987/11/25

Similarity = 3Ind A

Ind B Name A Name B Sex A

Sex B

Do BA Do BB

84 85 Johnson, Ntando Johnson, Nontando MAL FEM1983/12/03

1983/12/03

255 260 Sosibo, Thandiwe Sosibo, Thandeka FEM FEM1976/05/07

1976/05/07

569 12191 Smith, Bongani Smith, Lindani MAL MAL1994/08/14

1994/08/14

585 35418 García, Sanele García, Zanele MAL FEM1997/12/06

1996/12/06

b. Reference rulesA reference rule ensures that every reference made from one entity occurrence to another entity occurrence can be successfully resolved. Each reference rule is represented in relational data models by a foreign key that ties an attribute or a collection of attributes of one entity with the primary key of another entity. Foreign keys guarantee that navigation of a reference across entities does not result in a “dead end.”

Example : Child to Parent references.

Status Mother Father

Known 74,043 32,257

IndicatorA=1−MissingTotal

Indicator B=1−Missing+UnknownTotal

Missing 2,855 9,708

Unknown 49,514 84,447

Total 126,412 126,412

Indicator A 97.7% 92.3%

Indicator B 58.6% 25.5%

c. Cardinal rulesA cardinal rule defines the constraints on relationship cardinality. Cardinal rules are not to be confused with reference rules. Whereas reference rules are concerned with the identity of the occurrences in referenced entities, cardinal rules define the allowed number of such occurrences.

Residency

Wrong Correct

Exists 170653 124657

Indicator=1−¿¿None 1755 1755

Total 172408 126412

Indicator

99.0% 98.6%

12

Page 13: INDEPTH Data Quality Workshop Program & Curriculum

0 1 2 3 4 5 6 7 8 9

Cardinality 1755 90779 24790 6822 1686 436 115 19 9 1

5000

15000

25000

35000

45000

55000

65000

75000

85000

95000

Residency Cardinality

d. Inheritance rulesAn inheritance rule expresses integrity constraints on entities that are associated through generalization and specialization, or more technically through sub- typing.

Example : Not available.

3. Rules for historical data a. Currency Rule

A currency rule enforces the desired “freshness” of the historical data. Currency rules are usually expressed in the form of constraints on the effective date of the most recent record in the history. For example if the status of an individual under surveillance is 'Current', then the last visit date should be no earlier than the start of the previous surveillance round.

Example 1 : Last observation for current residency episodes must be at least in previous census round.

Currency Residency

EpisodesCurrent 62621

Indicator=1−NotCurrentTotal

Not Current 2384

Total 65005

Indicator 96.3%

13

Page 14: INDEPTH Data Quality Workshop Program & Curriculum

Example 2 : At year end, last status observation should not be prior than 1 July of that year (older than 183 days)

2000 2001 2002 2003 2004 2005 2006 2007 20080%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Currency of Status Observations

UndefinedNotCurrentCurrent

b. Retention Rule

A retention rule enforces the desired depth of the historical data. Retention rules are usually expressed in the form of constraints on the overall duration or the number of records in the history.

c. Granularity rule

A granularity rule requires all measurement periods in an accumulator history to have the same size.

E.g. If the surveillance implies a six monthly visit to each homestead, is that in fact the case?

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210

50

100

150

200

250

300

350

Inter-round Visit Gap, Interquartile Ranges5-95 percentile extremes

Round n-1:n

Days

14

Page 15: INDEPTH Data Quality Workshop Program & Curriculum

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 210%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Round Granularity

d. Continuity rule

A continuity rule prohibits gaps and overlaps in accumulator histories. Continuity rules require that the beginning date of each measurement period immediately follows the end date of the previous period.

For example for internal migrations, the next residency episode must follow directly on the previous.

Example : Internal migrations

Continuity nContinuity 18 657

Indicator=1−DiscontinuityTotal

Discontinuity 6 430Total 25 087Indicator 74.4%

2000 2001 2002 2003 2004 2005 2006 2007 2008 20090%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Continuity

e. Timestamp pattern rule

A timestamp pattern rule requires all timestamps to fall into a certain prescribed date interval, such as every March or every other Wednesday or between the first and fifth of each month. Occasionally the pattern takes the form of minimum or maximum length of time between measurements. For example,

15

Page 16: INDEPTH Data Quality Workshop Program & Curriculum

participants in a medical study may be required to take blood pressure readings at least once a week. While the length of time between particular measurements will differ, it has to be no longer than seven days.

Example : Similar to granularity rule, homestead has to be visited at least once every six months.

Semester VisitsLocatio

nsVisited 194,238

Indicator=1−NotVisitedTotal

Not Visited 19,488Total 213,726Indicator 90.9%

Note : Care should be taken with the type of observations used to derive this measure. If for example only observation tied to residency and status observations are considered, those locations visited where no observation was recorded due to non-contact with the occupants will not be considered in this indicator.

20002000

20012001

20022002

20032003

20042004

20052005

20062006

20072007

20082008

20092009

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Proportion of Locations With At Least One Visit per Semester During Period of Occupancy

Year/Semester

f. Value Pattern Rule

Value histories for time-dependent attributes usually also follow systematic patterns. A value pattern rule utilizes these patterns to predict reasonable ranges of values for each measurement and identify likely outliers. Value pattern rules can restrict direction, magnitude, or volatility of change in data values.

i. Direction of Change

The simplest value pattern rules restrict the direction in value changes from measurement to measurement. A person's length is unlikely to decrease over multiple measures in time, same for educational attainment.

Example: Educational attainment cannot decline.

Direction Measures

Invalid 21,178

Indicator=1− InvalidTotal

Valid 128,613

Total 149,791

Indicator

85.9%

16

Page 17: INDEPTH Data Quality Workshop Program & Curriculum

ii. Magnitude of Change

It is usually expressed as a maximum (and occasionally minimum) allowed change per unit of time.

Example : Educational attainment cannot increase by more than the difference in years between two observation dates.

Direction Measures

Valid 117,626

Indicator=1− InvalidDirection+ InvalidMagnitudeTotal

Invalid Direction

21,181

Invalid Magnitude

10,984

Total 149,791

Indicator 78.5%

g. Event History rulesi. Event Dependencies

Various events often affect the same objects and therefore may be interdependent. Data quality rules can use these dependencies to validate the event histories. E.g. An out migration event cannot be recorded for an individual without a prior birth or in- migration event.

Example : Outmigration events cannot be preceded by ‘Death, ’Visit’ or ‘Outmigration’ events.

Dependency

n

Correct 56,

817Indicator=1− Incorrect

TotalIncorrect 3

Total 56,

820

Indicator99.99

%

ii. Event Conditions

Events of many kinds do not occur at random but rather only happen under certain unique circumstances. Event conditions verify these circumstances.

Example: Birth spacing, the time between two subsequent pregnancies with a live birth outcome should not be less than 9 months (280 days).

Birth Spacing Pregnancies

Too Short 542

Indicator=1−TooShortTotal

Valid 40,935

Total 41,477

Indicator 98.7%

17

Page 18: INDEPTH Data Quality Workshop Program & Curriculum

iii. Event-specific Attribute Constraints

Events themselves are often complex entities, each with numerous attributes.

Example: A pregnancy outcome event requires the mother to be of child bearing age.

Birth Spacing

Pregnancies

Valid 70,601

Invalid 1,285

Total 71,886

Indicator 98.2%

4. Rules for state-dependent objects

These rules place constraints on the lifecycle of objects described by so- called state-transition models.

State-dependent objects go through a sequence of states in the course of their life cycle as a result of various events. Data for the state-dependent objects is very common in real world databases and is also most error- prone. Various data quality rules can be implemented to validate such data. Some of these rules are rather simple, while others can be quite complex and vary significantly depending on the data structure. In all cases, data quality rules for state-dependent objects are key to successful data quality assessment, since data for such objects is typically very important and yet contains numerous "hidden" errors.

Not under surveillance

Under surveillance

(known location)

Dead

Census

Inmigration

Birth

Outmigration

Death

Under surveillance (unknown location)

Internal Outm

igration

Internal Inmigration

Visit

a. State domain constraint

A state domain constraint limits the set of allowed states to only those shown in the state- transition model. Invalid states are usually typos inside otherwise valid records. The true state can often be deduced based on the action value.

18

Page 19: INDEPTH Data Quality Workshop Program & Curriculum

b. Action domain constraint

An action domain constraint limits the set of allowed actions to only those shown in the state-transition model. Invalid actions are usually typos inside otherwise valid records. The true action can often be deduced based on the state value.

c. Terminator domain constraint

A terminator domain constraint limits the set of allowed terminators, specifically states in which an object can start and end its life cycle. Invalid terminators often are a symptom of missing records at the beginning of the life cycle.

Example : Invalid states at first transition

To State Action

n

INV HMS 1,838

INV INM 51

INV INT 9,205

SLK DLV 16,430

SLK DSS 62,633

SLK INM 34,500

Total124,65

7Indicator

91.1%

d. State-transition constraints

These constraints limit state changes to those allowed by the state- transition model. For example, a person who is already out- migrated cannot be out-migrated again without being in- migrated in between. Invalid state-transitions often signify a missing action.

Example : Residency state transitions

Final State

Individuals

Invalid 16,409

Indicator=1− InvalidEndStateIndividuals

Indicator=1− InvalidTransitionTransitions

Valid 108,248

Total 124,657

Indicator

86.8%

State Transitions

Invalid 16,409

Valid 296,501

Total 312,910

Indicator

94.8%

19

Page 20: INDEPTH Data Quality Workshop Program & Curriculum

Invalid Transition Causes

Invalid Reason Action

n %

Action disallowed if not under surveillance INT 11,329 69.0%

Invalid action HDS 1,16119.8%

Invalid action HMS 2,088

Action cannot start a residency if at unknown location

INM 1,620 9.9%

Temporal integrity violated HDS 3

0.9%Temporal integrity violated HMS 1

Temporal integrity violated INT 78

Temporal integrity violated OTM 64

Action condition violated INM 55 0.3%

Action cannot start a residency if already at known location

INM 40.1%

Action cannot start a residency if already at known location

INT 6

Total 16,409

e. State-action constraints

Require that each action is consistent with the change in the object state. For example, after an out migration, the state of an individual must be non-resident

f. Continuity rules

Prohibit gaps and overlaps in state-transition history. In other words, they require that the effective date of each state record must immediately follow the end date of the previous state record.

Example : See 3.d. Historical data, continuity rule

g. Duration rules

Put a constraint on the maximum and/or minimum length of time an object can stay in any specific state. The simplest form of the duration rule is the zero-length rule, which requires the length of time spent in each state to be greater than zero.

Example : Residency episode duration cannot be negative (end before start) or zero.

Duration Episodes

Valid 170,372

Indicator=1− InvalidEpisodesEpisodes

Invalid 281

Total 170,653

Indicator 99.8%

20

Page 21: INDEPTH Data Quality Workshop Program & Curriculum

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201098.6%

98.8%

99.0%

99.2%

99.4%

99.6%

99.8%

100.0%

Resident Episode Durations

h. Action pre-conditions

The conditions that must be satisfied before an action can take place. E.g. Mother must be resident for a child to start residency with birth

Example : Mother’s state at child residency start, if child starts residency with delivery.

Mothers Children

Mother resident 14,696

Indicator=1−Mothernonresident+Motherunknown

ChildrenMother non-resident

1,619

Mother unknown 131

Total 16,446

Indicator 89.4%

i. Action post-conditions

These are the conditions that must be satisfied after the action is successfully completed.

5. General attribute dependency rules

Rules that describe complex attribute relationships, including constraints on redundant, derived, partially dependent, and correlated attributes.

a. Redundant attributes

Redundant attributes are data elements that represent the same attribute of a real world object. While attribute redundancy goes against basic data modelling principles, it is common in practice for several reasons. First, redundancy is widespread in “legacy” databases and certain systems that were converted from the “legacy” databases. Secondly, redundancy is often used even in modern relational databases to improve efficiency of data access, information presentation, and transaction processing. Finally, some data across different systems are invariably redundant. Comparison of redundant attributes is a sure way to identify (and eventually correct) numerous data problems.

Example: Link between mother and child explicit via MotherID and implicit via births and pregnancies, both these should be consistent.

21

Page 22: INDEPTH Data Quality Workshop Program & Curriculum

Link Pairs

Linked 15746 Of the cases where residency start is Birth and it is linked to a Pregnancy, in one case this link between child and mother was not reflected in the MotherID of the child.

Indicator=1−NotLinkedPairs

Not linked

1

Total 15747

Indicator

99.99%

The converse is slightly more complex. Of the children born to the mother while she was resident, are all such children recorded as resident and the residency start marked as Birth? Whether this test is absolute will depend on the eligibility rules of the HDSS.

Link Pairs

Birth not linked 1750 Child resident by birth is not linked to a resident mother via a pregnancy.

Birth not resident

1102 Resident mother gave birth to a child that is not resident from birth.

Consistent 14696

Indicator=1−BirthsNotLinked+BirthsNotResident

PairsTotal 17548

Indicator 83.7%

b. Derived Attributes

Values of derived attributes are calculated based on the values of some other attributes. This approach is very common when the calculation is rather complex and involves data stored in multiple records of possibly multiple entities. Performing the calculation on the fly is then very inefficient. One of the most common special cases of derived attribute constraints is a balancing rule, which requires an aggregate attribute to equal the total of atomic level attribute values.

Example : Data should satisfy the demographic equation:

Populationt+1=Populationt+(Birthst−Deathst )+(Immigration t−Emigrationt)

Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

Population 0 66035 68027 67277 65785 64981 64916 65376 64289 63494

Start observation 62633 0 0 0 0 0 0 0 0 0

Births 1675 1723 1719 1641 1743 1748 1749 1692 1579 1101

Immigration 3887 5689 7033 6032 5111 5348 5511 5242 5279 4992

Deaths 886 1077 1098 1129 983 979 886 913 796 697

Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427

Population t+1 65386 67513 67487 66592 65452 65060 65530 65226 64649 63463

Balance -649 -514 210 807 471 144 154 937 1155

Indicator 99.0% 99.2% 99.7% 98.8% 99.3% 99.8% 99.8% 98.6% 98.2%

22

Page 23: INDEPTH Data Quality Workshop Program & Curriculum

Provision made for contextual factors such as change in HDSS boundary and loss to follow-up:

Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

Population 0 66033 68025 67266 65654 64263 63436 65376 64289 63494

Start observation 62633 0 0 0 0 0 2239 0 0 0

Births 1675 1723 1719 1640 1729 1696 1685 1692 1579 1101

Immigration 3885 5689 7027 5974 4940 5069 5172 5242 5279 4992

Deaths 886 1077 1098 1129 983 979 886 913 796 697

Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427

Loss to Follow-up 56 50 96 78 93 67 204 538 635 36233

Population t+1 65328 67461 67383 66444 65043 63944 65682 64688 64014 27230

Balance -705 -564 117 790 780 508 306 399 520

Indicator 98.9% 99.2% 99.8% 98.8% 98.8% 99.2% 99.5% 99.4% 99.2%

c. Partially Dependant Attributes

The values of redundant and derived attributes are prescribed exactly by the dependency. Oftentimes, the relationships between attributes are not so exact. The value of one attribute may restrict possible values of another attribute to a smaller subset, but not to a single value.

Example : Certain causes of death are only possible for women and/or men, e.g. cancer of the cervix or causes related to maternal death.

Sex n

FEM 120 Causes of death that ought to be associated with women.

Indicator=1− MaleDeathsDeathsFemaleCauses

MAL 1

Total 121

Indicator

99.2%

d. Conditional Optionality

Conditional optionality represents situations where values of one attribute determine whether or not the other attribute must take Null or not-Null value (i.e., is the value to be prevented or required). Technically speaking, attributes with conditional optionality are a special case of partially dependent attributes discussed above.

e. Correlated Attributes

Values of one attribute can change the likelihood of values of another one, though not firmly restricting any possibilities. An example is the correlation between gender and first name. The majority of names are distinctly male or female. Thus there is a definite relationship between these attributes; however, the relationship is not exact in nature.

4.7 Total Data Quality Management : Theory

4.7.1 Learning Objectives1. Able to identify the role players in data quality and their respective roles2. Able to describe the basic principles of Total Data Quality Management3. Able to list and describe the steps in the Ten Step Approach to Data Quality Improvement

23

Page 24: INDEPTH Data Quality Workshop Program & Curriculum

4.7.2 Content1. Role Players

a. Data Collectorsb. Data Custodiansc. Data Consumers

2. Total Data Quality Management Cyclea. Defineb. Measurec. Analysed. Improve

4.7.3 Pre-reading and Reference Material1. Carlo Batini, Monica Scannapieca. Data Quality. Concepts, Methodologies and Techniques.

2006. Springer Berlin. Pp 161-188.2. Danette McGilvray. Executing Data Quality Projects. Ten Steps to Quality Data and Trusted

Information. 2008. Morgan Kaufmann Burlington. Pp54-58.

24

Page 25: INDEPTH Data Quality Workshop Program & Curriculum

Appendix A : Sample Database

BirthsResidentEpisode

Pregnancy

Birthweight

CensusRoundsCensusRound

StartDate

EndDate

DeathsResidentEpisode

DeathCause

DeathLocation

IndividualsIndividual

LastName

FirstName

Sex

DoB

EndDate

MotherID

FatherID

InMigrationsResidentEpisode

OriginLocation

OriginPlace

Reason

LocationsLocation

Latitude

Longitude

StartDate

EndDate

ObservationsObservation

Location

CensusRound

ObservationDate

Observer

ObservationType

OutMigrationsResidentEpisode

DestinationLocation

DestinationPlace

Reason

PregnanciesPregnancy

Individual

StartDate

FirstObservation

EndDate

TerminatingEventType

LastObservation

StillBorn

LiveBorn

BirthAttendant

BirthLocation

ResidentEpisodesResidentEpisode

Individual

Location

StartDate

StartPrecision

InitiatingEventType

FirstObservation

EndDate

EndPrecision

TerminatingEventType

LastObservation

StatusObservationsStatusObservationID

Individual

Observation

MaritalStatus

EducationLevel

25

Page 26: INDEPTH Data Quality Workshop Program & Curriculum

Appendix B : SQL Scripts----region Attribute Domain Constraints----region Optionality Constraints----region Cause of Death example--SELECT DeathCause, COUNT(*) nFROM dbo.Deaths DGROUP BY DeathCauseORDER BY DeathCause --COUNT(*) DESC--SELECT DeathCause, MAX(C.Description) Description, COUNT(*) nFROM dbo.Deaths D LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY DeathCauseORDER BY DeathCause---- Final formulation--SELECT CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' END Cause, COUNT(*) nFROM dbo.Deaths D LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' END---- Data Quality Trend--SELECT YEAR(E.EndDate) Year, CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' END Cause, COUNT(*) nFROM dbo.Deaths D JOIN dbo.ResidentEpisodes E ON D.ResidentEpisode=E.ResidentEpisode LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY YEAR(E.EndDate), CASE WHEN DeathCause IS NULL THEN 'Null' WHEN DeathCause<'A' THEN 'Unassigned' ELSE 'Assigned' ENDORDER BY YEAR(E.EndDate),Cause--endregion--region Complex example - Internal migration destination-- Destination location for internal migrationsSELECT DestinationLocation, COUNT(*) nFROM dbo.OutMigrations OM

26

Page 27: INDEPTH Data Quality Workshop Program & Curriculum

JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisodeWHERE TerminatingEventType='INT'GROUP BY DestinationLocationORDER BY COUNT(*) DESC---- Grouped Destination--SELECT CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' ELSE 'Known' END Destination, COUNT(*) nFROM dbo.OutMigrations OM JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisodeWHERE TerminatingEventType='INT'GROUP BY CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' ELSE 'Known' END---- Further Investigation--SELECT CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' WHEN L.Location IS NULL THEN 'Location wrong' ELSE 'Known' END Destination, COUNT(*) nFROM dbo.OutMigrations OM JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode LEFT JOIN dbo.Locations L ON OM.DestinationLocation=L.LocationWHERE TerminatingEventType='INT'GROUP BY CASE WHEN DestinationLocation=999998 THEN 'Unknown' WHEN DestinationLocation IS NULL THEN 'Null' WHEN L.Location IS NULL THEN 'Location wrong' ELSE 'Known' END--endregion--endregion--region Format ConstraintsSELECT COUNT(*) Total, SUM(CASE WHEN PATINDEX('%[^a-zA-Z '']%',LastName)>0 THEN 1 ELSE 0 END) InvalidFROM dbo.Individuals--endregion--region Valid Value ConstraitsSELECT InitiatingEventType, COUNT(*) nFROM dbo.ResidentEpisodesGROUP BY InitiatingEventType-- SELECT YEAR(StartDate) Yr, CASE WHEN InitiatingEventType='HMS' THEN 'Invalid' ELSE 'Valid' END Validity, COUNT(*) nFROM dbo.ResidentEpisodesGROUP BY YEAR(StartDate), CASE

27

Page 28: INDEPTH Data Quality Workshop Program & Curriculum

WHEN InitiatingEventType='HMS' THEN 'Invalid' ELSE 'Valid' ENDORDER BY Yr,Validity---- Birth Weight--SELECT Birthweight/100 W100q, COUNT(*) nFROM dbo.Births B JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)WHERE StartDate BETWEEN '20000101' AND '20101231'GROUP BY Birthweight/100ORDER BY Birthweight/100--endregion--region Precision and Granularity Contraints--region Date of Birth-- Birth DateSELECT StartPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE InitiatingEventType='DLV'GROUP BY StartPrecisionORDER BY StartPrecision--endregion--region Complex example Migration Date Precision---- InMigrationSELECT StartPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE InitiatingEventType='INM'GROUP BY StartPrecisionORDER BY StartPrecision---- Internal InMigrationSELECT StartPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE InitiatingEventType='INT'GROUP BY StartPrecisionORDER BY StartPrecision---- OutMigrationSELECT EndPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE TerminatingEventType='OTM'GROUP BY EndPrecisionORDER BY EndPrecision---- Internal OutMigrationSELECT EndPrecision, COUNT(*) nFROM dbo.ResidentEpisodesWHERE TerminatingEventType='INT'GROUP BY EndPrecisionORDER BY EndPrecision---- Migration Precision by Time--WITH InPrecision AS (

28

Page 29: INDEPTH Data Quality Workshop Program & Curriculum

SELECT YEAR(StartDate) Yr, StartPrecision [Precision], COUNT(*) n FROM dbo.ResidentEpisodes WHERE InitiatingEventType IN ('INM','INT') GROUP BY YEAR(StartDate),StartPrecision),OutPrecision AS ( SELECT YEAR(EndDate) Yr, EndPrecision [Precision], COUNT(*) n FROM dbo.ResidentEpisodes WHERE TerminatingEventType IN ('INT','OTM') GROUP BY YEAR(EndDate),EndPrecision),InScore AS ( SELECT Yr, SUM((10-[Precision])*n) Score, SUM(9*n) MaxScore FROM InPrecision GROUP BY Yr),OutScore AS ( SELECT Yr, SUM((10-[Precision])*n) Score, SUM(9*n) MaxScore FROM OutPrecision GROUP BY Yr)SELECT I.Yr, SUM(ISNULL(I.Score,0)+ISNULL(O.Score,0)), SUM(ISNULL(I.MaxScore,0)+ISNULL(O.MaxScore,0))FROM InScore I JOIN OutScore O ON (I.Yr=O.Yr)GROUP BY I.YrORDER BY I.Yr--endregion--endregion--endregion----region Relational Integrity Constraints--region Identity Rules-- Duplicate IndividualsSELECT *INTO IndividualComparisonFROM dbo.udfSeekDuplicates()--SELECT Similarity, COUNT(*) nFROM dbo.IndividualComparisonGROUP BY SimilarityORDER BY Similarity--SELECT C.IndA,C.IndB, I1.FirstName FirstNameA, I2.FirstName FirstNameB, I1.LastName LastNameA, I2.LastName LastNameB, I1.Sex SexA, I2.Sex SexB, I1.DoB DoBA, I2.DoB DoBBFROM dbo.IndividualComparison C JOIN dbo.Individuals I1 ON (C.IndA=I1.Individual)

29

Page 30: INDEPTH Data Quality Workshop Program & Curriculum

JOIN dbo.Individuals I2 ON (C.IndB=I2.Individual)WHERE C.Similarity=0ORDER BY C.IndA,C.IndB--region AC SpecificSELECT C.IndA,C.IndB, I1.Name NameA, I2.Name NameB, I1.Sex SexA, I2.Sex SexB, I1.DoB DoBA, I2.DoB DoBBFROM dbo.IndividualComparison C JOIN ACDIS.dbo.vacNamedIndividuals I1 ON (C.IndA=I1.IIntID) JOIN ACDIS.dbo.vacNamedIndividuals I2 ON (C.IndB=I2.IIntID)WHERE C.Similarity=0ORDER BY C.IndA,C.IndB--endregionSELECT COUNT(*)FROM dbo.Individuals--endregion--region Reference Rules---- Child to Parent linkages-- MotherId on ChildSELECT CASE WHEN C.MotherID IS NULL THEN 'Unknown' WHEN M.Individual IS NULL THEN 'Missing' ELSE 'Known' END Mother, COUNT(*) nFROM dbo.Individuals C LEFT JOIN dbo.Individuals M ON (C.MotherID=M.Individual)GROUP BY CASE WHEN C.MotherID IS NULL THEN 'Unknown' WHEN M.Individual IS NULL THEN 'Missing' ELSE 'Known' END-- FatherId on ChildSELECT CASE WHEN C.FatherID IS NULL THEN 'Unknown' WHEN F.Individual IS NULL THEN 'Missing' ELSE 'Known' END Father, COUNT(*) nFROM dbo.Individuals C LEFT JOIN dbo.Individuals F ON (C.FatherID=F.Individual)GROUP BY CASE WHEN C.FatherID IS NULL THEN 'Unknown' WHEN F.Individual IS NULL THEN 'Missing' ELSE 'Known' END--endregion--region Cardinal Rules-- Incorrect formulationSELECT CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END Residency, COUNT(*) nFROM dbo.Individuals I LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)GROUP BY CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END--region Correct formulation

30

Page 31: INDEPTH Data Quality Workshop Program & Curriculum

WITH UniqueResidencies AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes)SELECT CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END Residency, COUNT(*) nFROM dbo.Individuals I LEFT JOIN UniqueResidencies R ON (I.Individual=R.Individual)GROUP BY CASE WHEN R.Individual IS NULL THEN 'None' ELSE 'Exists' END---- Residency Cardinality--WITH ResidencyCount AS (SELECT I.Individual, COUNT(*) nFROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)GROUP BY I.IndividualUNIONSELECT I.Individual, 0 nFROM dbo.Individuals I LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)WHERE R.Individual IS NULL)SELECT n ResidencyCardinality, COUNT(*) CntFROM ResidencyCountGROUP BY nORDER BY n--endregion--endregion--endregion----region Rules for Historical Data--region Currency Rule---- Last visit of current residency episodesSELECT CensusRound, MIN(ObservationDate) MinDate, MAX(ObservationDate) MaxDateFROM dbo.ObservationsGROUP BY CensusRoundORDER BY CensusRound---- Start of previous round 13 Jul 2009SELECT CASE WHEN EndDate>'20090712' THEN 'Current' ELSE 'Not Current' END Currency, COUNT(*) nFROM dbo.ResidentEpisodesWHERE TerminatingEventType='VIS'GROUP BY CASE WHEN EndDate>'20090712' THEN 'Current'

31

Page 32: INDEPTH Data Quality Workshop Program & Curriculum

ELSE 'Not Current' END---- Currency of Statusobservation, e.g. MaritalStatus--WITH YearEnds AS ( SELECT CAST('20001231' AS datetime) YearEnd UNION SELECT CAST('20011231' AS datetime) YearEnd UNION SELECT CAST('20021231' AS datetime) YearEnd UNION SELECT CAST('20031231' AS datetime) YearEnd UNION SELECT CAST('20041231' AS datetime) YearEnd UNION SELECT CAST('20051231' AS datetime) YearEnd UNION SELECT CAST('20061231' AS datetime) YearEnd UNION SELECT CAST('20071231' AS datetime) YearEnd UNION SELECT CAST('20081231' AS datetime) YearEnd UNION SELECT CAST('20091231' AS datetime) YearEnd),YearEndIndividuals AS ( SELECT DISTINCT Individual,YearEnd FROM dbo.ResidentEpisodes R CROSS JOIN YearEnds WHERE R.EndDate>=YearEnd AND R.StartDate<YearEnd),SOCurrency AS ( SELECT S.Individual,YearEnd, MIN(DateDiff(day,O.ObservationDate,YearEnd)) Currency FROM dbo.StatusObservations S JOIN dbo.Observations O ON (S.Observation=O.Observation) JOIN YearEndIndividuals I ON (S.Individual=I.Individual) AND (O.ObservationDate<=I.YearEnd) GROUP BY S.Individual,YearEnd)SELECT I.YearEnd, CASE WHEN C.Currency IS NULL THEN 'Undefined' WHEN C.Currency>183 THEN 'NotCurrent' ELSE 'Current' END Currency, COUNT(*) nFROM YearEndIndividuals I LEFT JOIN SOCurrency C ON (I.Individual=C.Individual) AND (I.YearEnd=C.YearEnd)GROUP BY I.YearEnd,CASE WHEN C.Currency IS NULL THEN 'Undefined' WHEN C.Currency>183 THEN 'NotCurrent' ELSE 'Current' ENDORDER BY I.YearEnd,CASE WHEN C.Currency IS NULL THEN 'Undefined' WHEN C.Currency>183 THEN 'NotCurrent' ELSE 'Current' END--endregion----

32

Page 33: INDEPTH Data Quality Workshop Program & Curriculum

--region Granularity RuleWITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231'),MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)),MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound),MedianVisits AS (SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDateFROM MedianVisitDate),VisitGaps AS ( SELECT R1.Location, R1.CensusRound Rn, R2.CensusRound Rnn, DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity FROM MedianVisits R1 JOIN MedianVisits R2 ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1))SELECT Rn,Rnn,Granularity,COUNT(*) nFROM VisitGapsGROUP BY Rn,Rnn,GranularityORDER BY Rn,Rnn,Granularity---- Quality indicator based on granularity-- Gap should be +-15 days within 183 (twice yearly rounds)--SELECT Rnn CensusRound, CASE WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange' WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange' ELSE 'Outside' END Indicator, COUNT(*) nFROM dbo.vLocationVisitGapsGROUP BY Rnn, CASE WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange' WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange' ELSE 'Outside' ENDORDER BY Rnn, Indicator--endregion

33

Page 34: INDEPTH Data Quality Workshop Program & Curriculum

--region Continuity RuleWITH NumberedEpisodes AS ( SELECT Individual, StartDate, InitiatingEventType, EndDate, TerminatingEventType, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY StartDate) RowNum FROM dbo.ResidentEpisodes)SELECT YEAR(E2.StartDate) Yr, CASE WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext' WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity' ELSE 'Continuity' END Continuity, COUNT(*) nFROM NumberedEpisodes E1 JOIN NumberedEpisodes E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)WHERE E1.TerminatingEventType='INT'GROUP BY YEAR(E2.StartDate), CASE WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext' WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity' ELSE 'Continuity' END ORDER BY Yr, Continuity--endregion--region Timestamp pattern ruleWITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231'),MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)),MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound),MedianVisits AS (SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDateFROM MedianVisitDate),Semesters AS ( SELECT 1 AS Semester, CAST('20000101' AS datetime) SemStart, DATEADD(day,-1,DATEADD(quarter,2,'20000101')) SemEnd

34

Page 35: INDEPTH Data Quality Workshop Program & Curriculum

UNION ALL SELECT Semester+1 Semester, DATEADD(day,1,SemEnd) SemStart, DATEADD(day,-1,DATEADD(quarter,2,DATEADD(day,1,SemEnd))) SemEnd FROM Semesters WHERE SemStart<'20090701'),SemesterVisits AS ( SELECT Location,Semester,COUNT(*) n FROM MedianVisits V JOIN Semesters ON (MedianDate>=Semstart) AND (MedianDate<=SemEnd) GROUP BY Location,Semester)SELECT *FROM SemesterVisitsORDER BY Location,Semester--endregion--endregion--region Value Pattern Rule--region Direction of Change---- Example : Educational Attainment--WITH EducationStatus AS ( SELECT Individual,ObservationDate,Years, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum FROM dbo.StatusObservations SO JOIN dbo.Observations O ON (SO.Observation=O.Observation) JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel) WHERE NOT E.Years IS NULL)SELECT CASE WHEN E2.Years>=E1.Years THEN 'Valid' ELSE 'Invalid' END Direction, COUNT(*) MeasuresFROM EducationStatus E1 JOIN EducationStatus E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)GROUP BY CASE WHEN E2.Years>=E1.Years THEN 'Valid' ELSE 'Invalid' END----endregion--region Magnitude of ChangeWITH EducationStatus AS ( SELECT Individual,ObservationDate,Years, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum FROM dbo.StatusObservations SO JOIN dbo.Observations O ON (SO.Observation=O.Observation) JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel) WHERE NOT E.Years IS NULL)SELECT CASE WHEN E2.Years<E1.Years THEN 'Invalid Direction' WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate) THEN 'Invalid Magnitude' ELSE 'Valid'

35

Page 36: INDEPTH Data Quality Workshop Program & Curriculum

END Direction, COUNT(*) MeasuresFROM EducationStatus E1 JOIN EducationStatus E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)GROUP BY CASE WHEN E2.Years<E1.Years THEN 'Invalid Direction' WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate) THEN 'Invalid Magnitude' ELSE 'Valid' END--endregion--endregion--region Event History Rule--region Event Dependencies---- Out migration not preceded by Death, Visit or Outmigration--WITH Events AS ( SELECT Individual, InitiatingEventType Event, StartDate EventDate, ResidentEpisode FROM dbo.ResidentEpisodes WHERE StartDate<>Enddate UNION ALL SELECT Individual, TerminatingEventType Event, EndDate EventDate, ResidentEpisode FROM dbo.ResidentEpisodes WHERE StartDate<>Enddate),NumberedEvents AS ( SELECT Individual, Event,EventDate, ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY EventDate, ResidentEpisode) RowNum FROM Events)SELECT CASE WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect' ELSE 'Correct' END Dependency, --E1.Event, COUNT(*) nFROM NumberedEvents E1 JOIN NumberedEvents E2 ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)WHERE E2.Event='OTM'GROUP BY --E1.Event CASE WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect' ELSE 'Correct' END--endregion--region Event Conditions---- Pregnancies with live births should be spaced by 9 months (280 days)--WITH NumberedPregnancies AS ( SELECT Individual,EndDate DeliveryDate, ROW_NUMBER()

36

Page 37: INDEPTH Data Quality Workshop Program & Curriculum

OVER(PARTITION BY Individual ORDER BY EndDate) RowNum FROM dbo.Pregnancies WHERE LiveBorn>0)SELECT CASE WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort' ELSE 'Valid' END BirthSpacing, COUNT(*) PregnanciesFROM NumberedPregnancies P1 JOIN NumberedPregnancies P2 ON (P1.Individual=P2.Individual) AND (P1.RowNum=P2.RowNum-1)GROUP BY CASE WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort' ELSE 'Valid' END--endregion--region Event-specific attribute constraintsSELECT CASE WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid' ELSE 'Invalid' END BirthSpacing, COUNT(*) PregnanciesFROM dbo.Pregnancies P JOIN dbo.Individuals I ON (P.Individual=I.Individual)GROUP BY CASE WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid' ELSE 'Invalid' END--end region--endregion--endregion----region Rules for state-dependent objects--region State domain constraint--endregion--region Action domain constraint--endregion--region Terminator domain constraintSELECT ToState,Action,COUNT(*) nFROM dbo.udfStateTransitions('20000101')WHERE Transition=1GROUP BY ToState,ActionORDER BY ToState,Action--endregion--region State-transition constraints---- Individuals with invalid end states--WITH LastTransition AS ( SELECT Individual,MAX(Transition) LastTransition FROM dbo.udfStateTransitions('20000101') GROUP BY Individual)SELECT CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END Quality, COUNT(*) IndividualsFROM dbo.udfStateTransitions('20000101') T JOIN LastTransition LT ON (T.Individual=LT.Individual) AND (T.Transition=LT.LastTransition)GROUP BY

37

Page 38: INDEPTH Data Quality Workshop Program & Curriculum

CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END---- Invalid transitions--SELECT CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END Quality, COUNT(*) TransitionsFROM dbo.udfStateTransitions('20000101') TGROUP BY CASE WHEN ToState='INV' THEN 'Invalid' ELSE 'Valid' END---- Breakdown of invalid transitions--SELECT InvalidReason,Action, COUNT(*) nFROM dbo.udfStateTransitions('20000101') TWHERE ToState='INV'GROUP BY InvalidReason, ActionORDER BY InvalidReason, Action---- Breakdown by surveillance round--SELECT O.CensusRound, SUM(CASE WHEN ToState='INV' THEN 1 ELSE 0 END) Invalid, SUM(CASE WHEN ToState='INV' THEN 0 ELSE 1 END) Valid, COUNT(*) TransitionsFROM dbo.udfStateTransitions('20000101') T JOIN dbo.Observations O ON (T.Observation=O.Observation)GROUP BY O.CensusRoundORDER BY O.CensusRound--endregion--region State-action constraints--endregion--region Continuity rules--endregion--region Duration rules---- Residency episode cannot be of zero or negative duration--SELECT YEAR(StartDate) Yr, CASE WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid' ELSE 'Invalid' END Duration, COUNT(*) EpisodesFROM dbo.ResidentEpisodesGROUP BY YEAR(StartDate), CASE WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid' ELSE 'Invalid' ENDORDER BY Yr,Duration--endregion--region Action pre-conditionsWITH ResidentBabies AS ( SELECT

38

Page 39: INDEPTH Data Quality Workshop Program & Curriculum

Individual Baby,StartDate FROM dbo.ResidentEpisodes WHERE InitiatingEventType='DLV'),ResidentBabyMothers AS ( SELECT DISTINCT B.Baby, I.MotherID Mother FROM dbo.Individuals I JOIN ResidentBabies B ON (I.Individual=B.Baby) WHERE NOT MotherID IS NULL)SELECT CASE WHEN BM.Baby IS NULL THEN 'Mother unknown' WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident' ELSE 'Mother resident' END Mothers, COUNT(*) BabiesFROM ResidentBabies B LEFT JOIN ResidentBabyMothers BM ON (B.Baby=BM.Baby) LEFT JOIN dbo.ResidentEpisodes RE ON (BM.Mother=RE.Individual) AND (RE.StartDate<=B.StartDate) AND (RE.EndDate>=B.StartDate)GROUP BY CASE WHEN BM.Baby IS NULL THEN 'Mother unknown' WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident' ELSE 'Mother resident' END--endregion--region Action post-conditions--endregion--endregion----region General attribute dependency rules--region Redundant attributes---- Are all cases where residency is started by birth -- which is linked to a pregnancy and then to the mother, -- also reflected in the MotherID link of the child?--WITH DirectMCLink AS ( --76898 pairs SELECT MotherID, Individual ChildID FROM dbo.Individuals WHERE NOT MotherID IS NULL),IndirectMCLink AS ( --15747 SELECT DISTINCT P.Individual MotherID, R.Individual ChildID FROM dbo.Pregnancies P JOIN dbo.Births B ON (P.Pregnancy=B.Pregnancy) JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode))SELECT CASE WHEN D.MotherID IS NULL THEN 'Not linked' ELSE 'Linked' END Link, COUNT(*) PairsFROM IndirectMCLink I

39

Page 40: INDEPTH Data Quality Workshop Program & Curriculum

LEFT JOIN DirectMCLink D ON (I.MotherID=D.MotherID) AND (I.ChildID=D.ChildID)GROUP BY CASE WHEN D.MotherID IS NULL THEN 'Not linked' ELSE 'Linked' END---- Of the children born to the mother while she was resident, -- are all such children recorded as resident -- and the residency start marked as Birth?WITH MotherBirths AS ( --21907 SELECT MotherID, Individual ChildID, DoB FROM dbo.Individuals WHERE NOT MotherID IS NULL AND DoB>='20000101' -- After start of DSS),BirthsDuringResidency AS ( --15798 SELECT B.* FROM MotherBirths B JOIN dbo.ResidentEpisodes R ON (R.Individual=B.MotherID) AND (B.DoB>=R.StartDate) AND (B.DoB<=R.EndDate)),ResidenciesFromBirth AS ( --16430 SELECT Individual ChildID FROM dbo.Births B JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode))SELECT CASE WHEN A.ChildID IS NULL THEN 'Birth not linked' WHEN B.ChildID IS NULL THEN 'Birth not resident' ELSE 'Consistent' END Link, COUNT(*) PairsFROM BirthsDuringResidency A FULL JOIN ResidenciesFromBirth B ON (A.ChildID=B.ChildID)GROUP BY CASE WHEN A.ChildID IS NULL THEN 'Birth not linked' WHEN B.ChildID IS NULL THEN 'Birth not resident' ELSE 'Consistent' ENDORDER BY Link--endregion--region Derived Attributes---- Data should satisfy the demographic equation---- Resident Population at start of yearSELECT 'Population' AS Component, SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0 END) Y2000,

40

Page 41: INDEPTH Data Quality Workshop Program & Curriculum

SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesUNIONSELECT 'Start observation' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE InitiatingEventType='DSS'UNIONSELECT 'Births' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE InitiatingEventType='DLV'UNIONSELECT 'Immigration' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,

41

Page 42: INDEPTH Data Quality Workshop Program & Curriculum

SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'UNIONSELECT 'Deaths' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE TerminatingEventType='DTH'UNIONSELECT 'Emigration' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM dbo.ResidentEpisodesWHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'---- Taking into account contextual factors, such as change in DSS boundary--WITH CensoredEpisodes AS ( SELECT CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS' ELSE R.InitiatingEventType END InitiatingEventType, CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001' ELSE R.StartDate END StartDate, R.EndDate, R.TerminatingEventType FROM dbo.ResidentEpisodes R JOIN dbo.Locations L ON (R.Location=L.Location))SELECT 'Population' AS Component, SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0 END) Y2003,

42

Page 43: INDEPTH Data Quality Workshop Program & Curriculum

SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesUNIONSELECT 'Start observation' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DSS'UNIONSELECT 'Births' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DLV'UNIONSELECT 'Immigration' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'UNIONSELECT 'Deaths' AS Component,

43

Page 44: INDEPTH Data Quality Workshop Program & Curriculum

SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='DTH'UNIONSELECT 'Emigration' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'---- Taking into account contextual factors and loss to follow-up--WITH CensoredEpisodes AS ( SELECT CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS' ELSE R.InitiatingEventType END InitiatingEventType, CASE WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001' ELSE R.StartDate END StartDate, R.EndDate, R.TerminatingEventType FROM dbo.ResidentEpisodes R JOIN dbo.Locations L ON (R.Location=L.Location))SELECT 'Population' AS Component, SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0 END) Y2006,

44

Page 45: INDEPTH Data Quality Workshop Program & Curriculum

SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesUNIONSELECT 'Start observation' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DSS'UNIONSELECT 'Births' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='DLV'UNIONSELECT 'Immigration' AS Component, SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'UNIONSELECT 'Deaths' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,

45

Page 46: INDEPTH Data Quality Workshop Program & Curriculum

SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='DTH'UNIONSELECT 'Emigration' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'UNIONSELECT 'Loss to Follow-up' AS Component, SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000, SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001, SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002, SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003, SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004, SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005, SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006, SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007, SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008, SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009FROM CensoredEpisodesWHERE TerminatingEventType='VIS'---- Find 705 people present in 2001 in excess of expectations--WITH PresentIn2001 AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes WHERE StartDate<'20010101' AND EndDate>='20010101'),CameIn2000 AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes WHERE YEAR(StartDate)=2000 AND InitiatingEventType IN ('DSS','DLV','INM')),LeftIn2000 AS ( SELECT DISTINCT Individual FROM dbo.ResidentEpisodes WHERE YEAR(EndDate)=2000 AND TerminatingEventType IN ('DTH','VIS','OTM'))SELECT A.Individual

46

Page 47: INDEPTH Data Quality Workshop Program & Curriculum

FROM PresentIn2001 A JOIN LeftIn2000 B ON (A.Individual=B.Individual)--SELECT *FROM dbo.ResidentEpisodesWHERE Individual=56179--endregion--region Partially Dependant AttributesSELECT D.DeathCause, C.Description, COUNT(*) nFROM dbo.Deaths D JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)GROUP BY D.DeathCause,C.DescriptionORDER BY n DESC--SELECT I.Sex, COUNT(*) nFROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual) JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode)WHERE DeathCause IN ('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57')GROUP BY I.Sex--SELECT I.Sex,C.DescriptionFROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual) JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode) JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)WHERE DeathCause IN ('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57') AND I.Sex='MAL'--endregion--endregion

Procedures, Views and User-defined FunctionsCREATE FUNCTION dbo.udfStateTransitions(@DSSStart datetime)RETURNS @Transitions TABLE ( [RecID] int IDENTITY (1, 1) NOT NULL PRIMARY KEY NONCLUSTERED, Individual int NOT NULL, Transition int NOT NULL, FromState char(3) NOT NULL, --NUS (Not under surveillance) --SLK (under surveillance location known) --SLU (under surveillance location unknown) --DTH (Death) --INV (Invalid state) ToState char(3) NOT NULL, Action char(3) NOT NULL, --DSS (Surveillance Start), --INM (Inmigration), --DLV (Delivery), --INT (Internal migration), --DTH (Death), --OTM (Outmigration), --VIS (Visit), --INV (Invalid action) TransitionDate datetime NOT NULL, Observation int NOT NULL, InvalidReason varchar(80) NULL)

47

Page 48: INDEPTH Data Quality Workshop Program & Curriculum

AS BEGIN DECLARE @Individual int DECLARE @DoB datetime DECLARE @InitiatingEventType char(3) DECLARE @StartDate datetime DECLARE @TerminatingEventType char(3) DECLARE @EndDate datetime DECLARE @Transition int DECLARE @NextState char(3) DECLARE @CurrentState char(3) DECLARE @LastEvent char(3) DECLARE @LastIndividual int DECLARE @LastDate datetime DECLARE @FirstObservation int DECLARE @LastObservation int DECLARE C CURSOR LOCAL FAST_FORWARD FOR SELECT I.Individual,DoB, InitiatingEventType,StartDate,FirstObservation, TerminatingEventType,R.EndDate,LastObservation FROM dbo.Individuals I JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual) ORDER BY Individual,StartDate,ResidentEpisode; OPEN C; SET @LastIndividual=-1; FETCH C INTO @Individual, @DoB, @InitiatingEventType, @StartDate, @FirstObservation, @TerminatingEventType, @EndDate, @LastObservation WHILE (@@FETCH_STATUS=0) BEGIN IF (@LastIndividual<>@Individual) BEGIN --next individual SET @CurrentState='NUS'; SET @LastDate=@DoB; SET @Transition=0; SET @LastIndividual=@Individual END; -- Do start event transition SET @Transition = @Transition+1; IF (@CurrentState='NUS') BEGIN IF (@InitiatingEventType='DSS' AND @StartDate=@DSSStart AND @Transition=1 AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType='INM' AND @StartDate>@DSSStart AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType='DLV' AND @StartDate=@DoB AND @Transition=1 AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)

48

Page 49: INDEPTH Data Quality Workshop Program & Curriculum

VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType IN ('INT','DTH','OTM','VIS')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action disallowed if not under surveillance', @FirstObservation); END ELSE IF (@InitiatingEventType IN ('DSS','INM','DLV')) BEGIN --Invalid action condition SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action condition violated', @FirstObservation); END ELSE IF (@LastDate>@StartDate) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Temporal integrity violated', @FirstObservation); END ELSE BEGIN --Invalid event SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Invalid action', @FirstObservation); END; END; IF (@CurrentState='SLK') BEGIN IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency', @FirstObservation); END ELSE IF (@InitiatingEventType IN ('INT','INM','DLV','DSS')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency if already at known location', @FirstObservation); END

49

Page 50: INDEPTH Data Quality Workshop Program & Curriculum

ELSE BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Invalid action', @FirstObservation); END END; IF (@CurrentState='SLU') BEGIN IF (@InitiatingEventType='INT' AND @LastDate<=@StartDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate, @FirstObservation); END ELSE IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency', @FirstObservation); END ELSE IF (@InitiatingEventType IN ('INM','DLV','DSS')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Action cannot start a residency if at unknown location', @FirstObservation); END ELSE IF (@LastDate>@StartDate) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Temporal integrity violated', @FirstObservation); END ELSE BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'Invalid action', @FirstObservation); END END; IF (@CurrentState='DTH') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation)

50

Page 51: INDEPTH Data Quality Workshop Program & Curriculum

VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,@StartDate,'No transitions after terminating state', @FirstObservation); END; SET @LastDate=@StartDate; SET @CurrentState=@NextState; SET @Transition=@Transition+1; -- Do end event transition IF (@CurrentState='NUS') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Cannot be not under surveillance before residency end', @LastObservation); END; IF (@CurrentState='SLK') BEGIN IF (@TerminatingEventType='INT' AND @LastDate<@EndDate) BEGIN SET @NextState='SLU'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType='OTM' AND @LastDate<@EndDate) BEGIN SET @NextState='NUS'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType='VIS' AND @LastDate<=@EndDate) BEGIN SET @NextState='SLK'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType='DTH' AND @LastDate<=@EndDate) BEGIN SET @NextState='DTH'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate, @LastObservation); END ELSE IF (@TerminatingEventType IN ('INM','DSS','DLV')) BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Action cannot end a residency', @LastObservation); END ELSE IF (@LastDate>=@EndDate) BEGIN SET @NextState='INV';

51

Page 52: INDEPTH Data Quality Workshop Program & Curriculum

INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Temporal integrity violated', @LastObservation); END ELSE BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Invalid action', @LastObservation); END END; IF (@CurrentState='SLU') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Cannot be at unknown location before residency end', @LastObservation); END; IF (@CurrentState='DTH') BEGIN SET @NextState='INV'; INSERT INTO @Transitions (Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason, Observation) VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType,@EndDate,'Cannot be dead before residency end', @LastObservation); END; SET @LastDate=@EndDate; SET @CurrentState=@NextState; FETCH C INTO @Individual, @DoB, @InitiatingEventType, @StartDate, @FirstObservation, @TerminatingEventType, @EndDate, @LastObservation END; CLOSE C; DEALLOCATE C; RETURNEND

CREATE FUNCTION dbo.udfSeekDuplicates ()RETURNS @Duplicates TABLE ( IndA int, IndB int, Similarity int)AS BEGIN DECLARE @Individual int DECLARE @LastName varchar(50) DECLARE @FirstName varchar(50) DECLARE @Sex char(3) DECLARE @DoB datetime DECLARE C CURSOR LOCAL FAST_FORWARD FOR SELECT Individual,LastName,FirstName,Sex,DoB FROM dbo.Individuals

52

Page 53: INDEPTH Data Quality Workshop Program & Curriculum

ORDER BY Individual

OPEN C;

FETCH C INTO @Individual,@LastName,@FirstName,@Sex,@DoB; WHILE (@@FETCH_STATUS=0) BEGIN INSERT INTO @Duplicates SELECT @Individual,Individual, dbo.fnacLevenshtein(@LastName,LastName)+ dbo.fnacLevenshtein(@FirstName,FirstName)+ CASE WHEN @Sex=Sex THEN 0 ELSE 1 END + ABS(YEAR(@DoB)-YEAR(DoB)) + ABS(MONTH(@DoB)-MONTH(DoB)) + ABS(DAY(@DoB)-DAY(DoB)) FROM dbo.Individuals WHERE @Individual<Individual -- Do not re-evaluate inverse AND ABS(DATEDIFF(day,@DoB,DoB))<366 AND dbo.fnacLevenshtein(@LastName,LastName)<10 AND dbo.fnacLevenshtein(@FirstName,FirstName)<5 FETCH C INTO @Individual,@LastName,@FirstName,@Sex,@DoB; END; CLOSE C; DEALLOCATE C;

RETURNEND

CREATE VIEW dbo.vLocationVisitGapsAS WITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231' ), MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2) ), MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound ), MedianVisits AS ( SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDate FROM MedianVisitDate ) SELECT R1.Location, R1.CensusRound Rn,

53

Page 54: INDEPTH Data Quality Workshop Program & Curriculum

R2.CensusRound Rnn, DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity FROM MedianVisits R1 JOIN MedianVisits R2 ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)

CREATE VIEW dbo.vLocationVisitGapsAS WITH NumberedVisits AS ( SELECT Location, CensusRound, ObservationDate, ROW_NUMBER() OVER(PARTITION BY Location, CensusRound ORDER BY ObservationDate) RowNum, COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt FROM dbo.Observations WHERE CensusRound BETWEEN 1 AND 21 AND ObservationDate BETWEEN '20000101' AND '20091231' ), MidVisits AS ( SELECT Location, CensusRound, CAST(ObservationDate AS float) fDate, RowNum, Cnt FROM NumberedVisits WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2) ), MedianVisitDate AS ( SELECT Location, CensusRound, AVG(fDate) mDate FROM MidVisits GROUP BY Location, CensusRound ), MedianVisits AS ( SELECT Location, CensusRound, CONVERT(datetime,mDate) MedianDate FROM MedianVisitDate ) SELECT R1.Location, R1.CensusRound Rn, R2.CensusRound Rnn, DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity FROM MedianVisits R1 JOIN MedianVisits R2 ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)

54