Data Management for Longitudinal Data

20.7.04: LSS 1

Longitudinal Studies Seminars: Longitudinal Analyses Using

STATA

Stirling University, 20.7.04

Data and Variable ManagementPaul Lambert

20.7.04: LSS 2

Data Management for Longitudinal Data

1. The Nature of ‘Large and Complex’ Data

2. Data management & STATA – getting started

3. Longitudinal Data Types

4. Merging Datasets

20.7.04: LSS 3

The nature of ‘large and complex’ longitudinal resources: complicating

the variable by case matrix

Cases Variables

1 1 17 1.73 A . . . .

2 1 18 1.85 B . . . .

3 2 17 1.60 C . . . .

4 2 18 1.69 A . . . .

. . . . . . . . .

N

20.7.04: LSS 4

Large and complex =

Complexity in: • Multiple hierarchies of measurement• Array of variables / operationalisations• Relations between / subgroups of

cases• Multiple points of measurement

–Balanced or unbalanced repeated contacts

–Censored duration data• Sample collection and weighting

20.7.04: LSS 5

i) Multiple hierarchies (levels) of measurement

Common examples:• Both individuals and households• Schools and pupils • People and local districts and regions

Solutions: • Separate VxC matrix for each level, eg BHPS • Merged VxC matrix at lowest level

6

Illustration: Hierarchical dataset

Cluster Person Person-level Vars

1 1 1 38 1 1

1 2 2 34 2 2

1 3 2 6 - -

2 1 1 45 1 3

2 2 2 41 1 1

3 1 1 20 2 2

3 2 1 25 2 2

3 3 1 20 1 1

n1=3 n2=8

20.7.04: LSS 7

ii) Array of variables

Vast number of variable responses, eg 1K+• Recoding multiplies these up, eg dummies• Multiple response var.s (‘all that apply’)• Categorisations / indexes (eg occupations)

Implication: • Either separate files for separate var.

groups• Or very long and difficult files…

20.7.04: LSS 8

iii) Relations between cases

All respondents in a household Husbands and wives both sampled Fellow school pupils sampled Longitudinal: differing relations with

others at different times Outcomes:

• Link information between related cases

20.7.04: LSS 9

iv) Multiple measurement points

Longitudinal: information on same cases for multiple time points

Panel or cohort: several records via repeated contact for each individual• Problems of ‘unbalanced’ panels

Life history / retrospective: • Durations in spells: multistate /

multiepisode, overlapping spells; time varying covariates

• Left or right censoring of durations in spells

20.7.04: LSS 10

v) Sample collection / weighting

Multistage cluster particularly popular Sample may have been clustered,

stratified Longitudinal: uneven inclusion of cases

over time Sample weights designed to solve, but:

• Complex in application• Not suited to all applications

20.7.04: LSS 11





4. Merging Datasets

20.7.04: LSS 12

STATA data management examples: see datmanag_part1.do

Claim: For data management, STATA is powerful, but not always well designed

Batch files / interactive syntax / programs

Data entry / browsing Variable labels Computing / recoding Missing values Weighting data Survey estimators (svy)

20.7.04: LSS 13





4. Merging Datasets

20.7.04: LSS 14

Typology of longitudinal data files

3 Sets of contrasts :

1. Repeated X-section / Panel / Cohort

Event History / Time Series

2. Wide v’s Long3. Discrete v’s Continuous time

See datmanag_part 2.do

20.7.04: LSS 15

Contrast 1 Type A: Repeated x-sect data

Survey Person Person-level Vars

1 1 1 38 1 1

1 2 2 34 2 2

1 3 2 6 - -

2 4 1 45 1 3

2 5 2 41 1 1

3 6 1 20 2 2

3 7 1 25 2 2

3 8 1 20 1 1

N_s=3 N_c=8

20.7.04: LSS 16

C1 Type B: Panel dataset (Unbalanced)

Cases Year Variables

1 1 1 17 1 1

1 2 1 18 2 1

1 3 1 19 2 -

2 1 1 17 1 3

2 2 1 18 1 1

3 2 2 20 2 2

3 3 2 21 2 2

3 4 2 22 1 1

n1=3 n2=8

20.7.04: LSS 17

C1 Type C : Event history data analysis

Alternative data sources: • Panel / cohort (more reliable)• Retrospective (cheaper, but recall errors)

Aka: ‘Survival data analysis’; ‘Failure time analysis’; ‘hazards’; ‘risks’; ..

Focus shifts to length of time in a ‘state’ -

analyses determinants of time in state

20.7.04: LSS 18

Key to event histories is ‘state space’ Episodes within state space : Lifetime work histories for 3 adults born 1935 State space Person 1 FT work

PT work Not in work

Person 2 FT work

PT work Not in work

Person 3 FT work

PT work

Not in work 1950 1960 1970 1980 1990 2000

20.7.04: LSS 19

C1 Type D: Time series data

**Exact equivalence to panel data format

Examples: Unemployment rates by year in UK University entrance rates by year by

country

Statistical summary of one particular concept, collected at repeated time points from one or

more subjects

20.7.04: LSS 20

Contrast 2: ‘Wide’ versus ‘Long’ formatRelevant to all types of dataset: ‘Wide’ = 1 case per record (person),

additional vars for time points : Person 1 Sex YoB Var1_92 Var1_93 Var1_94 … Person 2 …

‘Long’ = 1 case per time point within person

(as panel data example)

STATA: ‘reshape’ command allows transfer between the two formats

20.7.04: LSS 21

Contrast 3: Continuous v’s Discrete time Primarily in terms of event history datasets Continuous time (‘spell files’, ‘event

oriented’) One episode per case, time in case is a

variable Discrete time One episode per time unit, type of event

and event occurrence as variables Analyses: Most packages can handle

either format comfortably

20.7.04: LSS 22

Illustration of a continuous time retrospective dataset Case Person Start

time End time

Duration Origin State

Destination state

{Other vars, person/state}

1 1 1 158 157 1 (FT) 3 (NW) 2 1 158 170 12 3 (NW) 3(NW) 3 2 1 22 21 3 (NW) 1 (FT) 4 2 22 106 84 1 (FT) 3 (NW) 5 2 106 149 43 3 (NW) 2 (PT) 6 2 149 170 21 2 (PT) 2 (PT) 7 3 1 10 9 1 (FT) 2 (PT) . . . . . . .

20.7.04: LSS 23

Illustration of a discrete time retrospective dataset Case Person Discrete

Time Approx real time

State End of state

{Other person, state, or time unit level variables}

1 1 1 5 1 FT 0 2 1 2 20 1 FT 0 3 1 3 35 1 FT 0 4 1 4 50 1 FT 0 5 1 5 65 1 FT 0 6 1 6 80 1 FT 0 7 1 7 95 1 FT 0 8 1 8 110 1 FT 0 9 1 9 125 1 FT 0 10 1 10 140 1 FT 1 11 1 11 155 3 NW 0 12 1 12 170 3 NW 1 13 2 1 5 3 NW 0 14 2 2 20 3 NW 1 15 2 3 35 1 FT 0 16 2 4 50 1 FT 1 . . . . . .

20.7.04: LSS 24





4. Merging Datasets

20.7.04: LSS 25

Matching files

Complex data inevitably involves more than one related data file

A vital data analysis skill!! Link data between files by connecting

them according to key linking variable(s)

Eg, ‘person identifier’ variable ‘pid’ Eg : http://iserwww.essex.ac.uk/bhps/doc/

See datmanag_part3.do

http://iserwww.essex.ac.uk/bhps/doc/









20.7.04: LSS 26

Types of file matching

Case-to-case matching• One-to-one link, eg two files with different

sets of variables for same people• STATA: append or merge

Table distribution• One-to-many link, eg one file has

individuals, another has households, and match household info to the individuals

• STATA: merge

20.7.04: LSS 27

Types of file matching ctd

Aggregating• Summarise over multiple cases then link

summaries back to cases• STATA: collapse

Related cases matching• Link info from one related case to another

case, eg info on spouse put on own case• STATA: merge or joinby

20.7.04: LSS 28

STATA file matching crib:

_merge = indicator of cases present for:

1 = Master file but not input file2 = Input file but not Master file3 = Master and input file

Data Management for Longitudinal Data

Documents

Transcript of Data Management for Longitudinal Data