Data Management for Longitudinal Data
description
Transcript of Data Management for Longitudinal Data
20.7.04: LSS 1
Longitudinal Studies Seminars: Longitudinal Analyses Using
STATA
Stirling University, 20.7.04
Data and Variable ManagementPaul Lambert
20.7.04: LSS 2
Data Management for Longitudinal Data
1. The Nature of ‘Large and Complex’ Data
2. Data management & STATA – getting started
3. Longitudinal Data Types
4. Merging Datasets
20.7.04: LSS 3
The nature of ‘large and complex’ longitudinal resources: complicating
the variable by case matrix
Cases Variables
1 1 17 1.73 A . . . .
2 1 18 1.85 B . . . .
3 2 17 1.60 C . . . .
4 2 18 1.69 A . . . .
. . . . . . . . .
N
20.7.04: LSS 4
Large and complex =
Complexity in: • Multiple hierarchies of measurement• Array of variables / operationalisations• Relations between / subgroups of
cases• Multiple points of measurement
–Balanced or unbalanced repeated contacts
–Censored duration data• Sample collection and weighting
20.7.04: LSS 5
i) Multiple hierarchies (levels) of measurement
Common examples:• Both individuals and households• Schools and pupils • People and local districts and regions
Solutions: • Separate VxC matrix for each level, eg BHPS • Merged VxC matrix at lowest level
6
Illustration: Hierarchical dataset
Cluster Person Person-level Vars
1 1 1 38 1 1
1 2 2 34 2 2
1 3 2 6 - -
2 1 1 45 1 3
2 2 2 41 1 1
3 1 1 20 2 2
3 2 1 25 2 2
3 3 1 20 1 1
n1=3 n2=8
20.7.04: LSS 7
ii) Array of variables
Vast number of variable responses, eg 1K+• Recoding multiplies these up, eg dummies• Multiple response var.s (‘all that apply’)• Categorisations / indexes (eg occupations)
Implication: • Either separate files for separate var.
groups• Or very long and difficult files…
20.7.04: LSS 8
iii) Relations between cases
All respondents in a household Husbands and wives both sampled Fellow school pupils sampled Longitudinal: differing relations with
others at different times Outcomes:
• Link information between related cases
20.7.04: LSS 9
iv) Multiple measurement points
Longitudinal: information on same cases for multiple time points
Panel or cohort: several records via repeated contact for each individual• Problems of ‘unbalanced’ panels
Life history / retrospective: • Durations in spells: multistate /
multiepisode, overlapping spells; time varying covariates
• Left or right censoring of durations in spells
20.7.04: LSS 10
v) Sample collection / weighting
Multistage cluster particularly popular Sample may have been clustered,
stratified Longitudinal: uneven inclusion of cases
over time Sample weights designed to solve, but:
• Complex in application• Not suited to all applications
20.7.04: LSS 11
Data Management for Longitudinal Data
1. The Nature of ‘Large and Complex’ Data
2. Data management & STATA – getting started
3. Longitudinal Data Types
4. Merging Datasets
20.7.04: LSS 12
STATA data management examples: see datmanag_part1.do
Claim: For data management, STATA is powerful, but not always well designed
Batch files / interactive syntax / programs
Data entry / browsing Variable labels Computing / recoding Missing values Weighting data Survey estimators (svy)
20.7.04: LSS 13
Data Management for Longitudinal Data
1. The Nature of ‘Large and Complex’ Data
2. Data management & STATA – getting started
3. Longitudinal Data Types
4. Merging Datasets
20.7.04: LSS 14
Typology of longitudinal data files
3 Sets of contrasts :
1. Repeated X-section / Panel / Cohort
Event History / Time Series
2. Wide v’s Long3. Discrete v’s Continuous time
See datmanag_part 2.do
20.7.04: LSS 15
Contrast 1 Type A: Repeated x-sect data
Survey Person Person-level Vars
1 1 1 38 1 1
1 2 2 34 2 2
1 3 2 6 - -
2 4 1 45 1 3
2 5 2 41 1 1
3 6 1 20 2 2
3 7 1 25 2 2
3 8 1 20 1 1
N_s=3 N_c=8
20.7.04: LSS 16
C1 Type B: Panel dataset (Unbalanced)
Cases Year Variables
1 1 1 17 1 1
1 2 1 18 2 1
1 3 1 19 2 -
2 1 1 17 1 3
2 2 1 18 1 1
3 2 2 20 2 2
3 3 2 21 2 2
3 4 2 22 1 1
n1=3 n2=8
20.7.04: LSS 17
C1 Type C : Event history data analysis
Alternative data sources: • Panel / cohort (more reliable)• Retrospective (cheaper, but recall errors)
Aka: ‘Survival data analysis’; ‘Failure time analysis’; ‘hazards’; ‘risks’; ..
Focus shifts to length of time in a ‘state’ -
analyses determinants of time in state
20.7.04: LSS 18
Key to event histories is ‘state space’ Episodes within state space : Lifetime work histories for 3 adults born 1935 State space Person 1 FT work
PT work Not in work
Person 2 FT work
PT work Not in work
Person 3 FT work
PT work
Not in work 1950 1960 1970 1980 1990 2000
20.7.04: LSS 19
C1 Type D: Time series data
**Exact equivalence to panel data format
Examples: Unemployment rates by year in UK University entrance rates by year by
country
Statistical summary of one particular concept, collected at repeated time points from one or
more subjects
20.7.04: LSS 20
Contrast 2: ‘Wide’ versus ‘Long’ formatRelevant to all types of dataset: ‘Wide’ = 1 case per record (person),
additional vars for time points : Person 1 Sex YoB Var1_92 Var1_93 Var1_94 … Person 2 …
‘Long’ = 1 case per time point within person
(as panel data example)
STATA: ‘reshape’ command allows transfer between the two formats
20.7.04: LSS 21
Contrast 3: Continuous v’s Discrete time Primarily in terms of event history datasets Continuous time (‘spell files’, ‘event
oriented’) One episode per case, time in case is a
variable Discrete time One episode per time unit, type of event
and event occurrence as variables Analyses: Most packages can handle
either format comfortably
20.7.04: LSS 22
Illustration of a continuous time retrospective dataset Case Person Start
time End time
Duration Origin State
Destination state
{Other vars, person/state}
1 1 1 158 157 1 (FT) 3 (NW) 2 1 158 170 12 3 (NW) 3(NW) 3 2 1 22 21 3 (NW) 1 (FT) 4 2 22 106 84 1 (FT) 3 (NW) 5 2 106 149 43 3 (NW) 2 (PT) 6 2 149 170 21 2 (PT) 2 (PT) 7 3 1 10 9 1 (FT) 2 (PT) . . . . . . .
20.7.04: LSS 23
Illustration of a discrete time retrospective dataset Case Person Discrete
Time Approx real time
State End of state
{Other person, state, or time unit level variables}
1 1 1 5 1 FT 0 2 1 2 20 1 FT 0 3 1 3 35 1 FT 0 4 1 4 50 1 FT 0 5 1 5 65 1 FT 0 6 1 6 80 1 FT 0 7 1 7 95 1 FT 0 8 1 8 110 1 FT 0 9 1 9 125 1 FT 0 10 1 10 140 1 FT 1 11 1 11 155 3 NW 0 12 1 12 170 3 NW 1 13 2 1 5 3 NW 0 14 2 2 20 3 NW 1 15 2 3 35 1 FT 0 16 2 4 50 1 FT 1 . . . . . .
20.7.04: LSS 24
Data Management for Longitudinal Data
1. The Nature of ‘Large and Complex’ Data
2. Data management & STATA – getting started
3. Longitudinal Data Types
4. Merging Datasets
20.7.04: LSS 25
Matching files
Complex data inevitably involves more than one related data file
A vital data analysis skill!! Link data between files by connecting
them according to key linking variable(s)
Eg, ‘person identifier’ variable ‘pid’ Eg : http://iserwww.essex.ac.uk/bhps/doc/
See datmanag_part3.do
20.7.04: LSS 26
Types of file matching
Case-to-case matching• One-to-one link, eg two files with different
sets of variables for same people• STATA: append or merge
Table distribution• One-to-many link, eg one file has
individuals, another has households, and match household info to the individuals
• STATA: merge
20.7.04: LSS 27
Types of file matching ctd
Aggregating• Summarise over multiple cases then link
summaries back to cases• STATA: collapse
Related cases matching• Link info from one related case to another
case, eg info on spouse put on own case• STATA: merge or joinby
20.7.04: LSS 28
STATA file matching crib:
_merge = indicator of cases present for:
1 = Master file but not input file2 = Input file but not Master file3 = Master and input file