Data Management for Longitudinal Data

28
20.7.04: LSS 1 Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, 20.7.04 Data and Variable Management Paul Lambert

description

Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, 20.7.04 Data and Variable Management Paul Lambert. Data Management for Longitudinal Data. The nature of ‘large and complex’ longitudinal resources: complicating the variable by case matrix. - PowerPoint PPT Presentation

Transcript of Data Management for Longitudinal Data

Page 1: Data Management for Longitudinal Data

20.7.04: LSS 1

Longitudinal Studies Seminars: Longitudinal Analyses Using

STATA

Stirling University, 20.7.04

Data and Variable ManagementPaul Lambert

Page 2: Data Management for Longitudinal Data

20.7.04: LSS 2

Data Management for Longitudinal Data

1. The Nature of ‘Large and Complex’ Data

2. Data management & STATA – getting started

3. Longitudinal Data Types

4. Merging Datasets

Page 3: Data Management for Longitudinal Data

20.7.04: LSS 3

The nature of ‘large and complex’ longitudinal resources: complicating

the variable by case matrix

Cases Variables

1 1 17 1.73 A . . . .

2 1 18 1.85 B . . . .

3 2 17 1.60 C . . . .

4 2 18 1.69 A . . . .

. . . . . . . . .

N

Page 4: Data Management for Longitudinal Data

20.7.04: LSS 4

Large and complex =

Complexity in: • Multiple hierarchies of measurement• Array of variables / operationalisations• Relations between / subgroups of

cases• Multiple points of measurement

–Balanced or unbalanced repeated contacts

–Censored duration data• Sample collection and weighting

Page 5: Data Management for Longitudinal Data

20.7.04: LSS 5

i) Multiple hierarchies (levels) of measurement

Common examples:• Both individuals and households• Schools and pupils • People and local districts and regions

Solutions: • Separate VxC matrix for each level, eg BHPS • Merged VxC matrix at lowest level

Page 6: Data Management for Longitudinal Data

6

Illustration: Hierarchical dataset

Cluster Person Person-level Vars

1 1 1 38 1 1

1 2 2 34 2 2

1 3 2 6 - -

2 1 1 45 1 3

2 2 2 41 1 1

3 1 1 20 2 2

3 2 1 25 2 2

3 3 1 20 1 1

n1=3 n2=8

Page 7: Data Management for Longitudinal Data

20.7.04: LSS 7

ii) Array of variables

Vast number of variable responses, eg 1K+• Recoding multiplies these up, eg dummies• Multiple response var.s (‘all that apply’)• Categorisations / indexes (eg occupations)

Implication: • Either separate files for separate var.

groups• Or very long and difficult files…

Page 8: Data Management for Longitudinal Data

20.7.04: LSS 8

iii) Relations between cases

All respondents in a household Husbands and wives both sampled Fellow school pupils sampled Longitudinal: differing relations with

others at different times Outcomes:

• Link information between related cases

Page 9: Data Management for Longitudinal Data

20.7.04: LSS 9

iv) Multiple measurement points

Longitudinal: information on same cases for multiple time points

Panel or cohort: several records via repeated contact for each individual• Problems of ‘unbalanced’ panels

Life history / retrospective: • Durations in spells: multistate /

multiepisode, overlapping spells; time varying covariates

• Left or right censoring of durations in spells

Page 10: Data Management for Longitudinal Data

20.7.04: LSS 10

v) Sample collection / weighting

Multistage cluster particularly popular Sample may have been clustered,

stratified Longitudinal: uneven inclusion of cases

over time Sample weights designed to solve, but:

• Complex in application• Not suited to all applications

Page 11: Data Management for Longitudinal Data

20.7.04: LSS 11

Data Management for Longitudinal Data

1. The Nature of ‘Large and Complex’ Data

2. Data management & STATA – getting started

3. Longitudinal Data Types

4. Merging Datasets

Page 12: Data Management for Longitudinal Data

20.7.04: LSS 12

STATA data management examples: see datmanag_part1.do

Claim: For data management, STATA is powerful, but not always well designed

Batch files / interactive syntax / programs

Data entry / browsing Variable labels Computing / recoding Missing values Weighting data Survey estimators (svy)

Page 13: Data Management for Longitudinal Data

20.7.04: LSS 13

Data Management for Longitudinal Data

1. The Nature of ‘Large and Complex’ Data

2. Data management & STATA – getting started

3. Longitudinal Data Types

4. Merging Datasets

Page 14: Data Management for Longitudinal Data

20.7.04: LSS 14

Typology of longitudinal data files

3 Sets of contrasts :

1. Repeated X-section / Panel / Cohort

Event History / Time Series

2. Wide v’s Long3. Discrete v’s Continuous time

See datmanag_part 2.do

Page 15: Data Management for Longitudinal Data

20.7.04: LSS 15

Contrast 1 Type A: Repeated x-sect data

Survey Person Person-level Vars

1 1 1 38 1 1

1 2 2 34 2 2

1 3 2 6 - -

2 4 1 45 1 3

2 5 2 41 1 1

3 6 1 20 2 2

3 7 1 25 2 2

3 8 1 20 1 1

N_s=3 N_c=8

Page 16: Data Management for Longitudinal Data

20.7.04: LSS 16

C1 Type B: Panel dataset (Unbalanced)

Cases Year Variables

1 1 1 17 1 1

1 2 1 18 2 1

1 3 1 19 2 -

2 1 1 17 1 3

2 2 1 18 1 1

3 2 2 20 2 2

3 3 2 21 2 2

3 4 2 22 1 1

n1=3 n2=8

Page 17: Data Management for Longitudinal Data

20.7.04: LSS 17

C1 Type C : Event history data analysis

Alternative data sources: • Panel / cohort (more reliable)• Retrospective (cheaper, but recall errors)

Aka: ‘Survival data analysis’; ‘Failure time analysis’; ‘hazards’; ‘risks’; ..

Focus shifts to length of time in a ‘state’ -

analyses determinants of time in state

Page 18: Data Management for Longitudinal Data

20.7.04: LSS 18

Key to event histories is ‘state space’ Episodes within state space : Lifetime work histories for 3 adults born 1935 State space Person 1 FT work

PT work Not in work

Person 2 FT work

PT work Not in work

Person 3 FT work

PT work

Not in work 1950 1960 1970 1980 1990 2000

Page 19: Data Management for Longitudinal Data

20.7.04: LSS 19

C1 Type D: Time series data

**Exact equivalence to panel data format

Examples: Unemployment rates by year in UK University entrance rates by year by

country

Statistical summary of one particular concept, collected at repeated time points from one or

more subjects

Page 20: Data Management for Longitudinal Data

20.7.04: LSS 20

Contrast 2: ‘Wide’ versus ‘Long’ formatRelevant to all types of dataset: ‘Wide’ = 1 case per record (person),

additional vars for time points : Person 1 Sex YoB Var1_92 Var1_93 Var1_94 … Person 2 …

‘Long’ = 1 case per time point within person

(as panel data example)

STATA: ‘reshape’ command allows transfer between the two formats

Page 21: Data Management for Longitudinal Data

20.7.04: LSS 21

Contrast 3: Continuous v’s Discrete time Primarily in terms of event history datasets Continuous time (‘spell files’, ‘event

oriented’) One episode per case, time in case is a

variable Discrete time One episode per time unit, type of event

and event occurrence as variables Analyses: Most packages can handle

either format comfortably

Page 22: Data Management for Longitudinal Data

20.7.04: LSS 22

Illustration of a continuous time retrospective dataset Case Person Start

time End time

Duration Origin State

Destination state

{Other vars, person/state}

1 1 1 158 157 1 (FT) 3 (NW) 2 1 158 170 12 3 (NW) 3(NW) 3 2 1 22 21 3 (NW) 1 (FT) 4 2 22 106 84 1 (FT) 3 (NW) 5 2 106 149 43 3 (NW) 2 (PT) 6 2 149 170 21 2 (PT) 2 (PT) 7 3 1 10 9 1 (FT) 2 (PT) . . . . . . .

Page 23: Data Management for Longitudinal Data

20.7.04: LSS 23

Illustration of a discrete time retrospective dataset Case Person Discrete

Time Approx real time

State End of state

{Other person, state, or time unit level variables}

1 1 1 5 1 FT 0 2 1 2 20 1 FT 0 3 1 3 35 1 FT 0 4 1 4 50 1 FT 0 5 1 5 65 1 FT 0 6 1 6 80 1 FT 0 7 1 7 95 1 FT 0 8 1 8 110 1 FT 0 9 1 9 125 1 FT 0 10 1 10 140 1 FT 1 11 1 11 155 3 NW 0 12 1 12 170 3 NW 1 13 2 1 5 3 NW 0 14 2 2 20 3 NW 1 15 2 3 35 1 FT 0 16 2 4 50 1 FT 1 . . . . . .

Page 24: Data Management for Longitudinal Data

20.7.04: LSS 24

Data Management for Longitudinal Data

1. The Nature of ‘Large and Complex’ Data

2. Data management & STATA – getting started

3. Longitudinal Data Types

4. Merging Datasets

Page 25: Data Management for Longitudinal Data

20.7.04: LSS 25

Matching files

Complex data inevitably involves more than one related data file

A vital data analysis skill!! Link data between files by connecting

them according to key linking variable(s)

Eg, ‘person identifier’ variable ‘pid’ Eg : http://iserwww.essex.ac.uk/bhps/doc/

See datmanag_part3.do

Page 26: Data Management for Longitudinal Data

20.7.04: LSS 26

Types of file matching

Case-to-case matching• One-to-one link, eg two files with different

sets of variables for same people• STATA: append or merge

Table distribution• One-to-many link, eg one file has

individuals, another has households, and match household info to the individuals

• STATA: merge

Page 27: Data Management for Longitudinal Data

20.7.04: LSS 27

Types of file matching ctd

Aggregating• Summarise over multiple cases then link

summaries back to cases• STATA: collapse

Related cases matching• Link info from one related case to another

case, eg info on spouse put on own case• STATA: merge or joinby

Page 28: Data Management for Longitudinal Data

20.7.04: LSS 28

STATA file matching crib:

_merge = indicator of cases present for:

1 = Master file but not input file2 = Input file but not Master file3 = Master and input file