Analysis of Complex Survey Data Katherine M. Keyes [email protected].

28
Analysis of Complex Survey Data Katherine M. Keyes [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    1

Transcript of Analysis of Complex Survey Data Katherine M. Keyes [email protected].

Page 1: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Analysis of Complex Survey Data

Katherine M. [email protected]

Page 2: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Purpose of this class

• Teach you how to analyze complex survey data using SUDAAN

• Provide you with the tools to: – 1) find datasets that fit your research interests; – 2) download and manage those datasets; – 3) do your own analyses

Page 3: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Structure of the class

• 1:00-2:00 Lecture• 2:00-3:30 Guided exercise• 3:30-3:45 Break• 3:45-5:00 Independent research project

Page 4: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Today’s schedule

• Introduction to each other• Key concepts in complex surveys• Introduction to the NHANES

– Focus on describing the complexities in sample and design weights• PREPARING AN ANALYTIC DATASET

– Locate variables– Download data files– Append and merge datasets– Clean and recode data– Format and label variables– Save datasets

Page 5: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Who am I?

Page 6: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Who are you?

Page 7: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

What is ‘complex survey data’

• Complex survey data usually refers to sample designs in which respondents have been sampled in a way that is multi-stage, stratified, unequally weighted, and/or clustered.

• Because of these design elements, the sample is no longer “randomly selected”, which violates the assumptions of basic large-sample statistics

Page 8: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

What is ‘complex survey data’

• Because of this, we need to take into account the design elements when estimating standard errors.

Page 9: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Two types of weights commonly used

• SAMPLE WEIGHTS: adjust for oversampling of certain typically hard to reach groups (e.g., young people) and informative nonresponse

• DESIGN WEIGHTS: adjust the standard errors for the nonrandom probability of selection into the sample

• TAKE HOME MESSAGE: • Sample weights affect the ESTIMATES and not the STANDARD

ERRORS• Design weights affect the STANDARD ERRORS and not the

ESTIMATES

• We need SUDAAN to incorporate the design weights.

Page 10: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Design weights: what are they

• Strata: larger geographic unit• Primary Sampling Units (PSUs): generally

single counties or groups of small counties• Households

Page 11: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Introduction to the data we will be using in this class

• National Health and Nutrition Examination Survey

• “A program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews with physical examinations.”

• http://www.cdc.gov/nchs/nhanes.htm

Page 12: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Introduction to the data we will be using in this class

Years 1959-1962

1963-1965

1966-1970

1971-1975

1976-1980

1989-1991

1991-1994

1999-2000

2001-2002

2003-2004

2005-2006

2007-2008

2009-2010

Survey name NHES I NHES II NHES III

NHANES I

NHANESII

NHANES III Phase

I

NHANES III Phase

IINHANES

99-00NHANES

01-02NHANES

03-04NHANES

05-06NHANES

07-08NHANES

09-10Age range 18-79 12-17 12-17 1-74 1-74 1-74 1-74 0-75 0-75 0-75 0-75 0-75 0-75

Page 13: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Domains of inquiry in the NHANES• Demographic background• Housing characteristics• Smoking• Consumer behavior• Income• Food security• Tracking and tracing• Acculturation• Arthritis• Audiometry• Blood pressure• Cardiovascular disease

• Dermatology• Diabetes• Dietary screener• Dietary behavior• Early childhood• Health insurance• Hospital utilization and access to

care• Immunization• Kidney conditions• Occupation• Oral health• Osteoporosis

Page 14: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Domains of inquiry in the NHANES• Physical activity and physical

fitness• Physical functioning• Respiratory Health and Disease• Sleep disorders• Weight history• Reproductive health• Illegal drug use• Depression• Alcohol use• Pesticide use• Bowel health

Page 15: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Physical exam includes measures of:

• Arthritis• Audiometry• Bone density (DXA)• Anthropometry• Oral Glucose Tolerance

Test• Oral Health• Physician’s Exam• Respiratory Health

Page 16: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Laboratory components include measures of:

• Venipuncture• Urine collection• Bone mineral status

markers• Diabetes profile• Infectious disease

profile• Oral HPV• C-reative protein

• Thyroid profile• Standard biochemical profile• Kidney disease profile• Pregnancy test• Prostate Specific Antigen• Nutritional biochemistries

and hematologies • STD profile• Blood lipids• Environmental health profile

Page 17: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

DNA

• Blood samples for DNA purification were collected from participants age 20 or more years in survey years 1999-2002 and 2007-2008.

• These are restricted access data

Page 18: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Landmark findings and public health results

• High blood lead levels– Lead out of gasoline

• Low folate levels– Mandatory food fortification

• Rising levels of obesity– Public health action plan

• Racial/ethnic disparities in Hepatitis B– Universal vaccination of all infants and children

Page 19: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

NHANES not for you?

• The concepts we will discuss apply to many other publicly available datasets, and you are encouraged to use these data for your in-class project if your research questions are not covered in the NHANES

• Where can I find other publicly available datasets?

– ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/

Page 20: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

SAMPLE WEIGHTING IN THE NHANES

Page 21: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Design weights: variable names

• Strata: SDMVSTRA

• PSU: SDMVPSU

Page 22: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Sample weights in the NHANES• If only data from the interviewed sample is used, then the

appropriate SAS variable is:– WTINT2YR

• If data from the medical examination is used, then the appropriate SAS variable is:– WTMEC2YR

• Some data are only collected on sub-samples of NHANES participants. These data are generally not publicly available or are only released a few years after the main interview data. If you are using data on a subsample of NHANES participants, appropriate subsample weights must be used and they are included on any data file where relevant.

Page 23: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Combining NHANES samples

• For NHANES 1999-2000, SDMVSTRA is numbered 1 to 13; for NHANES 2001-2002 SDMVSTRA is numbered 14-28; for NHANES 2003-2004 SDMVSTRA is numbered 29-43; etc.

• Therefore, two year NHANES cycles can be combined without any recoding of this variable

Page 24: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Combining NHANES samples: 1999-2006

• For the 1999-2002 and 2003-2006 survey periods, Mexican Americans were oversampled but non-Mexican American Hispanics were not oversampled.

• Therefore, estimates for Hispanics that are not Mexican Americans are generally unreliable and should not be analyzed

• Further, estimates for ‘all Hispanics’ should not be calculated

Page 25: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Combining NHANES samples: 2007-2008, 2009-2010

• The sample design of NHANES 2007-2010 is different than the sample designs for earlier cycles.

• Adolescents were no longer oversampled• Non-Mexican American Hispanics were

oversampled, allowing for estimates of “all Hispanics” (but smaller subgroups remain unreliable).

Page 26: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Summary: combining samples

• The NHANES sample designs for the periods 1999-2002 and 2003-2006 were similar, such that combining data cycles within these periods does not present any analytic issues.

• When combining with the 2007-2008 data, however, data users should not create estimates for total Hispanics for the 2005-2008 data period.

• For non-Hispanic white, non-Hispanic black, and Mexican American sample domains, rescaling the sample weights to create four-year weights should be sufficient

• But users should check estimates carefully to see if the four year estimates and sampling errors are consistent with each set of 2 year estimates.

Page 27: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

Reweighting the data when combining samples

• When combining two or more 2-year cycles of the continuous NHANES, the user must calculate new sample weights before beginning any analysis of the data.

• A set of four year weights has already been created for the 1999-2002 data (e.g., for the MEC sample it’s WTMEC4YR).

• For four year estimates for 2001-2004, one can create a new variable for a four year weight by assigning ½ of the 2 year weight for 2001-2002 if the person was sampled in 2001-2002 or assigning ½ of the 2 year weight for 2003-2004 if the person was sampled in 2003-2004.

• For an estimate for the 6-years of 1999-2004, a 6-year weight variable can be created by assigning 2/3 of the 4 year weight for 1999-2002 if the person was sampled between 1999-2002 or assigning 1/3 or the 2 year weight for 2003-2004 if the person was sampled in 2003-2004.

Page 28: Analysis of Complex Survey Data Katherine M. Keyes Kmk2104@columbia.edu.

LAB #1:PREPARING AN ANALYTIC DATASET

Open the Word document “Lab 1: Preparing an analytic dataset”