A Crash Course in Secondary Data Sources for Berkeley Researchers
-
Upload
zenaida-hall -
Category
Documents
-
view
29 -
download
3
description
Transcript of A Crash Course in Secondary Data Sources for Berkeley Researchers
A Crash Course in Secondary Data
Sourcesfor Berkeley Researchers
Jon Stiles D-Lab
Lesson 1 & 2: Plan Ahead!
Take a little time to check out the
landscape and see
what you might
want to look for.
Lesson 1 & 2: Look where you’re going!
…and
Don’t go so fast
that you lose
control of what you’re doing!
Secondary data: what is it and where does it come from? Secondary data: what is it and where does it come from? Why and how would you want to use it?Why and how would you want to use it?
Secondary data: where can you find it?Secondary data: where can you find it? Sites (archives, research organizations, government agencies)Sites (archives, research organizations, government agencies) Strategies (keyword, literature, snowball)Strategies (keyword, literature, snowball)
Tools to help you extract and use secondary dataTools to help you extract and use secondary data
Local resources to help youLocal resources to help you
Secondary Data Resources
Road Map for Today
Secondary data: what is it and where does it come from?
Secondary Secondary datadata: what is : what is it?it?
Data: plural of "datum" ….. from the Latin "something given."
Plural : Right on!
Something Given: Not so much….
SecondarySecondary datadata: what is : what is it?it?
Primary Primary data data ““New” dataNew” data Typically collected to answer specific questions or serve Typically collected to answer specific questions or serve
specific needsspecific needs Known universe/sample, intentional designKnown universe/sample, intentional design Tailored data itemsTailored data items
Secondary Secondary data data ““Recycled” dataRecycled” data Collected by others and re-used Collected by others and re-used Often (but not always) collected for a different useOften (but not always) collected for a different use Value reliant on meta-data (information about the data)Value reliant on meta-data (information about the data)
Secondary data: basic Secondary data: basic characteristicscharacteristics
Secondary data tend to emerge from three kinds of collection processes: Secondary data tend to emerge from three kinds of collection processes: Survey data: collection for research purposes, coherent research design, Survey data: collection for research purposes, coherent research design,
well-defined sampling process, intent to generalizewell-defined sampling process, intent to generalize Administrative data: collection for program administration or routine Administrative data: collection for program administration or routine
record-keepingrecord-keeping Digital exhaust: an electronic byproduct or residue of activitiesDigital exhaust: an electronic byproduct or residue of activities
Secondary data may be available either as: Microdata: individual level records for a unit of analysis Aggregate data: summary counts or statistics across multiple units
Secondary data may be available either as: Cross-sectional: data collected at a single point in time Longitudinal data: data collected for the same unit of observation at
multiple points in time
Data CharacteristicsData CharacteristicsSurvey Data CharacteristicsSurvey Data Characteristics
Well defined sampling processWell defined sampling processUsually fewer observationsUsually fewer observations
American community survey (~200K/mon)American community survey (~200K/mon)GSS (~1500-6000) –GSS (~1500-6000) –Public Opinion (~1200)Public Opinion (~1200)
Individual opinions and characteristics often gatheredIndividual opinions and characteristics often gathered
Administrative data characteristicsAdministrative data characteristicsRestricted universe, but can have large amounts of data (millions of observations)Restricted universe, but can have large amounts of data (millions of observations)Data collected only for program administrationData collected only for program administrationOther data spotty, even if described in programOther data spotty, even if described in programOften linkable to other data Often linkable to other data Rarely includes participant opinionRarely includes participant opinion
“ “Data Exhaust” CharacteristicsData Exhaust” CharacteristicsOften very large Often very large Skewed populations – unclear sampling frameSkewed populations – unclear sampling frameUncertain but developing capacity to linkUncertain but developing capacity to link
Secondary data: originsSecondary data: origins Secondary data emerge from several kinds of collection processes: Secondary data emerge from several kinds of collection processes:
Survey dataSurvey data: collection for research purposes, coherent research design, well-: collection for research purposes, coherent research design, well-defined sampling process, intent to generalizedefined sampling process, intent to generalize
Examples: Examples: General Social Survey (GSS)General Social Survey (GSS)National Health Interview Surveys (NHIS)National Health Interview Surveys (NHIS)Current Population Survey (CPS)Current Population Survey (CPS)
Administrative dataAdministrative data: collection for program administration or routine record-: collection for program administration or routine record-keeping keeping
Examples:Examples: Marriage RecordsMarriage RecordsProperty SalesProperty Sales
Hospital Discharge RecordsHospital Discharge RecordsCourt RecordsCourt Records
Data exhaustData exhaust: byproduct or residue of activities: byproduct or residue of activities
Examples: Examples: Twitter collectionsTwitter collectionsCell phone location dataCell phone location dataNewspaper articlesNewspaper articles
Advantages of Secondary Advantages of Secondary DataData
Cost: Cost: original data collector bear burdenoriginal data collector bear burden
Comparability: Comparability: results may be contrasted with results may be contrasted with others using same/similar sources others using same/similar sources
Chronology: Chronology: research process can be shortened research process can be shortened dramaticallydramatically
Coverage: Coverage: data may address points in time or data may address points in time or geographies not directly available to researchergeographies not directly available to researcher
Credibility: Credibility: data collection may use specially data collection may use specially trained/knowledgeable stafftrained/knowledgeable staff
Disadvantages/ Concerns Disadvantages/ Concerns about Secondary Dataabout Secondary Data
Sample design may be unknown/ undocumentedSample design may be unknown/ undocumented
Quality of data elements may vary dramatically Quality of data elements may vary dramatically
Data collection challenges may be difficult to Data collection challenges may be difficult to ascertainascertain
Data may be gathered for different purposes/ coded Data may be gathered for different purposes/ coded in inappropriate waysin inappropriate ways
Data may be outdatedData may be outdated
Cost/ Availability: proprietary or confidential dataCost/ Availability: proprietary or confidential data
Break & IntroductionBreak & Introduction Next we are going to talk about Next we are going to talk about
places which serve as places which serve as repositories for data, and how repositories for data, and how to locate data….to locate data….
But before we do that, let’s take a But before we do that, let’s take a break and talk about your interests break and talk about your interests and needs.and needs.
Secondary data: where can you find it?
ICPSR (Inter-University Consortium for Political and Social Research) is a membership-based organization which collects data from individual researchers, polling agencies, and governmental and international agencies. Data set cover areas such as political attitudes and behavior patterns, crime and criminal justice, state and national voting records, election studies, census enumerations, economic behavior, family studies, and social atttitudes. Holdings at ICPSR are available to UCB subject toIP verification. (www.icpsr.umich.edu)
Archives: Academic
Roper Center: The Roper Center archives data from thousands of surveys with national adult, state, foreign, and special subpopulation samples conducted by Gallup, NORC, CBS, ABC, Harris, the LA Times, the NY Times, and many other polling organizations. Polls are available from as far back as the mid-1930’s. Holdings at the Roper Center are, effective as of this month, also available via IP screening. (www.ropercenter.uconn.edu )
Archives: Polling Data
Government: NCESGovernment: NCES
http://nces.ed.gov/
NCES: Data AccessNCES: Data Access
http://nces.ed.gov/edat/
http://www.cdc.gov/nchs/surveys.htm
Government: NCHSGovernment: NCHS
Government: NSF Government: NSF - College, Doctoral, Post-- College, Doctoral, Post-
DoctoralDoctoral
http://www.nsf.gov/statistics/data-tools.cfm#micro-data
Government: BEAGovernment: BEA
Government: BLSGovernment: BLS
http://www.bls.gov/data/
Government: USDAGovernment: USDA
UKDA: General Purpose UKDA: General Purpose ArchiveArchive
http://discover.ukdataservice.ac.uk/
IEA: TIMSSIEA: TIMSS
OECD: PISAOECD: PISA
http://www.oecd.org/pisa/http://www.asdfree.com/2013/12/analyze-program-for-international.html
Other Archives/Data Resources on the net
Office of Population Research at Princetonhttp://opr.princeton.edu/archive/
This archive focuses on data of interest to demographers: data about fertility, mortality, and migration.
The Mexican Migration Project (MMP), an ongoing multidisciplinary study of migration from Mexico to the United States, has released data for 93 communities in 17 States in Mexico. The Latin American Migration Project (LAMP), which extends the MMP design to a study of migration flows originating in other Latin American countries, has now released data for Dominican Republic, Nicaragua, Costa Rica, Haiti, Peru and Paraguay.
Demographic and Health Surveys http://www.measuredhs.com/ Surveys from Central and South American, Africa, and Asia dealing with health, family planning, education, andhousehold characteristics. Free, but registration required.
Archives: Distributed
http://thedata.harvard.edu/dvn/http://thedata.harvard.edu/dvn/faces/site/BrowseDataversesPage.xhtml?initialSort=Released
Add HealthAdd Health
http://www.cpc.unc.edu/projects/addhealth/data
Other Archives/Data Resources on the net
Integrated Public Use Microdata Series (IPUMS)http://www.ipums.umn.edu/
This is THE starting place if you have any interest in using microdata from the decennial censuses of the US. The documentation provides wording/context, extractions are straightforward, multiple statistical packages are supported.
General Social Surveyhttp://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10
The GSS (General Social Survey) is an almost annual "omnibus," personal interview survey of U.S. households conducted by the National Opinion Research Center (NORC) since 1972. It covers a broad range of topics, with a strong core of replicated items each year, and modules which are concurrently fielded in many other countries since the mid-1980’s.
Other Archives/Data Resources on the net
Panel Study of Income Dynamics (PSID) http://psidonline.isr.umich.edu/
The PSID is a longitudinal survey of a representative sample of US individuals and their families, ongoing since 1968. The data were collected each year through 1997, and every other year starting in 1999. Topics include income and wealth, expenses, education, and health care.
The National Survey of Families and Households (NSFH) http://www.ssc.wisc.edu/nsfh/home.htm The NSFH has fielded three waves of interviews between 1987 and 2002 which cover family structure, household division of labor,employment, cohabitation, parenting, health and well-being, etc..
Other Archives/Data Resources on the net
National Historic Geographic Information Systemhttp://www.nhgis.org/
Provides, free of charge, aggregate census data and GIS compatible boundary files for the United States between 1790 and 2000.
International Social Science Programme (ISSP)http://www.gesis.org/en/data_service/issp/data/list_quest_pdf.htm
The ISSP topical modules have focused on the Role of Government (1985, 1990, 1996), Family (1988, 1994, 2002), Social Inequality (1987, 1992, 1999), Social Networks (1986, 2001), Religion (1991, 1998), National Identity (1995, 2003), the Environment (1993, 2000) and Work Orientations (1989, 1997). The most recent modules are fielded in almost 40 countries.
Other Archives/Data Resources on the net
National Bureau of Economic Research
http://nber.org/data/ Downloadable macro and microdata. Includes ConsumerExpenditures data, Survey of Program Dynamics (SPD), Survey of Income and Program Participation (SIPP), natality and mortality files from NCHS, a long time series on segregation, and more.
http://nber.org/data/cps.html A very nice description and organization of the topical supplements to the CPS, as well as the data, documentation, and (in many cases) SAS, SPSS, and stata syntax to read in the data.
Other Archives/Data Resources on the net
American Religion Data Archive http://www.arda.tm/
Consortium for Earth Science Information Network (CIESIN) http://sedac.ciesin.org/data.html 1980/1990/2000 Census summary files in easily usable formatboundary files in popular GIS formats
University of Wisconsin-Madison Center for Demography and Ecology ftp sitehttp://www.ssc.wisc.edu/cde/library/cdeftp.htm
University of Virginia Libraryhttp://fisher.lib.virginia.edu/
Other Data Resources/tools on the net
The Dataferrett
http://dataferrett.census.gov/TheDataWeb/index.htmlA collaboration between the CDC and Census Bureau which allows youto extract and download data from:
American Community Survey (ACS)American Housing Survey (AHS)Behavioral Risk Factor Surveillance System (BRFSS)Consumer Expenditure Survey (CES)Current Population Survey (CPS)Decennial Census of Population and Housing (Census2000)National Ambulatory Medical Care Survey (NAMCS)National Center for Health Statistics Mortality-Underlying Cause-of-Death (MORT)National Health and Nutrition Examination Survey (HANES)National Health Interview Survey (NHIS)*National Hospital Ambulatory Medical Care Survey (NHAMCS)National Survey of Fishing, Hunting, and Wildlife-Assocated Recreation (FHWAR)Survey of Income and Program Participation (SIPP)Survey of Program Dynamics (SPD)
Tools to help you extract and use secondary data
www.socialexplorer.com
Local resources to help you
Selected Data Resources at Berkeley
D-Lab http://dlab.berkeley.edu/
UC DATAhttp://ucdata.berkeley.edu/
California Census Research Data Centerhttp://www.census.gov/ces/
Library Data Labhttp://www.lib.berkeley.edu/wikis/datalab/
SDA (Survey Documentation & Analysis)http://sda.berkeley.edu/
Geospatial Innovation Facilityhttp://gif.berkeley.edu/
Thank you. (Slides will be posted.)
Road Map(I)
Research Design & ImplementationData CollectionData Entry
Primary or Secondary – or both?
(& Documentation)(& Documentation)(& Documentation)
Road Map(II)
Data CleaningReading Data InLabellingEdit Checks (More Cleaning)Weighting
(& Documentation)(& Documentation)(& Documentation)(& Documentation)(& Documentation)
Road Map(III)
Descriptive StatisticsData TransformationRecord MatchingAggregation/Collapsing
(& Documentation)(& Documentation)(& Documentation)(& Documentation)
First StopsData Cleaning
Skip PatternsMissing DataRange Checks
Reading Data InFixed format / Delimited / HierarchicalVariable Typing (String/ Numeric)
LabellingVariables & Values
Edit Checks (More Cleaning)Consistency / Imputation
WeightingSampling ProbabilityNon-ResponsePopulation
Along the way
Descriptive StatisticsMin/Mean/Ptiles/Max/Valid N
Data TransformationRecodingComplexScales & Indices
Record MatchingLinking (1-1) / (1-Many)
Aggregation/CollapsingSummary Statistics
Planning Your Trip and Getting on the Road
Research Design & ImplementationWhat do you want to be able to say at the
end?Who/what are your units of analysis?What is the universe of the units you want to
talk about?How are the units you observe selected from
the universe?What is/are the instruments used to collect
data?Data Collection
How was the sampling strategy implemented?
Non-response – unit-level, item-level – and followupData Entry
Coding, Collapsing, Open-endedValidation