ICPSR Data Sharing

62
ICPSR AT 50: Facilitating Research and Data Sharing Part II: Data Sharing IASSIST Vancouver, BC May 31, 2011

description

This is Part II of a workshop presented by ICPSR at IASSIST 2011. This section focuses on data sharing of publicly available data.

Transcript of ICPSR Data Sharing

Page 1: ICPSR Data Sharing

ICPSR AT 50:Facilitating Research

and Data Sharing

Part II: Data SharingIASSIST Vancouver, BCMay 31, 2011

Page 2: ICPSR Data Sharing

“Public” Data Sharingbegins at 10:45

Page 3: ICPSR Data Sharing

ICPSR’s Public Data

Sharing Public Data - Agenda• 2010 US Census• ICPSR’s “Public” Archives

Page 4: ICPSR Data Sharing

DISSEMINATION OF DATA - MICRODATA

From the Office of Management and Budget (OMB) Policy Directive published in the Federal Register, Vol. 72, No. 46, Friday, March 7, 2008, Notices, pp. 12662-12626:

“When appropriate to facilitate in-depth research, and feasible in the presence of resource constraints, statistical agencies should provide public access to microdata files with secure safeguards to protect the confidentiality of individually-identifiable responses and with readily accessible documentation, metadata, or other means to facilitate user access to and manipulation of the data. “

Page 5: ICPSR Data Sharing
Page 6: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: KEY DATES

• National Census Day: 1 April 2010• April - July 2010: Census takers visit

households that did not return a form by mail• December 2010: By law, the Census Bureau

delivers population information to the President for apportionment

• March 2011: By law, the Census Bureau completes delivery of redistricting data to states

Page 7: ICPSR Data Sharing
Page 8: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF RESULTS

American FactFinder (AFF) is an online source for population, housing, economic and geographic data that presents the results from four key data programs: Decennial Census of Housing and Population - 1990 and

2000 Economic Census 1997-2002-2007  American Community Survey 1-Year Estimates and 3-

Year Estimates  Population Estimates Program - July 1, 2006 to July 1,

2009

Results from each of these data programs are provided in the form of data sets, tables, thematic maps, and reference maps. 

Page 9: ICPSR Data Sharing
Page 10: ICPSR Data Sharing
Page 11: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

• Direct File Access through Download FTP Center at Census Bureau• Free Access to all PUBLIC-USE DATA FILES• First Release of Data (February – March 2011)

– 2010 Census Redistricting Data Summary File (P.L. 94-171):• State and sub-state population counts to the block level for

the total population and the population 18 years and over for 63 race groups; and not Hispanic or Latino origin by 63 race groups

• State and sub-state housing unit counts down to the block level by occupancy status (occupied units, vacant units)

• Quickly followed by (April 2011):– National Summary File of Redistricting Data: Contains the

same data tables as the state files, but the geographic levels include the U.S., regions, divisions, other areas that cross state boundaries, and a small subset of the geographic areas shown in the state files.

Page 12: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

SUMMARY FILE 1 (SF 1): This file shows detailed tables on age, sex, households,

families, relationship to householder, housing units, detailed race and Hispanic or Latino origin groups, and group quarters. Most tables are shown down to the block or census tract level. Some tables are repeated for nine race/Hispanic or Latino origin groups. The nine groups are (1) White alone, (2) Black or African American alone, (3) American Indian and Alaska Native alone, (4) Asian alone, (5) Native Hawaiian and Other Pacific Islander alone, (6) Some Other Race alone, (7) Two or More Races, (8) Hispanic or Latino; (9) White alone, Not Hispanic or Latino. (Release: June-August 2011)

Page 13: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

SUMMARY FILE 1 (SF 1 CONTINUED):• The SF 1 National Update File contains the same data tables

as the state files, but the geographic levels include the U.S., regions, divisions, and other areas that cross state boundaries. (Release: November 2011)

• The SF 1 Urban/Rural Update File provides users with urban/rural population and housing unit counts (down to block) and characteristics for urbanized areas and urban clusters. (Release: October 2012)

• The SF 1 Redefined Core Based Statistical Areas Update File contains the same data tables as the state files for redefined CBSAs as defined by OMB following the 2010 Census. (Release: August 2013)

Page 14: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

SUMMARY FILE 2 (SF 2 CONTINUED): This file shows detailed tables on age, sex, households,

families, relationship to householder, housing units, and group quarters. Most tables are shown down to the census tract level. Tables are repeated by 141 race groups, 98 American Indian and Alaska Native tribes/tribal groupings, and 39 Hispanic or Latino origin groups. In order for any of the tables for a specific group to be shown in SF 2, the data must meet a minimum population threshold. The tables in SF 2 will be repeated for each group if there are at least 100 or more people of that specific group in a particular geographic area. (Release: December 2011-April 2012)

Page 15: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

SUMMARY FILE 2 (SF 2):• The SF 2 National Update File contains the same data tables

as the state files, but the geographic levels include the U.S., regions, divisions, and other areas that cross state boundaries. (Release: May 2012)

• The SF 2 Urban/Rural Update File provides users with urban/rural population and housing unit counts (down to census tract) and characteristics for urbanized areas and urban clusters. (Release: January 2013)

Page 16: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

• Congressional District Summary File – This file is a re-tabulation of Summary File 1 for newly redistricted Congressional Districts for the 113th Congress. State-based files will be released in January 2013 and every 2 years thereafter for states where congressional redistricting occurs.

• State Legislative District Summary File – This file is a re-tabulation of Summary File 1 for State Legislative Districts drawn following the 2010 Census. State-based files will be released in June 2013 and every 2 years thereafter for states where legislative redistricting occurs.

Page 17: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

American Indian and Alaska Native (AIAN) Summary

File – This is a national-level file showing the same content as Summary File 2. Tables are repeated for the total population, the total AIAN population, the total American Indian population, the total Alaska Native population, and for numerous American Indian and Alaska Native tribes. In order for any of the tables for a specific group to be shown, the data must meet a minimum population threshold of at least 100 or more people of that specific group in a particular geographic area. (Release: April 2013)

Page 18: ICPSR Data Sharing

U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA

• Public Use Microdata Sample (PUMS) Files – The PUMS files contain state-level 2010 Census data containing individual records of characteristics for a 10 percent sample of people and housing units. Data will be included for age, sex, race, Hispanic or Latino origin, household type and relationship, and tenure data with identifying information removed, for PUMAs of 100,000 or more population. (Release: TBD)

• Of lesser importance than 2000?

Page 19: ICPSR Data Sharing

Decennial Census

• In Census 2000, the census used 2 forms

– “short” form – asked for basic demographic and housing information, such as age, sex, race, how many people lived in the housing unit, and if the housing unit was owned or rented by the resident

– “long” form – collected the same information as the short form but also collected more in-depth information such as income, education, and language spoken at home

• Only a small portion of the population, called asample, received the long form.

Page 20: ICPSR Data Sharing

2010 Census and American Community Survey

• 2010 Census will focus on counting the U.S. population

• The sample data are now collected in the ACS

• Puerto Rico is the only U.S. territory where the ACS is conducted

• 2010 Census will have a long form for U.S. territories such as Guam and U.S. Virgin Islands

• Same “short form” questions on the ACS

Page 21: ICPSR Data Sharing

American Community Survey2008 Content Changes

• Three new questions– Health Insurance Coverage– Veteran’s Service-connected Disability– Marital History

• Deletion of one question– Time and main reason for staying at the address

• Changes in some wording and format

Page 22: ICPSR Data Sharing

American Community Survey Methodology

• Sample includes about 3 million addresses each year

• Three modes of data collection– mail– phone– personal visit

• Data are collected continuously throughout the year

Page 23: ICPSR Data Sharing

American Community SurveyTarget Population

• Resident population of the United States and Puerto Rico

– Living in housing units and group quarters

• Current residents at the selected address– “Two month” rule

Page 24: ICPSR Data Sharing

American Community SurveyGroup Quarters

• Place where people live or stay that is normally owned or managed by an entity or organization providing housing or services for the residents.

• 2 categories of group quarters:– Institutional– Non-institutional

Page 25: ICPSR Data Sharing

American Community Survey Period Estimates

• ACS estimates are period estimates, describing the average characteristics over a specified period

• Contrast with point-in-time estimates that describe the characteristics of an area on a specific date

• 1-year, 3-year, and 5-year estimates will be released for geographic areas that meet specific population thresholds

Page 26: ICPSR Data Sharing

American Community Survey Data Products Release Schedule

Data Product Population Size Data released in: of Area 2006 2007 2008 2009 2010 2011 2012 2013

1-Year Estimates 65,000+ 2005 2006 2007 2008 2009 2010 2011 2012for Data Collected in:

3-Year Estimates 20,000+ 2005-2007 2006-2008 2007-2009 2008-2010 2009-2011 2010-2012for Data Collected in:

5-Year Estimates All Areas* 2005-2009 2006-2010 2007-2011 2008-2012for Data Collected in:

* Five-year estimates will be available for areas as small as census tracts and block groups.Source: US Census Bureau

Page 27: ICPSR Data Sharing

American Community SurveyData Products

• Profiles– Data Profiles– Narrative Profiles– Comparison Profiles– Selected Population Profiles

• Tables– Detailed Tables– Subject Tables– Ranking Tables– Geographic Comparison Tables

• Thematic Maps• Public Use Microdata Sample (PUMS) Files

Page 28: ICPSR Data Sharing

American Community SurveySimilarities with Census 2000

• Same questions and many of the same basic statistics

• 5-year estimates will be produced for same broad set of geographic areas including census tracts and block groups

Page 29: ICPSR Data Sharing

American Community SurveyKey Differences from Census 2000

• Beginning in 2010, data for small geographic areas will be produced every year versus once every 10 years

• Data for larger areas are available now and data for mid sized area will be available in December 2008

• Census 2000 data described the population and housing as of April 1, 2000 while ACS data describe a period of time and require data for 12 months, 36 months, or 60 months

Page 30: ICPSR Data Sharing

American Community SurveyKey Differences from Census 2000

• The goal of ACS is to produce data comparable to the Census 2000 long form data

• These estimates will cover the same small areas as Census 2000 but with smaller sample sizes

• Smaller sample sizes for 5-year ACS estimates results in reductions in the reliability of estimates

Page 31: ICPSR Data Sharing

Cooperative Agreements

• Close collaboration with the Bureau over the years in making data available to the academic research community.

• Since the 1980’s ICPSR has sought outside funding to deal with Census data and entered into joint statistical agreements with the Bureau to facilitate its distribution and use.

• Importance in 1990: High cost of raw data ($175 per reel of tape; entire Census comprised about 2000 tapes = C. $350,000).

Page 32: ICPSR Data Sharing

Cooperative Agreements

• Data available to at no cost to member institutions without any rights to redistribute or resell.

• Joint annual summer workshops to offer training on the new Census data products.– One week training sessions held in 1991-1994

and 2001-2004– Census Bureau staff participated extensively

in these courses– Attracted both researchers and ICPSR Official

Representatives who attended to learn how to provide assistance to faculty and students on their campuses

Page 33: ICPSR Data Sharing

The Decennial In(di)gestion

• Census Data: Collected regularly since the 1960s.

• Number of files and bytes have grown exponentially with every new Census.

• Main reason for the rapid growth in the numbers of data files archived and disseminated by ICPSR.

• How much and how rapid?

Page 34: ICPSR Data Sharing

The Decennial In(di)gestion

Census Year Number of Files Number of Kilobytes

1960 83 586,8481970 502 9,622,3681980 1,134 62,947,6721990 4,880 312,893,9302000 412,051 556,240,394

Page 35: ICPSR Data Sharing

U.S. CENSUS DATA: DISSEMINATION OF DATA AND ICPSR• Another access point, focused on the social science

research community, to Census data and documentation• Original Census data available from the 1960s onward as

well as special samples created for earlier years• TIGER Line Files• American Community Survey • Many of the newer files are available in a variety of formats:

• SAS• SPSS• Stata• Ascii text files• Tab-delimited

Page 36: ICPSR Data Sharing

Special Census Subsets

These files report population and housing data for national and specific sub-national geographical entities, for example: • The entire nation • Each individual state• Counties• Metropolitan Areas• Places • Census Tracts

Page 37: ICPSR Data Sharing

Contextual File

• Based largely on Census data• Provides information at the ‘county’ level

in the U.S. (subunits of states numbering more than 3,100 in all)

• Contains data from other government and private sources at the same geographic level

• Under certain circumstances, can be merged with survey data

Page 38: ICPSR Data Sharing

Contextual File - 2• Population by age, sex, race, and Hispanic origin• Labor force size and unemployment• Personal income• Earnings and employment by industry• Land surface form typography• Climate• Government revenue and expenditures• Crimes reported to police• Presidential election results • Housing authorized by building permits• Medicare enrollment• Health profession shortage areas

Page 39: ICPSR Data Sharing

Preservation

• ICPSR provides another location to preserve data and documentation files produced by the Census Bureau

• ICPSR keeps multiple copies of these files both at its home location at the University of Michigan and at other sites in the United States

• Copies are continually checked and updated when necessary

• Considerable interest in historical Census data by demographers, historians, and economists.

Page 40: ICPSR Data Sharing

Current Happenings with ACS and Plans for Census 2010

Consulted with Collection Development Committee of ICPSR Council:

• Advised to continue ICPSR precedent of acquiring Census 2010 since the membership and the research community in general have traditionally come to ICPSR for their Census data needs.

• Suggestion that the data files need not be archived right away since all public-use data will be available directly from the Census Bureau.

• Emphases should center on archiving the most important Census data products when it could be best determined that final versions were created.

• The Committee also suggested that ICPSR consider holding training workshops on Census data once again as they did during the last decade and decide how best to finance them within the context of the Summer Program.

Page 41: ICPSR Data Sharing

Current Happenings with ACS and Plans for Census 2010

• Suggestion to study possibility that SDA functionality might work to produce subsets for Census data instead of creating specific data products to do so.

• Emphasis placed on partnerships and as an example working with the University of Minnesota Population Center and their National Historical Geographic Information System (NHGIS) which is expected to be able to produce subsets of 2010 Census data.

• Determine in general from membership and user community what value-added features might make sense for academic researchers as greater amounts of Census 2010 data become available.

Page 42: ICPSR Data Sharing

Current Happenings with ACS and Plans for Census 2010

Select files archived at ICPSR beginning with 1996 ACS:

• Emphasis on PUMS files at first

• Greater interest in Summary Files as more data is released and, in particular, with the recent appearance of the first 5-year Estimates File covering calendar years 2005-2009

Page 43: ICPSR Data Sharing

Current Happenings with ACS and Plans for Census 2010

TIGER files (Topologically Integrated Geographic Encoding and Referencing System)• 2010 extracts containing geographic and cartographic

information from the Census Bureau's MAF/TIGER® (Master Address File/Topologically Integrated Geographic Encoding and Referencing) database.

• These files support the 2010 Census Redistricting Data (P. L. 94-171) and the National Summary File of Redistricting Data/Summary File 1 releases.

• The files provide the digital map base for a Geographic Information System or mapping software. The files do not contain any mapping software.

Page 44: ICPSR Data Sharing

Current Happenings with ACS and Plans for Census 2010

TIGER files (Topologically Integrated Geographic Encoding and Referencing System)• All legal boundaries and names are as of January 1, 2010.

The boundaries shown are for Census Bureau statistical data collection and tabulation purposes only; their depiction and designation for statistical purposes does not constitute a determination of jurisdictional authority or rights of ownership or entitlement.

• The geographic entity codes needed to link the Census Bureau's demographic data to the geography are included in the files. The TIGER/Line Shapefiles do not contain any demographic or economic data; data can be downloaded separately using American FactFinder.

Page 45: ICPSR Data Sharing

Current Happenings with ACS and Plans for Census 2010

TIGER files (Topologically Integrated Geographic Encoding and Referencing System)

• Differences between shape files and line files• Data stored at ICPSR through designated Web site• Maintain archival copies as older versions of TIGER

files cease to be distributed by Census Bureau

http://www.icpsr.umich.edu/TIGER/index.html

Page 46: ICPSR Data Sharing

ICPSR’s Public Archives

Page 47: ICPSR Data Sharing

ICPSR’s Public Archives

Three Differentiating Characteristics of a “Public Archive”

• Funding Sources

• Access

• Search

Page 48: ICPSR Data Sharing

Funding Sources & Long Term Access

ICPSR’s public archives are funded by entities including:• Government agencies• Foundations• Other Organizations

And if the funding ceases:• ICPSR commitment to support access• Access generally reverts to membership-

only after some time period

Page 49: ICPSR Data Sharing

Why are Funders using ICPSR? An Archive’s Reasons for Being

• Dissemination Infrastructure– Systems & Search = technology, security, & metadata – Data Community Base (700 immediate members to

share with)– Community Outreach/engagement expertise

• Preservation• Fulfillment of Data Management Plan (Grant)

Requirements• Ability to Measure & Report Dissemination

Statistics

Page 50: ICPSR Data Sharing

Data Search within our Public Archives

A search for data/documents from within a public archive defaults to searches of materials (data) within that archive• A strategy to help one narrow their

scope• All materials are publicly available

Page 51: ICPSR Data Sharing

The Relationship Visual

ICPSR

NACDA

SAMHDA

NCAA

NACJD

HMCA

DSDR

Research Connections

NAHDAP

A common hub, yet each unique

Page 52: ICPSR Data Sharing

NACJD: National Archive of Criminal Justice Data

• Study topic: criminal justice

• Funders: BJS, OJJDP, NIJ

• Unique attribute: staff routinely assist non-researchers (police departments) in data use

Page 53: ICPSR Data Sharing

DSDR: Data Sharing for Demographic Research

• Study topic: demography

• Partnership of several institutions

• Unique attribute: as much a resource for data producers as well as a mechanism for dissemination

Page 54: ICPSR Data Sharing

NACDA: National Archive of Computerized Data on Aging

• Study topic: Aging – gerontological research

• Funder: National Institute on Aging

• Unique attribute: largest library of electronic data on aging in the US

Page 55: ICPSR Data Sharing

Research Connections: Child Care and Early Education

• Study topic: early education

• Funder: US Dept. of Health & Human Service

• Unique attribute: goal is more than data – to be the destination for child care & early education research

Page 56: ICPSR Data Sharing

NCAA Student-Athlete Experiences Data Archive

• Study topic: intercollegiate athletics and higher education

• Funder: NCAA

• Unique attribute: to assist in the development of national athletics policies

• Unique attribute: to assist in development of national athletics policies

Page 57: ICPSR Data Sharing

Health and Mental Health Collections

• Enhanced sensitivity in the area of disclosure risk

• From ingest of data to storage of data to analysis of data

• Has driven ICPSR, as the hub, to heighten its computing and data sharing environments

• Increasing demand has lead to a need to automate – in a secured manner

Page 58: ICPSR Data Sharing

Center for Population Research in LGBT Health

• Partner: Fenway Institute

• Unique attribute: data is processed offsite – ICPSR acts as the host

Page 59: ICPSR Data Sharing

SAMHDA: Substance Abuse & Mental Health Data Archive

• Funder: SAMHSA

• Unique attribute: driving our online services and virtual analysis capabilities

Page 60: ICPSR Data Sharing

NAHDAP: National Addiction & HIV Data Archive Program

• Funder: NIDA

• Unique attribute: driving restricted contract system

Page 61: ICPSR Data Sharing

IFSS: Integrated Fertility Survey Series

• Funder: NICHD • Unique attribute: data harmonization

Page 62: ICPSR Data Sharing

Let’s Take a BreakReturn at 11:45