RDAP13 Mark Leggott: Stewarding research data using the Islandora framework
RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…
description
Transcript of RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…
Domain Repositories and Institutional Repositories Partnering to Curate: Opportunities and Examples
Jared LyleRDAP13
About ICPSR• Founded in 1962 as a consortium of 21
universities to share the National Election Survey
• Today: 700+ members around the world• Data dissemination for more than 20 federal
and non-government sponsors• 600,000+ visitors per year
What we do• Acquire and archive social science data• Distribute data to researchers• Preserve data for future generations• Provide training in quantitative methods
Archive size• 8,000 data collections, over 60,000 data sets• Grows by 300+ collections a year• 9 Terabytes, soon to be 40+ Terabytes
http://www.icpsr.umich.edu
http://www.flickr.com/photos/dwiggs/3983200894/sizes/l/in/photostream/
1. Sharing Data (Archiving)
“It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing.” http://www.iassistdata.org/downloads/iqvol304niu.pdf
“Virtually all geneticists believe that scientists should share their results freely with peers…”
Louis, Jones, and Campbell (2002). “Sharing in Science.” http://dx.doi.org/10.1511/2002.4.304
“…the era of data sharing has arrived.”
Samet (2009). “Data: To Share or Not to Share?” http://dx.doi.org/10.1097/EDE.0b013e3181930df3
http://www.data-pass.org/
Most PIs indicated that they wanted to be “Good Citizens” and help:
“This sounds like an exciting project.”
“I hope your project is successful because I think that it is important.”
“Good Citizens” = high willingness
…but no time, money, or resources to submit data to us.
14.2%
58.7%
25.7%
010203040506070
Data AreArchived
Has Copy ofData
Data Are Lost
Data Sharing (N=1,544)
Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.” http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
Data Sharing (N=935)
Federal Agency
Shared Formally, Archived(n=111)
Shared Informally, Not Archived(n=415)
Not Shared(n=409)
NSF (27.3%)
22.4% 43.7% 33.9%
NIH(72.7%)
7.4% 45.0% 47.6%
Total 11.5% 44.6% 43.9%
Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data”. http://hdl.handle.net/2027.42/78307
2. Enhancing Data (Curating)
A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users.
A corollary: Do no harm.
http://img.gawkerassets.com/img/17xbuy519gga2jpg/ku-xlarge.jpg
Data
Documentation
http://dx.doi.org/10.3886/ICPSR31521.v1
20
21
Disclosure Issues
• Direct Identifiers? – personal names– addresses (including ZIP codes)– telephone numbers– social security numbers– driver license numbers– patient numbers– certification numbers,
Disclosure Issues
• Indirect Identifiers? – detailed geography (i.e., state, county, or
census tract of residence)– exact date of birth– exact occupations held– exact dates of events– detailed income
Disclosure Issues
• External Linkages?– public patient/medical records– court records– police and correction records– Social Security records– Medicare records– driver’s licenses– military records
http://www.flickr.com/photos/k3v1nm/3366181223/
Opportunity
“It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.”Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing.” http://www.iassistdata.org/downloads/iqvol304niu.pdf
“Search/Compare Variables” examines 2.1 million variables in 4,000 data collections
Emerging sources and types of data
• Geo-spatial• Video• Administrative data• Online text• Transactions• Clicks• Sensors
Partnerships
Green, Ann G., and Myron P. Gutmann. (2007) "Building Partnerships Among Social Science Researchers, Institution-based Repositories, and Domain Specific Data Archives." OCLC Systems and Services: International Digital Library Perspectives. 23: 35-53. http://hdl.handle.net/2027.42/41214
“We propose that domain specific archives partner with institution based repositories to provide expertise, tools, guidelines, and best practices to the research communities they serve.”
Support:
http://www.icpsr.umich.edu/icpsrweb/IR/
5 Pilot Data Collections
http://www.flickr.com/photos/smithsonian/2551170386/
Selection & Appraisal
Recovery
Finding interested partners
http://www.flickr.com/photos/usnationalarchives/4726917373/
Time & Willingness
http://www.flickr.com/photos/floridamemory/7026619371/
Inter-university Consortium for Political and Social Research. Survey of Data Curation Services for Repositories, 2012. ICPSR34302-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2012-09-21. doi:10.3886/ICPSR34302.v1
Survey of Repositories’ Data Needs
• Media recovery, format migration, data recovery
• Cost estimating and policy review• Metadata tools, documentation, and catalog
linkages• Support networks and training• Confidential data dissemination and
confidentiality review
Repository Suggested Solutions:
1. Community Wayfinder
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
2. Confidentiality Review & Treatment
• Suppressing unique cases• Grouping values (e.g., 13-29=1, 30-49=2)• Top-coding (e.g., >1,000=1,000)• Aggregating geographic areas• Swapping values• Sampling within a larger data collection• Adding “noise”• Replacing real data with synthetic data
http://www.icpsr.umich.edu/icpsrweb/content/DSDR/tools/qualanon.html
3. Access to Processing Tools
The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
Hermes Outputs
• ASCII data files– Column- and tab-delimited
• Stat package setup files– SAS, SPSS, Stata (.do and .dct)
• “Ready-to-go” data files– SAS transport (CPORT engine)– SPSS system (.sav)– Stata system (.dta)– R (.rda)
Useful categories for discussion?• Media recovery, format migration, data recovery• Cost estimating and policy review• Metadata tools, documentation, and catalog
linkages• Support networks and training• Confidential data dissemination and
confidentiality review
Your ideas on partnerships?