Post on 10-May-2015
description
Data Sharing & Data Citation
Micah Altman, Institute for Quantitative Social Science, Harvard University
Prepared for data coding, analysis, archiving, and sharing for open collaboration
NSFSept 15-16, 2011
Collaborators*
Data Sharing & Data Citation
Margaret Adams, George Alter, Leonid Andreev, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Tom Carsey, Kevin Condon, Jonathan Crabtree, Merce Crosas, Darrell Donakowski, Myron Guttman, Gary King, Patrick King, Tom Lipkis, Freeman Lo, Jared Lyle, Marc Maynard, Nancy McGovern, Amy Pienta, Lois Timms-Ferrarra, Akio Sone, Bob Treacy, Copeland Young
Research SupportThanks to the Library of Congress (PA#NDP03-1), the
National Science Foundation (DMS-0835500, SES 0112072), IMLS (LG-05-09-0041-09), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.
* And co-conspirators
Related Work
Data Sharing & Data Citation
Altman, M., and J. Crabtree, 2011. “Using the SafeArchive System: TRAC-Based Auditing of LOCKSS”, Proceedings of Archiving 2011. M. Crosas, 2011, “The Dataverse Network: An Open-Source Application for Sharing, Discovering and Preserving Data”, D-Lib Magazine 17(1/2). M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist. 72(1): 169-182Gutmann,M. Abrahamson, M, Adams, M.O., Altman, M, Arms, C., Bollen, K., Carlson, M., Crabtree, J., Donakowski, D., King, G., Lyle, J., Maynard, M., Pienta, A., Rockwell, R, Timms-Ferrara L., Young, C., 2009. "From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data", Library Trends 57(3):315-33 M. Altman, 2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag.M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April).G. King, 2007, " An Introduction to the Dataverse Network as an Infrastructure for Data Sharing", Sociological Methods and Research, Vol. 32, No. 2, pp. 173-199
Data Sharing & Data Citation
Motivations
Access to Data is the Foundation of Science
Data Sharing & Data Citation
Science is not (only) about being scientific Scientific progress requires community:
competition and collaboration in the pursuit of common goals Without access to the same materials:
no community exists
… data is the nucleus of scientific collaboration
The value of an article that can’t be replicated: ? Scholarly articles are summaries, not the actual research
results Experimental expensive to reproduce, observational data
impossible Hard for journal editors to verify --
If you find it, how do you know it’s the same? Replication projects show: many published articles cannot be
replicated
… data is needed for scientific replicationSources: Fienberg et. al 1985; ICSU 2004; Nature 2009
Open Data Broadens & Deepens Impact
Data Sharing & Data Citation
Data Intensive Science Increased opportunities for
interdisciplinarity Science modeling across multiple
scales Continuous, complete, fine-grained
information on physical processes, systems, human behavior
Education Data eases transition from
education to research Open Data Democratizes
Science Citizen-scientist Developing countries Researchers outside of inner-circle
of institution Crowd-sourcing, open notebooks,
and mashups
& Data Sharing Increases Publication Impact
[Gleditsch 2003; Wilson 2008; Piowar 2007]
Data is Key to Government
Data Sharing & Data Citation
Statistics = state-istics Reformers use data
To assess the performance of the state
To assess social conditions Governments attempt to control
access to data to evade accountability
Policy debates often centers on data War on poverty, civil rights,
consumer protection – all made heavy use of statistical arguments
Economic, environment policies are data-intensive
Data access brings together both sides of political spectrum In modern democracy the public
needs a direct source of information Liberals and conservatives
support access to data informing policy
Source: “Propaganda” http://www.media-studies.ca/articles/images/berlin_wall.jpg
Sources: Gough 2003; Shulman 2006; Wagner & Steinzor 2006; Alonzo and Starr 1988
Open Data is “Research Insurance”
Data Sharing & Data Citation
Keeps open option to after nominal end of project – extends lifecycle Continuation projects Publication revisions Broader research programs
Insures against loss of “project memory” Departure of a senior personnel from
institution Departure of post-docs, graduate
students from students Accidental loss of data due to local IT
failures Reduces questions from secondary
analysts Insures against intentional and
unintentional errors All collaborators can verify results prior
to publication Enables more intensive peer review
Source: Berman, et. al 2008.
Data Sharing Across Communities
[Micah Altman, 10/6/2009]
Open Data9
Data sharing practices vary greatly across communities Proprietary Formal sharing Formal deposit Significant correlates: Tacit knowledge, Individual investment of
time in data collection, confidentiality, journal practices, funder policies & practices
Source: R.I.N. 2008 also see Borgman 2007; Niu 2006
So when do things go wrong?
Source: Reich & Rosenthal 2005
Data Sharing & Data Citation
Confidentiality Restrictions for Personal Private Information
Overlapping laws differ: People/subjects covered Organizations covered Required technical and
procedural controls Definition of identifiability
Some Strategies Consent for sharing up front Commercialize Observe public activity Share aggregates only De-identify
Recent Statistical Results (Oversimplified ) De-identification often leaks Aggregation sometimes
leaks12
Not included: EU directives, foreign laws,
ANPRM Request for Comment on proposed revisions to 45 CFR 46
www.hhs.gov/ohrp/humansubjects/anprm2011page.html
ANPRM Request for Comment on proposed revisions to 45 CFR 46
www.hhs.gov/ohrp/humansubjects/anprm2011page.html
Data Sharing & Data Citation
Integrating Tools
Data Management - Goals
Data Sharing & Data Citation
Data Management Elements
Data Sharing & Data Citation
Core Requirements for Data Sharing Infrastructure
Data Sharing & Data Citation
Stakeholder incentives recognition; citation; payment; compliance; services
Dissemination access to metadata; documentation; data
Access control authentication; authorization; rights management
Provenance chain of control; verification of metadata, bits, semantic
content Persistence
bits; semantic content; use Legal protection
rights management; consent; record keeping; auditing Usability
discovery; deposit; curation; administration; collaboration Business modelSources: King 2007; ICSU 2004; NSB 2005
Why is Infrastructure for Data Sharing Necessary?
Data Sharing & Data Citation
Accessibility: Many large data sets: in public archives Most data in published articles:
not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available
Problems with discovery and linking even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are
lost Locating/browsing/extracting requires specialized tools & approaches
Sharing data requires exposing tacit knowledge Explicit documentation of data structure, collection process, interpretation Harmonizing/linking to known ontologies, metadata schemas, vocabularies
Data sets are not preserved like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have
altered content! Why not Single Centralized infrastructure ?
Single point of failure Difficult when data are heterogeneous in format, origin, size, effort needed
to collect or analyze, legal access rules, etc. Data producers want credit, control, and visibility
For Organizations For Scholars
•Brand it like your own website.•Upload any type of data.•Establish a persistent data citation•Facilitate data discovery•Provide live analysis •Receive permanent storage space
•Used by archives, libraries, journals, schools•Enable contributors to upload data•Organize studies by collections•Search across a universe of data•Control access and terms of use•Federate with catalogs and partners: OAI-PMH, LOCKSS, Z39.50, DDI
Dataverse Gateway to over 39000 social science studies (world’s largest catalog)
Web Virtual Hosting 2.0 Service -- Over 350 virtual archives Federated search and delivery
Virtual Archive: Scholar Site
Data Sharing & Data Citation
Scholar retains control over branding and dissemination
Preservation and long-term access is guaranteed
Dissemination and compliance with Data Manage Plans is verifiable
Integrates with OpenScholar
Interoperability & Integration
Data Sharing & Data
Citation
Mind the Gaps GAP: Coverage across entire lifecycle
-- decoupling of dissemination, formal publication, long-term access, reuse
GAP: Interoperability and integration across tools GAP: Maturity and sustainability of tools
--- most tools have small communities of maintainers, particular worrisome w/lack of interoperability
desi
gn
publ
ishi
ng
diss
emin
atio
n
arch
ivin
g
reus
e
colle
ctio
n
proc
essi
ng
inte
grat
ion
anal
ysis
cati / capi
Enhanced publication (sweave)
identifiers
Google-__________
data archives, hosting, networks
General digital libraries and repositories
Scientific workflow systems
Data Sharing & Data Citation
Supporting Institutions
Institutional Data Access Strategies*
Data Sharing & Data Citation
“Ignore it, maybe someone else will take care of it” (internet archive, …)
“We’ll always be here” (self-preservation)
Let the publishers do It
“We are ever true to [Insert Alma Mater]” (institutional archives)
“Ask us (domain archive) to do it” (ICPSR, MRA, Roper, …)
“Ask someone(s) else do it” (Data-PASS, Meta-Archive, ClockSS)
“Trust No One” (LOCKSS)
*All quotes are entirely fictional :-)
Data Sharing & Data
Citation
Institutional Preservation Strategies -- Corollaries
There are potential single points of failure in both technology, organization and legal regimes:
Diversify your portfolio: multiple software systems, hardware, organization (e.g., Data-PASS :-)
Seek international partners
Many combinations of preservation & dissemination strategies are compatible:
Layer technologies and strategies Leverage dissemination (in a planned way) for preservation
(and vice-versa)
Preservation is impossible to demonstrate conclusively: Consider organizational credentials No organization is absolutely certain to be reliable
Open Data
Institutional Topologies
Projects at the Intersection
Partnership Agreements MOU Secession Plans & Agreements
Coordinating Operations Development of shared
procedures
Joint “Not-bad” practices Identification & selection Metadata Confidentiality
Shared Catalog Unified Discovery Content replication
Data-PASS is a broad-based partnership of data archives dedicated to acquiring and preserving data at-risk of being lost to the social science research community.
Data-PASS partners have rescued thousands of data sets and created the largest catalog of social science data in existence.
Data-PASS partners collaborate to identify and promote good archival practices, seek out at-risk research data, build preservation infrastructure,and mutually safeguard
Data Sharing & Data Citation
Ideal integration of policy and technology?
Expressed in high-level domain/business language Captures a significant portion of business domain Translated to a formal schematization Automatically measurable Directly controls procedures & actions to achieve
compliance Verifiable translation from business domain policy
Data Sharing & Data Citation
Policy: A set of rules and objectives expressed at a high level domain that controls actions at a lower level
Data Sharing & Data Citation
“The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies.”
Policy
Schematization
Behavior(Operationalization)
SafeArchive: TRAC-Based Management of LOCKSS Facilitating collaborative replication and
preservation with technology… Collaborators declare explicit non-uniform resource
commitments Policy records commitments, storage network
properties Storage layer provides replication, integrity,
freshness, versioning SafeArchive software provides monitoring, auditing,
and provisioning Content is harvested through HTTP (LOCKSS) or OAI-
PMH Integration of LOCKSS, The Dataverse Network, TRAC
Data Sharing & Data Citation
AligningIncentives
Data Sharing & Data Citation
Data Sharing & Data Citation
Stakeholders & Information Flow
Research sources:- Research Subjects.- Owners of subject material- Owners of supplementary data
Research sponsors:- Home institution- Funding sources
Project Personnel:- Investigators- Research Staff
Research Publishers- Print publishers- Research archives
Research Consumers- Readers- Secondary researcher
LicensingCopyrightDMCAInformed ConsentPrivacyTrade secrets
LicensingFreedom of InformationCopyright
Copyright
CopyrightLicensing
Fair Use
InformationTransfer
PrivacyConfidentialityIntellectual Property
Replicable ResearchPolicy RelevanceAccessibility of ResearchProtect IPAvoid third party IP/Privacy Issues
Replicable ResearchPublishPromote use of PublicationsTrack use
Replicable researchPromote use of their publicationsProtect publisher IPAvoid third party IP/Privacy Issues
Replicate and extendSecondary analysisLink research
Stakeholder Concerns Legal Issues
Data Collection
Publication of Research Products
Data Citation as a Leverage Point Services
Identifiers to specific fixed versions of data are needed to establish unambiguous chains of provenance
Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized access and analysis services
Persistence of identifiers are needed to maintain long-term access
Incentives Scholarly credit (intellectual attribution) is a large motivator
for many researchers – citation creates incentive for researchers to publish data
Scholars also comply with enforceable journal policies-- requiring data citation is a light-weight method to make data access policies auditable
Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact
Data Sharing & Data Citation
Data Sharing & Data Citation
Com
mon
Prin
cipl
es
Data Sharing & Data Citation
Thanks to 37 Participants
Data Sharing & Data Citation
Data Sharing & Data Citation
What is a citation?
Data Sharing & Data Citation
Workflow
Data Sharing & Data Citation
Workflow
Data Sharing & Data Citation
Data Sharing & Data Citation
- Separatescientific principles, use cases, requirements
-Distinguish syntax, semantics, from presentation-Design for ecosystem & lifecycle-Incremental value for incremental effort-- Think Globally, Act Locally
Design Principles
Data Sharing & Data Citation
Theory
Data Sharing & Data Citation
Theory +
Data citations should be first class objects for publication -- appear with citation; should be as easy to cite as other works
At minimum, all data necessary to understand assess extend conclusions in scholarly work should be cited
Citations should persist and enable access to fixed version of data at least as long as citing work
Data citation should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem
Theory + Practice
Data Sharing & Data Citation
Data Sharing & Data Citation
Use Cases
Data Sharing & Data Citation
Use Cases (details)Operational Constraints?
-Syntax-Interoperability-Technical contexts of use
Actors
Data Sharing & Data Citation
Data Sharing & Data Citation
- Semantic:Persistent ID, Author, Title, Version (or at least date)
- Presentation:Any styleGrouped with other referencesActionable in context
- PolicyTreat data cites as first classIf its needed support a claim, cite itOffer credit to contributors
Simple Proposal
- We cannot depend on a single tool-- plans for integration and interoperability through citations and linking mechanisms, interchange formats, ontology hooks, protocols ?
- Large portion of benefit from data sharing arises from open access… -- how can OpenShare “nudge” researchers toward Open Data?
- Individual researchers cannot ensure long-term access -- how will OpenShapa fit in institutional ecosystem?Dis
cuss
ion
Contact
Data Sharing & Data Citation
Micah Altman
futurelib.org