Data Sharing and Accessibility · Data Sharing and Accessibility Bill Michener, University of New...
Transcript of Data Sharing and Accessibility · Data Sharing and Accessibility Bill Michener, University of New...
7/25/2014
1
Data Sharing and
Accessibility
Bill Michener, University of New Mexico
“More than a Collection: Applied Uses
of Supplemental Data”
CSE Annual Conference
May 2-5, 1014, San Antonio, TX
An archiving crisis?
What happens to the
data underlying the
millions of articles
published every year?
2
Data entropy
3
Info
rma
tio
n C
on
ten
t
Time
Time of publication
Specific details
General details
Accident
Retirement or
career change
Death
(Michener et al. 1997)
7/25/2014
2
80% of biology data is irretrievable after
20 years
Vines TH et al. (2013) Current Biology DOI:10.1016/j.cub.2013.11.0144
Who cares if data are lost?
By Agrant141 (Own work) [CC-BY-SA-3.0
(http://creativecommons.org/licenses/by-sa/3.0)],
via Wikimedia Commons
James Cook, portrait by Nathaniel
Dance-Holland, c. 1775, National
Maritime Museum, Greenwich
5
6
Stakeholder perspectives
1 University Corporation for Atmospheric Research, US Global Change Research Program: Curation, analysis, and synthesis of global change data2 NSF Dimensions of Biodiversity: Lake Baikal responses to global change: the role of genetic, functional and taxonomic diversity in the plankton. 3 Data Curation Specialist implementing many of the UC Curation Center's services, including the DMPTool and DataShare
Steve Aulenbach,
Scientist1, USGCRP
Kara Woo, Researcher2
Washington State Univ.
Carly Strasser, Librarian3
University of California
Where are the data? How can they be more
easily discovered, integrated and analyzed?
science?How do I manage and analyze 60+ years of
Lake Baikal data? Reproducible science?
How can I help researchers better manage
and share their data?
7/25/2014
3
The Road Ahead
7
Journals, publishers, and societies make a
difference! Joint Data Archiving Policy ( JDAP )
Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future.
As a condition for publication, data supporting the results in the article should be deposited in an appropriate public archive.
Authors may elect to embargo access to the data for a period up to a year after publication.
Exceptions may be granted at the discretion of the editor, especially for sensitive information.
http://datadryad.org/pages/jdap
8
7/25/2014
4
Dryad process
10
Materials and Methods
References
Dryad impact
Dryad uptake
>5,000 data packages containing >16,000 files
associated with articles in 300 journals
>200 submissions each month
>50,000 downloads each month; some data
packages have been downloaded more than
10,000 times
Fewer than 10% of authors chose to embargo
their data when this option is allowed by the
journal12
7/25/2014
5
Why use Dryad rather than
Supplementary Online Materials?
Dryad SOM
Discoverable: indexed and exposed to both web and bibliographic search engines ✔ ✗
Identifiable: DataCite DOIs within articles serve as permanent, resolvable identifiers ✔ ✗*
Permanent: processes in place to promote preservation (incl. format migration) ✔ ✔/✗**
Curated: quality control by both automated processes and human inspection ✔ ✗*
Ease of deposit: streamlined deposit, allowance for large and complex datasets ✔ ✔/✗**
Formatted for reuse: support for non-PDF file formats ✔ ✔/✗**
Updatable: new versions of data files can be added, metadata can be enhanced ✔ ✗
Support for embargoes: can delay release of data in accordance with journal policy ✔ ✗
Free reuse: no paywall, clear terms of reuse (all data released under CC Zero) ✔ ✔/✗**
Economy of scale: cost efficiency from shared infrastructure ✔ ✔/✗**
Alignment to organizational mission: focus on archiving and reuse of scientific data ✔ ✗
* A few publisher SOM sites are exceptions to the general rule** Practices differ among publishers, see Smit (2011), doi:10.1045/january2011-smit
DataDryad.org
13
Sponsoring open data
24 organizations sponsor data deposits in >60 journals
American Genetic Association German National Library of Medicine
American Society of Naturalists Nature Publishing Group
American Society of Plant Biologists Pensoft
American Society of Plant Taxonomists Society for the Study of Evolution
BMJ Publishing Group Society of Systematic Biologists
Botanical Society of America The Genetics Society
British Ecological Society The Palaeontological Association
Canadian Healthy Ocens Network CHONe The Paleontological Society
Ecological Society of America The Royal Society
Elementa: Science of the Anthropocene University of Rochester
eLife US Fish and Wildlife Service
European Society for Evolutionary Biology Wiley
14
Flood of data and repositories
15
7/25/2014
6
16
DataONE: Federating data discovery
Three major components for a
flexible, scalable, sustainable
network
Coordinating Nodes
• retain complete metadata
catalog
• indexing for search
• network-wide services
• ensure content availability
(preservation)
• replication services
Three major components for a
flexible, scalable, sustainable
network
Coordinating Nodes
• retain complete metadata
catalog
• indexing for search
• network-wide services
• ensure content availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
17
DataONE: Federating data discovery
Three major components for a
flexible, scalable, sustainable
network
Coordinating Nodes
• retain complete metadata
catalog
• indexing for search
• network-wide services
• ensure content availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
18
DataONE: Federating data discovery
Investigator Toolkit
7/25/2014
7
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
19
DataONE: Enabling science through tools and services
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
20
DataONE vision
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
21
DataONE solutions
Semantics-enabled
discovery service
Data services
Provenance services
Education
7/25/2014
8
22
Impact on the Community
Spatio-Temporal Exploratory
Models predict the
probability of occurrence of
bird species across the United
States at a 3 km x 3 km grid.
23
Challenge: Using historical data to understand and
conserve North American birds
Bird observations and
environmental data from >
350,000 locations in US
integrated and analyzed using
High Performance Computing
Resources
Land Cover
Potential Uses-
• Examine patterns of migration
• Infer impacts of climate change
• Measure patterns of habitat use
• Measure population trends
Model resultseBird
Meteorology
MODIS –
Remote
sensing data
Occurrence of Indigo Bunting (2008)
Jan Sep DecJunApr
Results: Full Life-cycle Distribution Estimates for
300+ Species of North American Birds
7/25/2014
9
Impact: new conservation approaches
on public and private lands
25
26
DataONE Team and Sponsors
• Bertram Ludaescher
• Deborah McGuinness
• Jeff Horsburgh
• Robert Sandusky
• Peter Honeyman
• Carole Goble
• Cliff Duke
• Donald Hobern
• Ewa Deelman• Amber Budden, Roger Dahl, Rebecca Koskela, Bill
Michener, Robert Nahf, Skye Roseboom, Mark Servilla
• Patricia Cruse, John Kunze
• Dave Vieglais
• Paul Allen, Rick Bonney, Steve Kelling
• Stephanie Hampton, Chris Jones, Matt Jones, Ben
Leinfelder, Andrew Pippin, Mark Schildhauer, Jing
Tao
• Suzie Allard, Kimberly Douglass, Laura Moyers,
Carol Tenopir, Robert Waltz, Bruce Wilson
• John Cobb, Bob Cook, Ranjeet Devarakonda,
Giri Palanismy, Line Pouchard
• Sky Bristol, Mike Frame, Richard Huffine, Viv
Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly
• David DeRoure
• Jane Greenberg, Ryan Scherle, Todd
Vision
LEON LEVY FOUNDATION
• Randy Butler
• Paolo Missier
Visit
Datadryad.org
DataONE.org
27