Data Sharing and Accessibility · Data Sharing and Accessibility Bill Michener, University of New...

9
7/25/2014 1 Data Sharing and Accessibility Bill Michener, University of New Mexico “More than a Collection: Applied Uses of Supplemental Data” CSE Annual Conference May 2-5, 1014, San Antonio, TX An archiving crisis? What happens to the data underlying the millions of articles published every year? 2 Data entropy 3 Information Content Time Time of publication Specific details General details Accident Retirement or career change Death (Michener et al. 1997)

Transcript of Data Sharing and Accessibility · Data Sharing and Accessibility Bill Michener, University of New...

7/25/2014

1

Data Sharing and

Accessibility

Bill Michener, University of New Mexico

“More than a Collection: Applied Uses

of Supplemental Data”

CSE Annual Conference

May 2-5, 1014, San Antonio, TX

An archiving crisis?

What happens to the

data underlying the

millions of articles

published every year?

2

Data entropy

3

Info

rma

tio

n C

on

ten

t

Time

Time of publication

Specific details

General details

Accident

Retirement or

career change

Death

(Michener et al. 1997)

7/25/2014

2

80% of biology data is irretrievable after

20 years

Vines TH et al. (2013) Current Biology DOI:10.1016/j.cub.2013.11.0144

Who cares if data are lost?

By Agrant141 (Own work) [CC-BY-SA-3.0

(http://creativecommons.org/licenses/by-sa/3.0)],

via Wikimedia Commons

James Cook, portrait by Nathaniel

Dance-Holland, c. 1775, National

Maritime Museum, Greenwich

5

6

Stakeholder perspectives

1 University Corporation for Atmospheric Research, US Global Change Research Program: Curation, analysis, and synthesis of global change data2 NSF Dimensions of Biodiversity: Lake Baikal responses to global change: the role of genetic, functional and taxonomic diversity in the plankton. 3 Data Curation Specialist implementing many of the UC Curation Center's services, including the DMPTool and DataShare

Steve Aulenbach,

Scientist1, USGCRP

Kara Woo, Researcher2

Washington State Univ.

Carly Strasser, Librarian3

University of California

Where are the data? How can they be more

easily discovered, integrated and analyzed?

science?How do I manage and analyze 60+ years of

Lake Baikal data? Reproducible science?

How can I help researchers better manage

and share their data?

7/25/2014

3

The Road Ahead

7

Journals, publishers, and societies make a

difference! Joint Data Archiving Policy ( JDAP )

Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future.

As a condition for publication, data supporting the results in the article should be deposited in an appropriate public archive.

Authors may elect to embargo access to the data for a period up to a year after publication.

Exceptions may be granted at the discretion of the editor, especially for sensitive information.

http://datadryad.org/pages/jdap

8

7/25/2014

4

Dryad process

10

Materials and Methods

References

Dryad impact

Dryad uptake

>5,000 data packages containing >16,000 files

associated with articles in 300 journals

>200 submissions each month

>50,000 downloads each month; some data

packages have been downloaded more than

10,000 times

Fewer than 10% of authors chose to embargo

their data when this option is allowed by the

journal12

7/25/2014

5

Why use Dryad rather than

Supplementary Online Materials?

Dryad SOM

Discoverable: indexed and exposed to both web and bibliographic search engines ✔ ✗

Identifiable: DataCite DOIs within articles serve as permanent, resolvable identifiers ✔ ✗*

Permanent: processes in place to promote preservation (incl. format migration) ✔ ✔/✗**

Curated: quality control by both automated processes and human inspection ✔ ✗*

Ease of deposit: streamlined deposit, allowance for large and complex datasets ✔ ✔/✗**

Formatted for reuse: support for non-PDF file formats ✔ ✔/✗**

Updatable: new versions of data files can be added, metadata can be enhanced ✔ ✗

Support for embargoes: can delay release of data in accordance with journal policy ✔ ✗

Free reuse: no paywall, clear terms of reuse (all data released under CC Zero) ✔ ✔/✗**

Economy of scale: cost efficiency from shared infrastructure ✔ ✔/✗**

Alignment to organizational mission: focus on archiving and reuse of scientific data ✔ ✗

* A few publisher SOM sites are exceptions to the general rule** Practices differ among publishers, see Smit (2011), doi:10.1045/january2011-smit

DataDryad.org

13

Sponsoring open data

24 organizations sponsor data deposits in >60 journals

American Genetic Association German National Library of Medicine

American Society of Naturalists Nature Publishing Group

American Society of Plant Biologists Pensoft

American Society of Plant Taxonomists Society for the Study of Evolution

BMJ Publishing Group Society of Systematic Biologists

Botanical Society of America The Genetics Society

British Ecological Society The Palaeontological Association

Canadian Healthy Ocens Network CHONe The Paleontological Society

Ecological Society of America The Royal Society

Elementa: Science of the Anthropocene University of Rochester

eLife US Fish and Wildlife Service

European Society for Evolutionary Biology Wiley

14

Flood of data and repositories

15

7/25/2014

6

16

DataONE: Federating data discovery

Three major components for a

flexible, scalable, sustainable

network

Coordinating Nodes

• retain complete metadata

catalog

• indexing for search

• network-wide services

• ensure content availability

(preservation)

• replication services

Three major components for a

flexible, scalable, sustainable

network

Coordinating Nodes

• retain complete metadata

catalog

• indexing for search

• network-wide services

• ensure content availability

(preservation)

• replication services

Member Nodes

• diverse institutions

• serve local community

• provide resources for

managing their data

• retain copies of data

17

DataONE: Federating data discovery

Three major components for a

flexible, scalable, sustainable

network

Coordinating Nodes

• retain complete metadata

catalog

• indexing for search

• network-wide services

• ensure content availability

(preservation)

• replication services

Member Nodes

• diverse institutions

• serve local community

• provide resources for

managing their data

• retain copies of data

18

DataONE: Federating data discovery

Investigator Toolkit

7/25/2014

7

Plan

Collect

Assure

Describe

Preserve

Discover

Integrate

Analyze

19

DataONE: Enabling science through tools and services

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

20

DataONE vision

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

21

DataONE solutions

Semantics-enabled

discovery service

Data services

Provenance services

Education

7/25/2014

8

22

Impact on the Community

Spatio-Temporal Exploratory

Models predict the

probability of occurrence of

bird species across the United

States at a 3 km x 3 km grid.

23

Challenge: Using historical data to understand and

conserve North American birds

Bird observations and

environmental data from >

350,000 locations in US

integrated and analyzed using

High Performance Computing

Resources

Land Cover

Potential Uses-

• Examine patterns of migration

• Infer impacts of climate change

• Measure patterns of habitat use

• Measure population trends

Model resultseBird

Meteorology

MODIS –

Remote

sensing data

Occurrence of Indigo Bunting (2008)

Jan Sep DecJunApr

Results: Full Life-cycle Distribution Estimates for

300+ Species of North American Birds

7/25/2014

9

Impact: new conservation approaches

on public and private lands

25

26

DataONE Team and Sponsors

• Bertram Ludaescher

• Deborah McGuinness

• Jeff Horsburgh

• Robert Sandusky

• Peter Honeyman

• Carole Goble

• Cliff Duke

• Donald Hobern

• Ewa Deelman• Amber Budden, Roger Dahl, Rebecca Koskela, Bill

Michener, Robert Nahf, Skye Roseboom, Mark Servilla

• Patricia Cruse, John Kunze

• Dave Vieglais

• Paul Allen, Rick Bonney, Steve Kelling

• Stephanie Hampton, Chris Jones, Matt Jones, Ben

Leinfelder, Andrew Pippin, Mark Schildhauer, Jing

Tao

• Suzie Allard, Kimberly Douglass, Laura Moyers,

Carol Tenopir, Robert Waltz, Bruce Wilson

• John Cobb, Bob Cook, Ranjeet Devarakonda,

Giri Palanismy, Line Pouchard

• Sky Bristol, Mike Frame, Richard Huffine, Viv

Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly

• David DeRoure

• Jane Greenberg, Ryan Scherle, Todd

Vision

LEON LEVY FOUNDATION

• Randy Butler

• Paolo Missier

Visit

Datadryad.org

DataONE.org

27