Whither Small Data?

Whither Small Data? Some Thoughts on Managing Research Data

February 26, 2013Anita de Waard

VP Research Data Collaborations, Elsevier [email protected]

mailto:[email protected]

Why should data be saved?A. Hold scientists accountable: – Preserve record of scientific process, provenance– Enable reproducible research

B. Do better science: – Use results obtained by others!– Improve interdisciplinary work

C. Enable long-term access:– Use for technology transfer; societal/industrial

development– Reward scientists for data creation (credit/attribution)– Allow public/others insight/use of results

Data Preservation

Data Use

Sustainable Models

> 50 My Papers2 M scientists

2 My papers/year

Where The Data Goes Now:

Dryad: 7,631 files

Dataverse:0.6 My

Datacite: 1.5 My

MiRB: 25k

PetDB: 1,5 k

Majority of data(90%?) is stored

on local hard drives

Some data (8%?) stored in large,

generic data repositories

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small portion of data (1-2%?) stored in small,

topic-focuseddata repositories

> 50 My Papers2 M scientists

2 My papers/year

Key Needs:

Dryad: 7,631 files

Dataverse:0.6 My

Datacite: 1.5 My

MiRB: 25k

PetDB: 1,5 k

Majority of data(90%?) is stored

on local hard drives

Some data (8%?) stored in large,

generic data repositories

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small portion of data (1-2%?) stored in small,

topic-focuseddata repositories

INCREASE DATA PRESERVATION

IMPR

OVE DAT

A USE

DEVELOP SUSTAINABLE MODELS

A. Data Preservation:• Issues: – Currently data is often used by single researchers or

small groups: many different, idiosyncratic formats– Often not in electronic form (maps, images)– No metadata: when, where, by whom, WHY was this

data collected?• Needs: – Tools to make data export/storage simple and

unavoidable– Policies that make data sharing mandatory and simple– Systems that reward data sharing/digitisation

B. Data Use:• Issues: – In generic data repositories, data cannot be used

because of inadequate metadata, lack of quality review, lack of provenance

– It’s expensive to make data useable!– Domain-specific data stores are not cross-

searchable across discipline/national borders• Needs:– Standardised metadata systems across

systems/repositories and tools to apply them easily– Integration layers to enable cross-repository queries– A funding model to enable long-term preservation

C. Sustainable Models:• Issues: – Many successful domain-specific data repositories

are running out of funding– Is adding metadata something you want to keep

paying PhD+ scientists to do? – Unclear who foots the bill: the researcher? The

institute? The grant agency? For how long?• Needs: – Attribution models for rewarding scientists– Policies to improve cross-domain and cross-national

collaborations– Funding models to sustain databases long-term

Linking papers to research data:

9

Database Object Linked Displayed

Pangaea Google Maps Location Map with location

Protein Databank PDB Protein 3d Protein Visualisation

Genbank Gene Name NCBI Gene Viewer

Exoplanets + Exoplanet name Rich Information on extrasolar Planets

Species + Species name Rich information on species

Calculate, coordinate…

Compile, comment, compare…

6. Allow apps/tools to integrate

Towards ‘wrapping papers around data’1. Store metadata on all materialsmetadata

metadata

metadata

metadata

metadata

5. Invite reviews; open data to trusted parties, at trusted time

2. Track the methods while doing them

4. Don’t ‘send’ your papers – just expose them to the outside world

ReviewEdit

Revise

Rats were subjected to two grueling tests(click on fig 2 to see underlying data). These results suggest that the neurological pain pro-

3. Write papers that ‘wrap around’ this

Research Data Services:

A. Increase Data Preservation: Help increase the amount and quality of data preserved and shared

B. Improve Data Use: Help increase the value and usability of the data shared by increasing annotation, normalization, provenance enabling enhanced interoperability

C. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the institute, and the funding body, enabling more sustainable platforms.

Guiding Principles of RDS:• In principle, all open data stays open and URLs,

front end etc. stay where they are (i.e. with repository)

• Collaboration is tailored to data repositories’ unique needs/interests- ‘service-model’ type: – Aspects where collaboration is needed are discussed– A collaboration plan is drawn up using a Service-Level

Agreement: agree on time, conditions, etc. • Transparent business model• Very small (2/3 people) department; immediate

communication; instant deployment of ideas

Three pilots: 1. Carnegie Mellon Electrophysiology Lab:

A. Data Input: Develop a suite of tools to enable simple data capturing on a handheld device, add metadata during experiment, store with raw traces and create dashboard for viewing

B. Data Use: Integrate with NIF and eagle-I ontologies, enable access through NIF; combine with other sources

2. ImageVault, with Duke CIVM:A. Data Input: Get 3D image data into common format,

resolution, annotated to allow comparisonB. Data Use: View other image data sets & do image

analytics C. Sustainable Models: Create funding for 3D image sets:

free layer for raw data/subscription analytics.

3. IEDA Data Rescue Process Study Data Rescue: – Identify 3 -5 data sets that need to be ‘rescued’– Work with investigators to identify data sources,

formats– Work with IEDA to define metadata standards,

quality checks etc.

Data Rescue Process: – A group of data wranglers perform ‘electrification’

and annotation– (Open source) software is developed where needed,

to help this process– We help develop common standards, if needed

3. IEDA Data Rescue Process Study Data Rescue Process Study: Jointly publish a report on a ‘gap analysis’ comparing where are we now vs. and where we need to be, including:– What we did (data imported, processes/standards

created/described; software built; user tests, outcomes)

– Effort involved (time, software, equipment, skills, etc)– How easy it would be to scale up; what part of data

out there could be done this way.– Recommendations for tools and skills that are

needed, if we want to scale up this process

Summary:• Three key issues:

A. Data PreservationB. Data UseC. Sustainable Models

• Elsevier’s approach: – Linking data to papers– Wrap papers around data– Explore role in the research data space

• Elsevier RDS: – Three pilots (CMU, Duke, IEDA) to investigate issues– We’ll report back in about a year!

Questions?

Anita de Waard VP Research Data Collaborations, Elsevier

[email protected]



Whither Small Data?

Documents

Transcript of Whither Small Data?