Whither Small Data?
-
Upload
anita-de-waard -
Category
Documents
-
view
356 -
download
1
description
Transcript of Whither Small Data?
![Page 1: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/1.jpg)
Whither Small Data? Some Thoughts on Managing Research Data
February 26, 2013Anita de Waard
VP Research Data Collaborations, Elsevier [email protected]
![Page 2: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/2.jpg)
Why should data be saved?A. Hold scientists accountable: – Preserve record of scientific process, provenance– Enable reproducible research
B. Do better science: – Use results obtained by others!– Improve interdisciplinary work
C. Enable long-term access:– Use for technology transfer; societal/industrial
development– Reward scientists for data creation (credit/attribution)– Allow public/others insight/use of results
Data Preservation
Data Use
Sustainable Models
![Page 3: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/3.jpg)
![Page 4: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/4.jpg)
> 50 My Papers2 M scientists
2 My papers/year
Where The Data Goes Now:
Dryad: 7,631 files
Dataverse:0.6 My
Datacite: 1.5 My
MiRB: 25k
PetDB: 1,5 k
Majority of data(90%?) is stored
on local hard drives
Some data (8%?) stored in large,
generic data repositories
TAIR: 72,1 k
PDB: 88,3 k
SedDB: 0.6 k
A small portion of data (1-2%?) stored in small,
topic-focuseddata repositories
![Page 5: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/5.jpg)
> 50 My Papers2 M scientists
2 My papers/year
Key Needs:
Dryad: 7,631 files
Dataverse:0.6 My
Datacite: 1.5 My
MiRB: 25k
PetDB: 1,5 k
Majority of data(90%?) is stored
on local hard drives
Some data (8%?) stored in large,
generic data repositories
TAIR: 72,1 k
PDB: 88,3 k
SedDB: 0.6 k
A small portion of data (1-2%?) stored in small,
topic-focuseddata repositories
INCREASE DATA PRESERVATION
IMPR
OVE DAT
A USE
DEVELOP SUSTAINABLE MODELS
![Page 6: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/6.jpg)
A. Data Preservation:• Issues: – Currently data is often used by single researchers or
small groups: many different, idiosyncratic formats– Often not in electronic form (maps, images)– No metadata: when, where, by whom, WHY was this
data collected?• Needs: – Tools to make data export/storage simple and
unavoidable– Policies that make data sharing mandatory and simple– Systems that reward data sharing/digitisation
![Page 7: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/7.jpg)
B. Data Use:• Issues: – In generic data repositories, data cannot be used
because of inadequate metadata, lack of quality review, lack of provenance
– It’s expensive to make data useable!– Domain-specific data stores are not cross-
searchable across discipline/national borders• Needs:– Standardised metadata systems across
systems/repositories and tools to apply them easily– Integration layers to enable cross-repository queries– A funding model to enable long-term preservation
![Page 8: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/8.jpg)
C. Sustainable Models:• Issues: – Many successful domain-specific data repositories
are running out of funding– Is adding metadata something you want to keep
paying PhD+ scientists to do? – Unclear who foots the bill: the researcher? The
institute? The grant agency? For how long?• Needs: – Attribution models for rewarding scientists– Policies to improve cross-domain and cross-national
collaborations– Funding models to sustain databases long-term
![Page 9: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/9.jpg)
Linking papers to research data:
9
Database Object Linked Displayed
Pangaea Google Maps Location Map with location
Protein Databank PDB Protein 3d Protein Visualisation
Genbank Gene Name NCBI Gene Viewer
Exoplanets + Exoplanet name Rich Information on extrasolar Planets
Species + Species name Rich information on species
![Page 10: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/10.jpg)
Calculate, coordinate…
Compile, comment, compare…
6. Allow apps/tools to integrate
Towards ‘wrapping papers around data’1. Store metadata on all materialsmetadata
metadata
metadata
metadata
metadata
5. Invite reviews; open data to trusted parties, at trusted time
2. Track the methods while doing them
4. Don’t ‘send’ your papers – just expose them to the outside world
ReviewEdit
Revise
Rats were subjected to two grueling tests(click on fig 2 to see underlying data). These results suggest that the neurological pain pro-
3. Write papers that ‘wrap around’ this
![Page 11: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/11.jpg)
Research Data Services:
A. Increase Data Preservation: Help increase the amount and quality of data preserved and shared
B. Improve Data Use: Help increase the value and usability of the data shared by increasing annotation, normalization, provenance enabling enhanced interoperability
C. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the institute, and the funding body, enabling more sustainable platforms.
![Page 12: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/12.jpg)
Guiding Principles of RDS:• In principle, all open data stays open and URLs,
front end etc. stay where they are (i.e. with repository)
• Collaboration is tailored to data repositories’ unique needs/interests- ‘service-model’ type: – Aspects where collaboration is needed are discussed– A collaboration plan is drawn up using a Service-Level
Agreement: agree on time, conditions, etc. • Transparent business model• Very small (2/3 people) department; immediate
communication; instant deployment of ideas
![Page 13: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/13.jpg)
Three pilots: 1. Carnegie Mellon Electrophysiology Lab:
A. Data Input: Develop a suite of tools to enable simple data capturing on a handheld device, add metadata during experiment, store with raw traces and create dashboard for viewing
B. Data Use: Integrate with NIF and eagle-I ontologies, enable access through NIF; combine with other sources
2. ImageVault, with Duke CIVM:A. Data Input: Get 3D image data into common format,
resolution, annotated to allow comparisonB. Data Use: View other image data sets & do image
analytics C. Sustainable Models: Create funding for 3D image sets:
free layer for raw data/subscription analytics.
![Page 14: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/14.jpg)
3. IEDA Data Rescue Process Study Data Rescue: – Identify 3 -5 data sets that need to be ‘rescued’– Work with investigators to identify data sources,
formats– Work with IEDA to define metadata standards,
quality checks etc.
Data Rescue Process: – A group of data wranglers perform ‘electrification’
and annotation– (Open source) software is developed where needed,
to help this process– We help develop common standards, if needed
![Page 15: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/15.jpg)
3. IEDA Data Rescue Process Study Data Rescue Process Study: Jointly publish a report on a ‘gap analysis’ comparing where are we now vs. and where we need to be, including:– What we did (data imported, processes/standards
created/described; software built; user tests, outcomes)
– Effort involved (time, software, equipment, skills, etc)– How easy it would be to scale up; what part of data
out there could be done this way.– Recommendations for tools and skills that are
needed, if we want to scale up this process
![Page 16: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/16.jpg)
Summary:• Three key issues:
A. Data PreservationB. Data UseC. Sustainable Models
• Elsevier’s approach: – Linking data to papers– Wrap papers around data– Explore role in the research data space
• Elsevier RDS: – Three pilots (CMU, Duke, IEDA) to investigate issues– We’ll report back in about a year!
![Page 17: Whither Small Data?](https://reader036.fdocuments.net/reader036/viewer/2022062418/554e790fb4c9054a698b4f82/html5/thumbnails/17.jpg)
Questions?
Anita de Waard VP Research Data Collaborations, Elsevier