Ten Habits of Highly Successful Data
of 16
/16
-
Author
anita-de-waard -
Category
Science
-
view
681 -
download
33
Embed Size (px)
description
Slides for http://www.discoveryinformaticsinitiative.org/ workshop, Quebec, Sunday July 27 2014
Transcript of Ten Habits of Highly Successful Data
Ten habits of highly effective data: Helping your dataset achieve
its full potential Anita de Waard VP Research Data Collaborations
[email protected] http://researchdata.elsevier.com/ Quebec
City. Canada, July Who cares about Research Data? Funding bodies:
Demonstrate impact Guarantee permanence, discoverability Avoid
fraud Avoid double funding Serve general public Research
Management/Libary: Generate, track outputs Comply with mandates
Ensure availability Phil Bourne, Ass Director for Data Science at
NIH: Foster an ecosystem that enables biomedical research to be
done as a digital enterprise. Mike Huerta, Ass. Director NLM:
Today, the major public product of science are concepts, written
down in papers. But tomorrow, data will be the main product of
science. We will require scientists to track and share their data
as least as well, if not better, than they are sharing their ideas
today. Researchers: Derive credit Comply with mandates Discover and
use Cite/acknowledge Nathan Urban, PI Urban Lab, CMU, 3/13: If we
can share our data, we can write a paper that will knock everybodys
socks off! Barbara Ransom, NSF Program Director Earth Sciences:
Were not going to spend any more money for you to go out and get
more data! We want you first to show us how youre going to use all
the data we paid yall to collect in the past! Whats the problem?
One example: Using antibodies and squishy bits Grad Students
experiment and enter details into their lab notebook. The PI then
tries to make sense of their slides, and writes a paper. End of
story. 7. Trusted (validated/checked by reviewers) Maslows
Hierarchy of Needs (for Research Data) 6. Reproducible (others can
redo experiments) 9. Usable (allow tools to run on it) 4.
Comprehensible (others can understand data & processes) 2.
Archived (long-term & format- independent) 1. Preserved
(existing in some form) 5. Discoverable (can be indexed by a
system) 3. Accessible (can be accessed by others) 8. Citable (able
to point & track citations) 1. Preserve: Data Rescue Challenge
With IEDA/Lamont: award succesful data rescue attempts Awarded at
AGU 2013 23 submissions of data that was digitized, preserved, made
available Winner: NIMBUS Data Rescue: Recovery, reprocessing and
digitization of the infrared and visible observations along with
their navigation and formatting. Over 4000 7-track tapes of global
infrared satellite data were read and reprocessed. Nearly 200,000
visible light images were scanned, rectified and navigated. All the
resultant data was converted to HDF-5 (NetCDF) format and freely
distributed to users from NASA and NSIDC servers. This data was
then used to calculate monthly sea ice extents for both the Arctic
d the Antarctic. Conclusion: we (collectively) need to do more of
this! How can we fund it? 7. Trusted (validated/checked by
reviewers) 6. Reproducible (others can redo experiments) 9. Usable
(allow tools to run on it) 4. Comprehensible (others can understand
data & processes) 1. Preserved (existing in some form) 5.
Discoverable (can be indexed by a system) 8. Citable (able to point
& track citations) 3. Accessible (can be accessed by others) 2.
Archived (long-term & format- independent) 7. Trusted
(validated/checked by reviewers) 6. Reproducible (others can redo
experiments) 9. Usable (allow tools to run on it) 4. Comprehensible
(others can understand data & processes) 3. Accessible (can be
accessed by others) 1. Preserved (existing in some form) 5.
Discoverable (can be indexed by a system) 2. Archived (long-term
& format- independent) 8. Citable (able to point & track
citations) 2. Archive: Olive Project CMU CS & Library: funded
by a grant from the IMLS, Elsevier is partner Goal: Preservation of
executable content - nowadays a large part of intellectual output,
and very fragile Identified a series of software packages and
prepared VM to preserve Does it work? Yes see video (1:24) 7.
Trusted (validated/checked by reviewers) 6. Reproducible (others
can redo experiments) 9. Usable (allow tools to run on it) 4.
Comprehensible (others can understand data & processes) 1.
Preserved (existing in some form) 5. Discoverable (can be indexed
by a system) 8. Citable (able to point & track citations) 3.
Access: Urban Legend 3. Accessible (can be accessed by others) 2.
Archived (long-term & format- independent) Part 1: Metadata
acquisition Step through experimental process in series of dropdown
menus in simple web UI Can be tailored to workflow of individual
researcher Connected to shared ontologies through lookup table,
managed centrally in lab Connect to data input console (Igor Pro)
7. Trusted (validated/checked by reviewers) 6. Reproducible (others
can redo experiments) 9. Usable (allow tools to run on it) 4.
Comprehensible (others can understand data & processes) 1.
Preserved (existing in some form) 5. Discoverable (can be indexed
by a system) 8. Citable (able to point & track citations) 4.
Comprehend: Urban Legend 3. Accessible (can be accessed by others)
2. Archived (long-term & format- independent) Part 2: Data
Dashboard Access, select and manipulate data (calculate properties,
sort and plot) Final goal: interactive figures linked to data Plan
to expand to more labs, other data 7. Trusted (validated/checked by
reviewers) 6. Reproducible (others can redo experiments) 9. Usable
(allow tools to run on it) 4. Comprehensible (others can understand
data & processes) 1. Preserved (existing in some form) 5.
Discoverable (can be indexed by a system) 8. Citable (able to point
& track citations) 5. Discover: Data Discovery Index NIH
interested in creating DDI consortium Three places where data is
deposited: 1. Curated sources for a single data type (e.g.Protein
Data Bank, VentDB, Hubble Space Data) 2. Non- or semicurated
sources for different data types (e.g. DataDryad, Dataverse,
Figshare) 3. Tables in papers: Ways to find this: Cross-domain
query tools, i.e. NIF, DataOne, etc Search for papers -> link to
data How to find data in papers?? Propose to build prototypes
across all of these data sources: Needs NLP, models of data
patterns? What else? 3. Accessible (can be accessed by others) 2.
Archived (long-term & format- independent) Papers Non-curated
DBs Curated DBs 7. Trusted (validated/checked by reviewers) 6.
Reproducible (others can redo experiments) 9. Usable (allow tools
to run on it) 4. Comprehensible (others can understand data &
processes) 1. Preserved (existing in some form) 5. Discoverable
(can be indexed by a system) 8. Citable (able to point & track
citations) 6. Reproduce: Resource Identifier Initiative Force11
Working Group to add data identifiers to articles that is 1)
Machine readable; 2) Free to generate and access; 3) Consistent
across publishers and journals. Authors publishing in participating
journals will be asked to provide RRID's for their resources; these
are added to the keyword field RRID's will be drawn from: The
Antibody Registry Model Organism Databases NIF Resource Registry So
far, Springer, Wiley, Biomednet, Elsevier journals have signed up
with 11 journals, more to come Wide community adoption! 3.
Accessible (can be accessed by others) 2. Archived (long-term &
format- independent) 7. Trusted (validated/checked by reviewers) 6.
Reproducible (others can redo experiments) 9. Usable (allow tools
to run on it) 4. Comprehensible (others can understand data &
processes) 1. Preserved (existing in some form) 5. Discoverable
(can be indexed by a system) 8. Citable (able to point & track
citations) 7.Trust: Moonrocks 3. Accessible (can be accessed by
others) 2. Archived (long-term & format- independent) How can
we scale up data curation? Pilot project with IEDA: Lunar
geochemistry database: leapfrog & improve curation time 1-year
pilot, funded by Elsevier If spreadsheet columns/headers map to RDB
schema, we can scale up curation process and move from tables >
curated databases! 7. Trusted (validated/checked by reviewers) 6.
Reproducible (others can redo experiments) 9. Usable (allow tools
to run on it) 4. Comprehensible (others can understand data &
processes) 1. Preserved (existing in some form) 5. Discoverable
(can be indexed by a system) 8. Citable (able to point & track
citations) 8. Cite: Force11 Data Citation Principles Another
Force11 Working group Defined 8 principles: Now seeking
endorsement/working on implementation 3. Accessible (can be
accessed by others) 2. Archived (long-term & format-
independent) 1. Importance: Data should be considered legitimate,
citable products of research. Data citations should be accorded the
same importance in the scholarly record as citations of other
research objects, such as publications. 2. Credit and attribution:
Data citations should facilitate giving scholarly credit and
normative and legal attribution to all contributors to the data,
recognizing that a single style or mechanism of attribution may not
be applicable to all data. 3. Evidence: Where a specific claim
rests upon data, the corresponding data citation should be
provided. 4. Unique Identification: A data citation should include
a persistent method for identification that is machine actionable,
globally unique, and widely used by a community. 5. Access: Data
citations should facilitate access to the data themselves and to
such associated metadata, documentation, and other materials, as
are necessary for both humans and machines to make informed use of
the referenced data. 6. Persistence: Metadata describing the data,
and unique identifiers should persist, even beyond the lifespan of
the data they describe. 7. Versioning and granularity: Data
citations should facilitate identification and access to different
versions and/or subsets of data. Citations should include
sufficient detail to verifiably link the citing work to the portion
and version of data cited. 8. Interoperability and flexibility:
Data citation methods should be sufficiently flexible to
accommodate the variant practices among communities but should not
differ so much that they compromise interoperability of data
citation practices across communities. 7. Trusted
(validated/checked by reviewers) 6. Reproducible (others can redo
experiments) 9. Usable (allow tools to run on it) 4. Comprehensible
(others can understand data & processes) 1. Preserved (existing
in some form) 5. Discoverable (can be indexed by a system) 8.
Citable (able to point & track citations) 9. Use: Executable
Papers Result of a challenge to come up with cyberinfrastructure
components to enable executable papers Pilot in Computer Science
journals See all code in the paper Save it, export it Change it and
rerun on data set: 3. Accessible (can be accessed by others) 2.
Archived (long-term & format- independent) 10: Putting it all
together: 7. Trusted (validated/checked by reviewers) 6.
Reproducible (others can redo experiments) 9. Usable (allow tools
to run on it) 4. Comprehensible (others can understand data &
processes) 2. Archived (long-term & format- independent) 1.
Preserved (existing in some form) 5. Discoverable (can be indexed
by a system) 3. Accessible (can be accessed by others) 8. Citable
(able to point & track citations) Experimental Metadata:
Workflows, Samples, Settings, Reagents, Organisms, etc. Record
Metadata: DOI, Date, Author, Institute, etc. Processed Data:
Mathematically/computationally processed data: correlations, plots,
etc. Raw Data: Direct outputs from equipment: images, traces,
spectra, etc. Methods and Equipment: Reagents, settings,
manufacturers details, etc. Validation: Approval, Reproduction,
Selection, Quality Stamp Morecuration Moreusable So how can we help
research data be more happy and productive? Group therapy: Force11,
W3C, other fora shared standards help everyone (we play well with
others !) Financial therapy: find grants to support data-driven
processes to support grant proposals; funders like us. Creative
therapy: innovative collaboration projects that expand everyones
mind lets put data together and put it through its paces
Relationship therapy: different communities working together is
likely to produce efforts! E.g. Big Mechanism effort: scientists +
clinicians + publishers + CS = Collaborations and discussions
gratefully acknowledged: CMU: Nathan Urban, Shreejoy Tripathy,
Shawn Burton, Ed Hovy UCSD: Brian Shoettlander, David Minor, Declan
Fleming, Ilya Zaslavsky NIF: Maryann Martone, Anita Bandrowski
Force11: Ed Hovy, Tim Clark, Ivan Herman, Paul Groth, Maryann
Martone, Cameron Neylon, Stephanie Hagstrom OHSU: Melissa Haendel,
Nicole Vasilevsky Columbia/IEDA: Kerstin Lehnert, Leslie Hsu MIT:
Micah Altman Thank you! http://researchdata.elsevier.com/ Anita de
Waard [email protected]