NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural...
Transcript of NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural...
Martin Fenner DataCite Technical Director https://orcid.org/0000-0003-1419-2405
NIH Data Commons Pilot (DCPPC): lessons learned from collaboratively building infrastructure to provide global unique identifiers for FAIR biomedical digital objects
(Some) lessons learned for Open Science.
Our work in the pilot: globally unique identifiers (GUIDs) for FAIR biomedical digital objects.
Overview of the US National Institutes of Health (NIH) Data Commons Pilot.
WHAT THE TALK IS ABOUT
WARNING THIS PRESENTATION CONTAINS
PERSONAL VIEWS.
Phil Bourne NIH Associate Director for Data Science 2014-2017
THE COMMONS
The commons is the cultural and natural resources accessible to all members of a society, including natural materials such as air, water,
and a habitable earth. These resources are held in common, not owned privately.
https://en.wikipedia.org/wiki/Commons
PERSPECTIVE
Should biomedical research be like Airbnb?Vivien R. Bonazzi, Philip E. Bourne*
National Institutes of Health, Bethesda, Maryland, United States of America
Abstract
The thesis presented here is that biomedical research is based on the trusted exchange of
services. That exchange would be conducted more efficiently if the trusted software plat-
forms to exchange those services, if they exist, were more integrated. While simpler and
narrower in scope than the services governing biomedical research, comparison to existing
internet-based platforms, like Airbnb, can be informative. We illustrate how the analogy to
internet-based platforms works and does not work and introduce The Commons, under
active development at the National Institutes of Health (NIH) and elsewhere, as an example
of the move towards platforms for research.
At first glance, the idea that biomedical research has any relationship to an accommodationrental service probably seems absurd. Let us explain.
As we worked through a strategy for improved data management and sustainability as partof our work with the National Institutes of Health (NIH), one of us (PEB) was also in the pro-cess of making an apartment available through Airbnb. While having a number of satisfactoryrental experiences with Airbnb, he had never been a host (the person renting to travelers)before. Hosting further exemplifies what one experiences as a renter—it succeeds because it isa relationship built upon trust. Using Google Drive or Github are other simple examples of arelationship built on trust. Using the Airbnb software platform, the renter trusts that theaccommodation is going to be as advertised; the host trusts that the person renting is notgoing to trash their property. Host and renter both trust Airbnb to facilitate and manage thetransaction. The software platform upon which Airbnb is based makes every effort to gather asmuch data on both renters and hosts to maximize the sense of trust. The platform is easy touse, and transactions are inexpensive. The service is far from perfect. Issues have arisen con-cerning how Airbnb can change neighborhoods in areas with high tourist potential [1] or,indeed, of claims of racial bias by hosts [2]. Nevertheless, something is working, since as ofFebruary 2016, Airbnb had 60 million users searching 1.5 million listings in 191 countries,with an average of 500,000 stays per night [3]—all leading to a valuation of US$25 billion. So,what does this have to do with biomedical research?
Airbnb supports a trusted service between providers and consumers of that service. Con-sider biomedical preclinical and clinical research, in which the trusted service involves theexchange of papers, data, software, reagents, and so on. An author publishes a paper having
PLOS Biology | https://doi.org/10.1371/journal.pbio.2001818 April 7, 2017 1 / 9
a1111111111a1111111111a1111111111a1111111111a1111111111
OPENACCESS
Citation: Bonazzi VR, Bourne PE (2017) Shouldbiomedical research be like Airbnb? PLoS Biol15(4): e2001818. https://doi.org/10.1371/journal.pbio.2001818
Published: April 7, 2017
Copyright: This is an open access article, free of allcopyright, and may be freely reproduced,distributed, transmitted, modified, built upon, orotherwise used by anyone for any lawful purpose.The work is made available under the CreativeCommons CC0 public domain dedication.
Funding: The author(s) received no specificfunding for this work.
Competing interests: The authors have declaredthat no competing interests exist.
Abbreviations: BD2K, Big Data to Knowledge;CDISC, Clinical Data Interchange StandardsConsortium; FAIR, Findable, Accessible,Interoperable, and Reusable; FDA, the Food andDrug Administration; GNP, gross national product;NIH, National Institutes of Health; OSF, OpenScience Framework; SaaS, software as a service
Provenance: Not commissioned; externally peerreviewed.
Published: April 7, 2017 https://doi.org/10.1371/journal.pbio.2001818
https://www.slideshare.net/pebourne/the-commons-leveraging-the-power-of-the-cloud-for-big-data/12
What are the PRINCIPLES of The Commons?
� Supports a digital biomedical ecosystem
� Treats products of research – data, software, methods, papers etc. as digital research objects
� Digital research objects exist in a shared virtual spaceFind, Deposit, Manage, Share and Reuse data, software, metadata and workflows
� Digital objects need to conform to FAIR principles:� Findable� Accessible (and usable)� Interoperable � Reusable
https://www.slideshare.net/pebourne/the-commons-leveraging-the-power-of-the-cloud-for-big-data/14
The Commons Framework
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital O
bject Com
pliance
App store/User Interface
Trans-Omics for Precision Medicine (TOPMed) The TOPMed program collects and pairs whole-genome sequencing (WGS) and other large-scale data (e.g., DNA methylation signatures, RNA expression profiles, metabolite profiles, proteomics) with molecular, behavioral, imaging, environmental, and clinical data from studies focused on heart, lung, blood and sleep (HLBS) disorders.
Alliance of Genome Resources (AGR) The Model Organism Databases (MODs) provide in-depth biological data for intensively studied model organisms.
Genotype-Tissue Expression (GTEx) The GTEx program explores how human genes are expressed and regulated in different tissues, and the role that genomic variation plays in changing gene expression.
TEST CASE DATA SETS
Pilot phase 1 from April 2018 to October 2018. DataCite participated in Team Sodium, which led KC2 GUIDs and digital objects.
Mercè Crosas @mercecrosas Principal Investigator
Tim Clark Co-PI
Martin Fenner @mfenner Co-PI
Team Sodium Harvard University, California Digital Library, DataCite, EMBL, University of Virginia, University of Illinois
Christian Haselgrove
Ian Fore
Philipe Rocca-Serra
Andy Jenkinson Repository Metadata Expert Group
Leads: Martin Fenner (DataCite), Merce Crosas (Dataverse)
Force11 Data Citation Implementation Pilot (DCIP)
https://doi.org/10.1101/097196
Collecting and providing metrics data should become part of scholarly infrastructure.
Built registration and landing page service that interoperates with full stacks and other services.
Based on schema.org for datasets and data catalogs.
CORE METADATA SPECIFICATION
OBJECT REGISTRATION SERVICES
GLOBALLY UNIQUE IDENTIFIER (GUID) SERVICES DOCUMENT REGISTRATION OF GUIDS FOR
TEST CASE DATA SETS
DCPPC KC2 Outputs
Alignment of the GUID services used in the NIH Data Commons.
Demos using the prototype and production services working with the test case data sets.
Building of prototype and product services implementing the specifications we agreed on.
Lots of discussion and document writing amongst the various teams to align with regards to identifiers and metadata.
THINGS WE DID
https://ors.datacite.org/doi:/10.25491/ky16-6894
https://toolbox.google.com/datasetsearch/search?query=ors.datacite.org
CRITERIA FOR SUCCESS
A. Team Sodium members knew each other well from previous projects. B. We all engaged in discussions and joint work with each other and the other teams. C. Monthly in-person meetings of all NIH Data Commons teams, building common understanding
and trust. D. Open for change, goals and workflows changed constantly during the project. E. Project management in team and overall project that kept things on track.
DCPPC Phase I ended in October 2018. Currently evaluation by NIH of work done and how to move forward. Decision about phase 2 will come soon.
NEXT STEPS
Program. Iterative.
Wide variety of use cases defined by calls and projects.
Loose connection of outputs. Cloud more as a concept than a technical
infrastructure.
EUROPEAN OPEN SCIENCE CLOUDFREYA PROJECT
Pilot. Intense. Concrete use cases around test case datasets. All work is towards a single Commons. Everything is built for and in the cloud.
NIH DATA COMMONS PILOT TEAM SODIUM
DCPPC vs. EOSC
LESSONS LEARNED FOR OPEN SCIENCE
LESSONS LEARNED
Open Science has become mainstream.
LESSONS LEARNED
FAIR is everywhere.
LESSONS LEARNED
The relationships between people, organizations and projects are redefined.
LESSONS LEARNED
Collaboration is the key challenge.
LESSONS LEARNED
There is still so much work to do.
THANK YOU