NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural...

28
Martin Fenner DataCite Technical Director https://orcid.org/0000-0003-1419-2405 NIH Data Commons Pilot (DCPPC): lessons learned from collaboratively building infrastructure to provide global unique identifiers for FAIR biomedical digital objects

Transcript of NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural...

Page 1: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Martin Fenner DataCite Technical Director https://orcid.org/0000-0003-1419-2405

NIH Data Commons Pilot (DCPPC): lessons learned from collaboratively building infrastructure to provide global unique identifiers for FAIR biomedical digital objects

Page 2: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

(Some) lessons learned for Open Science.

Our work in the pilot: globally unique identifiers (GUIDs) for FAIR biomedical digital objects.

Overview of the US National Institutes of Health (NIH) Data Commons Pilot.

WHAT THE TALK IS ABOUT

Page 3: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

WARNING THIS PRESENTATION CONTAINS

PERSONAL VIEWS.

Page 4: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Phil Bourne NIH Associate Director for Data Science 2014-2017

Page 5: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

THE COMMONS

The commons is the cultural and natural resources accessible to all members of a society, including natural materials such as air, water,

and a habitable earth. These resources are held in common, not owned privately. 

https://en.wikipedia.org/wiki/Commons

Page 6: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

PERSPECTIVE

Should biomedical research be like Airbnb?Vivien R. Bonazzi, Philip E. Bourne*

National Institutes of Health, Bethesda, Maryland, United States of America

* [email protected]

Abstract

The thesis presented here is that biomedical research is based on the trusted exchange of

services. That exchange would be conducted more efficiently if the trusted software plat-

forms to exchange those services, if they exist, were more integrated. While simpler and

narrower in scope than the services governing biomedical research, comparison to existing

internet-based platforms, like Airbnb, can be informative. We illustrate how the analogy to

internet-based platforms works and does not work and introduce The Commons, under

active development at the National Institutes of Health (NIH) and elsewhere, as an example

of the move towards platforms for research.

At first glance, the idea that biomedical research has any relationship to an accommodationrental service probably seems absurd. Let us explain.

As we worked through a strategy for improved data management and sustainability as partof our work with the National Institutes of Health (NIH), one of us (PEB) was also in the pro-cess of making an apartment available through Airbnb. While having a number of satisfactoryrental experiences with Airbnb, he had never been a host (the person renting to travelers)before. Hosting further exemplifies what one experiences as a renter—it succeeds because it isa relationship built upon trust. Using Google Drive or Github are other simple examples of arelationship built on trust. Using the Airbnb software platform, the renter trusts that theaccommodation is going to be as advertised; the host trusts that the person renting is notgoing to trash their property. Host and renter both trust Airbnb to facilitate and manage thetransaction. The software platform upon which Airbnb is based makes every effort to gather asmuch data on both renters and hosts to maximize the sense of trust. The platform is easy touse, and transactions are inexpensive. The service is far from perfect. Issues have arisen con-cerning how Airbnb can change neighborhoods in areas with high tourist potential [1] or,indeed, of claims of racial bias by hosts [2]. Nevertheless, something is working, since as ofFebruary 2016, Airbnb had 60 million users searching 1.5 million listings in 191 countries,with an average of 500,000 stays per night [3]—all leading to a valuation of US$25 billion. So,what does this have to do with biomedical research?

Airbnb supports a trusted service between providers and consumers of that service. Con-sider biomedical preclinical and clinical research, in which the trusted service involves theexchange of papers, data, software, reagents, and so on. An author publishes a paper having

PLOS Biology | https://doi.org/10.1371/journal.pbio.2001818 April 7, 2017 1 / 9

a1111111111a1111111111a1111111111a1111111111a1111111111

OPENACCESS

Citation: Bonazzi VR, Bourne PE (2017) Shouldbiomedical research be like Airbnb? PLoS Biol15(4): e2001818. https://doi.org/10.1371/journal.pbio.2001818

Published: April 7, 2017

Copyright: This is an open access article, free of allcopyright, and may be freely reproduced,distributed, transmitted, modified, built upon, orotherwise used by anyone for any lawful purpose.The work is made available under the CreativeCommons CC0 public domain dedication.

Funding: The author(s) received no specificfunding for this work.

Competing interests: The authors have declaredthat no competing interests exist.

Abbreviations: BD2K, Big Data to Knowledge;CDISC, Clinical Data Interchange StandardsConsortium; FAIR, Findable, Accessible,Interoperable, and Reusable; FDA, the Food andDrug Administration; GNP, gross national product;NIH, National Institutes of Health; OSF, OpenScience Framework; SaaS, software as a service

Provenance: Not commissioned; externally peerreviewed.

Published: April 7, 2017 https://doi.org/10.1371/journal.pbio.2001818

Page 7: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

https://www.slideshare.net/pebourne/the-commons-leveraging-the-power-of-the-cloud-for-big-data/12

What are the PRINCIPLES of The Commons?

� Supports a digital biomedical ecosystem

� Treats products of research – data, software, methods, papers etc. as digital research objects

� Digital research objects exist in a shared virtual spaceFind, Deposit, Manage, Share and Reuse data, software, metadata and workflows

� Digital objects need to conform to FAIR principles:� Findable� Accessible (and usable)� Interoperable � Reusable

Page 8: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

https://www.slideshare.net/pebourne/the-commons-leveraging-the-power-of-the-cloud-for-big-data/14

The Commons Framework

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital O

bject Com

pliance

App store/User Interface

Page 9: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Trans-Omics for Precision Medicine (TOPMed) The TOPMed program collects and pairs whole-genome sequencing (WGS) and other large-scale data (e.g., DNA methylation signatures, RNA expression profiles, metabolite profiles, proteomics) with molecular, behavioral, imaging, environmental, and clinical data from studies focused on heart, lung, blood and sleep (HLBS) disorders. 

Alliance of Genome Resources (AGR) The Model Organism Databases (MODs) provide in-depth biological data for intensively studied model organisms. 

Genotype-Tissue Expression (GTEx) The GTEx program explores how human genes are expressed and regulated in different tissues, and the role that genomic variation plays in changing gene expression. 

TEST CASE DATA SETS

Page 10: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Pilot phase 1 from April 2018 to October 2018. DataCite participated in Team Sodium, which led KC2 GUIDs and digital objects.

Page 11: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Mercè Crosas @mercecrosas Principal Investigator

Tim Clark Co-PI

Martin Fenner @mfenner Co-PI

Team Sodium Harvard University, California Digital Library, DataCite, EMBL, University of Virginia, University of Illinois

Page 12: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Christian Haselgrove

Ian Fore

Philipe Rocca-Serra

Andy Jenkinson Repository Metadata Expert Group

Leads: Martin Fenner (DataCite), Merce Crosas (Dataverse)

Force11 Data Citation Implementation Pilot (DCIP)

Page 13: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

https://doi.org/10.1101/097196

Page 14: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Collecting and providing metrics data should become part of scholarly infrastructure.

Built registration and landing page service that interoperates with full stacks and other services.

Based on schema.org for datasets and data catalogs.

CORE METADATA SPECIFICATION

OBJECT REGISTRATION SERVICES

GLOBALLY UNIQUE IDENTIFIER (GUID) SERVICES DOCUMENT REGISTRATION OF GUIDS FOR

TEST CASE DATA SETS

DCPPC KC2 Outputs

Alignment of the GUID services used in the NIH Data Commons.

Page 15: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Demos using the prototype and production services working with the test case data sets.

Building of prototype and product services implementing the specifications we agreed on.

Lots of discussion and document writing amongst the various teams to align with regards to identifiers and metadata.

THINGS WE DID

Page 16: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

https://ors.datacite.org/doi:/10.25491/ky16-6894

Page 17: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

https://toolbox.google.com/datasetsearch/search?query=ors.datacite.org

Page 18: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such
Page 19: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

CRITERIA FOR SUCCESS

A. Team Sodium members knew each other well from previous projects. B. We all engaged in discussions and joint work with each other and the other teams. C. Monthly in-person meetings of all NIH Data Commons teams, building common understanding

and trust. D. Open for change, goals and workflows changed constantly during the project. E. Project management in team and overall project that kept things on track.

Page 20: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

DCPPC Phase I ended in October 2018. Currently evaluation by NIH of work done and how to move forward. Decision about phase 2 will come soon.

NEXT STEPS

Page 21: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

Program. Iterative.

Wide variety of use cases defined by calls and projects.

Loose connection of outputs. Cloud more as a concept than a technical

infrastructure.

EUROPEAN OPEN SCIENCE CLOUDFREYA PROJECT

Pilot. Intense. Concrete use cases around test case datasets. All work is towards a single Commons. Everything is built for and in the cloud.

NIH DATA COMMONS PILOT TEAM SODIUM

DCPPC vs. EOSC

Page 22: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

LESSONS LEARNED FOR OPEN SCIENCE

Page 23: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

LESSONS LEARNED

Open Science has become mainstream.

Page 24: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

LESSONS LEARNED

FAIR is everywhere.

Page 25: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

LESSONS LEARNED

The relationships between people, organizations and projects are redefined.

Page 26: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

LESSONS LEARNED

Collaboration is the key challenge.

Page 27: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

LESSONS LEARNED

There is still so much work to do.

Page 28: NIH Data Commons Pilot (DCPPC): lessons learned from ... · The commons is the cultural and natural resources accessible to all members of a society, including natural materials such

THANK YOU