NIH Data Summit - The NIH Data Commons

37
NIH Data Commons NIH Data Storage Summit October 20, 2017 Vivien Bonazzi Ph.D. Senior Advisor for Data Science (NIH/OD) Project Leader for the NIH Data Commons

Transcript of NIH Data Summit - The NIH Data Commons

Page 1: NIH Data Summit - The NIH Data Commons

NIH Data Commons

NIH Data Storage Summit

October 20, 2017

Vivien Bonazzi Ph.D.

Senior Advisor for Data Science (NIH/OD)Project Leader for the NIH Data Commons

Page 2: NIH Data Summit - The NIH Data Commons

What’s driving the need for a

Data Commons?

Page 3: NIH Data Summit - The NIH Data Commons

Challenges with the current state of data

Generating large volumes of biomedical data

Cheap to generate, costly to store on local servers

Multiple copies of the same data in different locations

Building data resources that cannot be easily found by others

Data resources are not connected to each other and cannot

share data or tools

No standards and guidelines on how to share and access data

Page 4: NIH Data Summit - The NIH Data Commons

Convergence of factors

Increasing recognition of the need to support data sharing

Availability of digital technologies and infrastructures that

support Data at scale

Cloud: data storage, compute and sharing

FAIR – Findable Accessible Interoperable Reproducible

Understanding that data is a valuable resource that needs to be

sustained

Page 5: NIH Data Summit - The NIH Data Commons

https://gds.nih.gov/

Went into effect January 25, 2015

NCI guidance:

http://www.cancer.gov/grants-training/grants-management/nci-

policies/genomic-data

Requires public sharing of genomic data sets

Page 6: NIH Data Summit - The NIH Data Commons
Page 7: NIH Data Summit - The NIH Data Commons
Page 8: NIH Data Summit - The NIH Data Commons
Page 9: NIH Data Summit - The NIH Data Commons

Findable

Accessible

Interoperable

Reusable

Page 10: NIH Data Summit - The NIH Data Commons

DATA has VALUE

DATA is CENTRAL to the Digital Economy

a signal of the coming Digital Economy

Page 11: NIH Data Summit - The NIH Data Commons

Scientific digital assets

Data

Software

Workflows

Documentation

Journal Articles

Organizations will be defined by their digital assets

Page 12: NIH Data Summit - The NIH Data Commons

The most successful organizations of the

future will be those that can

leverage their digital assets and transform

them into a digital enterprise

Page 13: NIH Data Summit - The NIH Data Commons

Data Commons

Enabling data driven science

Enable investigators to leverage all possible data and

tools in the effort to accelerate biomedical discoveries,

therapies and cures

by

driving the development of data infrastructure and data

science capabilities through collaborative research and

robust engineering

Page 14: NIH Data Summit - The NIH Data Commons

Developing a Data Commons

Treats products of research – data, methods, tools, papers etc. as digital objects

For this presentation: Data = Digital Objects

These digital objects exist in a shared virtual space

Find, Deposit, Manage, Share, and Reuse data, software, metadata and workflows

Digital object compliance through FAIR principles:

Findable

Accessible (and usable)

Interoperable

Reusable

Page 15: NIH Data Summit - The NIH Data Commons

The Data Commons

is a platform

that allows transactions to occur

on FAIR data at scale

Page 16: NIH Data Summit - The NIH Data Commons

The Data Commons Platform

Compute Platform: Cloud

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data

“Reference” Data Sets

User defined data

FA

IR

App store/User Interface/Portal

PaaS

SaaS

IaaS

Page 17: NIH Data Summit - The NIH Data Commons

Other Data Commons’

Page 18: NIH Data Summit - The NIH Data Commons

Data Commons Engagement

US Government Agencies & EU groups

Page 19: NIH Data Summit - The NIH Data Commons

Interoperability with other Commons’

Common goals – democratizing, collaborating & sharing data

Reuse of currently available open source tools which support

interoperability GA4GH, UCSC, GDC, NYGC

May 2017 BioIT Commons Session

Shared open standard APIs for data access and computing

Ability to deploy and compute across multiple cloud environments

Docker containers – Dockerstore/Docker registry

Workflows management, sharing and deployment

Discoverability (indexing) objects across cloud commons

Global Unique identifiers

Common user authentication system

Page 20: NIH Data Summit - The NIH Data Commons

The Good News

Considerable agreement about the general approaches to

be taken

Many people are already addressing many of the problems:

Data architectures/platforms

Automated/semi-automated data access/authentication protocols

Common metadata standards and templates

Open tools and software

Instantiation and initial metrics of Findability, Accessibility,

Interoperability, and Reusability

Relationships/agreements with Cloud Service Providers that leverage

their interest in hosting NIH data

Moving data to the cloud and operating in a cloud environment

Page 21: NIH Data Summit - The NIH Data Commons

The Challenges

A need to “Bring it all Together” – Community endorsement of:

Metadata standards/tools/approaches

Crosswalks between equivalent terms/ontologies

Robust, shared approaches to data access/authentication

Best practices that will enable existing data to become FAIR and will

guide generation of future datasets

Rapidly evolving field makes approaches/tools/etc subject to

change – approaches need to be adaptable

Effort is required to adapt data to community standards and move

data to the cloud

How much does that cost and how long does it take?

Lack of interoperability between cloud providers

Page 22: NIH Data Summit - The NIH Data Commons

The Challenges

Making data FAIR comes with a cost

How much does it actually cost?

How can we minimize the cost?

How do we determine whether any one set of data warrants the

expense?

What is the value added to the data by making it FAIR?

What new science can be achieved?

How can new derived data or new computational approaches be

added to the dataset to enrich it?

What are the limitations of FAIRness from dataset to dataset?

Page 23: NIH Data Summit - The NIH Data Commons

Development of a

NIH Data Commons Pilot

Page 24: NIH Data Summit - The NIH Data Commons

NIH Data Commons Pilot

allows access, use and sharing

of large, high value NIH data

in the cloud

Page 25: NIH Data Summit - The NIH Data Commons

NIH Data Commons Pilot

Page 26: NIH Data Summit - The NIH Data Commons

NIH Data Commons Structure

26

Cloud

Services: APIs, Containers, GUIDs, Indexing, Search,

Auth

ACCESS

Scientific analysis tools/workflows

Data

“Reference” Data Sets

TOPMed, GTEx, MODs

FA

IR

App store/User Interface/Portal/Workspace

PaaS

SaaS

IaaS

Page 27: NIH Data Summit - The NIH Data Commons

Operationalizing

the NIH Data Commons Pilot

Page 28: NIH Data Summit - The NIH Data Commons

NIH Data Commons Pilot : Implementation

Storage, NIH Marketplace, Metrics and Costs

Leveraging and extending relationships established as part of BD2K

to provide access cloud to storage and compute

Supplements: TOPMed, GTEx, MODs groups

Prepare (and move) data sets to the cloud for storage, access and

scientific use

Work collaboratively with the OT awardees to build towards data access

Data Commons OT Solicitation: Other Transaction

ROA: Research Opportunity Announcement

Developing the fundamental FAIR computational components to

support access, use and sharing of the 3 data sets above

Page 29: NIH Data Summit - The NIH Data Commons

NIH Data Commons Pilot Consortium

Page 30: NIH Data Summit - The NIH Data Commons

Establishing a new NIH Marketplace

access to a sustainable cloud infrastructure for data science at NIH

Over the next 18 months, NIH will establish its own NIH Cloud Marketplace

Data Commons Pilot Consortium awardees ability to acquire cloud storage and compute

services

Enable ICs to easily acquire cloud storage and storage services from commercial

cloud providers, resellers, and integrators

Building on existing relationship with CSPs

Led by CIT with input from Multi-IC working group

Storage, NIH Marketplace, Metrics and Costs

Page 31: NIH Data Summit - The NIH Data Commons

Assessment and Evaluation

What are the costs associated with cloud storage and usage?

What are the business best practices?

How should costs be paid?

Who should pay them?

How should highly used data be managed vs less used data?

Are data producers supportive of this model?

Are users (of all experience levels) able to access and use data effectively?

How will we know if the Data Commons Pilot is successful?

How to adjust to changing needs?

Storage, NIH Marketplace, Metrics and Costs

Page 32: NIH Data Summit - The NIH Data Commons

Supplements to 3 Test Data Set Groups

Administrative Supplements to TOPMed, GTEx and MODs

PIs for each data set were requested to review the OT (ROA) and

determine appropriate ways to interact

Prepare (and move) data sets to the cloud for storage, access

and scientific use

Make community workflows and cloud based tools of popular

analysis pipelines from the 3 datasets accessible

Facilitate discovery and interpretation of the association of

human and model organism genotypes and phenotypes

Page 33: NIH Data Summit - The NIH Data Commons

NIH Data Commons: OT ROA

Key Capabilities – modular components

Development of Community Supported FAIR Guidelines and Metrics

Global Unique Identifiers (GUID) for FAIR biomedical data

Open Standard APIs (interoperability & connectivity)

Cloud Agnostic Architecture and Frameworks

Cloud User Workspaces

Research Ethics, Privacy, and Security (AUTH)

Indexing and Search

Scientific Use cases

Training, Outreach, Coordination

Page 34: NIH Data Summit - The NIH Data Commons

Stage 1: 180 day window

Develop MVPs (Minimum Viable Products)

Demonstrations of the Data Commons and its components

Have one copy of each test data set in each cloud provider

Understanding of the process required to achieve this

Draft version of a single standard access control system

be able to access and use the data through the access control system

Able to use a variety of analysis tools and pipelines on the 3 data sets in the cloud – (driven by scientific use cases)

Have a rudimentary ability to query across test data sets

Display phenotype, expression and variant data aligned with a specific gene or genomic location

Display model organism orthologs for a given set of human genes

Draft FAIR guidelines and metrics

Understand how each of the computational components that support the ability to access data fit together and what standards are needed

Written plans of how and why these demonstrations should be extended into a full Pilot

NIH Data Commons Pilot: Outcomes

Page 35: NIH Data Summit - The NIH Data Commons

Stage 2: 4 year period

To extend and fully implement the Data Commons Pilot based on the

design strategies and capabilities developed as part of Stage 1

Review of MVP/demonstrations and written plans from Stage 1

Goals and Milestones with clear and specific outcomes

Evaluate, negotiate, and revise terms of existing awards

Award additional OTs

NIH Data Commons Pilot: Outcomes

Page 36: NIH Data Summit - The NIH Data Commons

Acknowledgments

DPCPSI: Jim Anderson, Betsy Wilder, Vivien Bonazzi, Marie Nierras, Rachel Britt,

Sonyka Ngosso, Lora Kutkat, Kristi Faulk, Jen Lewis, Kate Nicholson,

Chris Darby, Tonya Scott

NHLBI: Gary Gibbons, Alastair Thomson, Teresa Marquette, Jeff Snyder,

Melissa Garcia, Maarten Lerkes, Ann Gawalt, Cashell Jaquish,

George, Papanicolaou

NHGRI: Eric Green, Valentina di Francesco, Ajay Pillai, Simona Volpi, Ken Wiley

NIAID: Nick Weber

CIT: Andrea Norris

NLM: Patti Brennan

NCBI: Steve Sherry