BD2K and the Commons : ELIXR All Hands

60
BD2K & the Commons @ NIH Vivien Bonazzi, Ph.D. Senior Advisor for Data Science Technologies Office of Data Science (ADDS) National Institutes of Health

Transcript of BD2K and the Commons : ELIXR All Hands

Page 1: BD2K and the Commons : ELIXR All Hands

BD2K & the Commons @ NIH

Vivien Bonazzi, Ph.D.

Senior Advisor for Data Science Technologies Office of Data Science (ADDS)National Institutes of Health

Page 2: BD2K and the Commons : ELIXR All Hands
Page 3: BD2K and the Commons : ELIXR All Hands

A Digital Story

Page 4: BD2K and the Commons : ELIXR All Hands
Page 5: BD2K and the Commons : ELIXR All Hands

NIH Data

Page 6: BD2K and the Commons : ELIXR All Hands

NIH Data NIH Data

Page 7: BD2K and the Commons : ELIXR All Hands
Page 8: BD2K and the Commons : ELIXR All Hands
Page 9: BD2K and the Commons : ELIXR All Hands
Page 10: BD2K and the Commons : ELIXR All Hands
Page 11: BD2K and the Commons : ELIXR All Hands

US Government Memo - Increasing Access to Results of Federally Funded Scientific Research

In Feb 2013 the US OSTP issued a memo calling for all US Federal Agencies to make digital assets from federally funded research availableOSTP - Office of Science Technology Policy at the White House

Public Access to Data Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf

Page 12: BD2K and the Commons : ELIXR All Hands

US Government Memo - Increasing Access to Results of Federally Funded Scientific Research

Each agency’s public access plan shall:

Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds while:

i) protecting confidentiality and personal privacy

ii) recognizing proprietary interests, business confidential information, and intellectual property rights and avoiding significant negative impact on intellectual property rights, innovation, and U.S. competitiveness, and

iii) preserving the balance between the relative value of long-term preservation and access and the associated cost and administrative burden.

Page 13: BD2K and the Commons : ELIXR All Hands

NIH Response

In response to the incredible growth of large biomedical (digital) datasets, the Director of NIH established a special Data and Informatics Working Group (DIWG)

http://acd.od.nih.gov/diwg.htm

Page 14: BD2K and the Commons : ELIXR All Hands

NIH Response

Establish new data science research and training programsFulfilling the recommendation of the ACD WG report

Big Data to Knowledge (BD2K) - 2013http://datascience.nih.gov/bd2k

Establish a new position: NIH Associate Director of Data Science (ADDS) Phil Bourne – 2014

Page 15: BD2K and the Commons : ELIXR All Hands

CHAPTER 3

Page 16: BD2K and the Commons : ELIXR All Hands

BD2K – Big Data to Knowledge Expanding training programs in data science Find and Sharing Data & Software though

Indexes Targeted Software tools and methods

Data wrangling Privacy security of data Data repurposing Applications of metadata

Advance Big methods, tools and applications BD2K Centers of Excellence)

https://datascience.nih.gov/bd2k/funded-programs

Page 17: BD2K and the Commons : ELIXR All Hands

To enable biomedical research as a digital enterprise through which new discoveries are made and knowledge generated by maximizing community engagement and productivity.

Page 18: BD2K and the Commons : ELIXR All Hands

NIH ADDS Mission Statement

To use data science to foster an

Open Digital Ecosystem that will accelerate

efficient, cost-effective biomedical research

to enhance health, lengthen life, and reduce illness and

disability

Page 19: BD2K and the Commons : ELIXR All Hands

Enabling digital Ecosystems via a Commons & BD2K

Leveraging BD2K efforts

Harnessing e-infrastructures - Public-private partnerships & Interagency collaborations

Collaborating with external communities

Page 20: BD2K and the Commons : ELIXR All Hands

Commons : Achieving a BalanceBiomedical Use Cases + Data Science + e-infrastructures

Supporting open biomedical science using robust, scalable and flexible digital technologies

In collaboration with global communities

Page 21: BD2K and the Commons : ELIXR All Hands

What are the PRINCIPLES of a Commons?

Supports a digital biomedical ecosystem Treats products of research – data, software, methods,

papers etc. as digital objects Digital objects exist in a shared virtual space

Find, Deposit, Manage, Share and Reuse data, software, metadata and workflows

Digital objects need to conform to FAIR principles: Findable Accessible (and usable) Interoperable Reusable

Page 22: BD2K and the Commons : ELIXR All Hands

Developing a Commons Framework

Exploits new scalable computing technologies - Cloud Making digital objects : FAIR

Indexable/Findable, Accessible & Usable, Interoperable, Reproducible

Simplifies access, sharing and interoperability of digital objects such as data, software, metadata and workflows

Provides physical or logical access to digital objects Provides understanding and accounting of usage patterns Is potentially more cost effective given digital growth Gives currency to digital objects and the people who develop

and support them

Page 23: BD2K and the Commons : ELIXR All Hands

Commons Framework

Compute Platform: Cloud or SC Facilities

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

https://datascience.nih.gov/commons

Page 24: BD2K and the Commons : ELIXR All Hands

Commons Framework

Compute Platform: Cloud or SC Facilities

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

IaaS

PaaS

SaaS

https://datascience.nih.gov/commons

Page 25: BD2K and the Commons : ELIXR All Hands

Commons: Digital Object Compliance

Attributes of digital research objects in the Commons Initial Phase

Unique digital object identifiers of resolvable to original authoritative source

Machine readable A minimal set of searchable metadata Physically available in a cloud based Commons provider Clear access rules (especially important for human subjects data) An entry (with metadata) in one or more indices

Future Phases Standard, community based unique digital object identifiers Conform to community approved standard metadata and ontologies for

enhanced searching Digital objects accessible via open standard APIs Are physically and logical available to the commons

Page 26: BD2K and the Commons : ELIXR All Hands

Towards Data Commons’

Page 27: BD2K and the Commons : ELIXR All Hands

Towards Data Commons’

co-locate data, storage and computing infrastructure with commonly used tools for accessing, analyzing, sharing data to create an open interoperable resource for the research community.

Page 28: BD2K and the Commons : ELIXR All Hands

NIH Commons PILOTS

Page 29: BD2K and the Commons : ELIXR All Hands

Current Commons Pilots

Reference Data Sets

Commons Framework

Pilots

Cloud Credit Model

Resource Search &

Index

Explore feasibility of the Commons framework Provide data objects to populate the Commons Facilitate collaboration and interoperability

Provide access to cloud (IaaS) and PaaS/SaaS via credits Connecting credits to NIH Grants

Making large and/or high value NIH funded data sets and tool accessible in the cloud

Developing Data & Software Indexing methods Leveraging BD2K efforts bioCADDIE et al Collaborating with external groups

Page 30: BD2K and the Commons : ELIXR All Hands

Other Commons Activities

HMP Cloud (NIAID/Comm

on Fund)

NCI Cloud Pilots

& GDC

NIH affiliated Commons projects

Testing cloud environments to enable access, sharing. use and reuse of large data sets and accompanying tools The Cancer Genome Atlas (TCGA) - NCI Human Microbiome Project (HMP) - NIAID

Providing a portals to view representation and analysis of large data sets (Genomic Data Commons – NCI)

?

Other Commons’

Page 31: BD2K and the Commons : ELIXR All Hands

Commons Framework Pilots

Page 32: BD2K and the Commons : ELIXR All Hands

Exploring feasibility of the Commons framework using the BD2K Centers, MODs, and HMP groups

Facilitating connectivity, interoperability and access to digital objects

Providing digital research objects to populate the Commons

Enable biomedical science to happen more easily and robustly

Connecting biology use cases with data science

Commons Framework PilotsBD2K Centers, MODs, HMP

Page 33: BD2K and the Commons : ELIXR All Hands

BD2K Centers, MODS and HMP

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping to the Commons framework:Commons Framework Pilots

PaaS

SaaS

Page 34: BD2K and the Commons : ELIXR All Hands

Does your work map to the Commons framework? Good Bad Ugly

How does it enable science? Using robust computational methods Enable biomedical use cases

Commons Framework PilotsBD2K Centers, MODs, HMP

Page 35: BD2K and the Commons : ELIXR All Hands

Commons Framework PilotsPI Parent grant’s

ICProject description

TOGA NIBIB • Cloud-hosted data publication system • Allows the automatic creation and publication of data a personalized data

repository

MUSEN NIAID • Smart APIs – improved handling for metadata within APIs• Ontological support for metadata within an API• Improving smart API discoverability: a registry of APIs

HAN NIGMS • Docker container hub for BD2K community• Docker containers for genomic analysis applications and pipelines• Benchmark, Evaluation & best practices

COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data , tools + genomic and phenomic data (PICI)

• Docker containers for CCD tools available in AWSHAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast

cancer susceptibility genes and variations• (GA4GH) API : being able to query this data and metadata

Ohno-Machado NHLBI • Development of an ecosystem for repeatable science • easy reuse of data AND software; tracking of provenance. • Use of container technologies for software and data reuse.

Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation tasks of the collaborating sites.

• An API to provide programmatic access to the relevant papers in PMC

White NHGRI • The entire HMP1 data set made accessible on AWS• Analysis tools for microbiome data in AWS

Westerfield NHGRI • Development of a common data model for the MODs• Development of APIs accessing data across the MODs

Page 36: BD2K and the Commons : ELIXR All Hands

More specifically from a Data Science perspective Open standards for APIs and Docker containers Docker registry and best practices Improved metadata handing in APIs Data Object registry and indexing

Reusing what is currently available bioCADDIE, schema.org and schema.org

Publication Preprint server with Links to all digital objects

Commons Framework PilotsBD2K Centers, MODs, HMP

Page 37: BD2K and the Commons : ELIXR All Hands

Example of a biomedical Use Case: Develop a common gene model for all the MODs Develop a open well structured, resuable and documented API that can be used across the MOD data

Why?• To be able to query a human gene against all MOD orthologs• Improved understanding of health and disease states• Improved understanding of genome structure & organization

Commons Framework PilotsBD2K Centers, MODs, HMP

Page 38: BD2K and the Commons : ELIXR All Hands

The purpose of the Commons Framework is to support

BOTH

Biological use cases + Data Science methods

To allow biological research to happen at scale

Commons Framework PilotsBD2K Centers, MODs, HMP

Page 39: BD2K and the Commons : ELIXR All Hands

Commons Credits Model

Page 40: BD2K and the Commons : ELIXR All Hands

The Cloud Credits ModelThe Commons

Cloud ProviderA

Cloud ProviderB

InvestigatorNIH

Provides credits

HPC Provider

Uses credits inCommons

Enabling search: Index Commons Compliance Commons Conformance

Page 41: BD2K and the Commons : ELIXR All Hands

Drivers of the Cloud Credits Model

Scalability Exploiting new computing models Potentially Cost Effectiveness Simplified sharing of digital objects Cloud computing supports many of these

objectives

Page 42: BD2K and the Commons : ELIXR All Hands

Cloud credits model (CCM)

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping pilots to the Commons framework: Cloud Credits Model:

IaaS

PaaS

SaaS

Page 43: BD2K and the Commons : ELIXR All Hands

Supports simplified data sharing by driving science into publicly accessible computing environments that still provide for investigator level access control

Scalable for the needs of the scientific community for the next 5 years

Democratize access to data and computational tools Cost effective

Competitive marketplace for biomedical computing services Reduces redundancy Uses resources efficiently

Advantages of this Model

Page 44: BD2K and the Commons : ELIXR All Hands

Novelty:Never been tried, so we don’t have data about likelihood of success

Cost Models: Assumes stable or declining prices among providersTrue for the last several years, but we can’t guarantee that it will continue, particularly if there is significant consolidation in industry

Service Providers:Assumes that providers are willing to make the investment to become conformantMarket research suggests 3-5 providers within 2-3 months of launch

Persistence: The model is ‘Pay As You Go’ which means if you stop paying it

stops going Giving investigators an unprecedented level of control over what

lives (or dies) in the Commons

Potential Disadvantages of this Model

Page 45: BD2K and the Commons : ELIXR All Hands

Cloud Commons Reference Data Sets

Page 46: BD2K and the Commons : ELIXR All Hands

Data Sets in a Cloud Commons

Making High Value and/or High Volume NIH funded data sets available in a cloud commons

Co-location of large datasets and compute power enables access, use, resuse and sharing of data and tools

Data must adhere to FAIR/Commons compliance principles Helps “seed” the Commons with FAIR/Commons compliant

data Provides an Indexable test data sets for bioCADDIE (and

other indexing efforts)

Page 47: BD2K and the Commons : ELIXR All Hands

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping pilots to the Commons framework : Large, high value Data Sets

NIH defined data sets

Page 48: BD2K and the Commons : ELIXR All Hands

Data Sets in the Cloud Commons Preliminary possible data sets

GTex (Genotype-Tissue Expression) LINCS (Library of Integrated network based cellular signatures) Model Organism Databases (MODs) UniProt Neuroimaging Resource (NITRIC) Radiology Image Share Epigenomics GenPort The Cancer Genome Atlas Project (TCGA) this data set is currently

housed at the GDC but there ARE plans to move to AWS and Google BTRIS Data – NIH Clinical center NIAID AIDs Data dbGAP GEO

Page 49: BD2K and the Commons : ELIXR All Hands

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping pilots to the Commons framework : Community Defined Data Sets

Community defined data sets

Page 50: BD2K and the Commons : ELIXR All Hands

Data Sets in a Cloud Commons: Opportunities

Ability to share data more easily

Ability to access and compute on data more easily

Reduced costs: Costs is paid by NIH not the individual PI Stops continues uploads of the same data

sets

FAIR/ Commons Compliance of data sets

Page 51: BD2K and the Commons : ELIXR All Hands

Data Sets in a Cloud Commons: Challenges

Supporting sensitive (human) data in commercial clouds Updating, versioning, maintaining Consents for data

Can be very strict and only valid across 1 data set Analysis across data sets may constrained by consents

Optimizing for cloud environments: performance Incentivizing data (and tool) generators to move and

maintain their data in the cloud Data peering across clouds

Commercial clouds are resistant : cyclinders of excellence

Peering and Virtualization of services

Page 52: BD2K and the Commons : ELIXR All Hands

Making things Findable

Indexing & Search methods

Page 53: BD2K and the Commons : ELIXR All Hands

Commons Pilots: Search & Index Indexing and Searching digital objects in a

Commons

Leveraging indexing methods within BD2KBioCADDIE, Others approach within BD2KSchema.org

Coexisting efforts

Page 54: BD2K and the Commons : ELIXR All Hands

BD2K Indexinge.g. BioCADDIE, Other, schema.org

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital Object Compliance

App store/User Interface

Mapping pilots to the Commons framework : Indexing & Searching

Page 55: BD2K and the Commons : ELIXR All Hands

What is bioCADDIE?biomedical and healthCAre

Data Discovery Index Ecosystem

University of California San Diego PI Lucila Ohno-Machado

Development of a prototype of Data Discovery Index (DDI)

Aims – “Pubmed” for Data1. Help users find shared data 2. Build a prototype data discovery index3. Evaluate requirements for next phase

Page 56: BD2K and the Commons : ELIXR All Hands

ecosystem components for finding data

Policiescriteria for inclusion, sustainability

Standardsmetadatadata

Identifiersreuse of existing ID issuing services

Metadataminimal setguidelines for mapping,accessibility information,provenance

Search engineconnection to other engines, repositories, data sets

Page 57: BD2K and the Commons : ELIXR All Hands

Commons Pilots Leveraging Schema.org

Marking up a biomedical resource using schema.org Flexible and scalable Developing a bioschema.org approach

Helps drive a community standard for reuse by other groups

Harnesses the power of search engines to find digital objects

Page 58: BD2K and the Commons : ELIXR All Hands

Commons : Achieving a BalanceBiomedical Use Cases + Data Science + e-infrastructures

Supporting open biomedical science using robust, scalable and flexible digital technologies

In collaboration with global communities

Page 59: BD2K and the Commons : ELIXR All Hands

Thankyou ADDS Office

Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso

NCBI: George Komatsoulis

NHGRI: Valentina di Francesco, Kevin Lee

CIT: Debbie Sinmao, Andrea Norris, Stacy Charland Trans NIH BD2K Executive Committee & Working groups NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian

Fore

NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan Many biomedical researchers, cloud providers, IT

professionals

Page 60: BD2K and the Commons : ELIXR All Hands

The end