How Cyverse.org enables scalable data discoverability and re-use

15
Transforming Science Through Data-driven Discovery How Cyverse.org enables scalable data discoverability and re-use Matt Vaughn, co-PI @mattdotvaughn [email protected]

Transcript of How Cyverse.org enables scalable data discoverability and re-use

Page 1: How Cyverse.org enables scalable data discoverability and re-use

Transforming Science Through Data-driven Discovery

How Cyverse.org enables scalable data discoverability and re-use

Matt Vaughn, co-PI@mattdotvaughn

[email protected]

Page 2: How Cyverse.org enables scalable data discoverability and re-use

History and Context

~ $100m direct NSF investment over 10

years

Currently working to sustain its successes

beyond 2018

iPlant 2008Empowering a

New Plant Biology

iPlant 2013Cyberinfrastructure

for Life Science

CyVerse 2016Transforming Science Through Data-Driven

Discovery

Plant Science Cyberinfrastructure CollaborativeA "new type of organization" that is "community-driven" uniting "biologists, computer and information scientists and experts from other disciplines working in an integrated team" to provide "computational and cyberinfrastructure capabilities and expertise that are capable of handling large and heterogeneous plant biology data sets"

Page 3: How Cyverse.org enables scalable data discoverability and re-use

What is Cyberinfrastructure?

•Data storage and retrieval

•Software (system & user)

•Computing capability

•Human expertise and support

Organized into systems that solve problems of size and scope that would not otherwise be solvable

Page 4: How Cyverse.org enables scalable data discoverability and re-use

Platform Overview

Ready to usePlatforms

FoundationalCapabilities

Established CI Components

Extensible Services

Ease

of

Use

Page 5: How Cyverse.org enables scalable data discoverability and re-use

Adoption and Outputs• Over 40K registered users (15-20%

active)• Millions of computing hours on

XSEDE, campus HPC, Cyverse systems, and commercial cloud

• 2+ PB user data stored in CyVerseData Store

• Hundreds of publications, courses, and discoveries

• Spin-off technologies• Jetstream: NSF production

cloud• Syndicate: Software-defined

storage system• Agave API: Multitenant

science PaaS• Communities such as iAnimal,

iMicrobe, iPlant.UK• 3rd party software resources

using it as a platform

Page 6: How Cyverse.org enables scalable data discoverability and re-use

FederationMetadata

Finding and re-using Data (1)

iRODS (2+PB)

ElasticSearchTucson Resources

AustinResources

Catalog Servers

CSHL Resource

iPlant.UK Resources

Data Store APIs

Agave API

AWS S3

Public FTP

SFTP

At the heart of all Cyverse applications is a data-centric architecture, designed to be scaled and extended

Page 7: How Cyverse.org enables scalable data discoverability and re-use

Finding and re-using Data (2)

• Browser-based file manager• Upload from local or URI• Download• Add/Edit comments and tags• AVU metadata + structured

templates• Share with collaborators or any

Cyverse user

The Cyverse Discovery Environment Data Window

Page 8: How Cyverse.org enables scalable data discoverability and re-use

Finding and re-using Data (3)

• Browser-based file manager• Upload from local or URI• Download• Add/Edit comments and tags• AVU metadata + structured

templates• Share with collaborators or any

Cyverse user

Google Drive, for big data

The Cyverse Discovery Environment Data Window

Page 9: How Cyverse.org enables scalable data discoverability and re-use

Finding and re-using Software (1)• Extendable App Catalog

• Provide Dockerfile + GUI specification

• Develop VM image• Deploy application web

service

Info view for a Cyverse Discovery Environment application

Page 10: How Cyverse.org enables scalable data discoverability and re-use

Finding and re-using Software (2)• Extendable App Catalog

• Provide Dockerfile + GUI specification

• Develop VM image• Deploy application web

service• Require links to

documentation, example files and usage, appropriate software and domain ontologies

Public or shared Atmosphere VM images tagged with “GWAS”

Page 11: How Cyverse.org enables scalable data discoverability and re-use

Finding and re-using Software (3)• Extendable App Catalog

• Provide Dockerfile + GUI specification

• Develop VM image• Deploy application web

service• Require links to

documentation, example files and usage, appropriate software and domain ontologies

• Give credit to app author and software authorApplication and Data catalogs available to 3rd parties

Page 12: How Cyverse.org enables scalable data discoverability and re-use

Cyverse Data Commons (1)

Data Commons Landing Page (1.0)Persistent URL for each data set. No authentication

required. Fast browsing and retrieval.

NCBI SRA Submission Workflow in DECyverse is the analysis home for a lot of genomics

data. To get it off our systems, we need to help get it into the SRA!

Page 13: How Cyverse.org enables scalable data discoverability and re-use

Cyverse Data Commons (2)

Actively facilitating publication and discovery of data stored with CyVerse

Candidate Research Data @

Data Store

Identify, organize, rename files and folders

Prepare a DataCite metadata document

Submit to Cyverse Curation

Team

Data snapshot

made public. DOI

issued.

Candidate VM image

Document contents & capabilities

Prepare a DataCite metadata document

Submit to Cyverse Curation

Team

Public image

released. DOI issued.

Page 14: How Cyverse.org enables scalable data discoverability and re-use

Summary

• Cyverse is a model for providing cyberinfrastructure to diverse bioscience user communities

• State of the art has shifted at least twice since we started work

• Had to overcome initial reticence to “give data” to Cyverse

• Still hard to get developers and providers to maintain after contributing

• Cost recovery model - We have started using the term ‘subsidized’ rather than free but it might be too late.

• Natural syngergy between our organization and ODEN objectives

Page 15: How Cyverse.org enables scalable data discoverability and re-use

Transforming Science Through Data-driven Discovery

Parker Antin Nirav Merchant

Eric Lyons

Matt Vaughn@mattdotvaughn

[email protected]

Doreen WareDave Micklos

CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.

CyVerse Executive Team