Bionimbus Cambridge Workshop (3-28-11, v7)

Bionimbus: A Cloud-Based Infrastructure for Managing,

Analyzing and Sharing Genomics Data

Robert GrossmanInstitute for Genomics & Systems Biology

Computation InstituteUniversity of Chicago

andOpen Cloud Consortium

March 29, 2011

Part 1Biology, Big Data & Clouds

2

Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).

Source: Lincoln Stein

The Challenge is to Support Cubes of Next Gen Sequence Data

Perturb the environment

Different developmental stages

Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.

Different pathologies

Discipline Duration Size # Devices

HEP - LHC 10 years 15 PB/year One

Astronomy - LSST 10 years 10 PB/year One

Genomics -NGS 2-4 years 0.5 TB/genome Hundreds

Genomics as a Big Data Science

What is a new about clouds?

6

7

Scale is New

Elastic, On-Demand Computing with Usage Based Pricing Is New

8

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Part 2. What is Bionimbus?

www.bionimbus.org

Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.

Bionimbus Private Cloud

UC

Bionimbus Community

Cloud

Bionimbus Private

Cloud XYAmazondbGaP

External Sequencers

IGSBSequencers

Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.

Step 2. Send sample tobe sequenced.

BID Generator

Step 3b. Returnvariant calls, CNV, annotation…

Step 4. Secure datarouting to appropriatecloud based upon BID.

Step 5. Cloud based analysis

using IGSB and 3rd party tools and applications.

Step 3a. Return rawreads.

What is a good unit to understand data intensive computing of

biological data?

Bionimbus & OSDC Today

• The NIH in the U.S. currently makes available for download approximately 2PB of data.

• Bionimbus 2010 consists of 6 racks, 212 nodes, 1568 cores and 0.9 PB of storage.

• Bionimbus is part of the POC Open Science Data Cloud that consists of 14 racks, 472 nodes, 3776 cores and 3+ PB of storage.

Database Services

Analysis Pipelines & Re-analysis

Services

GWT-based Front End

Large Data Cloud Services

Data Ingestion Services

Elastic Cloud Services

Intercloud Services

Bionimbus Deployment Options

Bionimbus Community Cloudwww.bionimbus.org

Bionimbus AMIs & Amazon hosted applications

Bionimbus Private Clouds

Part 3. Some Bionimbus Case

Case Study: Public Datasets in Bionimbus

Case Study: ModENCODE

• Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).

• Bionimbus VMs were used for some of the integrative analysis.

• Bionimbus is used as a backup for the modENCODE DCC

Case Study: IGSB

• All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.

20

Bionimbus Virtual Machine Releases Peak Calling MAT

MA2CPeakSeqMACSSPP

Quality Control

Various

Alignment & Genotyping

Bowtie

TopHatSamtoolsPicard

What is the OSDC?

Part 4

Astronomical dataBiological data (Bionimbus)

NSF-PIRE OSDC Data ChallengeEarth science data (& disaster relief)

Open Science Data Cloud

23www.opencloudconsortium.org

• U.S based not-for-profit corporation.• Manages cloud computing infrastructure to

support scientific research: Open Science Data Cloud.

• Manages cloud computing testbeds: Open Cloud Testbed.

• Develop reference implementations, benchmarks and standards.

OCC Members

• Companies: Cisco, Citrix, Yahoo!, …• Universities: University of Chicago, Calit2,

Johns Hopkins, Northwestern Univ., ORNL, University of Illinois at Chicago, …

• Federal agencies: NASA• Other: National Lambda Rail• Adding international partners in 2011.

24

Infrastructure

• 2010 Proof-of-Concept Infrastructure– 450+ nodes– 3000+ cores– 3+ PB– Four data centers (two more to come in 2011)– Data centers have 10G network connections (some

100G links in 2011)• Plan to add approximately 1 PB of data in 2011.• With current funding, we will refresh 1/3 of the

infrastructure in 2011 and 2012.

Towards a Long Term, Sustainable Model

• Cap Exp about $1M/year• Op Exp about $1M/year• Moore Foundation providing $1M/year for

2011 and 2012 to support the Cap Exp.

Small Medium to Large Very Large

Data Size

Low

Med

Wide

Variety of analysis

No infrastructure Dedicated infrastructureGeneral infrastructure

Scientist with laptop

Open Science Data Cloud

Sequencing centers, LHC, LSST

Single workstations

Small to medium clusters

HPC

Cycles

Small

Med

Large

Persistent data

data clouds

Large & spec. clusters

databases

Bionimbus Team*David Hanley, Nicolas Negre, Elizabeth Bartom, Nicholas Bild, Christopher D. Brown, Marc Domanus, , Robert L Grossman, A. Jason Grundstad, Xiangjun Liu, Michal Sabala, Parantu K Shah, Kevin P WhiteInstitute for Genomics & Systems BiologyUniversity of Chicago

Jia Chen, Yunhong Gu and Damian RoqueiroUniversity of Illinois at Chicago

Lincoln Stein and Zheng ZhaOntario Institute for Cancer Research*In alphabetical order

Acknowledgements

Questions?

Thank You

For more information: www.bionimbus.org

www.opencloudconsortium.orgwww.igsb.org

rgrossman.com

Bionimbus Cambridge Workshop (3-28-11, v7)

Technology

Transcript of Bionimbus Cambridge Workshop (3-28-11, v7)