Bionimbus - Northwestern CGI Workshop 4-21-2011

43
Bionimbus: A Cloud-Based Infrastructure for Managing, Analyzing and Sharing Genomics Data Robert Grossman Institute for Genomics & Systems Biology (IGSB) Computation Institute University of Chicago and Open Cloud Consortium April 21, 2011

Transcript of Bionimbus - Northwestern CGI Workshop 4-21-2011

Page 1: Bionimbus - Northwestern CGI Workshop 4-21-2011

Bionimbus: A Cloud-Based Infrastructure for Managing,

Analyzing and Sharing Genomics Data

Robert GrossmanInstitute for Genomics & Systems Biology (IGSB)

Computation InstituteUniversity of Chicago

andOpen Cloud Consortium

April 21, 2011

Page 2: Bionimbus - Northwestern CGI Workshop 4-21-2011

Background

Page 3: Bionimbus - Northwestern CGI Workshop 4-21-2011

Growth of Genomic Data

1977

Sanger Sequencing

1995

Microarray technology

2005

454, Solexa sequencing

2001HGP

2003ENCODESequence

species

Sequence everythingSequence

environment

Genbank 10^5 10^8 10^10

2003GFS

2008Hadoop 2006

AWS

Page 4: Bionimbus - Northwestern CGI Workshop 4-21-2011

Source: Lincoln Stein

Page 5: Bionimbus - Northwestern CGI Workshop 4-21-2011

The Challenge is to Support Cubes of High Throughput Sequence Data

Perturb the environment

Different developmental stages

Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.

Different pathologies

Page 6: Bionimbus - Northwestern CGI Workshop 4-21-2011

We Have a Problem

• More and more of your colleagues produce so much data that they cannot easily manage, move, analyze and share it.

• Centers and large projects build their own infrastructure.• Every else is on their own.

vs…

Page 7: Bionimbus - Northwestern CGI Workshop 4-21-2011

Part 1. Using Bionimbus

www.bionimbus.org

Page 8: Bionimbus - Northwestern CGI Workshop 4-21-2011

8

Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.

Page 9: Bionimbus - Northwestern CGI Workshop 4-21-2011

9

User

1.

2.

3.

Enabling a broad community to utilize genome research

Bionimbus Cloud Sequencing Partner

or Center

Page 10: Bionimbus - Northwestern CGI Workshop 4-21-2011

Step 1. Prepare a Sample

Page 11: Bionimbus - Northwestern CGI Workshop 4-21-2011

Step 2. Login to Bionimbus and get a Bionimbus Key.

Page 12: Bionimbus - Northwestern CGI Workshop 4-21-2011

Step 3. Fedex your sample to CGI.

Page 13: Bionimbus - Northwestern CGI Workshop 4-21-2011

Step 4. Login on to Bionimbus and view your data

Page 14: Bionimbus - Northwestern CGI Workshop 4-21-2011

Step 5. Use Bionimbus to perform standard and custom pipelines.

Using the ability of Bionimbus to launch multiple virtual machines reduced this analysis from 25 days to 1 day.

Page 15: Bionimbus - Northwestern CGI Workshop 4-21-2011

Bionimbus Private Cloud

UC

Bionimbus Community

Cloud

Bionimbus Private

Cloud XYAmazondbGaP

CGIInternalSequencers

Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.

Step 2. Send sample tobe sequenced.

BID Generator

Step 3b. Returnvariant calls, CNV, annotation…

Step 4. Secure datarouting to appropriatecloud based upon BID.

Step 5. Cloud based analysis

using IGSB and 3rd party tools and applications.

Step 3a. Return rawreads.

Page 16: Bionimbus - Northwestern CGI Workshop 4-21-2011

Part 2. Introduction to Clouds

Page 17: Bionimbus - Northwestern CGI Workshop 4-21-2011

17

Clouds provide on-demand computing and storage resources at the scale and with the reliability of a data center.

Computer scientists were caught by surprise.

Page 18: Bionimbus - Northwestern CGI Workshop 4-21-2011

What is a Cloud?

18

Software as a Service (SaaS)

Page 19: Bionimbus - Northwestern CGI Workshop 4-21-2011

What Else a Cloud?

19

Infrastructure as a Service (IaaS)

Users get one or more virtual machines “on demand”

Page 20: Bionimbus - Northwestern CGI Workshop 4-21-2011

Are There Other Types of Clouds?

20

Hadoop was developed for processing Internet scale data for ad targeting and related applications but is now used for processing genomics data and may other applications.

ad targeting

Page 21: Bionimbus - Northwestern CGI Workshop 4-21-2011

What is a new about clouds?

21

Page 22: Bionimbus - Northwestern CGI Workshop 4-21-2011

22

Scale is New

Page 23: Bionimbus - Northwestern CGI Workshop 4-21-2011

Elastic, On-Demand Computing with Usage Based Pricing Is New

23

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Data center scale computing often leverages virtualization technologies.

Page 24: Bionimbus - Northwestern CGI Workshop 4-21-2011

Part 3. Some Bionimbus Cases

Page 25: Bionimbus - Northwestern CGI Workshop 4-21-2011

Case Study: Public Datasets in Bionimbus

Page 26: Bionimbus - Northwestern CGI Workshop 4-21-2011
Page 27: Bionimbus - Northwestern CGI Workshop 4-21-2011

Case Study: ModENCODE

• Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).

• Bionimbus VMs were used for some of the integrative analysis.

• Bionimbus is used as a backup for the modENCODE DCC

Page 28: Bionimbus - Northwestern CGI Workshop 4-21-2011

28

>300 ChIP datasets-Chromatin/RNA timecourse-CBP-PolII-Pho/silencers-HDACs-Insulators-TFs

Predictions537 silencers2,307 new promoters12,285 enhancers14,145 insulators

www.modencode.orgwww.cistrack.orgNegre et al. Nature 2011

Page 29: Bionimbus - Northwestern CGI Workshop 4-21-2011

Case Study: IGSB

• All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.

Page 30: Bionimbus - Northwestern CGI Workshop 4-21-2011

30

Bionimbus Virtual Machine Releases Peak Calling MAT

MA2CPeakSeqMACSSPP

Quality Control

Various

Alignment & Genotyping

Bowtie

TopHatSamtoolsPicard

Page 31: Bionimbus - Northwestern CGI Workshop 4-21-2011

Part 4

31

Data Centers for Science

Page 32: Bionimbus - Northwestern CGI Workshop 4-21-2011

experimental science

simulation science

datascience

160930x

1670250x

197610x-100x

200410x-100x

Page 33: Bionimbus - Northwestern CGI Workshop 4-21-2011

Astronomical dataBiological data (Bionimbus)

NSF-PIRE OSDC Data ChallengeEarth science data (& disaster relief)

Open Science Data Cloud

Page 34: Bionimbus - Northwestern CGI Workshop 4-21-2011

The goal is to build a data center in Chicago for biological, scientific,

medical and health care data in 4 to 5 years.

Page 35: Bionimbus - Northwestern CGI Workshop 4-21-2011

Part 5. More About Bionimbus

Page 36: Bionimbus - Northwestern CGI Workshop 4-21-2011

Database Services

Analysis Pipelines & Re-analysis

Services

GWT-based Front End

Large Data Cloud Services

Data Ingestion Services

Elastic Cloud Services

Intercloud Services

Page 37: Bionimbus - Northwestern CGI Workshop 4-21-2011

Database Services

Analysis Pipelines & Re-analysis

Services

GWT-based Front End

Large Data Cloud Services

Data Ingestion Services

Elastic Cloud Services

Intercloud Services

(Hadoop,Sector/Sphere)

(Eucalyptus,OpenStack)

(PostgreSQL)

(IDs, etc.)(UDT, replication)

Page 38: Bionimbus - Northwestern CGI Workshop 4-21-2011

Bionimbus Deployment Options

Bionimbus Community Cloudwww.bionimbus.org

Bionimbus AMIs & Amazon hosted applications

Bionimbus Private Clouds

Page 39: Bionimbus - Northwestern CGI Workshop 4-21-2011

1. Provide long term persistent storage services at the scale of a data center.

A successful cloud will…

3. High performance ingestion and transport of data.2. Provide

Compute services at the scale of a data center.

Page 40: Bionimbus - Northwestern CGI Workshop 4-21-2011

6. Peer with private genomics clouds.

A successful cloud will…

5. Peer with public clouds.

4. Support the liberation of data.

Page 41: Bionimbus - Northwestern CGI Workshop 4-21-2011

Bionimbus satisfies each of these six requirements.

Page 42: Bionimbus - Northwestern CGI Workshop 4-21-2011

Bionimbus Road Map

Over the next 3 to 4 months, we will:• Launch Bionimbus (we are in a pre-launch)• Add Galaxy-based workflow to Bionimbus• Add secure routing of genomes• Add more public datasets• Add more pipelines

Page 43: Bionimbus - Northwestern CGI Workshop 4-21-2011

For More Informationwww.bionimbus.org