Bionimbus - Northwestern CGI Workshop 4-21-2011

Bionimbus: A Cloud-Based Infrastructure for Managing,

Analyzing and Sharing Genomics Data

Robert GrossmanInstitute for Genomics & Systems Biology (IGSB)

Computation InstituteUniversity of Chicago

andOpen Cloud Consortium

April 21, 2011

Background

Growth of Genomic Data

1977

Sanger Sequencing

1995

Microarray technology

2005

454, Solexa sequencing

2001HGP

2003ENCODESequence

species

Sequence everythingSequence

environment

Genbank 10^5 10^8 10^10

2003GFS

2008Hadoop 2006

AWS

Source: Lincoln Stein

The Challenge is to Support Cubes of High Throughput Sequence Data

Perturb the environment

Different developmental stages

Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.

Different pathologies

We Have a Problem

• More and more of your colleagues produce so much data that they cannot easily manage, move, analyze and share it.

• Centers and large projects build their own infrastructure.• Every else is on their own.

vs…

Part 1. Using Bionimbus

www.bionimbus.org

8

Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.

9

User

1.

2.

3.

Enabling a broad community to utilize genome research

Bionimbus Cloud Sequencing Partner

or Center

Step 1. Prepare a Sample

Step 2. Login to Bionimbus and get a Bionimbus Key.

Step 3. Fedex your sample to CGI.

Step 4. Login on to Bionimbus and view your data

Step 5. Use Bionimbus to perform standard and custom pipelines.

Using the ability of Bionimbus to launch multiple virtual machines reduced this analysis from 25 days to 1 day.

Bionimbus Private Cloud

UC

Bionimbus Community

Cloud

Bionimbus Private

Cloud XYAmazondbGaP

CGIInternalSequencers

Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.

Step 2. Send sample tobe sequenced.

BID Generator

Step 3b. Returnvariant calls, CNV, annotation…

Step 4. Secure datarouting to appropriatecloud based upon BID.

Step 5. Cloud based analysis

using IGSB and 3rd party tools and applications.

Step 3a. Return rawreads.

Part 2. Introduction to Clouds

17

Clouds provide on-demand computing and storage resources at the scale and with the reliability of a data center.

Computer scientists were caught by surprise.

What is a Cloud?

18

Software as a Service (SaaS)

What Else a Cloud?

19

Infrastructure as a Service (IaaS)

Users get one or more virtual machines “on demand”

Are There Other Types of Clouds?

20

Hadoop was developed for processing Internet scale data for ad targeting and related applications but is now used for processing genomics data and may other applications.

ad targeting

What is a new about clouds?

21

22

Scale is New

Elastic, On-Demand Computing with Usage Based Pricing Is New

23

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Data center scale computing often leverages virtualization technologies.

Part 3. Some Bionimbus Cases

Case Study: Public Datasets in Bionimbus

Case Study: ModENCODE

• Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).

• Bionimbus VMs were used for some of the integrative analysis.

• Bionimbus is used as a backup for the modENCODE DCC

28

>300 ChIP datasets-Chromatin/RNA timecourse-CBP-PolII-Pho/silencers-HDACs-Insulators-TFs

Predictions537 silencers2,307 new promoters12,285 enhancers14,145 insulators

www.modencode.orgwww.cistrack.orgNegre et al. Nature 2011

http://www.modencode.org/

http://www.cistrack.org/

Case Study: IGSB

• All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.

30

Bionimbus Virtual Machine Releases Peak Calling MAT

MA2CPeakSeqMACSSPP

Quality Control

Various

Alignment & Genotyping

Bowtie

TopHatSamtoolsPicard

Part 4

31

Data Centers for Science

experimental science

simulation science

datascience

160930x

1670250x

197610x-100x

200410x-100x

Astronomical dataBiological data (Bionimbus)

NSF-PIRE OSDC Data ChallengeEarth science data (& disaster relief)

Open Science Data Cloud

The goal is to build a data center in Chicago for biological, scientific,

medical and health care data in 4 to 5 years.

Part 5. More About Bionimbus

Database Services

Analysis Pipelines & Re-analysis

Services

GWT-based Front End

Large Data Cloud Services

Data Ingestion Services

Elastic Cloud Services

Intercloud Services

Database Services

Analysis Pipelines & Re-analysis

Services

GWT-based Front End

Large Data Cloud Services

Data Ingestion Services

Elastic Cloud Services

Intercloud Services

(Hadoop,Sector/Sphere)

(Eucalyptus,OpenStack)

(PostgreSQL)

(IDs, etc.)(UDT, replication)

Bionimbus Deployment Options

Bionimbus Community Cloudwww.bionimbus.org

Bionimbus AMIs & Amazon hosted applications

Bionimbus Private Clouds

1. Provide long term persistent storage services at the scale of a data center.

A successful cloud will…

3. High performance ingestion and transport of data.2. Provide

Compute services at the scale of a data center.

6. Peer with private genomics clouds.

A successful cloud will…

5. Peer with public clouds.

4. Support the liberation of data.

Bionimbus satisfies each of these six requirements.

Bionimbus Road Map

Over the next 3 to 4 months, we will:• Launch Bionimbus (we are in a pre-launch)• Add Galaxy-based workflow to Bionimbus• Add secure routing of genomes• Add more public datasets• Add more pipelines

For More Informationwww.bionimbus.org

Bionimbus - Northwestern CGI Workshop 4-21-2011

Technology

Transcript of Bionimbus - Northwestern CGI Workshop 4-21-2011