Bionimbus - Northwestern CGI Workshop 4-21-2011
-
Upload
robert-grossman -
Category
Technology
-
view
32.983 -
download
1
Transcript of Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus: A Cloud-Based Infrastructure for Managing,
Analyzing and Sharing Genomics Data
Robert GrossmanInstitute for Genomics & Systems Biology (IGSB)
Computation InstituteUniversity of Chicago
andOpen Cloud Consortium
April 21, 2011
Background
Growth of Genomic Data
1977
Sanger Sequencing
1995
Microarray technology
2005
454, Solexa sequencing
2001HGP
2003ENCODESequence
species
Sequence everythingSequence
environment
Genbank 10^5 10^8 10^10
2003GFS
2008Hadoop 2006
AWS
Source: Lincoln Stein
The Challenge is to Support Cubes of High Throughput Sequence Data
Perturb the environment
Different developmental stages
Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.
Different pathologies
We Have a Problem
• More and more of your colleagues produce so much data that they cannot easily manage, move, analyze and share it.
• Centers and large projects build their own infrastructure.• Every else is on their own.
vs…
Part 1. Using Bionimbus
www.bionimbus.org
8
Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.
9
User
1.
2.
3.
Enabling a broad community to utilize genome research
Bionimbus Cloud Sequencing Partner
or Center
Step 1. Prepare a Sample
Step 2. Login to Bionimbus and get a Bionimbus Key.
Step 3. Fedex your sample to CGI.
Step 4. Login on to Bionimbus and view your data
Step 5. Use Bionimbus to perform standard and custom pipelines.
Using the ability of Bionimbus to launch multiple virtual machines reduced this analysis from 25 days to 1 day.
Bionimbus Private Cloud
UC
Bionimbus Community
Cloud
Bionimbus Private
Cloud XYAmazondbGaP
CGIInternalSequencers
Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.
Step 2. Send sample tobe sequenced.
BID Generator
Step 3b. Returnvariant calls, CNV, annotation…
Step 4. Secure datarouting to appropriatecloud based upon BID.
Step 5. Cloud based analysis
using IGSB and 3rd party tools and applications.
Step 3a. Return rawreads.
Part 2. Introduction to Clouds
17
Clouds provide on-demand computing and storage resources at the scale and with the reliability of a data center.
Computer scientists were caught by surprise.
What is a Cloud?
18
Software as a Service (SaaS)
What Else a Cloud?
19
Infrastructure as a Service (IaaS)
Users get one or more virtual machines “on demand”
Are There Other Types of Clouds?
20
Hadoop was developed for processing Internet scale data for ad targeting and related applications but is now used for processing genomics data and may other applications.
ad targeting
What is a new about clouds?
21
22
Scale is New
Elastic, On-Demand Computing with Usage Based Pricing Is New
23
1 computer in a rack for 120 hours
120 computers in three racks for 1 hour
costs the same as
Data center scale computing often leverages virtualization technologies.
Part 3. Some Bionimbus Cases
Case Study: Public Datasets in Bionimbus
Case Study: ModENCODE
• Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).
• Bionimbus VMs were used for some of the integrative analysis.
• Bionimbus is used as a backup for the modENCODE DCC
28
>300 ChIP datasets-Chromatin/RNA timecourse-CBP-PolII-Pho/silencers-HDACs-Insulators-TFs
Predictions537 silencers2,307 new promoters12,285 enhancers14,145 insulators
www.modencode.orgwww.cistrack.orgNegre et al. Nature 2011
Case Study: IGSB
• All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.
30
Bionimbus Virtual Machine Releases Peak Calling MAT
MA2CPeakSeqMACSSPP
Quality Control
Various
Alignment & Genotyping
Bowtie
TopHatSamtoolsPicard
Part 4
31
Data Centers for Science
experimental science
simulation science
datascience
160930x
1670250x
197610x-100x
200410x-100x
Astronomical dataBiological data (Bionimbus)
NSF-PIRE OSDC Data ChallengeEarth science data (& disaster relief)
Open Science Data Cloud
The goal is to build a data center in Chicago for biological, scientific,
medical and health care data in 4 to 5 years.
Part 5. More About Bionimbus
Database Services
Analysis Pipelines & Re-analysis
Services
GWT-based Front End
Large Data Cloud Services
Data Ingestion Services
Elastic Cloud Services
Intercloud Services
Database Services
Analysis Pipelines & Re-analysis
Services
GWT-based Front End
Large Data Cloud Services
Data Ingestion Services
Elastic Cloud Services
Intercloud Services
(Hadoop,Sector/Sphere)
(Eucalyptus,OpenStack)
(PostgreSQL)
(IDs, etc.)(UDT, replication)
Bionimbus Deployment Options
Bionimbus Community Cloudwww.bionimbus.org
Bionimbus AMIs & Amazon hosted applications
Bionimbus Private Clouds
1. Provide long term persistent storage services at the scale of a data center.
A successful cloud will…
3. High performance ingestion and transport of data.2. Provide
Compute services at the scale of a data center.
6. Peer with private genomics clouds.
A successful cloud will…
5. Peer with public clouds.
4. Support the liberation of data.
Bionimbus satisfies each of these six requirements.
Bionimbus Road Map
Over the next 3 to 4 months, we will:• Launch Bionimbus (we are in a pre-launch)• Add Galaxy-based workflow to Bionimbus• Add secure routing of genomes• Add more public datasets• Add more pipelines
For More Informationwww.bionimbus.org