Bionimbus Cambridge Workshop (3-28-11, v7)
-
Upload
robert-grossman -
Category
Technology
-
view
1.142 -
download
4
Transcript of Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus: A Cloud-Based Infrastructure for Managing,
Analyzing and Sharing Genomics Data
Robert GrossmanInstitute for Genomics & Systems Biology
Computation InstituteUniversity of Chicago
andOpen Cloud Consortium
March 29, 2011
Part 1Biology, Big Data & Clouds
2
Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
Source: Lincoln Stein
The Challenge is to Support Cubes of Next Gen Sequence Data
Perturb the environment
Different developmental stages
Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.
Different pathologies
Discipline Duration Size # Devices
HEP - LHC 10 years 15 PB/year One
Astronomy - LSST 10 years 10 PB/year One
Genomics -NGS 2-4 years 0.5 TB/genome Hundreds
Genomics as a Big Data Science
What is a new about clouds?
6
7
Scale is New
Elastic, On-Demand Computing with Usage Based Pricing Is New
8
1 computer in a rack for 120 hours
120 computers in three racks for 1 hour
costs the same as
Part 2. What is Bionimbus?
www.bionimbus.org
Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.
Bionimbus Private Cloud
UC
Bionimbus Community
Cloud
Bionimbus Private
Cloud XYAmazondbGaP
External Sequencers
IGSBSequencers
Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.
Step 2. Send sample tobe sequenced.
BID Generator
Step 3b. Returnvariant calls, CNV, annotation…
Step 4. Secure datarouting to appropriatecloud based upon BID.
Step 5. Cloud based analysis
using IGSB and 3rd party tools and applications.
Step 3a. Return rawreads.
What is a good unit to understand data intensive computing of
biological data?
Bionimbus & OSDC Today
• The NIH in the U.S. currently makes available for download approximately 2PB of data.
• Bionimbus 2010 consists of 6 racks, 212 nodes, 1568 cores and 0.9 PB of storage.
• Bionimbus is part of the POC Open Science Data Cloud that consists of 14 racks, 472 nodes, 3776 cores and 3+ PB of storage.
Database Services
Analysis Pipelines & Re-analysis
Services
GWT-based Front End
Large Data Cloud Services
Data Ingestion Services
Elastic Cloud Services
Intercloud Services
Bionimbus Deployment Options
Bionimbus Community Cloudwww.bionimbus.org
Bionimbus AMIs & Amazon hosted applications
Bionimbus Private Clouds
Part 3. Some Bionimbus Case
Case Study: Public Datasets in Bionimbus
Case Study: ModENCODE
• Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).
• Bionimbus VMs were used for some of the integrative analysis.
• Bionimbus is used as a backup for the modENCODE DCC
Case Study: IGSB
• All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.
20
Bionimbus Virtual Machine Releases Peak Calling MAT
MA2CPeakSeqMACSSPP
Quality Control
Various
Alignment & Genotyping
Bowtie
TopHatSamtoolsPicard
What is the OSDC?
Part 4
Astronomical dataBiological data (Bionimbus)
NSF-PIRE OSDC Data ChallengeEarth science data (& disaster relief)
Open Science Data Cloud
23www.opencloudconsortium.org
• U.S based not-for-profit corporation.• Manages cloud computing infrastructure to
support scientific research: Open Science Data Cloud.
• Manages cloud computing testbeds: Open Cloud Testbed.
• Develop reference implementations, benchmarks and standards.
OCC Members
• Companies: Cisco, Citrix, Yahoo!, …• Universities: University of Chicago, Calit2,
Johns Hopkins, Northwestern Univ., ORNL, University of Illinois at Chicago, …
• Federal agencies: NASA• Other: National Lambda Rail• Adding international partners in 2011.
24
Infrastructure
• 2010 Proof-of-Concept Infrastructure– 450+ nodes– 3000+ cores– 3+ PB– Four data centers (two more to come in 2011)– Data centers have 10G network connections (some
100G links in 2011)• Plan to add approximately 1 PB of data in 2011.• With current funding, we will refresh 1/3 of the
infrastructure in 2011 and 2012.
Towards a Long Term, Sustainable Model
• Cap Exp about $1M/year• Op Exp about $1M/year• Moore Foundation providing $1M/year for
2011 and 2012 to support the Cap Exp.
Small Medium to Large Very Large
Data Size
Low
Med
Wide
Variety of analysis
No infrastructure Dedicated infrastructureGeneral infrastructure
Scientist with laptop
Open Science Data Cloud
Sequencing centers, LHC, LSST
Single workstations
Small to medium clusters
HPC
Cycles
Small
Med
Large
Persistent data
data clouds
Large & spec. clusters
databases
Bionimbus Team*David Hanley, Nicolas Negre, Elizabeth Bartom, Nicholas Bild, Christopher D. Brown, Marc Domanus, , Robert L Grossman, A. Jason Grundstad, Xiangjun Liu, Michal Sabala, Parantu K Shah, Kevin P WhiteInstitute for Genomics & Systems BiologyUniversity of Chicago
Jia Chen, Yunhong Gu and Damian RoqueiroUniversity of Illinois at Chicago
Lincoln Stein and Zheng ZhaOntario Institute for Cancer Research*In alphabetical order
Acknowledgements
Questions?
Thank You
For more information: www.bionimbus.org
www.opencloudconsortium.orgwww.igsb.org
rgrossman.com