Everything comes in 3's
-
Upload
delagoya -
Category
Technology
-
view
1.289 -
download
4
description
Transcript of Everything comes in 3's
Everything Comes in 3’s
Angel PizarroDirector, ITMAT Bioinformatics Facility
University of Pennsylvania School of Medicine
Outline
• This talk looks at the practical aspects of Cloud Computing–We will be diving into specific examples
• 3 pillars of systems design
• 3 storage implementations
• 3 areas of bioinformatics – And how they are affected by clouds
• 3 interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors
Pillars of Systems Design
1. Provisioning– API access (AWS, Microsoft, RackSpace, GoGrid,
etc.)– Not discussing further, since this is the WHOLE
POINT of cloud computing.
2. Configuration– How to get a system up to the point you can do
something with it
3. Command and Control– How to tell the system what to do
System Configuration with Chef
• Automatic installation of packages, service configuration and initialization
• Specifications use a real programming language with known behavior
• Bring the system to an idempotent state
• http://opscode.com/chef/
http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg
Chef Recipes & Cookbooks
• The specification for installing and configuring a system component
• Able to support more than one platform• Has access to system-wide information– hostname, IP addr, RAM, # processors, etc.
• Contain templates, documentation, static files & assets
• Can define dependencies on other recipes• Executed in order, execution stops at first failure
Simple Recipe : Rsync
• Install rsync to the system• Meta data file states what
platforms are supported• Note that Chef is a Linux
centric system• BUT, the WikiWiki is
MessyMessy– Look at Chef Solo &
Resources
More Complex Recipe: Heartbeat
• Installs heartbeat package
• Registers the service and specifies that is can be restarted and provides a status message
• Finally it starts the service
Command and Control
• Traditional grid computing– QSUB – SGE, PBS, Torque– Usually requires tightly coupled and static systems– Shared file systems, firewalls, user accounts, shared
exe & lib locations– Best for capability processes (e.g. MPI)
• Map-Reduce is the new hotness– Best for data-parallel processes– Assumes loosely coupled non-static components– Job staging is a critical component
Map Reduce in a Nutshell
• Algorithm pioneered by Google for distributed data analysis– Data-parallel analysis fit
well into this model– Split data, work on each
part in parallel, then merge results
• Hadoop, Disco, CloudCrowd, …
Serial Execution of Proteomics Search
Parallel Proteomics Search
Roll-Your-Own MR on AWS
• Define small scripts to– Split a FASTA file– Run a BLAT search– The first script make defines the inputs of the second
• Submit the input FASTA to S3• Start a master node as the central communication
hub• Start slave nodes, configured to ask for work from
master and save results back to S3• Press “Play”
Workflow of Distributed BLAT
S3
PC
Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes
Master
Slave
Slave
Slave
Slave
Boot master & slaves
Upload inputs
Download results
Submit the BLAT job
Master Node => Resque
• Github developed background job processing framework
• Jobs attached to a class from your application, stored as JSON
• Uses REDIS key-value store
• Simple front end for viewing job queue status, failed job
Resque can invoke any class that has a class method “perform()”
http://github.com/defunkt/resque
The scripts
Storage in the Cloud : S3
• Permanent storage for your data
• Pay as you go for transmission and holding
• Eliminates backups• Pretty good CDN
– Able to hook into better CDN SLA via CloudFront
• Can be slow at times– Reports of 10 second delay,
but average is 300ms response
S3
Your Data
S3 CostsUsage Rates Usage Example
$0.15 GB / month 1,690 GB
$0.10 GB / month IN 100 GB IN
$0.15 GB / month OUT 100 GB OUT
$0.01 per 1,000 PUT/POST requests
1,000,000 requests
$0.01 per 10,000 GET requests
1,000,000 requests
$289.50 per month
$0.17 per GB per month
$2.06 per GB per year
$3,474.00 per 1690 GB per year
Storage 2: Distributed FS on EC2
• Hadoop HDFS, Gigaspaces, etc.
• Network latency may be an issue for traditional DFSs– Gluster, GPFS, etc.
• Tighter integration with execution framework, better performance?
EC2 NodeEC2 Node
EC2 NodeEC2 Node
EC2 Node Disk
Your Data
DFS on EC2 m1.xlarge CostsInitial cost Usage costs
$2,800.00 3-yr reserved instance fee
$0.24 ¢/hr
24 hours / day
365 days / yr
3 yrs
$9,107.20 Total 3 yr cost
$3,035.73 cost 1690 GB per year*
$1.80 cost per GB per year*
* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
Storage 3: Memory Grids
• “RAM is the new Disk”• Application level RAM
clustering– Terracotta, Gemstone
Gemfire, Oracle, Cisco, Gigaspaces
• Performance for capability jobs?
EC2 RAMEC2 RAM
EC2 RAMEC2 RAM
EC2 RAMEC2 RAM
Your Data
* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
Memory Grid CostInitial cost Usage costs
$9,800.00 3-yr reserved instance fee
$0.84 ¢/hr
24 hours / day
365 days / yr
3 yrs
$31,875.20 Total 3 yr cost
$10,625.07 cost per yr
$155.34 cost per GB per year
$262,519.92Cost 1690 GB per yr
Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
Cloud Influence on Bioinformatics
• Computational Biology– Algorithms will need to account for large I/O latency– Statistical tests will need to account for incomplete
information, or incremental results• Software Engineering– Built for the cloud algorithms are popping up
• CloudBurst is a feature example in AWS EMR!
• Application to Life Sciences– Deploy ready-made images for use
• Cycle Computing, ViPDAC, others soon to follow
Algorithms need to be I/O centric
• Incur a slightly higher computational burden to reduce I/O across non-optimal networks
P. Balaji, W. Feng, H. Lin 2008
Some Internal Projects• Resource Manager
– Service for on-demand provisioning and release of EC2 nodes– Utilizes Chef to define and apply roles (compute node, DB server, etc)– Terminates idle compute nodes at 52 minutes
• Workflow Manager– Defines and executes data analysis workflows– Relies on RM to provision nodes– Once appropriate worker nodes are available, acts as the central work queue
• RUM– RNA-Seq Ultimate Mapper– Map Reduce RNA-Seq analysis pipeline– Combines Bowtie + BLAT and feeds results into a decision tree for more
accurate mapping of sequence reads
Bowtie Alone
74%
8%
18%
Mapping Efficiency
MappedAmbiguousUnmapped
38.0%
37.0%
25.0%
Mapping Breakdown
Unique PairedUnique SingleAmbiguous
RUM (Bowtie + BLAT + processing)
70%
16%
14%
Mapping Breakdown
Unique PairedUnique SingleAmbiguous
81%
4% 15%
Mapping Efficiency
Mapped
Unmapped
Mapped Ambiguously
Significantly increases the confidence of your data
RUM Costs
• Computational cost ~$100 - $200– 6-8 hours per lane on m2.4xlarge ($2.40 / hour)
• Cost of reagents ~= $10,000
1% of total
Acknowledgements
• Garret FitzGerald• Ian Blair
• John Hogenesch• Greg Grant• Tilo Grosser
• NIH & UPENN for support
• My Team– David Austin– Andrew Brader– Weichen Wu
Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s