Everything comes in 3's

Everything Comes in 3’s

Angel PizarroDirector, ITMAT Bioinformatics Facility

University of Pennsylvania School of Medicine

Outline

• This talk looks at the practical aspects of Cloud Computing–We will be diving into specific examples

• 3 pillars of systems design

• 3 storage implementations

• 3 areas of bioinformatics – And how they are affected by clouds

• 3 interesting internal projectsThere are 2 hard problems in computer science: caching, naming, and off-by-1 errors

Pillars of Systems Design

1. Provisioning– API access (AWS, Microsoft, RackSpace, GoGrid,

etc.)– Not discussing further, since this is the WHOLE

POINT of cloud computing.

2. Configuration– How to get a system up to the point you can do

something with it

3. Command and Control– How to tell the system what to do

System Configuration with Chef

• Automatic installation of packages, service configuration and initialization

• Specifications use a real programming language with known behavior

• Bring the system to an idempotent state

• http://opscode.com/chef/

http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg

http://www.opscode.com/chef/



Chef Recipes & Cookbooks

• The specification for installing and configuring a system component

• Able to support more than one platform• Has access to system-wide information– hostname, IP addr, RAM, # processors, etc.

• Contain templates, documentation, static files & assets

• Can define dependencies on other recipes• Executed in order, execution stops at first failure

Simple Recipe : Rsync

• Install rsync to the system• Meta data file states what

platforms are supported• Note that Chef is a Linux

centric system• BUT, the WikiWiki is

MessyMessy– Look at Chef Solo &

Resources

More Complex Recipe: Heartbeat

• Installs heartbeat package

• Registers the service and specifies that is can be restarted and provides a status message

• Finally it starts the service

Command and Control

• Traditional grid computing– QSUB – SGE, PBS, Torque– Usually requires tightly coupled and static systems– Shared file systems, firewalls, user accounts, shared

exe & lib locations– Best for capability processes (e.g. MPI)

• Map-Reduce is the new hotness– Best for data-parallel processes– Assumes loosely coupled non-static components– Job staging is a critical component

Map Reduce in a Nutshell

• Algorithm pioneered by Google for distributed data analysis– Data-parallel analysis fit

well into this model– Split data, work on each

part in parallel, then merge results

• Hadoop, Disco, CloudCrowd, …

Serial Execution of Proteomics Search

Parallel Proteomics Search

Roll-Your-Own MR on AWS

• Define small scripts to– Split a FASTA file– Run a BLAT search– The first script make defines the inputs of the second

• Submit the input FASTA to S3• Start a master node as the central communication

hub• Start slave nodes, configured to ask for work from

master and save results back to S3• Press “Play”

Workflow of Distributed BLAT

S3

PC

Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes

Master

Slave

Slave

Slave

Slave

Boot master & slaves

Upload inputs

Download results

Submit the BLAT job

Master Node => Resque

• Github developed background job processing framework

• Jobs attached to a class from your application, stored as JSON

• Uses REDIS key-value store

• Simple front end for viewing job queue status, failed job

Resque can invoke any class that has a class method “perform()”

http://github.com/defunkt/resque

The scripts

Storage in the Cloud : S3

• Permanent storage for your data

• Pay as you go for transmission and holding

• Eliminates backups• Pretty good CDN

– Able to hook into better CDN SLA via CloudFront

• Can be slow at times– Reports of 10 second delay,

but average is 300ms response

S3

Your Data

S3 CostsUsage Rates Usage Example

$0.15 GB / month 1,690 GB

$0.10 GB / month IN 100 GB IN

$0.15 GB / month OUT 100 GB OUT

$0.01 per 1,000 PUT/POST requests

1,000,000 requests

$0.01 per 10,000 GET requests

1,000,000 requests

$289.50 per month

$0.17 per GB per month

$2.06 per GB per year

$3,474.00 per 1690 GB per year

Storage 2: Distributed FS on EC2

• Hadoop HDFS, Gigaspaces, etc.

• Network latency may be an issue for traditional DFSs– Gluster, GPFS, etc.

• Tighter integration with execution framework, better performance?

EC2 NodeEC2 Node

EC2 NodeEC2 Node

EC2 Node Disk

Your Data

DFS on EC2 m1.xlarge CostsInitial cost Usage costs

$2,800.00 3-yr reserved instance fee

$0.24 ¢/hr

24 hours / day

365 days / yr

3 yrs

$9,107.20 Total 3 yr cost

$3,035.73 cost 1690 GB per year*

$1.80 cost per GB per year*

* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3

Storage 3: Memory Grids

• “RAM is the new Disk”• Application level RAM

clustering– Terracotta, Gemstone

Gemfire, Oracle, Cisco, Gigaspaces

• Performance for capability jobs?

EC2 RAMEC2 RAM

EC2 RAMEC2 RAM

EC2 RAMEC2 RAM

Your Data

* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads

Memory Grid CostInitial cost Usage costs

$9,800.00 3-yr reserved instance fee

$0.84 ¢/hr

24 hours / day

365 days / yr

3 yrs

$31,875.20 Total 3 yr cost

$10,625.07 cost per yr

$155.34 cost per GB per year

$262,519.92Cost 1690 GB per yr

Take home message: Unless your needs are small, you may be better off procuring bare-metal resources

Cloud Influence on Bioinformatics

• Computational Biology– Algorithms will need to account for large I/O latency– Statistical tests will need to account for incomplete

information, or incremental results• Software Engineering– Built for the cloud algorithms are popping up

• CloudBurst is a feature example in AWS EMR!

• Application to Life Sciences– Deploy ready-made images for use

• Cycle Computing, ViPDAC, others soon to follow

Algorithms need to be I/O centric

• Incur a slightly higher computational burden to reduce I/O across non-optimal networks

P. Balaji, W. Feng, H. Lin 2008

Some Internal Projects• Resource Manager

– Service for on-demand provisioning and release of EC2 nodes– Utilizes Chef to define and apply roles (compute node, DB server, etc)– Terminates idle compute nodes at 52 minutes

• Workflow Manager– Defines and executes data analysis workflows– Relies on RM to provision nodes– Once appropriate worker nodes are available, acts as the central work queue

• RUM– RNA-Seq Ultimate Mapper– Map Reduce RNA-Seq analysis pipeline– Combines Bowtie + BLAT and feeds results into a decision tree for more

accurate mapping of sequence reads

Bowtie Alone

74%

8%

18%

Mapping Efficiency

MappedAmbiguousUnmapped

38.0%

37.0%

25.0%

Mapping Breakdown

Unique PairedUnique SingleAmbiguous

RUM (Bowtie + BLAT + processing)

70%

16%

14%

Mapping Breakdown

Unique PairedUnique SingleAmbiguous

81%

4% 15%

Mapping Efficiency

Mapped

Unmapped

Mapped Ambiguously

Significantly increases the confidence of your data

RUM Costs

• Computational cost ~$100 - $200– 6-8 hours per lane on m2.4xlarge ($2.40 / hour)

• Cost of reagents ~= $10,000

1% of total

Acknowledgements

• Garret FitzGerald• Ian Blair

• John Hogenesch• Greg Grant• Tilo Grosser

• NIH & UPENN for support

• My Team– David Austin– Andrew Brader– Weichen Wu

Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s

http://speakerrate.com/talks/3041-everything-comes-in-3-s

Everything comes in 3's

Technology

Transcript of Everything comes in 3's