Talk at NCRR P41 Director's Meeting

60
Amazon Web Services A platform for life science research Deepak Singh, Ph.D. Amazon Web Services NCRR P41 PI meeting, October 2010

description

Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010

Transcript of Talk at NCRR P41 Director's Meeting

Page 1: Talk at NCRR P41 Director's Meeting

Amazon Web ServicesA platform for life science research

Deepak Singh, Ph.D.Amazon Web Services

NCRR P41 PI meeting, October 2010

Page 2: Talk at NCRR P41 Director's Meeting

the new reality

Page 3: Talk at NCRR P41 Director's Meeting

lots and lots and lots and lots and lots of data

Page 4: Talk at NCRR P41 Director's Meeting

lots and lots and lots and lots and lots of

people

Page 5: Talk at NCRR P41 Director's Meeting

lots and lots and lots and lots and lots of

places

Page 6: Talk at NCRR P41 Director's Meeting

constant change

Page 7: Talk at NCRR P41 Director's Meeting

science in a new reality

Page 8: Talk at NCRR P41 Director's Meeting

science in a new reality^

Page 9: Talk at NCRR P41 Director's Meeting

science in a new realitydata

^

Page 11: Talk at NCRR P41 Director's Meeting

goal

Page 12: Talk at NCRR P41 Director's Meeting

optimize the most valuable resource

Page 13: Talk at NCRR P41 Director's Meeting

compute, storage, workflows, memory,

transmission, algorithms, cost, …

Page 15: Talk at NCRR P41 Director's Meeting

enter the cloud

Page 16: Talk at NCRR P41 Director's Meeting

what is the cloud?

Page 17: Talk at NCRR P41 Director's Meeting

infrastructure

Page 18: Talk at NCRR P41 Director's Meeting
Page 19: Talk at NCRR P41 Director's Meeting

scalable

Page 20: Talk at NCRR P41 Director's Meeting

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

Page 21: Talk at NCRR P41 Director's Meeting

highly available

Page 22: Talk at NCRR P41 Director's Meeting

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

Page 23: Talk at NCRR P41 Director's Meeting

durable

Page 24: Talk at NCRR P41 Director's Meeting

99.999999999%

Page 25: Talk at NCRR P41 Director's Meeting

dynamic

Page 26: Talk at NCRR P41 Director's Meeting

extensible

Page 27: Talk at NCRR P41 Director's Meeting
Page 28: Talk at NCRR P41 Director's Meeting

secure

Page 29: Talk at NCRR P41 Director's Meeting

a utility

Page 30: Talk at NCRR P41 Director's Meeting

on-demand instancesreserved instances

spot instances

Page 31: Talk at NCRR P41 Director's Meeting
Page 32: Talk at NCRR P41 Director's Meeting
Page 33: Talk at NCRR P41 Director's Meeting

infrastructure as code

Page 34: Talk at NCRR P41 Director's Meeting

class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] endend

Page 35: Talk at NCRR P41 Director's Meeting

include_recipe "packages"include_recipe "ruby"include_recipe "apache2"

if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endend

gem_package "passenger" do version node[:passenger][:version]end

execute "passenger_module" do command 'echo -en "\n\n\n\n" | passenger-install-apache2-module' creates node[:passenger][:module_path]end

Page 36: Talk at NCRR P41 Director's Meeting

import botoimport boto.emrfrom boto.emr.step import StreamingStepfrom boto.emr.bootstrap_action import BootstrapActionimport time

# set your aws keys and S3 bucket, e.g. from environment or .botoAWSKEY= SECRETKEY= S3_BUCKET=NUM_INSTANCES = 1

conn = boto.connect_emr(AWSKEY,SECRETKEY)

bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None)

step = StreamingStep(name='Wordcount',                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                     reducer='aggregate',                     input='s3n://elasticmapreduce/samples/wordcount/input',                     output='s3n://' + S3_BUCKET + '/output/wordcount_output')

jobid = conn.run_jobflow(    name="testbootstrap",     log_uri="s3://" + S3_BUCKET + "/logs",     steps = [step],    bootstrap_actions=[bootstrap_step],    num_instances=NUM_INSTANCES)

print "finished spawning job (note: starting still takes time)"

state = conn.describe_jobflow(jobid).stateprint "job state = ", stateprint "job id = ", jobidwhile state != u'COMPLETED':    print time.localtime()    time.sleep(30)    state = conn.describe_jobflow(jobid).state    print "job state = ", state    print "job id = ", jobid

print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMPprint "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."

Connect to Elastic MapReduce

Install packages

Set up mappers &reduces

job state

Page 37: Talk at NCRR P41 Director's Meeting

a data science platform

Page 38: Talk at NCRR P41 Director's Meeting

dataspaces

Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data

Page 39: Talk at NCRR P41 Director's Meeting

accept all data formats

Page 40: Talk at NCRR P41 Director's Meeting

evolve APIs

Page 41: Talk at NCRR P41 Director's Meeting

beyond the database and the data warehouse

Page 42: Talk at NCRR P41 Director's Meeting

move compute to the data

Page 43: Talk at NCRR P41 Director's Meeting

data is a royal garden

Page 44: Talk at NCRR P41 Director's Meeting

compute is a fungible commodity

Page 45: Talk at NCRR P41 Director's Meeting

“I terminate the instance and relaunch it. Thats my error handling”

Source: @jtimberman on Twitter

Page 46: Talk at NCRR P41 Director's Meeting

the cloud is an architectural and

cultural fit for data science

Page 47: Talk at NCRR P41 Director's Meeting

amazon web services

Page 48: Talk at NCRR P41 Director's Meeting

your data science platform

Page 49: Talk at NCRR P41 Director's Meeting

s3://1000genomes

Page 50: Talk at NCRR P41 Director's Meeting
Page 52: Talk at NCRR P41 Director's Meeting

Credit: Angel Pizzaro, U. Penn

Page 53: Talk at NCRR P41 Director's Meeting

http://usegalaxy.org/cloud

Page 55: Talk at NCRR P41 Director's Meeting
Page 56: Talk at NCRR P41 Director's Meeting

AWS knows scalable infrastructure

Page 57: Talk at NCRR P41 Director's Meeting

you know the science

Page 58: Talk at NCRR P41 Director's Meeting

we can make this work together

Page 60: Talk at NCRR P41 Director's Meeting

[email protected] Twitter:@mndoci

http://slideshare.net/mndocihttp://mndoci.com

Inspiration and ideas from Matt Wood, James Hamilton

& Larry Lessig

Credit” Oberazzi under a CC-BY-NC-SA license