Big data dive amazon emr processing

23
DATA PROCESSING WITH AMAZON ELASTIC MAPREDUCE, AMAZON AWS USE CASES Sergey Sverchkov Project Manager Altoros Systems [email protected] Skype: sergey.sverchkov

Transcript of Big data dive amazon emr processing

Page 1: Big data dive amazon emr processing

DATA PROCESSINGWITH AMAZON ELASTIC

MAPREDUCE,AMAZON AWS USE CASES

Sergey SverchkovProject ManagerAltoros Systems

[email protected]: sergey.sverchkov

Page 2: Big data dive amazon emr processing

AMAZON EMR – SCALABLE DATA PROCESSING SERVICE

Amazon EMR service: Amazon EC2 + Amazon S3 + Apache Hadoop- Cost-Effective- Automated- Scalable- Easy-to-use

Page 3: Big data dive amazon emr processing

MAPREDUCE

• Simple data-parallel programming model designed for scalability and fault-tolerance

• Pioneered by Google– Processes 20 petabytes of data per day

• Popularized by open-source Hadoop project– Used at Yahoo!, Facebook, Amazon, …

Page 4: Big data dive amazon emr processing

AMAZON EC2 SERVICE

• Elastic - Increase or decrease capacity within minutes, not hours or days

• Completely controlled• Flexible –multiple instance types (CPU, memory,

storage), operating systems, and software packages. • Reliable –99.95% availability for each Amazon EC2

Region.• Secure – numerous mechanisms for securing your

compute resources. • Inexpensive: Reserved Instance and Spot Instances • Easy to Start.

Page 5: Big data dive amazon emr processing

AMAZON S3 STORAGE

• Write, read, and delete objects containing from 1 byte to 5 terabytes

• Objects are stored in a bucket • Authentication mechanisms• Options for secure data upload/download and encryption

of data at rest• Designed to provide 99.999999999% durability and

99.99% availability of objects over a given year• Reduced Redundancy Storage (RRS)

Page 6: Big data dive amazon emr processing

AMAZON EMR FEATURES

• Web-based interface and command-line tools for running Hadoop jobs on Amazon EC2

• Data stored in Amazon S3• Monitors job and shuts down machines after use• Small extra charge on top of EC2 pricing• Significantly reduces the complexity of the time-

consuming set-up, management and tuning of Hadoop clusters

Page 7: Big data dive amazon emr processing

GETTING STARTED – SIGN UP

• Sign up for Amazon EMR / AWS at http://aws.amazon.com

• Need to be signed also for Amazon S3 and Amazon EC2• Locate and save AWS credentials:

– AWS Access Key ID– AWS Secret Access Key– EC2 Key Pair

• Optionally install on desktop: – EMR command line client – S3 command line

Page 8: Big data dive amazon emr processing

GETTING STARTED – SECURITY, TOOLS

Page 9: Big data dive amazon emr processing

EMR JOB FLOW - BASIC STEPS

1. Upload input data to S32. Create job flow by defining Map and Reduce3. Download output data from S3

Page 10: Big data dive amazon emr processing

EMR WORD COUNT SAMPLE

Page 11: Big data dive amazon emr processing

WORD COUNT – INPUT DATA

• Word count input data size in sample S3 bucket:./s3cmd du s3://elasticmapreduce/samples/wordcount/input/19105856 s3://elasticmapreduce/samples/wordcount/input/

• Word count input data files./s3cmd ls s3://elasticmapreduce/samples/wordcount/input/2009-04-02 02:55 2392524 s3://elasticmapreduce/samples/wordcount/input/00012009-04-02 02:55 2396618 s3://elasticmapreduce/samples/wordcount/input/00022009-04-02 02:55 1593915 s3://elasticmapreduce/samples/wordcount/input/00032009-04-02 02:55 1720885 s3://elasticmapreduce/samples/wordcount/input/00042009-04-02 02:55 2216895 s3://elasticmapreduce/samples/wordcount/input/0005

Page 12: Big data dive amazon emr processing

EMR WORD COUNT SAMPLE

• Starting instances, bootstrapping, running job steps:

Page 13: Big data dive amazon emr processing

EMR WORD COUNT SAMPLE

• Start the word count sample job from EMR command line:$ ./elastic-mapreduce --create --name "word count commandline test" --stream --input s3n://elasticmapreduce/samples/wordcount/input --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate --output s3n://test.emr.bucket/wordcount/output2

• Output contains job number: Created job flow j-317IN1TUMRQ5B

Page 14: Big data dive amazon emr processing

WORD COUNT – OUTPUT DATA

• Locate and download output data in the specified output S3 bucket:

Page 15: Big data dive amazon emr processing

REAL-WORLD EXAMPLE - GENOTYPING

• Crossbow is a scalable, portable, and automatic Cloud Computing tool for finding SNPs from short read data.

• Crossbow is designed to be easy to run (a) in "the cloud" Amazon's Elastic MapReduce service, (b) on any Hadoop cluster, or (c) on any single computer, without Hadoop.

• Open-source available to anyone

http://bowtie-bio.sourceforge.net/crossbow/

Page 16: Big data dive amazon emr processing

SINGLE-NUCLEOTIDE POLYMORPHISM

• A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual.

Page 17: Big data dive amazon emr processing

SNP ANALYSIS IN AMAZON EMR

• Crossbow web inteface

http://bowtie-bio.sourceforge.net/crossbow/ui.html

Page 18: Big data dive amazon emr processing

SNP ANALYSIS – DATA IN AMAZON S3

• Data for SNP analysis is uploaded to Amazon S3 bucket• Output of analysis is placed in S3

Page 19: Big data dive amazon emr processing

SNP ANALYSIS – INPUT / OUTPUT DATA

• Input data – single file ~ 1.4GB@E201_120801:4:1:1208:14983#ACAGTG/1 GAAGGAATAATGAGACCTNACGTTTCTGNNCNNNNNNNNNNNNNNNNNNN +E201_120801:4:1:1208:14983#ACAGTG/1 gggfggdgfgdgg_e^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @E201_120801:4:1:1208:6966#ACAGTG/1 GCTGGGATTACAGACACANGCCACCACANNTNNNNNNNNNNNNNNNNNNN +E201_120801:4:1:1208:6966#ACAGTG/1

• Output data – multiple fileschr1 841900 G A 3 A 68 2

2 G 0 0 0 2 01.000001.000001

chr1 922615 T G 2 G 38 33 T 67 1 1 4 01.000001.000000

chr1 1011278A G 12 G 69 11 A 0 0 0 1 01.000001.000001

Page 20: Big data dive amazon emr processing

SNP ANALYSIS - TIME

• To process 1.4GB on 1 EMR instance – 6 hours• To process 1.4GB on 2 EMR instances – 4 hours• To process 1.4GB on 4 EMR instances – 2.5 hours• Haven’t tried more instances…

Page 21: Big data dive amazon emr processing

AND MORE CASES FOR AMAZON AWS

Customer 1 successful migration from dedicated hosting to Amazon: 1 EC2 xlarge Linux instance (15 GB, 4 cores, 64bit) with 4

EBS volumes 250GB in US West (North California) region Runs 1 heavy web sites with > 1К concurrent users Tomcat app server and Oracle SE 11.2 Amazon Elastip IP for web site Continuous Oracle backup to Amazon S3 through Oracle

secure backup for S3 And it costs for customer only …wow <2 days for LIVE migration on weekend

Page 22: Big data dive amazon emr processing

AND MORE CASES FOR AMAZON AWS

Customer 2 successful migration from Rackspace to Amazon:

Rackspace hosting + service cost $..К, and service level very low. Rackspace server was fixed.

Migrated to 1 Amazon 2xlarge (34.2 GB, 4 virtual cores) EC2 Windows 2008 R2 instance. >100 web sites for corporate customers. 2 EBS volumes 1.5TB

Amazon Oracle RDS as backend – fully automated Oracle database with expandable storage.

200GB of user data in RDS. Full LIVE migration completed in 48 hours with DNS

names switch. And budget is significantly lower!!

Page 23: Big data dive amazon emr processing

THANK YOU WELCOME FOR DISCUSSION…

Sergey [email protected]

skype: sergey.sverchkov