Cost effective BigData Processing on Amazon EC2

Big Data Cloud Meetup

Cost Effective Big-Data Processing using Amazon Elastic Map Reduce

Sujee Maniyam

[email protected] / www.sujee.net

July 08, 2011

Hi, I’m Sujee

• 10+ years of software development

– enterprise apps web apps iphone

apps Hadoop

• More : http://sujee.net/tech

I am an ‘expert’

Quiz

• PRIZE!

• Where was this picture taken?

Quiz : Where was this picture taken?

Answer : Montara Light House

Ah.. Data

•

Nature of Data…

• Primary Data

– Email, blogs, pictures, tweets

– Critical for operation (Gmail can’t loose emails)

• Secondary data

– Wikipedia access logs, Google search logs

– Not ‘critical’, but used to ‘enhance’ user experience

– Search logs help predict ‘trends’

– Yelp can figure out you like Chinese food

Data Explosion• Primary data has grown phenomenally

• But secondary data has exploded in recent years

• “log every thing and ask questions later”

• Used for

– Recommendations (books, restaurants ..etc)

– Predict trends (job skills in demand)

– Show ADS ($$$)

– ..etc

• ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)

• Startups are struggling to get on top of ‘big data’

Big Guys

Startups

Startups and bigdata

Hadoop to Rescue

• Hadoop can help with BigData

• Hadoop has been proven in the

field

• Under active development

• Throw hardware at the problem

– Getting cheaper by the year

• Bleeding edge technology

– Hire good people!

Hadoop : It is a CAREER

Data Spectrum

Who is Using Hadoop?

60 PB40,000 machines running hadoop4000 node cluster (largest)

36 PB 100 TB /day2500 node cluster

7 TB / day40 PB50 TB / day1000+ nodes cluster

Many many startupsMultiple Terrabytes100s Gigs / dayCluster size : 10-50

About This Presentation

• Based on my experience with a startup

• 5 people (3 Engineers)

• Ad-Serving Space

• Amazon EC2 is our ‘data center’

• Technologies:

– Web stack : Python, Tornado, PHP, mysql , LAMP

– Amazon EMR to crunch data

• Data size : 1 TB / week

Story of a Startup…month-1

• Each web server

writes logs locally

• Logs were copied

to a log-server

and purged from

web servers

• Log Data size :

~100-200 G

Story of a Startup…month-6

• More web servers

come online

• Aggregate log server

falls behind

Data @ 6 months

• 2 TB of data already

• 50-100 G new data / day

• And we were operating

at 20% of our capacity!

Future…

Solution?

• Scalable database (NOSQL)

– Hbase

– Cassandra

• Hadoop log processing / Map Reduce

What We Evaluated

• 1) Hbase cluster

• 2) Hadoop cluster

• 3) Amazon EMR

Hadoop on Amazon EC2

• 1) Permanent Cluster

• 2) On demand cluster (elastic map

reduce)

1) Permanent Hadoop Cluster

Architecture 1

Hadoop Cluster

• 7 C1.xlarge machines

• 15 TB EBS volumes

• Sqoop exports mysql log tables into HDFS

• Logs are compressed (gz) to minimize

disk usage (data locality trade-off)

• All is working well…

2 months later

• Couple of EBS volumes DIE

• Couple of EC2 instances DIE

• Maintaining the hadoop cluster is mechanical job

less appealing

• COST!

– Our jobs utilization is about 50%

– But still paying for machines running 24x7

Lessons Learned

• C1.xlarge is pretty stable (8 core / 8G memory)

• EBS volumes

– max size 1TB, so string few for higher density / node

– DON’T RAID them; let hadoop handle them as individual disks

– Might fail

• Backup data on S3

• Skip EBS. Use instance store disks, and store data in S3

• Use Apache WHIRR to setup cluster easily

Amazon Storage OptionsS3 EBS Ephemeral

- highly available storage

- high latency access time

- recommended uses : distributing media / video, backups, store bigdata

- cost : 1TB = $150 / month(compare 2TB disk = $150)

- in between S3 and ephemeral storage

- persistent storage

- attaches to instances like regular disks (/dev/sdx)

- access over network

- recommended uses : database storage..

- max size : 1TB

- these do fail. So snapshot backups are recommended

- disk you get with server instance

- Free

- data is local (not over network)

- recommended use : cache, tmp data

- size upto 1.5 TB

- fast access times

- goes away when your server instance is terminated

Amazon EC2 Cost

Hadoop cluster on EC2 cost

• $3,500 = 7 c1.xlarge @ $500 / month

• $1,500 = 15 TB EBS storage @ $0.10 per

GB

• $ 500 = EBS I/O requests @ $0.10 per 1

million I/O requests

• $5,500 / month

• $60,000 / year !

Buy / Rent ?

• Typical hadoop machine cost : $10-15k

– 10 node cluster = $100k

– Plus data center costs

– Plus IT-ops costs

• Amazon Ec2 10 node cluster:

– $500 * 10 = $5,000 / month = $60k / year

Buy / Rent• Amazon EC2 is great, for

– Quickly getting started

• Startups

– Scaling on demand / rapidly adding more servers

• popular social games

– Netflix story

• Streaming is powered by EC2

• Encoding movies ..etc

• Use 1000s of instances

• Not so economical for running clusters 24x7

• http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-

colocation-comparison-for-hadoop-clusters/

Buy vs Rent

Next : Amazon EMR

Where was this picture taken?

Answer : Pacifica Pier

Amazon’s Elastic Map Reduce

• Basically ‘on demand’ hadoop cluster

• Store data on Amazon S3

• Kick off a hadoop cluster to process

data

• Shutdown when done

• Pay for the HOURS used

Architecture2 : Amazon EMR

Moving parts

• Logs go into Scribe

• Scribe master ships logs into S3, gzipped

• Spin EMR cluster, run job, done

• Using same old Java MR jobs for EMR

• Summary data gets directly updated to a

mysql (no output files from reducers)

EMR Wins• Cost only pay for use

• http://aws.amazon.com/elasticmapreduce/pricing/

• Example: EMR ran on 5 C1.xlarge for 3hrs

– EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20

– http://aws.amazon.com/elasticmapreduce/faqs/#billing-4

– (1 hour of c1.xlarge = 8 hours normalized compute time)

– EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr =

$14.40

– Plus S3 storage cost : 1TB / month = $150

– Data bandwidth from S3 to EC2 is FREE!

– $25 bucks

Design Wins

• Bidders now write logs to Scribe

directly

– No mysql at web server machines

–Writes much faster!

• S3 has been a reliable storage and

cheap

EMR Wins

• No hadoop cluster to maintain

no failed nodes / disks

EMR Wins

• Hadoop clusters can be of any size!

• Can have multiple hadoop clusters

– smaller jobs fewer number of

machines

–memory hungry tasks m1.xlarge

– cpu hungry tasks c1.xlarge

EMR trade-offs• Lower performance on MR jobs compared to a cluster

Reduced data throughput (S3 isn’t the same as local

disk)

• Streaming data from S3, for each job

• EMR Hadoop is not the latest version

• Missing tools : Oozie

• Right now, trading performance for convenience and

cost

Lessons Learned

• Debugging a failed MR job is tricky

– Because the hadoop cluster is

terminated no logs files

• Save log files to S3

Lessons : Script every thing

• scripts

– to launch jar EMR jobs

– Custom parameters depending on job needs (instance types,

size of cluster ..etc)

– monitor job progress

– Save logs for later inspection

– Job status (finished / cancelled)

• https://github.com/sujee/amazon-emr-beyond-basics

Sample Launch Script#!/bin/bash

## run-sitestats4.sh

# config

MASTER_INSTANCE_TYPE="m1.large"

SLAVE_INSTANCE_TYPE="c1.xlarge"

INSTANCES=5

export JOBNAME="SiteStats4"

export TIMESTAMP=$(date +%Y%m%d-%H%M%S)

# end config

echo "==========================================="

echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."

export t1=$(date +%s)

export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type

"$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class

com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action

s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://

my_bucket/jars/mapred-site.xml”)

sh ./emr-wait-for-completion.sh

Lessons : tweak cluster for each job

Mapred-config-m1-xl.xml<configuration>

<property>

<name>mapreduce.map.java.opts</name>

<value>-Xmx1024M</value>

</property>

<property>

<name>mapreduce.reduce.java.opts</name>

<value>-Xmx3000M</value>

</property>

<property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

<value>3</value>

</property>

<property>

<name>mapred.output.compress</name>

<value>true</value>

</property>

<property>

<name>mapred.output.compression.type</name>

<value>BLOCK</value>

</property>

</configuration>

Saved Logs

Sample Saved Log

Map reduce tips : Control the amount of Input

• We get different type of events

– event A (freq: 10,000) >>> event B (100) >> event C (1)

• Initially we put them all into a single log file

A

A

A

A

B

A

A

B

C

Control Input…• So have to process the entire file, even if we are interested

only in ‘event C’

too much wasted processing

• So we split the logs

– log_A….gz

– log_B….gz

– log_C…gz

• Now only processing fraction of our logs

– Input : s3://my_bucket/logs/log_B*

– x-ref using memcache if needed

Map reduce tips: Data joining (x-ref)

• Data is split across log files, need to x-ref during Map phase

• Used to load the data in mapper’s memory (data was small

and in mysql)

• Now we use Membase (Memcached)

• Two MR jobs are chained

– First one processes logfile_type_A and populates Membase (very

quick, takes minutes)

– Second one, processes logfile_type_B, cross-references values

from Membase

Map reduce tips: Logfile format

• CSV JSON

• Started with CSV

• CSV: "2","26","3","07807606-7637-41c0-9bc0-

8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-

09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?

s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.28401

92E9","0","221","0.60000","NULL","NULL

• 20-40 fields… fragile, position dependant, hard to code

– url = csv[18]…counting position numbers gets old after 100th time

around)

– If (csv.length == 29) url = csv[28] else url = csv[26]

Map reduce tips: Logfile format

• JSON:

{ exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?

s=01&e=07”….}

• Self-describing, easy to add new fields, easy to

process

– url = map.get(‘url’)

• Flatten JSON to fit in ONE LINE

• Compresses pretty well (not much data inflation)

Map reduce tips: Incremental Log Processing

• Recent data (today / yesterday / this

week) is more relevant than older

data (6 months +)

Map reduce tips: Incremental Log Processing

• Adding ‘time window’ to our stats

• only process newer logs

faster

Next Steps

Where was this pic taken?

Answer : Foster City

Next steps : faster processing• Streaming S3 data for each MR job is

not optimal

– Spin cluster

– Copy data from S3 to HDFS

– Run all MR jobs (make use of data

locality)

– terminate

Next Steps : More Processing

• More MR jobs

• More frequent data processing

– Frequent log rolls

– Smaller delta window (1 hr / 15 mins)

Next steps : new software

• New Software

– Pig, python mrJOB (from Yelp)

• Scribe Cloudera flume?

• Use work flow tools like Oozie

• Hive?

– Adhoc SQL like queries

Next Steps : SPOT instances

• SPOT instances : name your price (ebay style)

• Been available on EC2 for a while

• Just became available for Elastic map reduce!

• New cluster setup:

– 10 normal instances + 10 spot instances

– Spots may go away anytime

• That is fine! Hadoop will handle node failures

• Bigger cluster : cheaper & faster

Example Price Comparison

In summary…

• Amazon EMR could be a great

solution

• We are happy!

Take a test drive• Just bring your credit-card

• http://aws.amazon.com/

elasticmapreduce/

• Forum :

https://forums.aws.amazon.com/foru

m.jspa?forumID=52

Thanks

• Questions?

• Sujee Maniyam

– http://sujee.net

– [email protected]

Devil’s slide, Pacifica

Cost effective BigData Processing on Amazon EC2

Technology

Transcript of Cost effective BigData Processing on Amazon EC2