(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Abhishek Sinha, Amazon Web Services

Gaurav Agrawal, AOL Inc

October 2015

BDT208

A Technical Introduction to

Amazon EMR

What to Expect from the Session

• Technical introduction to Amazon EMR

• Basic tenets

• Amazon EMR feature set

• Real-Life experience of moving a 2-PB, on-premises

Hadoop cluster to the AWS cloud

• Is not a technical introduction to Apache Spark, Apache

Hadoop, or other frameworks

Amazon EMR • Managed platform

• MapReduce, Apache Spark, Presto

• Launch a cluster in minutes

• Open source distribution and MapR

distribution

• Leverage the elasticity of the cloud

• Baked in security features

• Pay by the hour and save with Spot

• Flexibility to customize

Make it easy, secure, and

cost-effective to run

data-processing frameworks

on the AWS cloud

What Do I Need to Build a Cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

An Example EMR Cluster

Master Node

r3.2xlarge

Slave Group - Core

c3.2xlarge

Slave Group – Task

m3.xlarge

Slave Group – Task

m3.2xlarge (EC2 Spot)

HDFS (DataNode).

YARN (NodeManager).

NameNode (HDFS)

ResourceManager

(YARN)

Choice of Multiple Instances

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Machine

Learning

Processing

In-memory

(Spark &

Presto)

Large HDFS

Select an Instance

Choose Your Software (Quick Bundles)

Choose Your Software – Custom

Hadoop Applications Available in Amazon EMR

Choose Security and Access Control

You Are Up and Running!

Master Node DNS

Information about the software you are

running, logs and features

Infrastructure for this cluster

Security Groups and Roles

Use the CLI

aws emr create-cluster

--release-label emr-4.0.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK

Programmatic Access to Cluster Provisioning

Now that I have a cluster, I need to process

some data

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

On an On-premises Environment

Tightly coupled

Compute and Storage Grow Together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

Underutilized or Scarce Resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Underutilized capacity

Provisioned capacity

Contention for Same Resources

Compute

boundMemory

Separation of Resources Creates Data Silos

Team A

Replication Adds to Cost

Single datacenter

So how does Amazon EMR solve these problems?

Decouple Storage and Compute

Amazon S3 is Your Persistent Data Store

11 9’s of durability

$0.03 / GB / month in US-East

Lifecycle policies

Versioning

Distributed by default

EMRFSAmazon S3

The Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file-system

• Streams data directly from Amazon S3

• Uses HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Consistent view – consistency for read after write

• Support for encryption

• Fast listing of objects

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

LOCATION ‘samples/pig-apache/input/'

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Benefit 1: Switch Off Clusters

Amazon S3Amazon S3 Amazon S3

Auto-Terminate Clusters

You Can Build a Pipeline

Or You Can Use AWS Data Pipeline

Input data

Use Amazon EMR to

transform unstructured

data to structured

Push to

Amazon S3

Ingest into

Amazon

Redshift

Sample Pipeline

Run Transient or Long-Running Clusters

Run a Long-Running Cluster

Amazon EMR cluster

Benefit 2: Resize Your Cluster

Resize the Cluster

Scale Up, Scale Down, Stop a resize,

issue a resize on another

How do you scale up and save cost ?

Spot Instance

Spot Integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

The Spot Bid Advisor

Spot Integration with Amazon EMR

• Can provision instances from the Spot market

• Replaces a Spot instance incase of interruption

• Impact of interruption

• Master node – Can lose the cluster

• Core node – Can lose intermediate data

• Task nodes – Jobs will restart on other nodes (application

dependent)

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

50 % less run-time ( 14 7)

25% less cost (140 105)

Scaling Hadoop Jobs with Spothttp://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/

1500 to 2000 clusters

6000 Jobs

For each instance_type in (Availability Zone, Region)

cpuPerUnitPrice = instance.cpuCores/instance.spotPrice

if (maxCpuPerUnitPrice < cpuPerUnitPrice) {

optimalInstanceType = instance_type;

Source: Github /Bloomreach/ Briefly

Intelligent Scale Down

Intelligent Scale Down: HDFS

Effectively Utilize Clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical Separation of Jobs

Hive, Pig,

Cascading

Presto Ad-Hoc

Amazon S3

Benefit 4: Disaster Recovery Built In

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Amazon S3 as a Data Lake

Nate Sammons, Principal Architect – NASDAQ

Reference – AWS Big Data Blog

Re-cap

Rapid provisioning of clusters

Hadoop, Spark, Presto, and other applications

Standard open-source packaging

De-couple storage and compute and scale them

independently

Resize clusters to manage demand

Save costs with Spot instances

How AOL Inc. moved a 2 PB Hadoop

cluster to the AWS cloud

Gaurav Agrawal

Senior Software Engineer, AOL Inc.

AWS Certified Associate Solutions Architect

AOL Data Platforms Architecture 2014

Data Stats & Insights

Cluster Size

In-House

Cluster

100 Nodes

Data/Day

2-3 TB

Retention

13-24 Months

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

Migration

• Web Console vs. CLI

Web Console and CLI

Web Console for Training

Setup IAM for users

AWS Services Options

S3 Data upload

EMR Creation & Steps

Try & Test multiple approaches

CLI is your friend..!!!

Migration

• Copy Existing Data to S3

bucket-prod-control

Environment Level Buckets

Dev, QA, Production, Analyst

Project Level Buckets

Code, Data, Log, Extract and Control

Compressed Snappy Data to GZIP

Multi Platforms Support

Best Compression

Lowest storage cost

Low cost for Data OUT

bucket-dev bucket-qa

bucket-prod bucket-analyst

bucket-prod-code

bucket-prod-log

bucket-prod-data

bucket-prod-extract

76%Less Storage

70KSaving/Year

Copy Existing Data to S3

Migration

• EMR Design options

EMR Design Options

Transient

Amazon S3

Elastic Cluster

On-Demand vs. Reserved vs.

Core NodesAmazon EMR

vs. Persistent Cluster

vs. local HDFS

vs. Static Cluster

vs. Task Nodes

Migration

• EMR Jobs Submission - CLI

EMR Jobs Submission - CLI

In-house scheduler

Common Utilities

Provision EMR

Push/Pull Data to S3

Job submission to Scheduler

Database Load

JSON Files

Applications, Steps, Bootstrap,EC2 attributes, Instance Groups

Future : Event Driven Design – Lambda, SQS

EMR Jobs Submission - CLI

aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" \

--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" \

--visible-to-all-users \

--ec2-attributes file://omni_awssot.generic.ec2_attributes.json \

--ami-version "3.7.0" \

--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ \

--enable-debugging \

--instance-groups file://omni_awssot.generic.instance_groups.json \

--auto-terminate \

--applications file://omni_awssot.generic.applications.json \

--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json \

--steps file://omni_awssot.generic.steps.json

Migration

• EMR Jobs Submission – CLI

• Monitoring

Monitoring

EMR WatchDog : Node.js

Duplicate Clusters

Failed Clusters

Long-running Clusters

Long-provisioning Clusters

CloudWatch Alarms

Monthly Billing

S3 Bucket Size

SNS Email Notifications

Amazon CloudWatchAmazon SNS

Migration

• Monitoring

• Elasticity

Elasticity

Why be Elastic?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 09/05/2015 Cores Nodes

Daily Processes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Core Nodes Demand - 09/20/2015 Core Nodes

No Clusters

Spike in Demand

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015Cores Nodes

Major RestatementDemand > 10K EC2

Elasticity

Why be Elastic?

True Cloud Architecture

Spot is an Open Market

Scale Horizontally

Our Limit : 3,000 EC2/Region

Multiple Regions

Multiple Instance Types

Migration

• Monitoring

• Elasticity

• Cost Management & BCDR

Cost Management & BCDR

Multi Region Deployment

Best AZ for pricing

Design for failure

Global. BC-DR.

Migration

• Monitoring

• Elasticity

• Cost Management & BCDR

• Optimization

OptimizationData Management

Partition Data on S3

S3 Versioning/Lifecycle

How many nodes?

Based on Data Volume

Complete hour for pricing

Hadoop Run-time Params

Memory Tuning

Compress M & R Output

Combine Splits Input format

Security

Score Card

Feature AWS

Pay for what you use ✔

Decouple Storage and Compute ✔

True Cloud Architecture ✔

Self Service Model ✔

Elastic & Scalable ✔

Global Infrastructure. BCDR. ✔

Quick & Easy Deployments ✔

Redshift External Tables on S3 ?

More languages for Lambda ?

AWS vs. In-House Cost

0 2 4 6

Service

Cost Comparison

In-House

Source : AOL & AWS Billing Tool

4xIn-House / Month

1xAWS / Month

** In-House cluster includes Storage, Power and Network cost.

AWS vs. In-House Cost

10/8/2015

Amazon Web Services

1/4th Cost of In-House Hadoop Infrastructure

1/4th Cost

Data Platforms. AOL Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015Core…

Restatement Use Case

• Restate historical data going back 6 months

Availability Zones

550EMR Clusters

24,000Spot EC2 Instances

Timing Comparison

In-House

Tag All Resources

Infrastructure as CodeCommand Line Interface

JSON as configuration files

IAM Roles and Policies

Use of Application ID

Enable CloudTrail

S3 Lifecycle ManagementS3 Versioning

Separate Code/Data/Logs buckets

Keyless EMR Clusters

Hybrid Model

Enable Debugging

Create Multiple CLI Profiles

Multi-Factor Authentication

CloudWatch Billing Alarms

Spot EC2 Instances

SNS notifications for failures

Loosely coupled Apps

Scale Horizontally

Best Practices & Suggestions

Remember to complete

your evaluations!

Thank you!

Photo Credits• Key Board : http://bit.ly/1LRQMdR

• Compression : http://bit.ly/1MtT3Pa

• Optimization : http://bit.ly/1FlidQD

• WatchDog : http://bit.ly/1OX50j6

• Elasticity : http://bit.ly/1YFfCr4

• Fish Bowl : http://bit.ly/1VjrcJd

• Blank Cheque : http://bit.ly/1RkTgGe

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Technology

Transcript of (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.

Amazon Elastic MapReduce Best Practices

Deep Dive - Amazon Elastic MapReduce (EMR)

MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W15/slides/... · 2014-12-29 · •Features powered by Amazon Elastic MapReduce include: –People Who Viewed this Also Viewed

Metail and Elastic MapReduce

Introducing Elastic MapReduce

A Modern Framework for Amazon Elastic MapReduce (BDT309) | AWS re:Invent 2013

Moving to the Cloud As easy as 1 2 4 - HEAnet into the cloud... · Amazon Simple Workflow Amazon Cognito Amazon AppStream Compute Amazon Elastic Compute Cloud Amazon Elastic MapReduce

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

Amazon Elastic MapReduce - AWS Documentation · Amazon Elastic MapReduce API Reference ... Example ...

Amazon Elastic MapReduce -- Getting started with Hadoop

Deep Dive: Amazon Elastic MapReduce

[AWSマイスターシリーズ] Amazon Elastic MapReduce (EMR)

Getting Started with Amazon Elastic MapReduce 1.2.2 · Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and

(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

AWS Black Belt Tech シリーズ 2015 - Amazon Elastic MapReduce

Amazon Elastic MapReduce (EMR): Hadoop as a Service

Getting Started with Amazon Elastic MapReduce 1.2.2awsmedia.s3.amazonaws.com/pdf/introduction-to-amazon-elastic-map... · information is supplemented by the efforts of a passionate

Processing Data using Amazon Elastic MapReduce and Apache Hive