(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Post on 15-Apr-2017

2.019 views 2 download

Transcript of (BDT208) A Technical Introduction to Amazon Elastic MapReduce

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Abhishek Sinha, Amazon Web Services

Gaurav Agrawal, AOL Inc

October 2015

BDT208

A Technical Introduction to

Amazon EMR

What to Expect from the Session

• Technical introduction to Amazon EMR

• Basic tenets

• Amazon EMR feature set

• Real-Life experience of moving a 2-PB, on-premises

Hadoop cluster to the AWS cloud

• Is not a technical introduction to Apache Spark, Apache

Hadoop, or other frameworks

Amazon EMR • Managed platform

• MapReduce, Apache Spark, Presto

• Launch a cluster in minutes

• Open source distribution and MapR

distribution

• Leverage the elasticity of the cloud

• Baked in security features

• Pay by the hour and save with Spot

• Flexibility to customize

Make it easy, secure, and

cost-effective to run

data-processing frameworks

on the AWS cloud

What Do I Need to Build a Cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

An Example EMR Cluster

Master Node

r3.2xlarge

Slave Group - Core

c3.2xlarge

Slave Group – Task

m3.xlarge

Slave Group – Task

m3.2xlarge (EC2 Spot)

HDFS (DataNode).

YARN (NodeManager).

NameNode (HDFS)

ResourceManager

(YARN)

Choice of Multiple Instances

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Machine

Learning

Batch

Processing

In-memory

(Spark &

Presto)

Large HDFS

Select an Instance

Choose Your Software (Quick Bundles)

Choose Your Software – Custom

Hadoop Applications Available in Amazon EMR

Choose Security and Access Control

You Are Up and Running!

You Are Up and Running!

Master Node DNS

You Are Up and Running!

Information about the software you are

running, logs and features

You Are Up and Running!

Infrastructure for this cluster

You Are Up and Running!

Security Groups and Roles

Use the CLI

aws emr create-cluster

--release-label emr-4.0.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK

Programmatic Access to Cluster Provisioning

Now that I have a cluster, I need to process

some data

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

Amazon EMR can process data from multiple sources

Hadoop Distributed File

System (HDFS)

Amazon S3 (EMRFS)

Amazon DynamoDB

Amazon Kinesis

On an On-premises Environment

Tightly coupled

Compute and Storage Grow Together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

Underutilized or Scarce Resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Underutilized or Scarce Resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

Underutilized or Scarce Resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Underutilized capacity

Provisioned capacity

Contention for Same Resources

Compute

boundMemory

bound

Separation of Resources Creates Data Silos

Team A

Replication Adds to Cost

3x

Single datacenter

So how does Amazon EMR solve these problems?

Decouple Storage and Compute

Amazon S3 is Your Persistent Data Store

11 9’s of durability

$0.03 / GB / month in US-East

Lifecycle policies

Versioning

Distributed by default

EMRFSAmazon S3

The Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file-system

• Streams data directly from Amazon S3

• Uses HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Consistent view – consistency for read after write

• Support for encryption

• Fast listing of objects

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Benefit 1: Switch Off Clusters

Amazon S3Amazon S3 Amazon S3

Auto-Terminate Clusters

You Can Build a Pipeline

Or You Can Use AWS Data Pipeline

Input data

Use Amazon EMR to

transform unstructured

data to structured

Push to

Amazon S3

Ingest into

Amazon

Redshift

Sample Pipeline

Run Transient or Long-Running Clusters

Run a Long-Running Cluster

Amazon EMR cluster

Benefit 2: Resize Your Cluster

Resize the Cluster

Scale Up, Scale Down, Stop a resize,

issue a resize on another

How do you scale up and save cost ?

Spot Instance

Bid

Price

OD

Price

Spot Integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

The Spot Bid Advisor

Spot Integration with Amazon EMR

• Can provision instances from the Spot market

• Replaces a Spot instance incase of interruption

• Impact of interruption

• Master node – Can lose the cluster

• Core node – Can lose intermediate data

• Task nodes – Jobs will restart on other nodes (application

dependent)

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Resize Nodes with Spot Instances

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Scaling Hadoop Jobs with Spothttp://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/

1500 to 2000 clusters

6000 Jobs

For each instance_type in (Availability Zone, Region)

{

cpuPerUnitPrice = instance.cpuCores/instance.spotPrice

if (maxCpuPerUnitPrice < cpuPerUnitPrice) {

optimalInstanceType = instance_type;

}

}

Source: Github /Bloomreach/ Briefly

Intelligent Scale Down

Intelligent Scale Down: HDFS

Effectively Utilize Clusters

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical Separation of Jobs

Hive, Pig,

Cascading

Prod

Presto Ad-Hoc

Amazon S3

Benefit 4: Disaster Recovery Built In

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Amazon S3 as a Data Lake

Nate Sammons, Principal Architect – NASDAQ

Reference – AWS Big Data Blog

Re-cap

Rapid provisioning of clusters

Hadoop, Spark, Presto, and other applications

Standard open-source packaging

De-couple storage and compute and scale them

independently

Resize clusters to manage demand

Save costs with Spot instances

How AOL Inc. moved a 2 PB Hadoop

cluster to the AWS cloud

Gaurav Agrawal

Senior Software Engineer, AOL Inc.

AWS Certified Associate Solutions Architect

AOL Data Platforms Architecture 2014

Data Stats & Insights

Cluster Size

2 PB

In-House

Cluster

100 Nodes

Raw

Data/Day

2-3 TB

Data

Retention

13-24 Months

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

AOL Data Platforms Architecture 2015

12

2

34

56

Migration

• Web Console vs. CLI

Web Console and CLI

Web Console for Training

Setup IAM for users

AWS Services Options

S3 Data upload

EMR Creation & Steps

Try & Test multiple approaches

CLI is your friend..!!!

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

bucket-prod-control

Environment Level Buckets

Dev, QA, Production, Analyst

Project Level Buckets

Code, Data, Log, Extract and Control

Compressed Snappy Data to GZIP

Multi Platforms Support

Best Compression

Lowest storage cost

Low cost for Data OUT

bucket-dev bucket-qa

bucket-prod bucket-analyst

bucket-prod-code

bucket-prod-log

bucket-prod-data

bucket-prod-extract

76%Less Storage

70KSaving/Year

Copy Existing Data to S3

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

EMR Design Options

Transient

Amazon S3

Elastic Cluster

On-Demand vs. Reserved vs.

Core NodesAmazon EMR

vs. Persistent Cluster

vs. local HDFS

vs. Static Cluster

Spot

vs. Task Nodes

AOL Data Platforms Architecture 2015

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission - CLI

EMR Jobs Submission - CLI

In-house scheduler

Common Utilities

Provision EMR

Push/Pull Data to S3

Job submission to Scheduler

Database Load

JSON Files

Applications, Steps, Bootstrap,EC2 attributes, Instance Groups

Future : Event Driven Design – Lambda, SQS

EMR Jobs Submission - CLI

aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" \

--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" \

--visible-to-all-users \

--ec2-attributes file://omni_awssot.generic.ec2_attributes.json \

--ami-version "3.7.0" \

--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ \

--enable-debugging \

--instance-groups file://omni_awssot.generic.instance_groups.json \

--auto-terminate \

--applications file://omni_awssot.generic.applications.json \

--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json \

--steps file://omni_awssot.generic.steps.json

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

Monitoring

EMR WatchDog : Node.js

Duplicate Clusters

Failed Clusters

Long-running Clusters

Long-provisioning Clusters

CloudWatch Alarms

Monthly Billing

S3 Bucket Size

SNS Email Notifications

Amazon CloudWatchAmazon SNS

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

• Elasticity

Elasticity

Why be Elastic?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 09/05/2015 Cores Nodes

Daily Processes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Core Nodes Demand - 09/20/2015 Core Nodes

No Clusters

Spike in Demand

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015Cores Nodes

Major RestatementDemand > 10K EC2

Elasticity

Why be Elastic?

True Cloud Architecture

Spot is an Open Market

Scale Horizontally

Our Limit : 3,000 EC2/Region

Multiple Regions

Multiple Instance Types

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

• Elasticity

• Cost Management & BCDR

Cost Management & BCDR

Multi Region Deployment

Best AZ for pricing

Design for failure

Global. BC-DR.

Migration

• Web Console vs. CLI

• Copy Existing Data to S3

• EMR Design options

• EMR Jobs Submission – CLI

• Monitoring

• Elasticity

• Cost Management & BCDR

• Optimization

OptimizationData Management

Partition Data on S3

S3 Versioning/Lifecycle

How many nodes?

Based on Data Volume

Complete hour for pricing

Hadoop Run-time Params

Memory Tuning

Compress M & R Output

Combine Splits Input format

Security

Score Card

Feature AWS

Pay for what you use ✔

Decouple Storage and Compute ✔

True Cloud Architecture ✔

Self Service Model ✔

Elastic & Scalable ✔

Global Infrastructure. BCDR. ✔

Quick & Easy Deployments ✔

Redshift External Tables on S3 ?

More languages for Lambda ?

AWS vs. In-House Cost

0 2 4 6

Service

Cost Comparison

AWS

In-House

Source : AOL & AWS Billing Tool

4xIn-House / Month

1xAWS / Month

** In-House cluster includes Storage, Power and Network cost.

AWS vs. In-House Cost

10/8/2015

Amazon Web Services

1/4th Cost of In-House Hadoop Infrastructure

1/4th Cost

Data Platforms. AOL Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015Core…

Restatement Use Case

• Restate historical data going back 6 months

Availability Zones

10

550EMR Clusters

24,000Spot EC2 Instances

0

10

20

30

40

50

60

70

Timing Comparison

In-House

AWS

Tag All Resources

Infrastructure as CodeCommand Line Interface

JSON as configuration files

IAM Roles and Policies

Use of Application ID

Enable CloudTrail

S3 Lifecycle ManagementS3 Versioning

Separate Code/Data/Logs buckets

Keyless EMR Clusters

Hybrid Model

Enable Debugging

Create Multiple CLI Profiles

Multi-Factor Authentication

CloudWatch Billing Alarms

Spot EC2 Instances

SNS notifications for failures

Loosely coupled Apps

Scale Horizontally

Best Practices & Suggestions

Remember to complete

your evaluations!

Thank you!

Photo Credits• Key Board : http://bit.ly/1LRQMdR

• Compression : http://bit.ly/1MtT3Pa

• Optimization : http://bit.ly/1FlidQD

• WatchDog : http://bit.ly/1OX50j6

• Elasticity : http://bit.ly/1YFfCr4

• Fish Bowl : http://bit.ly/1VjrcJd

• Blank Cheque : http://bit.ly/1RkTgGe