Masterclass Live: Amazon EMR

Post on 16-Apr-2017

846 views 1 download

Transcript of Masterclass Live: Amazon EMR

Abhishek Sinha – Sr. Product Manager

sinhaar@amazon.com

@abysinha

Amazon EMR

Amazon EMR

Making it easy, secure and cost-effective to run

data processing frameworks on the AWS cloud

Amazon EMR

• Managed platform

• Hadoop MapReduce, Spark, Presto,

and more

• Launch clusters in minutes

• Apache Bigtop based distribution

• Leverage the elasticity of the cloud

• Added security features

• Pay by the hour and save with Spot

• Flexibility to customize

• Programmable Infrastructure

What do I need to build a cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

Cluster composition

Master Node

Core Instance Group Task Instance

Groups

NameNode (HDFS),

ResourceManager (YARN),

and other components

HDFS DataNode

YARN Node ManagerYARN Node Manager

Choice of multiple instances

CPU

c3 family

c4 family

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

m4 family

Machine

Learning

Batch

Processing

In-memory

(Spark &

Presto)

Large HDFS

Or add EBS volumes if you need additional on-cluster storage.

Hadoop applications available in EMR

Or, use bootstrap actions to install arbitrary

applications on your cluster!

Choose your software - Quick Create

Choose your software – Advanced Options

Configuration API for custom configs

[

{

"Classification": "core-site",

"Properties": {

"hadoop.security.groups.cache.secs": "250"

}

},

{

"Classification": "mapred-site",

"Properties": {

"mapred.tasktracker.map.tasks.maximum": "2",

"mapreduce.map.sort.spill.percent": "90",

"mapreduce.tasktracker.reduce.tasks.maximum": "5"

}

}

]

Use the AWS CLI to easily create clusters:

aws emr create-cluster

--release-label emr-4.3.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK for programmatic provisioning:

Use Amazon EMR to

separate your

compute and storage.

On premises: compute and storage grow together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

On premises: Underutilized or scarce resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

On premises: Contention for same resources

Compute

boundMemory

bound

Separation of resources creates data silos

Team A

On premises: Replication adds to cost

3x

HDFS needs 3x

Multi-Data Center DR

Use Amazon EMR to

separate your

compute and storage.

EMR can process data from many sources

• Hadoop Distributed File

System (HDFS)

• Amazon S3 (EMRFS)

• Amazon Dynamo DB,

Redshift, Aurora, RDS

• Amazon Kinesis

• Other applications running in

your architecture (Kafka,

ElasticSearch, etc.)

Amazon S3 is your persistent data store

11 9’s of durability

$0.03 / GB / Month in US-East

Life Cycle Policies

Available across AZs

Easy access

Amazon S3

The EMR Filesystem (EMRFS)

• Allows you to leverage S3 as a file-system for Hadoop

• Streams data directly from S3

• Cluster still uses local disk/HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Optional consistent view for consistent list

• Support for encryption

• Fast listing of objects

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Benefit 1: Switch off clusters

Amazon S3Amazon S3 Amazon S3

Auto-terminate clusters after job completion

You can build a pipeline

Submit jobs using:

- EMR Step API

- Oozie

- SSH directly

- Genie (Gateway)

- OSS workflow tools

(i.e. Luigi)

You can use Amazon Data Pipeline

Input data

Use EMR to transform

unstructured to

structured data

Push to S3Ingest into

Redshift

Run transient or long-running clusters

Benefit 2: Resize your cluster to match

workload requirements

Resize using the Console, CLI, or API

Save costs with EC2 Spot instances

Bid

Price

OD

Price

Spot integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

The Spot Bid Advisor

Spot Integration with EMR

• Can provision instances from the Spot Market

• Replaces a spot instance in case of interruption

• Impact of interruption

• Master Node – Can lose the cluster

• Core Node – Can lose data stored in HDFS

• Task Nodes – lose the task (but the task will run elsewhere)

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Resize Nodes with Spot Instances

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Intelligent scale down

Intelligent scale down – HDFS

Effectively utilize clusters

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical separation of jobs

Hive, Pig,

Cascading

Prod

Presto Ad-Hoc

Amazon S3

Benefit 4 : Disaster recovery built-in

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Hive Metastore in

Amazon RDS

S3 as a data-lake

Nate Summons, Principle Architect - NASDAQ

Monitoring with CloudWatch (or Ganglia)

EMR logging to S3 makes logs easily available

Spark moves at interactive speed

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in RDDs in memory

• Partitioning-aware to avoid

network-intensive shuffle

Spark components to match your use case

Spark speaks your language

Use DataFrames to easily interact with data

• Distributed

collection of data

organized in

columns

• An extension of the

existing RDD API

• Optimized for query

execution

Easily create DataFrames from many formats

RDD

Additional libraries for Spark SQL Data Sources

at spark-packages.org

Load data with the Spark SQL Data Sources API

Additional libraries at spark-packages.org

Sample DataFrame manipulations

Use DataFrames for machine learning

• Spark ML libraries

(replacing MLlib) use

DataFrames as

input/output for

models

• Create ML pipelines

with a variety of

distributed algorithms

Create DataFrames on streaming data

• Access data in Spark Streaming DStream

• Create SQLContext on the SparkContext used for Spark

Streaming application for ad hoc queries

• Incorporate DataFrame in Spark Streaming application

• Checkpointing streaming jobs

Spark Pipeline

Use R to interact with DataFrames

• SparkR package for using R to manipulate DataFrames

• Create SparkR applications or interactively use the SparkR

shell (no Zeppelin support yet - ZEPPELIN-156)

• Comparable performance to Python and Scala

DataFrames

Amazon EMR runs Spark on YARN

• Dynamically share and centrally configure

the same pool of cluster resources across

engines

• Schedulers for categorizing, isolating, and

prioritizing workloads

• Choose the number of executors to use, or

allow YARN to choose (dynamic allocation)

• Kerberos authentication

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

In MemorySpark

ApplicationsPig, Hive, Cascading, Spark Streaming, Spark SQL

Inside Spark Executor on YARN

Max Container size on node

Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)

𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟.𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10

Executor Container

Memory

Overhead

Config File: spark-default.conf

Inside Spark Executor on YARN

Max Container size on node

Spark executor memory - Amount of memory to use per executor process

spark.executor.memory

Executor Container

Memory

Overhead

Spark Executor Memory

Config File: spark-default.conf

Inside Spark Executor on YARN

Max Container size on node

Shuffle Memory Fraction – pre-Spark 1.6

Executor Container

Memory

Overhead

Spark Executor Memory

Shuffle

memoryFraction

Default: 0.2

Inside Spark Executor on YARN

Max Container size on node

Storage storage Fraction - pre-Spark 1.6

Executor Container

Memory

Overhead

Spark Executor Memory

Shuffle

memoryFractionStorage

memoryFraction

Default: 0.6

Inside Spark Executor on YARN

Max Container size on node

In Spark 1.6+, Spark automatically balances the amount of memory for execution

and cached data.

Executor Container

Memory

Overhead

Spark Executor Memory

Execution / Cache

Default: 0.6

Dynamic Allocation on YARN

Scaling up on executors

- Request when you want the job to complete faster

- Idle resources on cluster

- Exponential increase in executors over time

New Default beginning EMR 4.4

Dynamic allocation setup

Optional

Property Value

Spark.dynamicAllocation.enabled true

Spark.shuffle.service.enabled true

spark.dynamicAllocation.minExecutors 5

spark.dynamicAllocation.maxExecutors 17

spark.dynamicAllocation.initalExecutors 0

sparkdynamicAllocation.executorIdleTime 60s

spark.dynamicAllocation.schedulerBacklogTimeout 5s

spark.dynamicAllocation.sustainedSchedulerBacklog

Timeout

5s

Compress your input data set

• Always compress Data Files on Amazon S3

• Reduces storage cost

• Reduces bandwidth between Amazon S3 and Amazon

EMR, which can speed up bandwidth constrained jobs

Compressions

Compression Types:

– Some are fast BUT offer less space reduction

– Some are space efficient BUT Slower

– Some are splitable and some are not

Algorithm % Space

Remaining

Encoding Speed Decoding Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Data Serialization

• Data is serialized when cached or shuffled

Default: Java serializer

• Kyro serialization (10x faster than Java serialization)

• Does not support all Serializable types

• Register the class in advance

Usage: Set in SparkConf

conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")

Running Spark on

Amazon EMR

Focus on deriving insights from your data

instead of manually configuring clusters

Easy to install and configure Spark

Secured

Spark submit, Oozie or use Zeppelin UI

Quickly addand remove capacity

Hourly, reserved, or EC2 Spot pricing

Use S3 to decouplecompute and storage

Launch the latest Spark version

Spark 1.6.1 is the current version on EMR.

< 3 week cadence with latest open source release

Create a fully configured cluster in minutes

AWS Management

Console

AWS Command Line

Interface (CLI)

Or use a AWS SDK directly with the Amazon EMR API

Or easily change your settings

Many storage layers to choose from

Amazon DynamoDB

EMR-DyanmoDB

connector

Amazon RDS

Amazon

Kinesis

Streaming data

connectorsJDBC Data Source

w/ Spark SQL

ElasticSearch

connector

Amazon Redshift

Spark-Redshift

connector

EMR File System

(EMRFS)

Amazon S3

Amazon EMR

Decouple compute and storage by using S3

as your data layer

HDFS

S3 is designed for 11

9’s of durability and is

massively scalable

EC2 Instance

Memory

Amazon S3

Amazon EMR

Amazon EMR

Amazon EMR

Easy to run your Spark workloads

Amazon EMR Step API

SSH to master node and use Spark

Submit, Oozie or Zeppelin

Submit a Spark

application

Amazon EMR

Customer use cases

Some of our customers running Spark on EMR

Integration Pattern – ETL with Spark

Amazon EMRAmazon S3

HDFSRead

Unstructured

Data

Write

Structured

Extract

Load from

HDFS

Store Output Data

Integration Pattern – Tumbling Window Reporting

Amazon EMR

Amazon

Kinesis

Streaming Input

HDFS

Tumbling/Fixed

Window

Aggregation

Periodic Output

Amazon Redshift

COPY from EMR

Or checkpoint to S3 and use

the Lambda loader app

EMR Security Overview

Encryption ComplianceSecurity

Fundamentals

• Identity and Access

Management (IAM) policies,

• Bucket policies

• Access Control Lists (ACLs)

• Query string authentication

• SSL endpoints

• Server Side Encryption

(SSE-S3)

• Server Side Encryption

with KMS provided keys

(coming soon)

• Client-side Encryption

• Buckets access logs

• Lifecycle Management

Policies

• Access Control Lists

(ACLs)

• Versioning & MFA deletes

Networking: VPC private subnets

• Use Amazon S3 Endpoints for

connectivity to S3

• Use Managed NAT for connectivity to

other services or the Internet

• Control the traffic using Security Groups

• ElasticMapReduce-Master-Private

• ElasticMapReduce-Slave-Private

• ElasticMapReduce-ServiceAccess

Access Control: IAM Users and Roles

• IAM Policies for access to Amazon EMR service (IAM users or federated users)

• AmazonElasticMapReduceFullAccess

• AmazonElasticMapReduceReadOnlyAccess

• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable

actions for Amazon EMR service, like creating EC2 instances.

• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.

Data at Rest: S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Customer Stories

AOL’s Spot Use Case: restate 6 months of

historical data

Availability Zones

10

550EMR Clusters

24,000Spot EC2 Instances

0

10

20

30

40

50

60

70

Timing Comparison

In-House

AWS

OUR CLOUD ARCHITECTURE

FINRA saves money with comparable

performance with Hive on Tez with S3

Using EMR and cloud capacity for ETL

Bridging on-prem and EMR for easy ETL

Twitter (Answers) uses EMR as the batch layer

in their Lambda architecture

Using EMR for batch, streaming, and ad hoc

SmartNews

Nasdaq: data lake architecture diagram

Optimizing data warehousing costs with S3 and EMR

AWS Pop-up Loft LondonThank You

Abhishek Sinha | sinhaar@amazon.com | @abysinha