Masterclass Live: Amazon EMR

99
Abhishek Sinha Sr. Product Manager [email protected] @abysinha Amazon EMR

Transcript of Masterclass Live: Amazon EMR

Page 1: Masterclass Live: Amazon EMR

Abhishek Sinha – Sr. Product Manager

[email protected]

@abysinha

Amazon EMR

Page 2: Masterclass Live: Amazon EMR

Amazon EMR

Making it easy, secure and cost-effective to run

data processing frameworks on the AWS cloud

Page 3: Masterclass Live: Amazon EMR

Amazon EMR

• Managed platform

• Hadoop MapReduce, Spark, Presto,

and more

• Launch clusters in minutes

• Apache Bigtop based distribution

• Leverage the elasticity of the cloud

• Added security features

• Pay by the hour and save with Spot

• Flexibility to customize

• Programmable Infrastructure

Page 4: Masterclass Live: Amazon EMR

What do I need to build a cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

Page 5: Masterclass Live: Amazon EMR

Cluster composition

Master Node

Core Instance Group Task Instance

Groups

NameNode (HDFS),

ResourceManager (YARN),

and other components

HDFS DataNode

YARN Node ManagerYARN Node Manager

Page 6: Masterclass Live: Amazon EMR

Choice of multiple instances

CPU

c3 family

c4 family

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

m4 family

Machine

Learning

Batch

Processing

In-memory

(Spark &

Presto)

Large HDFS

Or add EBS volumes if you need additional on-cluster storage.

Page 7: Masterclass Live: Amazon EMR

Hadoop applications available in EMR

Or, use bootstrap actions to install arbitrary

applications on your cluster!

Page 8: Masterclass Live: Amazon EMR

Choose your software - Quick Create

Page 9: Masterclass Live: Amazon EMR

Choose your software – Advanced Options

Page 10: Masterclass Live: Amazon EMR

Configuration API for custom configs

[

{

"Classification": "core-site",

"Properties": {

"hadoop.security.groups.cache.secs": "250"

}

},

{

"Classification": "mapred-site",

"Properties": {

"mapred.tasktracker.map.tasks.maximum": "2",

"mapreduce.map.sort.spill.percent": "90",

"mapreduce.tasktracker.reduce.tasks.maximum": "5"

}

}

]

Page 11: Masterclass Live: Amazon EMR

Use the AWS CLI to easily create clusters:

aws emr create-cluster

--release-label emr-4.3.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK for programmatic provisioning:

Page 12: Masterclass Live: Amazon EMR

Use Amazon EMR to

separate your

compute and storage.

Page 13: Masterclass Live: Amazon EMR

On premises: compute and storage grow together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

Page 14: Masterclass Live: Amazon EMR

On premises: Underutilized or scarce resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

Page 15: Masterclass Live: Amazon EMR

On premises: Contention for same resources

Compute

boundMemory

bound

Page 16: Masterclass Live: Amazon EMR

Separation of resources creates data silos

Team A

Page 17: Masterclass Live: Amazon EMR

On premises: Replication adds to cost

3x

HDFS needs 3x

Multi-Data Center DR

Page 18: Masterclass Live: Amazon EMR

Use Amazon EMR to

separate your

compute and storage.

Page 19: Masterclass Live: Amazon EMR

EMR can process data from many sources

• Hadoop Distributed File

System (HDFS)

• Amazon S3 (EMRFS)

• Amazon Dynamo DB,

Redshift, Aurora, RDS

• Amazon Kinesis

• Other applications running in

your architecture (Kafka,

ElasticSearch, etc.)

Page 20: Masterclass Live: Amazon EMR

Amazon S3 is your persistent data store

11 9’s of durability

$0.03 / GB / Month in US-East

Life Cycle Policies

Available across AZs

Easy access

Amazon S3

Page 21: Masterclass Live: Amazon EMR

The EMR Filesystem (EMRFS)

• Allows you to leverage S3 as a file-system for Hadoop

• Streams data directly from S3

• Cluster still uses local disk/HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Optional consistent view for consistent list

• Support for encryption

• Fast listing of objects

Page 22: Masterclass Live: Amazon EMR

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Page 23: Masterclass Live: Amazon EMR

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Page 24: Masterclass Live: Amazon EMR

Benefit 1: Switch off clusters

Amazon S3Amazon S3 Amazon S3

Page 25: Masterclass Live: Amazon EMR

Auto-terminate clusters after job completion

Page 26: Masterclass Live: Amazon EMR

You can build a pipeline

Submit jobs using:

- EMR Step API

- Oozie

- SSH directly

- Genie (Gateway)

- OSS workflow tools

(i.e. Luigi)

Page 27: Masterclass Live: Amazon EMR

You can use Amazon Data Pipeline

Input data

Use EMR to transform

unstructured to

structured data

Push to S3Ingest into

Redshift

Page 28: Masterclass Live: Amazon EMR

Run transient or long-running clusters

Page 29: Masterclass Live: Amazon EMR

Benefit 2: Resize your cluster to match

workload requirements

Page 30: Masterclass Live: Amazon EMR

Resize using the Console, CLI, or API

Page 31: Masterclass Live: Amazon EMR

Save costs with EC2 Spot instances

Bid

Price

OD

Price

Page 32: Masterclass Live: Amazon EMR

Spot integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Page 33: Masterclass Live: Amazon EMR

The Spot Bid Advisor

Page 34: Masterclass Live: Amazon EMR

Spot Integration with EMR

• Can provision instances from the Spot Market

• Replaces a spot instance in case of interruption

• Impact of interruption

• Master Node – Can lose the cluster

• Core Node – Can lose data stored in HDFS

• Task Nodes – lose the task (but the task will run elsewhere)

Page 35: Masterclass Live: Amazon EMR

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Page 36: Masterclass Live: Amazon EMR

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Page 37: Masterclass Live: Amazon EMR

Resize Nodes with Spot Instances

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

Page 38: Masterclass Live: Amazon EMR

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Page 39: Masterclass Live: Amazon EMR

Intelligent scale down

Page 40: Masterclass Live: Amazon EMR

Intelligent scale down – HDFS

Page 41: Masterclass Live: Amazon EMR

Effectively utilize clusters

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Page 42: Masterclass Live: Amazon EMR

Benefit 3: Logical separation of jobs

Hive, Pig,

Cascading

Prod

Presto Ad-Hoc

Amazon S3

Page 43: Masterclass Live: Amazon EMR

Benefit 4 : Disaster recovery built-in

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Hive Metastore in

Amazon RDS

Page 44: Masterclass Live: Amazon EMR

S3 as a data-lake

Nate Summons, Principle Architect - NASDAQ

Page 45: Masterclass Live: Amazon EMR

Monitoring with CloudWatch (or Ganglia)

Page 46: Masterclass Live: Amazon EMR

EMR logging to S3 makes logs easily available

Page 47: Masterclass Live: Amazon EMR
Page 48: Masterclass Live: Amazon EMR

Spark moves at interactive speed

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in RDDs in memory

• Partitioning-aware to avoid

network-intensive shuffle

Page 49: Masterclass Live: Amazon EMR

Spark components to match your use case

Page 50: Masterclass Live: Amazon EMR

Spark speaks your language

Page 51: Masterclass Live: Amazon EMR

Use DataFrames to easily interact with data

• Distributed

collection of data

organized in

columns

• An extension of the

existing RDD API

• Optimized for query

execution

Page 52: Masterclass Live: Amazon EMR

Easily create DataFrames from many formats

RDD

Additional libraries for Spark SQL Data Sources

at spark-packages.org

Page 53: Masterclass Live: Amazon EMR

Load data with the Spark SQL Data Sources API

Additional libraries at spark-packages.org

Page 54: Masterclass Live: Amazon EMR

Sample DataFrame manipulations

Page 55: Masterclass Live: Amazon EMR

Use DataFrames for machine learning

• Spark ML libraries

(replacing MLlib) use

DataFrames as

input/output for

models

• Create ML pipelines

with a variety of

distributed algorithms

Page 56: Masterclass Live: Amazon EMR

Create DataFrames on streaming data

• Access data in Spark Streaming DStream

• Create SQLContext on the SparkContext used for Spark

Streaming application for ad hoc queries

• Incorporate DataFrame in Spark Streaming application

• Checkpointing streaming jobs

Page 57: Masterclass Live: Amazon EMR

Spark Pipeline

Page 58: Masterclass Live: Amazon EMR

Use R to interact with DataFrames

• SparkR package for using R to manipulate DataFrames

• Create SparkR applications or interactively use the SparkR

shell (no Zeppelin support yet - ZEPPELIN-156)

• Comparable performance to Python and Scala

DataFrames

Page 59: Masterclass Live: Amazon EMR
Page 60: Masterclass Live: Amazon EMR

Amazon EMR runs Spark on YARN

• Dynamically share and centrally configure

the same pool of cluster resources across

engines

• Schedulers for categorizing, isolating, and

prioritizing workloads

• Choose the number of executors to use, or

allow YARN to choose (dynamic allocation)

• Kerberos authentication

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

In MemorySpark

ApplicationsPig, Hive, Cascading, Spark Streaming, Spark SQL

Page 61: Masterclass Live: Amazon EMR

Inside Spark Executor on YARN

Max Container size on node

Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)

𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟.𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10

Executor Container

Memory

Overhead

Config File: spark-default.conf

Page 62: Masterclass Live: Amazon EMR

Inside Spark Executor on YARN

Max Container size on node

Spark executor memory - Amount of memory to use per executor process

spark.executor.memory

Executor Container

Memory

Overhead

Spark Executor Memory

Config File: spark-default.conf

Page 63: Masterclass Live: Amazon EMR

Inside Spark Executor on YARN

Max Container size on node

Shuffle Memory Fraction – pre-Spark 1.6

Executor Container

Memory

Overhead

Spark Executor Memory

Shuffle

memoryFraction

Default: 0.2

Page 64: Masterclass Live: Amazon EMR

Inside Spark Executor on YARN

Max Container size on node

Storage storage Fraction - pre-Spark 1.6

Executor Container

Memory

Overhead

Spark Executor Memory

Shuffle

memoryFractionStorage

memoryFraction

Default: 0.6

Page 65: Masterclass Live: Amazon EMR

Inside Spark Executor on YARN

Max Container size on node

In Spark 1.6+, Spark automatically balances the amount of memory for execution

and cached data.

Executor Container

Memory

Overhead

Spark Executor Memory

Execution / Cache

Default: 0.6

Page 66: Masterclass Live: Amazon EMR

Dynamic Allocation on YARN

Scaling up on executors

- Request when you want the job to complete faster

- Idle resources on cluster

- Exponential increase in executors over time

New Default beginning EMR 4.4

Page 67: Masterclass Live: Amazon EMR

Dynamic allocation setup

Optional

Property Value

Spark.dynamicAllocation.enabled true

Spark.shuffle.service.enabled true

spark.dynamicAllocation.minExecutors 5

spark.dynamicAllocation.maxExecutors 17

spark.dynamicAllocation.initalExecutors 0

sparkdynamicAllocation.executorIdleTime 60s

spark.dynamicAllocation.schedulerBacklogTimeout 5s

spark.dynamicAllocation.sustainedSchedulerBacklog

Timeout

5s

Page 68: Masterclass Live: Amazon EMR

Compress your input data set

• Always compress Data Files on Amazon S3

• Reduces storage cost

• Reduces bandwidth between Amazon S3 and Amazon

EMR, which can speed up bandwidth constrained jobs

Page 69: Masterclass Live: Amazon EMR

Compressions

Compression Types:

– Some are fast BUT offer less space reduction

– Some are space efficient BUT Slower

– Some are splitable and some are not

Algorithm % Space

Remaining

Encoding Speed Decoding Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Page 70: Masterclass Live: Amazon EMR

Data Serialization

• Data is serialized when cached or shuffled

Default: Java serializer

• Kyro serialization (10x faster than Java serialization)

• Does not support all Serializable types

• Register the class in advance

Usage: Set in SparkConf

conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")

Page 71: Masterclass Live: Amazon EMR

Running Spark on

Amazon EMR

Page 72: Masterclass Live: Amazon EMR

Focus on deriving insights from your data

instead of manually configuring clusters

Easy to install and configure Spark

Secured

Spark submit, Oozie or use Zeppelin UI

Quickly addand remove capacity

Hourly, reserved, or EC2 Spot pricing

Use S3 to decouplecompute and storage

Page 73: Masterclass Live: Amazon EMR

Launch the latest Spark version

Spark 1.6.1 is the current version on EMR.

< 3 week cadence with latest open source release

Page 74: Masterclass Live: Amazon EMR

Create a fully configured cluster in minutes

AWS Management

Console

AWS Command Line

Interface (CLI)

Or use a AWS SDK directly with the Amazon EMR API

Page 75: Masterclass Live: Amazon EMR

Or easily change your settings

Page 76: Masterclass Live: Amazon EMR

Many storage layers to choose from

Amazon DynamoDB

EMR-DyanmoDB

connector

Amazon RDS

Amazon

Kinesis

Streaming data

connectorsJDBC Data Source

w/ Spark SQL

ElasticSearch

connector

Amazon Redshift

Spark-Redshift

connector

EMR File System

(EMRFS)

Amazon S3

Amazon EMR

Page 77: Masterclass Live: Amazon EMR

Decouple compute and storage by using S3

as your data layer

HDFS

S3 is designed for 11

9’s of durability and is

massively scalable

EC2 Instance

Memory

Amazon S3

Amazon EMR

Amazon EMR

Amazon EMR

Page 78: Masterclass Live: Amazon EMR

Easy to run your Spark workloads

Amazon EMR Step API

SSH to master node and use Spark

Submit, Oozie or Zeppelin

Submit a Spark

application

Amazon EMR

Page 79: Masterclass Live: Amazon EMR

Customer use cases

Page 80: Masterclass Live: Amazon EMR

Some of our customers running Spark on EMR

Page 81: Masterclass Live: Amazon EMR
Page 82: Masterclass Live: Amazon EMR

Integration Pattern – ETL with Spark

Amazon EMRAmazon S3

HDFSRead

Unstructured

Data

Write

Structured

Extract

Load from

HDFS

Store Output Data

Page 83: Masterclass Live: Amazon EMR

Integration Pattern – Tumbling Window Reporting

Amazon EMR

Amazon

Kinesis

Streaming Input

HDFS

Tumbling/Fixed

Window

Aggregation

Periodic Output

Amazon Redshift

COPY from EMR

Or checkpoint to S3 and use

the Lambda loader app

Page 84: Masterclass Live: Amazon EMR

EMR Security Overview

Page 85: Masterclass Live: Amazon EMR

Encryption ComplianceSecurity

Fundamentals

• Identity and Access

Management (IAM) policies,

• Bucket policies

• Access Control Lists (ACLs)

• Query string authentication

• SSL endpoints

• Server Side Encryption

(SSE-S3)

• Server Side Encryption

with KMS provided keys

(coming soon)

• Client-side Encryption

• Buckets access logs

• Lifecycle Management

Policies

• Access Control Lists

(ACLs)

• Versioning & MFA deletes

Page 86: Masterclass Live: Amazon EMR

Networking: VPC private subnets

• Use Amazon S3 Endpoints for

connectivity to S3

• Use Managed NAT for connectivity to

other services or the Internet

• Control the traffic using Security Groups

• ElasticMapReduce-Master-Private

• ElasticMapReduce-Slave-Private

• ElasticMapReduce-ServiceAccess

Page 87: Masterclass Live: Amazon EMR

Access Control: IAM Users and Roles

• IAM Policies for access to Amazon EMR service (IAM users or federated users)

• AmazonElasticMapReduceFullAccess

• AmazonElasticMapReduceReadOnlyAccess

• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable

actions for Amazon EMR service, like creating EC2 instances.

• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.

Page 88: Masterclass Live: Amazon EMR

Data at Rest: S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 89: Masterclass Live: Amazon EMR

Customer Stories

Page 90: Masterclass Live: Amazon EMR

AOL’s Spot Use Case: restate 6 months of

historical data

Availability Zones

10

550EMR Clusters

24,000Spot EC2 Instances

0

10

20

30

40

50

60

70

Timing Comparison

In-House

AWS

Page 91: Masterclass Live: Amazon EMR

OUR CLOUD ARCHITECTURE

Page 92: Masterclass Live: Amazon EMR

FINRA saves money with comparable

performance with Hive on Tez with S3

Page 93: Masterclass Live: Amazon EMR

Using EMR and cloud capacity for ETL

Page 94: Masterclass Live: Amazon EMR

Bridging on-prem and EMR for easy ETL

Page 95: Masterclass Live: Amazon EMR
Page 96: Masterclass Live: Amazon EMR

Twitter (Answers) uses EMR as the batch layer

in their Lambda architecture

Page 97: Masterclass Live: Amazon EMR

Using EMR for batch, streaming, and ad hoc

SmartNews

Page 98: Masterclass Live: Amazon EMR

Nasdaq: data lake architecture diagram

Optimizing data warehousing costs with S3 and EMR

Page 99: Masterclass Live: Amazon EMR

AWS Pop-up Loft LondonThank You

Abhishek Sinha | [email protected] | @abysinha