Masterclass Live: Amazon EMR

Abhishek Sinha – Sr. Product Manager

sinhaar@amazon.com

@abysinha

Amazon EMR

Making it easy, secure and cost-effective to run

data processing frameworks on the AWS cloud

Amazon EMR

• Managed platform

• Hadoop MapReduce, Spark, Presto,

and more

• Launch clusters in minutes

• Apache Bigtop based distribution

• Leverage the elasticity of the cloud

• Added security features

• Pay by the hour and save with Spot

• Flexibility to customize

• Programmable Infrastructure

What do I need to build a cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

Cluster composition

Master Node

Core Instance Group Task Instance

Groups

NameNode (HDFS),

ResourceManager (YARN),

and other components

HDFS DataNode

YARN Node ManagerYARN Node Manager

Choice of multiple instances

c3 family

c4 family

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

m4 family

Machine

Learning

Processing

In-memory

(Spark &

Presto)

Large HDFS

Or add EBS volumes if you need additional on-cluster storage.

Hadoop applications available in EMR

Or, use bootstrap actions to install arbitrary

applications on your cluster!

Choose your software - Quick Create

Choose your software – Advanced Options

Configuration API for custom configs

"Classification": "core-site",

"Properties": {

"hadoop.security.groups.cache.secs": "250"

"Classification": "mapred-site",

"Properties": {

"mapred.tasktracker.map.tasks.maximum": "2",

"mapreduce.map.sort.spill.percent": "90",

"mapreduce.tasktracker.reduce.tasks.maximum": "5"

Use the AWS CLI to easily create clusters:

aws emr create-cluster

--release-label emr-4.3.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK for programmatic provisioning:

Use Amazon EMR to

separate your

compute and storage.

On premises: compute and storage grow together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

On premises: Underutilized or scarce resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

On premises: Contention for same resources

Compute

boundMemory

Separation of resources creates data silos

Team A

On premises: Replication adds to cost

HDFS needs 3x

Multi-Data Center DR

Use Amazon EMR to

separate your

compute and storage.

EMR can process data from many sources

• Hadoop Distributed File

System (HDFS)

• Amazon S3 (EMRFS)

• Amazon Dynamo DB,

Redshift, Aurora, RDS

• Amazon Kinesis

• Other applications running in

your architecture (Kafka,

ElasticSearch, etc.)

Amazon S3 is your persistent data store

11 9’s of durability

$0.03 / GB / Month in US-East

Life Cycle Policies

Available across AZs

Easy access

Amazon S3

The EMR Filesystem (EMRFS)

• Allows you to leverage S3 as a file-system for Hadoop

• Streams data directly from S3

• Cluster still uses local disk/HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Optional consistent view for consistent list

• Support for encryption

• Fast listing of objects

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

LOCATION ‘samples/pig-apache/input/'

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Benefit 1: Switch off clusters

Amazon S3Amazon S3 Amazon S3

Auto-terminate clusters after job completion

You can build a pipeline

Submit jobs using:

- EMR Step API

- Oozie

- SSH directly

- Genie (Gateway)

- OSS workflow tools

(i.e. Luigi)

You can use Amazon Data Pipeline

Input data

Use EMR to transform

unstructured to

structured data

Push to S3Ingest into

Redshift

Run transient or long-running clusters

Benefit 2: Resize your cluster to match

workload requirements

Resize using the Console, CLI, or API

Save costs with EC2 Spot instances

Spot integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

The Spot Bid Advisor

Spot Integration with EMR

• Can provision instances from the Spot Market

• Replaces a spot instance in case of interruption

• Impact of interruption

• Master Node – Can lose the cluster

• Core Node – Can lose data stored in HDFS

• Task Nodes – lose the task (but the task will run elsewhere)

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

50 % less run-time ( 14 7)

25% less cost (140 105)

Intelligent scale down

Intelligent scale down – HDFS

Effectively utilize clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical separation of jobs

Hive, Pig,

Cascading

Presto Ad-Hoc

Amazon S3

Benefit 4 : Disaster recovery built-in

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Hive Metastore in

Amazon RDS

S3 as a data-lake

Nate Summons, Principle Architect - NASDAQ

Monitoring with CloudWatch (or Ganglia)

EMR logging to S3 makes logs easily available

Spark moves at interactive speed

filter

groupBy

Stage 3

Stage 1

Stage 2

C: D: E:

= cached partition= RDD

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in RDDs in memory

• Partitioning-aware to avoid

network-intensive shuffle

Spark components to match your use case

Spark speaks your language

Use DataFrames to easily interact with data

• Distributed

collection of data

organized in

columns

• An extension of the

existing RDD API

• Optimized for query

execution

Easily create DataFrames from many formats

Additional libraries for Spark SQL Data Sources

at spark-packages.org

Load data with the Spark SQL Data Sources API

Additional libraries at spark-packages.org

Sample DataFrame manipulations

Use DataFrames for machine learning

• Spark ML libraries

(replacing MLlib) use

DataFrames as

input/output for

models

• Create ML pipelines

with a variety of

distributed algorithms

Create DataFrames on streaming data

• Access data in Spark Streaming DStream

• Create SQLContext on the SparkContext used for Spark

Streaming application for ad hoc queries

• Incorporate DataFrame in Spark Streaming application

• Checkpointing streaming jobs

Spark Pipeline

Use R to interact with DataFrames

• SparkR package for using R to manipulate DataFrames

• Create SparkR applications or interactively use the SparkR

shell (no Zeppelin support yet - ZEPPELIN-156)

• Comparable performance to Python and Scala

DataFrames

Amazon EMR runs Spark on YARN

• Dynamically share and centrally configure

the same pool of cluster resources across

engines

• Schedulers for categorizing, isolating, and

prioritizing workloads

• Choose the number of executors to use, or

allow YARN to choose (dynamic allocation)

• Kerberos authentication

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

In MemorySpark

ApplicationsPig, Hive, Cascading, Spark Streaming, Spark SQL

Inside Spark Executor on YARN

Max Container size on node

Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)

𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟.𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10

Executor Container

Memory

Overhead

Config File: spark-default.conf

Spark executor memory - Amount of memory to use per executor process

spark.executor.memory

Executor Container

Memory

Overhead

Spark Executor Memory

Config File: spark-default.conf

Shuffle Memory Fraction – pre-Spark 1.6

Executor Container

Memory

Overhead

Shuffle

memoryFraction

Default: 0.2

Storage storage Fraction - pre-Spark 1.6

Executor Container

Memory

Overhead

Shuffle

memoryFractionStorage

memoryFraction

Default: 0.6

In Spark 1.6+, Spark automatically balances the amount of memory for execution

and cached data.

Executor Container

Memory

Overhead

Execution / Cache

Default: 0.6

Dynamic Allocation on YARN

Scaling up on executors

- Request when you want the job to complete faster

- Idle resources on cluster

- Exponential increase in executors over time

New Default beginning EMR 4.4

Dynamic allocation setup

Optional

Property Value

Spark.dynamicAllocation.enabled true

Spark.shuffle.service.enabled true

spark.dynamicAllocation.minExecutors 5

spark.dynamicAllocation.maxExecutors 17

spark.dynamicAllocation.initalExecutors 0

sparkdynamicAllocation.executorIdleTime 60s

spark.dynamicAllocation.schedulerBacklogTimeout 5s

spark.dynamicAllocation.sustainedSchedulerBacklog

Timeout

Compress your input data set

• Always compress Data Files on Amazon S3

• Reduces storage cost

• Reduces bandwidth between Amazon S3 and Amazon

EMR, which can speed up bandwidth constrained jobs

Compressions

Compression Types:

– Some are fast BUT offer less space reduction

– Some are space efficient BUT Slower

– Some are splitable and some are not

Algorithm % Space

Remaining

Encoding Speed Decoding Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Data Serialization

• Data is serialized when cached or shuffled

Default: Java serializer

• Kyro serialization (10x faster than Java serialization)

• Does not support all Serializable types

• Register the class in advance

Usage: Set in SparkConf

conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")

Running Spark on

Amazon EMR

Focus on deriving insights from your data

instead of manually configuring clusters

Easy to install and configure Spark

Secured

Spark submit, Oozie or use Zeppelin UI

Quickly addand remove capacity

Hourly, reserved, or EC2 Spot pricing

Use S3 to decouplecompute and storage

Launch the latest Spark version

Spark 1.6.1 is the current version on EMR.

< 3 week cadence with latest open source release

Create a fully configured cluster in minutes

AWS Management

Console

AWS Command Line

Interface (CLI)

Or use a AWS SDK directly with the Amazon EMR API

Or easily change your settings

Many storage layers to choose from

Amazon DynamoDB

EMR-DyanmoDB

connector

Amazon RDS

Amazon

Kinesis

Streaming data

connectorsJDBC Data Source

w/ Spark SQL

ElasticSearch

connector

Amazon Redshift

Spark-Redshift

connector

EMR File System

(EMRFS)

Amazon S3

Amazon EMR

Decouple compute and storage by using S3

as your data layer

S3 is designed for 11

9’s of durability and is

massively scalable

EC2 Instance

Memory

Amazon S3

Amazon EMR

Easy to run your Spark workloads

Amazon EMR Step API

SSH to master node and use Spark

Submit, Oozie or Zeppelin

Submit a Spark

application

Amazon EMR

Customer use cases

Some of our customers running Spark on EMR

Integration Pattern – ETL with Spark

Amazon EMRAmazon S3

HDFSRead

Unstructured

Structured

Extract

Load from

Store Output Data

Integration Pattern – Tumbling Window Reporting

Amazon EMR

Amazon

Kinesis

Streaming Input

Tumbling/Fixed

Window

Aggregation

Periodic Output

Amazon Redshift

COPY from EMR

Or checkpoint to S3 and use

the Lambda loader app

EMR Security Overview

Encryption ComplianceSecurity

Fundamentals

• Identity and Access

Management (IAM) policies,

• Bucket policies

• Access Control Lists (ACLs)

• Query string authentication

• SSL endpoints

• Server Side Encryption

(SSE-S3)

• Server Side Encryption

with KMS provided keys

(coming soon)

• Client-side Encryption

• Buckets access logs

• Lifecycle Management

Policies

• Access Control Lists

(ACLs)

• Versioning & MFA deletes

Networking: VPC private subnets

• Use Amazon S3 Endpoints for

connectivity to S3

• Use Managed NAT for connectivity to

other services or the Internet

• Control the traffic using Security Groups

• ElasticMapReduce-Master-Private

• ElasticMapReduce-Slave-Private

• ElasticMapReduce-ServiceAccess

Access Control: IAM Users and Roles

• IAM Policies for access to Amazon EMR service (IAM users or federated users)

• AmazonElasticMapReduceFullAccess

• AmazonElasticMapReduceReadOnlyAccess

• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable

actions for Amazon EMR service, like creating EC2 instances.

• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.

Data at Rest: S3 client-side encryption

Amazon S3

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Customer Stories

AOL’s Spot Use Case: restate 6 months of

historical data

Availability Zones

550EMR Clusters

24,000Spot EC2 Instances

Timing Comparison

In-House

OUR CLOUD ARCHITECTURE

FINRA saves money with comparable

performance with Hive on Tez with S3

Using EMR and cloud capacity for ETL

Bridging on-prem and EMR for easy ETL

Twitter (Answers) uses EMR as the batch layer

in their Lambda architecture

Using EMR for batch, streaming, and ad hoc

SmartNews

Nasdaq: data lake architecture diagram

Optimizing data warehousing costs with S3 and EMR

AWS Pop-up Loft LondonThank You

Abhishek Sinha | sinhaar@amazon.com | @abysinha

Masterclass Live: Amazon EMR

Technology

Transcript of Masterclass Live: Amazon EMR

IV EAFO LIVE SURGERY MASTERCLASS “HEAD & NECK CANCER”

DISCOGRAPHY · (Vangelis) N° EMR Blasorchester Concert Band EMR 1619 EMR 1663 EMR 1661 EMR 1638 EMR 1660 EMR 1653 EMR 1458 EMR 1178 EMR 1334 EMR 1546 EMR 1166 Time 4’24 4’32

EMR 9107 Zodiac - BB Parts on Landscape · 2015. 5. 28. · Blasorchester Concert Band EMR 11532 EMR 11582 EMR 11584 EMR 11589 EMR 11599 EMR 11603 EMR 11612 EMR 11602 EMR 11637 EMR

LES MILLS RPM LIVE MASTERCLASS AND VIRTUAL CLASS LAUNCH

DISCOGRAPHY - s3.eu-central-1.amazonaws.com · Concert Band EMR 1203 EMR 11058 EMR 11216 EMR 11206 EMR 11217 EMR 11056 EMR 11213 EMR 11057 EMR 11210 ... N° EMR Brass Band EMR 1204

3.5 Weeks from EMR Go-Live to HIMSS Stage 6 Clinicians ... · 3.5 Weeks from EMR Go-Live to HIMSS Stage 6 –Clinicians Adoption & Engagement Tamara Sunbul, MD, CPHIMS, PMP Medical

DISCOGRAPHY - alle-noten.de · Collection Timofei Dokshitser Trumpet Solo EMR 6001 EMR 639 EMR 677 Trumpet, Piano EMR 625 EMR 624 EMR 626 EMR 693 EMR 640 EMR 615 EMR 617 EMR 618 EMR

EMR 11449 Snow-White Waltz · Christmas Waltz (Noris) Sweet Bells Rumba (Noris) N° EMR Blasorchester EMR 11458 EMR 11468 EMR 11457 EMR 11449 EMR 11466 EMR 11464 EMR 11470 EMR 11454

DISCOGRAPHY · Radetzky March (Strauss J.) N° EMR Blasorchester Concert Band EMR 11270 EMR 10300 EMR 10107 EMR 10351 EMR 10911 EMR 11061 EMR 10530 EMR 1660 EMR 1360 EMR 11628 EMR

SABIAN FACEBOOK LIVE MASTERCLASS APRIL 22 · COLLAPSED RUDIMENTS The Chapin Legacy SABIAN FACEBOOK LIVE MASTERCLASS APRIL 22 -A 30 min class, showcasing the basic principle of COLLAPSED

DISCOGRAPHY · Take Five N° EMR Brass Band EMR 3619 EMR 3620 EMR 3621-EMR 3622 EMR 3623 EMR 3624 EMR 3625 EMR 3626 EMR 3627 EMR 3628 EMR ... HARMONIE – BLASORCHESTER TRUMPET &

DISCOGRAPHY · 2016-07-08 · Big Band EMR 13833 EMR 13086 EMR 13829 EMR 13830 EMR 14183 EMR 13835 EMR 14169 EMR 13092 EMR 13839 EMR 14223 ... EMR 1280 Hello, Dolly ! HERMAN (Mortimer)

WhichTestWon The Live Event: Subject Line Optimization Masterclass

DISCOGRAPHY - edrmartin.com file1’52 2’39 2’31 2’08 Ode To Joy N° EMR Brass Band EMR 3340 EMR 3344 EMR 3406 EMR 3414 EMR 3389 EMR 3285 EMR 3326 EMR 3330 EMR 3348 EMR 3363

Amazon EMR Masterclass

16069 Romance Strs - alle-noten.deHejre Kati (Hubay) N° EMR Clarinet & Orchestra EMR 16044 EMR 16058 EMR 16060 EMR 16062 EMR 16064 EMR 16066 EMR 16068 EMR 16069 EMR 16071 EMR 16073

EMR 11182 You Only Live Twice

· trumpet solo emr 6001 emr 639 emr 677 trumpet, piano emr 625 emr 624 emr 626 emr 693 emr 640 emr 615 emr 617 emr 618 emr 619 emr 678 emr 6060 emr 6016 emr 616 emr 6066 emr 6067

Thema und Variationen - alle-noten.de fileCollection Timofei Dokshitser Trumpet Solo EMR 6001 EMR 639 EMR 677 Trumpet, Piano EMR 625 EMR 624 EMR 626 EMR 693 EMR 640 EMR 615 EMR 617

Amen - Musiknoten Johanna Lindner & SohnConcert Band EMR 1278 EMR 1978 EMR 1403A EMR 1342C EMR 1346 EMR 1983 EMR 10112 EMR 10205 EMR 10117C EMR 1362 EMR 10384 EMR 10662 EMR 10982 EMR