Hadoop in the cloud with AWS' EMR

8
Hadoop in the Cloud: AWS Elastic Map Reduce What is EMR? How does EMR compare to Hadoop? Use cases

description

Quick intro to and walkthrough of the AWS Elastic Map Reduce (EMR) service. Part of a larger course at http://bit.ly/get-hadoop

Transcript of Hadoop in the cloud with AWS' EMR

Page 1: Hadoop in the cloud with AWS' EMR

Hadoop in the Cloud: AWS Elastic Map Reduce

• What is EMR?• How does EMR compare to Hadoop?• Use cases

Page 2: Hadoop in the cloud with AWS' EMR

EMR is an AWS Service

• AWS review helpful to understand• Infiniteskills offers a course!

– http://bit.ly/learn-aws

• AWS constantly changing and evolving

http://aws.amazon.com/documentation/elasticmapreduce/

Page 3: Hadoop in the cloud with AWS' EMR

EMR Overview

• Abstracts out cluster setup & management– Integrated provisioning, tooling, debug, monitoring– AWS constantly tuning and optimizing– Failed nodes automatically re-provisioned by AWS

• Reduced costs– Clusters shut down automatically by default– Excellent for sporadic MapReduce needs

• Integration to AWS– Leverage cost-effective EC2 instances for processing, S3 for storage– Monitoring done via CloudWatch

Page 4: Hadoop in the cloud with AWS' EMR

EMR Architecture

Master Instance Group

EC2

S3

Core Instance Group

EC2EC2

HDFS HDFS

Task Instance Group

EC2 EC2

EC2 EC2

• Master group controls cluster• Core group runs DataNode &

TaskTracker daemons• Task group runs tasks

• Can be added & removed• S3 can be used for data input / output• Master group coordinates core + task

activities and manages cluster state• Core + task instances read / write to /

from S3

Page 5: Hadoop in the cloud with AWS' EMR

EMR AWS Integration

• Datastore pull / push to– RDS– DynamoDB– S3

• Derived data can be stored in RedShift– Via AWS DataPipelines– Further post-processing

• Data can be pre-processed with Kinesis

Page 6: Hadoop in the cloud with AWS' EMR

What you give up with EMR

• Control– Always 2-3 months behind Hadoop releases– Cannot use CDH or HDP releases (although MapR is supported)

• Speed (if you’re not an AWS customer)• Vendor lock-in

Page 7: Hadoop in the cloud with AWS' EMR

EMR Use Cases

• Already AWS customer– Lots of data in S3 / DynamoDB / RDS

• Sporadic MapReduce needs• Proof-of-concepting Hadoop• Ease of use

– Seamless, near-infinite scale– Simple administration

Page 8: Hadoop in the cloud with AWS' EMR

Hadoop in the Cloud: AWS Elastic Map Reduce

• What is EMR?• How does EMR compare to Hadoop?• Benefits & downsides• Use cases