Amazon Elastic MapReduce (EMR): Hadoop as a Service

Amazon ElasticMapReduce

Ville Seppänen | Jari Voutilainen | @Vilsepi @Zharktas @GoforeOy

https://twitter.com/Zharktas

https://twitter.com/GoforeOy

https://twitter.com/Vilsepi

Agenda1. Introduction to Hadoop Streaming and Elastic

MapReduce

2. Simple EMR web interface demo

3. Introduction to our dataset

4. Using EMR from command line with botoAll presentation material is available at https://github.com/gofore/aws-

emr

https://github.com/gofore/aws-emr

https://github.com/gofore/aws-emr

Hadoop StreamingUtility that allows you to create and runMap/Reduce jobs with any executable or script asthe mapper and/or the reducer.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input my/Input/Directories \ -output my/Output/Directory \ -mapper myMapperProgram.py \ -reducer myReducerProgram.py

cat input_data.txt | mapper.py | reducer.py > output_data.txt

Elastic MapReduce(EMR)

Amazon Elastic MapReduceHadoop-based MapReduce cluster as a service

Can run either Amazon-optimized Hadoop orMapR

Managed from a web UI or through API

http://aws.amazon.com/elasticmapreduce/

Hadoop streaming in EMR

Quick EMR demo

The endlessly fascinating example of counting words in Hadoop

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-get-started-count-words.html

Cluster creation stepsCluster: name, logging

Tags: keywords for the cluster

Software: Hadoop distribution and version, pre-installed applications (Hive, Pig,...)

File System: encryption, consistency

Hardware: number and type of instances

Security and Access: ssh keys, node access roles

Bootstrap Actions: scripts to customize the cluster

Steps: a queue of mapreduce jobs for the cluster

(mapper)WordSplitter.py#!/usr/bin/pythonimport sysimport re

pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1"

LongValueSum:i 1LongValueSum:count 1LongValueSum:words 1LongValueSum:with 1LongValueSum:hadoop 1

https://s3.amazonaws.com/elasticmapreduce/samples/wordcount/wordSplitter.py

FilesystemsEMRFS is an implementation of HDFS, with readingand writing of files directly to S3.

HDFS should be used to cache results ofintermediate steps.

S3 is block-based just like HDFS. S3n is file based,which can be accessed with other tools, but filesizeis limited to 5GB

S3 is not a file system, it is a RESTish objectstorage.

S3 has eventual consistency: files written to S3might not be immediately available for reading.

EMR FS can be configured to encrypt files in S3and monitor consistancy of files, which can detectevent that try to use inconsistant files.

http://wiki.apache.org/hadoop/AmazonS3

http://wiki.apache.org/hadoop/AmazonS3

Our dataset

is a service offering real timeinformation and data about the traffic, weatherand condition information on the Finnish mainroads.

The service is provided by the (Liikennevirasto), and produced by

and .

Digitraffic

Finnish TransportAgency Gofore

Infotripla

http://gofore.com/

http://www.infotripla.fi/digitraffic/doku.php?id=start_en

http://infotripla.fi/

http://www.liikennevirasto.fi/

Traffic fluencyOur data consists of traffic fluency information, i.e.how quickly vehicles have been identified to passthrough a road segment (a link).

Data is gathered with camera-based , and more

recently with mobile-device-based .

AutomaticLicense Plate Recognition (ALPR)

Floating CarData (FCD)

http://en.wikipedia.org/wiki/Automatic_number_plate_recognition

http://en.wikipedia.org/wiki/Floating_car_data

Travel time link network

http://www.infotripla.fi/digitraffic/lib/exe/fetch.php?media=linkkiverkosto.pdf

<ivjtdata duration="60" periodstart="2014-06-24T02:55:00Z"> <recognitions> <link id="110302" data_source="1"> <recognition offset_seconds="8" travel_time="152"></recognition> <recognition offset_seconds="36" travel_time="155"></recognition> </link> <link id="410102" data_source="1"> <recognition offset_seconds="6" travel_time="126"></recognition> <recognition offset_seconds="45" travel_time="152"></recognition> </link> <link id="810502" data_source="1"> <recognition offset_seconds="25" travel_time="66"></recognition> <recognition offset_seconds="34" travel_time="79"></recognition> <recognition offset_seconds="35" travel_time="67"></recognition> <recognition offset_seconds="53" travel_time="58"></recognition> </link> </recognitions></ivjtdata>

Each file contains finished passthroughs for each road segment duringone minute.

Some numbers6.5 years worth of data from January 2008 to June2014

3.9 million XML files (525600 files per year)

6.3 GB of compressed archives (with 7.5GB ofadditional median data as CSV)

42 GB of data as XML (and 13 GB as CSV)

Potential research questions1. Do people drive faster during the night?

2. Does winter time have less recognitions (eitherdue to less cars or snowy plates)?

3. How well number of recognitions correlate withspeed (rush hour probably slows travel, but arespeeds higher during days with less traffic)?

4. Is it possible to identify speed limits from thetravel times? How much dispersion in speeds?

5. When do speed limits change (winter and summerlimits)?

Munging

The small files problemUnpacked the tar.gz archives and uploaded theXML files as such to S3 (using AWS ).

Turns out (4 million 11kB) small files with Hadoopis not fun. Hadoop does not handle well with filessignificantly smaller than the HDFS block size(default 64MB)

And well, XML is not fun either, so...

CLI tools

[1] [2] [3]

http://amilaparanawithana.blogspot.fi/2012/06/small-file-problem-in-hadoop.html

http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

http://aws.amazon.com/cli/

http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

JSONify all the things!Wrote scripts to parse, munge and upload data

Concatenated data into bigger files, calculatedsome extra data, and converted it into JSON. Sizereduced to 60% of the original XML.

First munged 1-day files (10-20MB each) and later1-month files (180-540MB each)

Munging XML worth of 6.5 years takes 8.5 hourson a single t2.medium instance

Static link information (120kb json)

{ "sites": [ { "id": "1205", "name": "Viinikka", "lat": 61.488282, "lon": 23.779057, "rno": "3495", "tro": "3495/1-2930" } ], "links": [ { "id": "99001041", "name": "Hallila -> Viinikka", "dist": 5003.0, "site_start": "1108", "site_end": "1205" }]}

{ "date": "2014-06-01T02:52:00Z", "recognitions": [ { "id": "4510201", "tt": 117, "cars": 4, "itts": [ 100, 139, 121, 110 ] } ]}

Programming EMR

Alternatives for the web interfaceAWS

SDKs like for Python

Command line tools

boto

http://aws.amazon.com/cli/

http://docs.pythonboto.org/en/latest/

Connect to EMR#!/usr/bin/env python

import boto.emrfrom boto.emr.instance_group import InstanceGroup

# Requires that AWS API credentials have been exported as env variablesconnection = boto.emr.connect_to_region('eu-west-1')

Specify EC2 instancesinstance_groups = []instance_groups.append(InstanceGroup( role="MASTER", name="Main node", type="m1.medium", num_instances=1, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="CORE", name="Worker nodes", type="m1.medium", num_instances=3, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="TASK", name="Optional spot-price nodes", type="m1.medium", num_instances=20, market="SPOT", bidprice=0.012))

Start EMR clustercluster_id = connection.run_jobflow( "Our awesome cluster", instance_groups=instance_groups, action_on_failure='CANCEL_AND_WAIT', keep_alive=True, enable_debugging=True, log_uri="s3://our-s3-bucket/logs/", ami_version="3.3.1", bootstrap_actions=[], ec2_keyname="name-of-our-ssh-key", visible_to_all_users=True, job_flow_role="EMR_EC2_DefaultRole", service_role="EMR_DefaultRole")

Add job step to clustersteps = []steps.append(boto.emr.step.StreamingStep( "Our awesome streaming app", input="s3://our-s3-bucket/our-input-data", output="s3://our-s3-bucket/our-output-path/", mapper="our-mapper.py", reducer="aggregate", cache_files=[ "s3://our-s3-bucket/programs/our-mapper.py#our-mapper.py", "s3://our-s3-bucket/data/our-dictionary.json#our-dictionary.json",) ], action_on_failure='CANCEL_AND_WAIT', jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar'))connection.add_jobflow_steps(cluster_id, steps)

Recap#!/usr/bin/env python

import boto.emrfrom boto.emr.instance_group import InstanceGroup

connection = boto.emr.connect_to_region('eu-west-1')cluster_id = connection.run_jobflow(**cluster_parameters)connection.add_jobflow_steps(cluster_id, **steps_parameters)

Step 1 of 2: Run mapreduce# Create new clusteraws-tools/run-jobs.py create-cluster "Car speed counting cluster"

Starting cluster j-F0K0A4Q9F5O0 Car speed counting cluster

# Add job step to the clusteraws-tools/run-jobs.py run-step j-F0K0A4Q9F5O0 05-car-speed-for-time-of-day_map.py digitraffic/munged/links-by-month/2014

Step will output data to s3://hadoop-seminar-emr/digitraffic/outputs/ 2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/

Step 2 of 2: Analyze results# Download and concatenate outputaws s3 cp s3://hadoop-seminar-emr/digitraffic/outputs/2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/ /tmp/emr --recursive --profile hadoop-seminar-emr

cat /tmp/emr/part-* > /tmp/emr/output

# Analyze resultsresult-analysis/05_speed_during_day/05-car-speed-for-time-of-day_output.py /tmp/emr/output example-data/locationdata.json

Some statisticsWe experimented with different input files ancluster sizes

Execution time was about half hour with smallcluster and 30 small 15-20 MB files

Same input parsed with simple python script tookabout 5 minutes

Larger cluster and 6 larger 500 MB files took 17minutes.

"Too small problem for EMR/Hadoop"

Summary

TakeawaysMake sure your problem is big enough for Hadoop

Munging wisely makes streaming programs easierand faster

Always use Spot instances with EMR

Further reading

Ubuntu MaaS blog:

Big Data Borat:

Amazon EMR Developer Guide

Amazon EMR Best practices

Scaling a 2000-node Hadoopcluster on EC2

"Quiz: Is it a Pokemon or a bigdatatechnology?"

https://maas.ubuntu.com/2012/06/04/scaling-a-2000-node-hadoop-cluster-on-ec2ubuntu-with-juju/

http://www.slate.com/blogs/future_tense/2014/05/02/big_data_borat_tests_people_on_pok_mon_versus_big_data_technology_names.html

https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html

Amazon Elastic MapReduce (EMR): Hadoop as a Service

Software

Transcript of Amazon Elastic MapReduce (EMR): Hadoop as a Service