Amazon Elastic MapReduce (EMR): Hadoop as a Service
-
Upload
ville-seppaenen -
Category
Software
-
view
117 -
download
2
Transcript of Amazon Elastic MapReduce (EMR): Hadoop as a Service
![Page 1: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/1.jpg)
Amazon ElasticMapReduce
Ville Seppänen | Jari Voutilainen | @Vilsepi @Zharktas @GoforeOy
![Page 2: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/2.jpg)
Agenda1. Introduction to Hadoop Streaming and Elastic
MapReduce
2. Simple EMR web interface demo
3. Introduction to our dataset
4. Using EMR from command line with botoAll presentation material is available at https://github.com/gofore/aws-
emr
![Page 3: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/3.jpg)
Hadoop StreamingUtility that allows you to create and runMap/Reduce jobs with any executable or script asthe mapper and/or the reducer.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input my/Input/Directories \ -output my/Output/Directory \ -mapper myMapperProgram.py \ -reducer myReducerProgram.py
cat input_data.txt | mapper.py | reducer.py > output_data.txt
![Page 4: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/4.jpg)
Elastic MapReduce(EMR)
![Page 5: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/5.jpg)
Amazon Elastic MapReduceHadoop-based MapReduce cluster as a service
Can run either Amazon-optimized Hadoop orMapR
Managed from a web UI or through API
![Page 6: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/6.jpg)
Hadoop streaming in EMR
![Page 7: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/7.jpg)
Quick EMR demo
![Page 8: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/8.jpg)
The endlessly fascinating example of counting words in Hadoop
![Page 9: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/9.jpg)
Cluster creation stepsCluster: name, logging
Tags: keywords for the cluster
Software: Hadoop distribution and version, pre-installed applications (Hive, Pig,...)
File System: encryption, consistency
Hardware: number and type of instances
Security and Access: ssh keys, node access roles
Bootstrap Actions: scripts to customize the cluster
Steps: a queue of mapreduce jobs for the cluster
![Page 10: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/10.jpg)
(mapper)WordSplitter.py#!/usr/bin/pythonimport sysimport re
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1"
LongValueSum:i 1LongValueSum:count 1LongValueSum:words 1LongValueSum:with 1LongValueSum:hadoop 1
![Page 11: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/11.jpg)
FilesystemsEMRFS is an implementation of HDFS, with readingand writing of files directly to S3.
HDFS should be used to cache results ofintermediate steps.
S3 is block-based just like HDFS. S3n is file based,which can be accessed with other tools, but filesizeis limited to 5GB
![Page 12: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/12.jpg)
S3 is not a file system, it is a RESTish objectstorage.
S3 has eventual consistency: files written to S3might not be immediately available for reading.
EMR FS can be configured to encrypt files in S3and monitor consistancy of files, which can detectevent that try to use inconsistant files.
http://wiki.apache.org/hadoop/AmazonS3
![Page 13: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/13.jpg)
Our dataset
![Page 14: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/14.jpg)
is a service offering real timeinformation and data about the traffic, weatherand condition information on the Finnish mainroads.
The service is provided by the (Liikennevirasto), and produced by
and .
Digitraffic
Finnish TransportAgency Gofore
Infotripla
![Page 15: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/15.jpg)
Traffic fluencyOur data consists of traffic fluency information, i.e.how quickly vehicles have been identified to passthrough a road segment (a link).
Data is gathered with camera-based , and more
recently with mobile-device-based .
AutomaticLicense Plate Recognition (ALPR)
Floating CarData (FCD)
![Page 16: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/16.jpg)
Travel time link network
![Page 17: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/17.jpg)
<link> <linkno>310102</linkno> <startsite>1108</startsite> <endsite>1107</endsite> <name language="en">Hallila -> Kaukajärvi</name> <name language="fi">Hallila -> Kaukajärvi</name> <name language="sv">Hallila -> Kaukajärvi</name> <distance> <value>3875.000</value> <unit>m</unit> </distance></link>
Static link information (271kb xml)642 one-way links, 243 sites
![Page 18: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/18.jpg)
<ivjtdata duration="60" periodstart="2014-06-24T02:55:00Z"> <recognitions> <link id="110302" data_source="1"> <recognition offset_seconds="8" travel_time="152"></recognition> <recognition offset_seconds="36" travel_time="155"></recognition> </link> <link id="410102" data_source="1"> <recognition offset_seconds="6" travel_time="126"></recognition> <recognition offset_seconds="45" travel_time="152"></recognition> </link> <link id="810502" data_source="1"> <recognition offset_seconds="25" travel_time="66"></recognition> <recognition offset_seconds="34" travel_time="79"></recognition> <recognition offset_seconds="35" travel_time="67"></recognition> <recognition offset_seconds="53" travel_time="58"></recognition> </link> </recognitions></ivjtdata>
Each file contains finished passthroughs for each road segment duringone minute.
![Page 19: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/19.jpg)
Some numbers6.5 years worth of data from January 2008 to June2014
3.9 million XML files (525600 files per year)
6.3 GB of compressed archives (with 7.5GB ofadditional median data as CSV)
42 GB of data as XML (and 13 GB as CSV)
![Page 20: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/20.jpg)
Potential research questions1. Do people drive faster during the night?
2. Does winter time have less recognitions (eitherdue to less cars or snowy plates)?
3. How well number of recognitions correlate withspeed (rush hour probably slows travel, but arespeeds higher during days with less traffic)?
4. Is it possible to identify speed limits from thetravel times? How much dispersion in speeds?
5. When do speed limits change (winter and summerlimits)?
![Page 21: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/21.jpg)
Munging
![Page 22: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/22.jpg)
The small files problemUnpacked the tar.gz archives and uploaded theXML files as such to S3 (using AWS ).
Turns out (4 million 11kB) small files with Hadoopis not fun. Hadoop does not handle well with filessignificantly smaller than the HDFS block size(default 64MB)
And well, XML is not fun either, so...
CLI tools
[1] [2] [3]
![Page 23: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/23.jpg)
JSONify all the things!Wrote scripts to parse, munge and upload data
Concatenated data into bigger files, calculatedsome extra data, and converted it into JSON. Sizereduced to 60% of the original XML.
First munged 1-day files (10-20MB each) and later1-month files (180-540MB each)
Munging XML worth of 6.5 years takes 8.5 hourson a single t2.medium instance
![Page 24: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/24.jpg)
Static link information (120kb json)
{ "sites": [ { "id": "1205", "name": "Viinikka", "lat": 61.488282, "lon": 23.779057, "rno": "3495", "tro": "3495/1-2930" } ], "links": [ { "id": "99001041", "name": "Hallila -> Viinikka", "dist": 5003.0, "site_start": "1108", "site_end": "1205" }]}
![Page 25: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/25.jpg)
{ "date": "2014-06-01T02:52:00Z", "recognitions": [ { "id": "4510201", "tt": 117, "cars": 4, "itts": [ 100, 139, 121, 110 ] } ]}
![Page 26: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/26.jpg)
Programming EMR
![Page 27: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/27.jpg)
Alternatives for the web interfaceAWS
SDKs like for Python
Command line tools
boto
![Page 28: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/28.jpg)
Connect to EMR#!/usr/bin/env python
import boto.emrfrom boto.emr.instance_group import InstanceGroup
# Requires that AWS API credentials have been exported as env variablesconnection = boto.emr.connect_to_region('eu-west-1')
![Page 29: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/29.jpg)
Specify EC2 instancesinstance_groups = []instance_groups.append(InstanceGroup( role="MASTER", name="Main node", type="m1.medium", num_instances=1, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="CORE", name="Worker nodes", type="m1.medium", num_instances=3, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="TASK", name="Optional spot-price nodes", type="m1.medium", num_instances=20, market="SPOT", bidprice=0.012))
![Page 30: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/30.jpg)
Start EMR clustercluster_id = connection.run_jobflow( "Our awesome cluster", instance_groups=instance_groups, action_on_failure='CANCEL_AND_WAIT', keep_alive=True, enable_debugging=True, log_uri="s3://our-s3-bucket/logs/", ami_version="3.3.1", bootstrap_actions=[], ec2_keyname="name-of-our-ssh-key", visible_to_all_users=True, job_flow_role="EMR_EC2_DefaultRole", service_role="EMR_DefaultRole")
![Page 31: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/31.jpg)
Add job step to clustersteps = []steps.append(boto.emr.step.StreamingStep( "Our awesome streaming app", input="s3://our-s3-bucket/our-input-data", output="s3://our-s3-bucket/our-output-path/", mapper="our-mapper.py", reducer="aggregate", cache_files=[ "s3://our-s3-bucket/programs/our-mapper.py#our-mapper.py", "s3://our-s3-bucket/data/our-dictionary.json#our-dictionary.json",) ], action_on_failure='CANCEL_AND_WAIT', jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar'))connection.add_jobflow_steps(cluster_id, steps)
![Page 32: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/32.jpg)
Recap#!/usr/bin/env python
import boto.emrfrom boto.emr.instance_group import InstanceGroup
connection = boto.emr.connect_to_region('eu-west-1')cluster_id = connection.run_jobflow(**cluster_parameters)connection.add_jobflow_steps(cluster_id, **steps_parameters)
![Page 33: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/33.jpg)
Step 1 of 2: Run mapreduce# Create new clusteraws-tools/run-jobs.py create-cluster "Car speed counting cluster"
Starting cluster j-F0K0A4Q9F5O0 Car speed counting cluster
# Add job step to the clusteraws-tools/run-jobs.py run-step j-F0K0A4Q9F5O0 05-car-speed-for-time-of-day_map.py digitraffic/munged/links-by-month/2014
Step will output data to s3://hadoop-seminar-emr/digitraffic/outputs/ 2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/
![Page 34: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/34.jpg)
Step 2 of 2: Analyze results# Download and concatenate outputaws s3 cp s3://hadoop-seminar-emr/digitraffic/outputs/2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/ /tmp/emr --recursive --profile hadoop-seminar-emr
cat /tmp/emr/part-* > /tmp/emr/output
# Analyze resultsresult-analysis/05_speed_during_day/05-car-speed-for-time-of-day_output.py /tmp/emr/output example-data/locationdata.json
![Page 35: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/35.jpg)
![Page 36: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/36.jpg)
![Page 37: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/37.jpg)
Some statisticsWe experimented with different input files ancluster sizes
Execution time was about half hour with smallcluster and 30 small 15-20 MB files
Same input parsed with simple python script tookabout 5 minutes
Larger cluster and 6 larger 500 MB files took 17minutes.
"Too small problem for EMR/Hadoop"
![Page 38: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/38.jpg)
Summary
![Page 39: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/39.jpg)
TakeawaysMake sure your problem is big enough for Hadoop
Munging wisely makes streaming programs easierand faster
Always use Spot instances with EMR
![Page 40: Amazon Elastic MapReduce (EMR): Hadoop as a Service](https://reader033.fdocuments.net/reader033/viewer/2022050923/55cae60cbb61eb39788b4825/html5/thumbnails/40.jpg)
Further reading
Ubuntu MaaS blog:
Big Data Borat:
Amazon EMR Developer Guide
Amazon EMR Best practices
Scaling a 2000-node Hadoopcluster on EC2
"Quiz: Is it a Pokemon or a bigdatatechnology?"