GoodOle Hadoop!$How$We$Love$...

Good Ole Hadoop! How We Love Thee!

Jerome E. Mitchell

Indiana University

Outline

•  Why Should You Care about Hadoop? •  What Exactly is Hadoop? •  Overview of Hadoop MapReduce and Hadoop Distributed File System (HDFS)

•  Workflow •  Conclusions and References •  GeOng Started with Hadoop

Why Should You Care? – Lots of Data

LOTS OF DATA EVERYWHERE

Why Should You Care? – Even Grocery Stores Care

September 2, 2009

Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier

LONDON — Britain oUen conjures images of unpredictable weather, with downpours someXmes followed by sunshine within the same hour — several Xmes a day. “Rapidly changing weather can be a real challenge,” Jonathan Church, a Tesco spokesman, said in a statement. “The system successfully predicted temperature drops during July that led to a major increase in demand for soup, winter vegetables and cold-‐weather puddings.”

What is Hadoop?

•  At Google MapReduce operaXon are run on a special file system called Google File System (GFS) that is highly opXmized for this purpose.

•  GFS is not open source. •  Doug CuOng and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).

•  The soUware framework that supports HDFS, MapReduce and other related enXXes is called the project Hadoop or simply Hadoop.

•  This is open source and distributed by Apache.

What Exactly is Hadoop?

•  A growing collecXon of subprojects

Why Hadoop for BIG Data? •  Most creditable open source toolkit for large scale, general

purpose compuXng •  Backed by ,

•  Used by , , and many others

•  Increasing support from web services •  Hadoop closely imitates infrastructure developed by

•  Hadoop processes petabytes daily, right now

MoXvaXon for MapReduce

•  Large-‐Scale Data Processing – Want to use 1000s of CPUs

•  But, don’t want the hassle of managing things

•  MapReduce Architecture provides –  AutomaXc ParallelizaXon and DistribuXon –  Fault Tolerance –  I/O Scheduling – Monitoring and Status Updates

MapReduce Model •  Input & Output: a set of key/value pairs •  Two primiXve operaXons

–  map: (k1,v1) à list(k2,v2) –  reduce: (k2,list(v2)) à list(k3,v3)

•  Each map operaXon processes one input key/value pair and produces a set of key/value pairs

•  Each reduce operaXon – Merges all intermediate values (produced by map ops) for a parXcular key

–  Produce final key/value pairs •  OperaXons are organized into tasks

– Map tasks: apply map operaXon to a set of key/value pairs –  Reduce tasks: apply reduce operaXon to intermediate key/value pairs

HDFS Architecture

The WorkFlow

•  Load data into the Cluster (HDFS writes) •  Analyze the data (MapReduce) •  Store results in the Cluster (HDFS) •  Read the results from the Cluster (HDFS reads)

Distributed Data Processing

Slaves

Master

Client

Distributed Data Storage

Name Node Secondary Name

Data Nodes & Task Tracker

MapReduce HDFS

Job Tracker

Hadoop Server Roles

Hadoop Cluster

Switch

Switch Switch Switch

DN + TT DN + TT DN + TT DN + TT

DN + TT DN + TT DN + TT DN + TT DN + TT

Switch

Rack 1 Rack 2 Rack 3 Rack N

Hadoop Rack Awareness – Why?

Data Node 1

Data Node 2

Data Node 3 Data Node 5 Rack 1 Rack 5 Rack 9

Data Node 5

Data Node 6

Data Node 7 Data Node 8

Data Node 5

Data Node 10

Data Node 11

Data Node 12

Switch

Name Node

Results.txt = BLK A: DN1, DN5, DN6 BLK B: DN7, DN1, DN2 BLK C DN5, DN8, DN9

metatadata

Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 1: Data Node 5 Data Node 6 Data Node 7

Rack Aware

Sample Scenario

Huge file containing all emails sent to Indiana University… File.txt

How many Xmes did our customers type the word refund into emails sent to Indiana University?

WriXng Files to HDFS File.txt

Data Node 1 Data Node 5 Data Node 6

BLK A BLK B BLK C

Data Node N

Client

NameNode

Data Node Reading Files From HDFS

Data Node 1

Data Node 2

Data Node Data Node Rack 1 Rack 5 Rack 9

Data Node 1

Data Node 2

Data Node Data Node

Data Node 1

Data Node 2

Data Node Data Node

Switch

Name Node

metatadata Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 1: Data Node 5

Rack Aware

MapReduce: Three Phases

1. Map 2. Sort

3. Reduce

Data Processing: Map

Map Task Map Task Map Task

BLK A BLK B BLK C

Job Tracker Name Node

File.txt

Refund = 3 Refund = 0 Refund= 11

MapReduce: The Map Step

map v k

k v map

Input key-‐value pairs

Intermediate key-‐value pairs

The Map (Example)

When in the course of human events it …

It was the best of Xmes and the worst of Xmes…

map (in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) …

(when,1), (course,1) (human,1) (events,1) (best,1) …

inputs tasks (M=3) parFFons (intermediate files) (R=2)

This paper evaluates the suitability of the … map (this,1) (paper,1) (evaluates,1) (suitability,1) …

(the,1) (of,1) (the,1) …

Over the past five years, the authors and many…

map (over,1), (past,1) (five,1) (years,1) (authors,1) (many,1) …

(the,1), (the,1) (and,1) …

Data Processing: Reduce

Map Task Map Task Map Task

Reduce Task Results.txt HDFS

BLK A BLK B BLK C

Client Job Tracker

Refund = 3

Refund = 0

Data Node 3 X Y Z

MapReduce: The Reduce Step

Intermediate key-‐value pairs

reduce

Key-‐value groups Output key-‐value pairs

The Reduce (Example)

reduce

(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) …

(the,1) (of,1) (the,1) …

reduce task parFFon (intermediate files) (R=2)

(the,1), (the,1) (and,1) …

(and, (1)) (in,(1)) (it, (1,1)) (the, (1,1,1,1,1,1)) (of, (1,1,1)) (was,(1))

(and,1) (in,1) (it, 2) (of, 3) (the,6) (was,1) Note: only one of the two reduce tasks shown

Clients Reading Files from HDFS

Data Node 1

Data Node 2

Data Node Data Node Rack 1 Rack 5 Rack 9

Data Node 1

Data Node 2

Data Node Data Node

Data Node 1

Data Node 2

Data Node Data Node

Client NameNode

metatadata

Conclusions

•  We introduced MapReduce programming model for processing large scale data

•  We discussed the supporXng Hadoop Distributed File System

•  The concepts were illustrated using a simple example •  We reviewed some important parts of the source code for the example.

References

1.  Apache Hadoop Tutorial: hsp://hadoop.apache.org hsp://hadoop.apache.org/core/docs/current/mapred_tutorial.html

2.  Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communica)on of ACM 51, 1 (Jan. 2008), 107-‐113.

3.  Cloudera Videos by Aaron Kimball: hsp://www.cloudera.com/hadoop-‐training-‐basic

FINALLY, A HANDS-‐ON ASSIGNMENT!

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as a MapReduce job

Job ConfiguraXon Parameters •  190+ parameters in

Hadoop •  Set manually or defaults

are used

Even though the Hadoop framework is wrisen in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++. CreaXng a launching program for your applicaXon The launching program configures: – The Mapper and Reducer to use – The output key and value types (input types are inferred from the InputFormat) – The locaXons for your input and output The launching program then submits the job and typically waits for it to complete A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to be wrisen by specifying an OutputFormat to be used

HANDS – ON SESSION

Now, Your Turn…

•  Verify the output from the wordcount program –  HINT: hadoop fs –ls output

•  View the output from the wordcount program –  HINT: hadoop fs –cat output/part-‐00000

Hadoop – The Good/Bad/Ugly

•  Hadoop is Good – that is why we are all here

•  Hadoop is not Bad – else we would not be here

•  Hadoop is someXmes Ugly – why? –  Out of the box configuraXon – not friendly –  Difficult to debug –  Performance – tuning/opXmizaXons is a black magic

Number of Maps/Reducers

•  mapred.tasktracker.map/reduce.tasks.maximum – Maximum tasks (map/reduce) for a tasktracker

•  Default: 2 •  SuggesXons

–  Recommended range – (cores_per_node)/2 to 2x(cores_per_node)

–  This value should be set according to the hardware specificaXon of cluster nodes and resource requirements of tasks (map/reduce)

GoodOle Hadoop!$How$We$Love$...

Documents

Transcript of GoodOle Hadoop!$How$We$Love$...

RACK and RACK-MOUNT OPTIONS CONFIGURATION GUIDE

5.2 Involutes cylindrical gear 5.2.1 Rack and Basic Rack Rack

ACM ScienceCloud 2013 2 - Data-Intensive Distributed ...datasys.cs.iit.edu/events/ScienceCloud2013/concluding-remarks.pdf · ACM ScienceCloud 2013 10 . ... ternatives to purchasing

Balancing & Coordination of Big Data in HDFS with ... · directories[1][2]. Datanode GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in

NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed File System

Rack power distribution and rack systems worldwide

Pipe Rack Rack Piping SAMSUNG Presentation SLIDE

ScienceCloud 2012 oresentation

RACK ER12W-18-27-35-42 IM V3mstrbrand.com/admin/uploads/products/manuals/27251_RACK...V1 RACK ER12W RACK ER18 RACK ER27 RACK ER35 RACKER42 INSTRUCTION MANUAL MANUAL DE INSTRUCCIONES

PROGRAM AGENDA - 3dsbiovia.com · ScienceCloud/Notebook: Secure cloud-based project management for a broad range of scientific workflows, using ... Biologics: A complete collaborative

Pipe rack & rack piping

STRUCTURAL RACK APPLICATIONS - nationwideshelving.com · STRUCTURAL RACK APPLICATIONS SELECTIVE RACK Selective Rack is the most commonly used pallet rack. It allows for the maximum

DataNode 硬碟空間配置 軟體貨櫃主機. DataNode 硬碟空間配置 ( 一 ) $ ssh dna1 $ df -h Filesystem Size Used Avail Use% Mounted on rootfs 19G 5.6G 12G 32% / none 19G

Vertiv™ Knürr INR Network Rack INR Server Rack KNÜRR INR NETWORK RACK - INR SERVER RACK 3 VERTIV KNÜRR INR NETWORK RACK 4 Vertiv Knürr INR Network Rack The standard Vertiv ...

Wooden Wine Rack by Modula Rack

Guide d'installation du rack Rack-Installationsanleitung · cabinet using the customer rack kit. The installation of your system and rack kit in any other rack cabinet has not been

Rack Hosting - Server Room and Data Centre solutions ... · Rack Hosting Rack Hosting Rack Hosting Rack Hosting 3 Contemporary Data Centre Space Workspace Technology’s Rack Hosting

19” Rack cabInetS aRMadI Rack 19” - Eurometeuromet.com/download/catalogo_euromet_rack.pdf · 19” Rack cabInetS aRMadI Rack 19” Stability and responsibility are the requirements

BIOVIA SCIENCECLOUD SECURITY

CANTILEVER RACK - Heartland Steel Storage Rack Drive-In/Drive Through Rack Push Back Rack Cantilever Rack Pallet Flow Rack P Engineered Pick Modules Platforms and Mezzanines Stairways,

DataNode 硬碟空間配置軟體貨櫃主機. DataNode 硬碟空間配置 ( 一 ) $ ssh dna1 $ df -h Filesystem Size Used Avail Use% Mounted on rootfs 19G 5.6G 12G 32% / none 19G