A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...

A NEW PLATFORM FOR A NEW ERA

2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Hadoop Immersion v1.1 Internal Use Only Do Not Distribute

3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute August 2014 “Just Do It” Program © Copyright 2014 Pivotal. All rights reserved.

Welcome!


Agenda �  Hadoop Overview –  Why Hadoop –  History

�  All About HDFS –  What is HDFS –  Design Assumptions –  HDFS Architecture –  About the NameNode

�  HDFS – NFS Bridge –  What is the HDFS – NFS Bridge

�  All About MapReduce –  What is MapReduce –  How does MapReduce work –  Architecture –  Fault Tolerance


Hadoop Overview


Hadoop Overview

�  Why Hadoop?

�  History of Hadoop


Why Hadoop?


Why Hadoop?

�  Data . . . It's Everywhere �  Traditional Computing

�  Is Hadoop The Solution?

�  Core Hadoop


Big Data! - Big Problem!

�  3 main characteristics of big data: Volume, Velocity, Variety

�  Volume –  Need to manage lots of data –  Social enterprises and Industrial Internet generate lots of data

�  Velocity –  Need fast read / write access to data - measured in nano or micro

seconds

�  Variety –  Today’s agile business require rapid response to changing

requirements –  Fixed schemas are too rigid, need 'emerging' schemas –  May not fit into traditional 'relational' and 'transactional' data model

of RDBMS


Where Does The Data Come From?

�  Social Media

�  Log Files

�  Video Networks

�  Sensor Data

�  Transactions ( retail / banking / stock market / etc )

�  e-mail / text messaging

�  Legacy Documents


There is Value in The Data!

�  The value is dependent –  Fraud Detection –  Marketing Analysis –  Threat Analysis –  Forecasting –  Recommendation Engines –  Trade Surveillance –  . . .


Why Hadoop?

�  Data . . . It's Everywhere

�  Traditional Computing �  Is Hadoop The Solution?

�  Core Hadoop


Monolithic Computing Model

�  Processor bound –  Very fast with small amounts of data

�  Solution was to build bigger and faster computers –  More CPU’s, more memory

�  Had serious limitations –  Expensive, did not scale as data volumes increased


How About Distributed Computing

'In Pioneer days they used oxen for heavy pulling, and when one ox was not sufficient to accomplish

a task, we did not try to grow a larger ox.'

Admiral Grace Hopper


About Distributed Computing

�  Processing is distributed across a cluster of machines –  Multiple nodes –  Common frameworks include MPI, PMV and Condor

�  Primarily focused on distributing the processing workload –  Powerful compute nodes –  Data typically on a separate storage appliance

MPI - Message Passing Interface PMV - Parallel Virtual Machine Condor - Batch Queueing


The Challenge of Distributed Processing

�  Does not scale gracefully –  Data has to be copied to be processed –  Problem escalates as more nodes are added to the cluster

�  Hardware failure –  Distributing data means a failure could corrupt part of it –  Replication can be a solution –  Requires ‘management’ of the pieces

�  Sort, combine and analyze data from multiple systems –  Finding ways to tie related data together


Why Hadoop?


�  Traditional Computing

�  Is Hadoop The Solution? �  Core Hadoop


Ideal Cluster Requirements

�  Linear horizontal scalability –  Adding nodes should increase capacity proportionally –  Avoid contention with 'shared nothing' architecture –  Expandable at a reasonable cost

�  Jobs run in relative isolation –  Results independent of other running jobs –  Reasonable performance degradation due to concurrency

�  Simple programming model –  Simple API –  Multiple language support


Ideal Cluster Requirements ( continued )

�  Failure Happens - Handle it Efficiently

Automatic Jobs complete automatically

Transparent Failed tasks are restarted

Graceful Proportional loss of capacity

Recoverable Restoration of failed component restores lost capacity

Consistent No corruption or invalid results

Ideal Failure-Handling Properties


Why Hadoop?


�  Traditional Computing

�  Is Hadoop The Solution?

�  Core Hadoop


The Essence of Hadoop

�  Distributed file system (HDFS) –  Reliable shared / distributed storage of large data files –  Potential for redundant copies for reliability

�  Flexible analysis (MapReduce) –  Abstracting the process of reading and writing data –  Breaking raw data into keys and values for tailored processing

�  In addition to HDFS and MapReduce Hadoop consists of the infrastructure to make the components work –  Web Interface –  Monitoring and Scheduling tools –  Filesystem utilities


Hadoop Eco-System

�  Eco-system tools available to integrate with Hadoop –  Hive and Pig – Data Analysis –  Tableau and Datameer – Data Visualization –  Sqoop – Data Integration –  Oozie – Workflow Management –  Puppet and Chef – Cluster Management

�  These are not 'Core Hadoop' but rather Eco-system tools –  Many ( but not all ) are top level Apache projects


Hadoop - A Different Approach

�  Distributed computing typically requires –  Complex synchronization code –  Expensive fault tolerant hardware –  Redundant high performance storage appliances

�  Hadoops approach based on the HDFS and MapReduce whitepapers addresses the problems implicit in distributed systems


Hadoop Scalability

�  Hadoop goal is to achieve linear horizontal scalability –  Minimal chatter between nodes ( shared nothing environment ) –  Scale horizontally by adding nodes to increase capacity and / or

performance

�  Components in a cluster will fail –  Provision using widely available commodity hardware –  Provision what you currently need and 'scale out' when necessary


Data Access Is A Bottleneck

�  Moving data from a storage appliance to the processors in a traditional distributed system is very time consuming

�  Store data and process the data on the same machines –  Eliminate the need to copy data between nodes

�  Process data intelligently with data locality –  Bring the data to the compute engine –  Process data on the same node as it is stored when possible


Disk Performance Is A Bottleneck

�  Disk technology has advanced significantly –  Single large capacity disks are readily available –  Scan performance has not scaled as well as capacity has scaled

�  Take advantage of multiple disks in parallel –  Assuming 1 disk and 3TB of data and a transfer rate of 300MB / s

▪  A little over 2h 35m to read the 3TB file –  Assuming 1000 disks and 3TB of data distributed across all 1000

disks they can transfer roughly 300GB / s ▪  Less than 10 seconds to read the same 3TB file

�  Co-located storage and processing makes this possible


Complex Processing Code

�  Code for a distributed computing environment is complex

�  Hadoop’s infrastructure abstracts complex programming requirements –  No sychronization code –  No networking code –  No file I / O code

�  MapReduce developer focuses on programs that provide value to the business –  Typically written in java or with Hadoop Streaming


Fault Tolerance

�  Distributed systems take advantage of expensive components

�  Failure is inevitable – plan for it –  Minimize the effect of failure –  Hadoop does this very well

�  Machine failure is a common and regular occurrence –  A server with a MTBF of 5 years ( 1825 days or so ) –  In a 2000 node cluster that is roughly one failure per day


The History of Hadoop


Prior To Hadoop

�  Nutch - Open Source Web Search Engine - 2002 –  Created by Doug Cutting and Mike Cafarella

�  Google Whitepapers –  Google Filesystem - 2003 –  MapReduce - 2004

�  Nutch Re-Architecture –  Doug Cutting - 2005


Early Hadoop Years

�  Hadoop factored out of Nutch –  Sub project of Nutch - 2006 –  Later became a top level Apache project - 2008

�  Much of the early development was led by Yahoo!!


Hadoop Today

�  Hadoop is mainstream

�  Many eco-system projects have been spawned –  Most are top level Apache projects –  Hive, pig, oozie, flume, hbase and others

�  Many organizations have migrated to Hadoop –  Has increased the focus on ‘enterprise level' features and tools

�  Hadoop is evolving and changing very frequently


About HDFS


About HDFS

�  What is HDFS?

�  Basic HDFS Assumptions

�  Architecture

�  About The Namenode

�  Lab Exercise – Getting Familiar With HDFS


What is HDFS?


What is HDFS?

�  Hadoop Distributed File System inspired by GoogleFS

�  Features: –  High performance file system for storing data –  Relatively simple centralized management: Master / Slave

architecture –  Fault-tolerant: data replication –  Optimized for MapReduce processing - Data Locality –  Scalability: scale horizontally for additional capacity / performance

�  Java application – slower but more portable

�  Other file-systems are available (FTP, S3)


Basic HDFS Assumptions


Basic HDFS Assumptions

�  Cluster components fail –  Use commodity hardware

�  Files are write once / read many

�  Leverage large streaming reads not random access

�  Favors high sustained throughput over low latency

�  Modest number of HUGE files –  Multi-gigabyte files are typical


Architecture


HDFS Architecture

�  Blocks �  Components

�  HDFS – Writing / Reading

�  Interacting With HDFS


HDFS Blocks

�  When a file is added to HDFS it is split into blocks

�  This is very similar to files added to any filesystem –  Default block size is 64MB –  Block size is configurable

�  Blocks are replicated across the cluster –  Based on the replication factor ( default is 3 ) –  Achieves data locality goal by making data more available


Multi-Block Replication Pipeline


HDFS Architecture

�  Blocks

�  Components �  HDFS – Writing / Reading



Classic HDFS Components

�  HDFS has three main components –  Namenode (single)

▪  Manages file-system content tree ▪  Manages file & directory meta-data ▪  Manages datanodes, and blocks they hold

–  Checkpoint namenode (single) ▪  Also known as 'secondary' or 'backup' namenode ▪  Not an active-standby of the namenode ▪  Merges in-memory and disk-based namenode metadata ▪  Keeps copy of this data, but is almost always out of date

–  Datanodes (multiple) ▪  Store & retrieve data blocks ▪  Report block usage to namenode


Classic HDFS Architecture

NameNode CheckPoint NameNode

DataNode DataNode DataNode DataNode DataNode DataNode

Master Master

Slave Slave Slave Slave Slave Slave


NameNode

�  Persistently stores all metadata: –  File-to-block mappings –  File-system tree –  File ownership and permissions

�  Metadata is stored on disk and read into RAM when the NameNode daemon starts

�  Transiently stores block-to-DataNode mappings

�  Changes to the metadata are held in RAM –  NameNode requires a lot of memory

�  All file operations start with the NameNode

�  Single point of failure


NameNode Storage

�  Edit log –  In-memory write-ahead log of modification operations

�  edits file –  Disk-based file of edits –  Log is flushed to file after writes, before return to client

�  fsimage file –  Disk-based file of complete file-system metadata –  Write-behind asynchronous merges occur periodically

�  Get current state by loading fsimage, replaying edits


Checkpoint NameNode

�  Also known as 'secondary' NameNode

�  CheckPoint NameNode is Not an active-standby of the NameNode

�  Performs merging of edits and fsimage files

�  Merging requires all metadata to be held in memory

�  Has similar memory requirements as NameNode


Updating fsimage

�  NameNode starts writing to edits.new file

�  edits and fsimage copied to checkpoint NameNode

�  Checkpoint NameNode merges applies edits to fsimage, writes result as fsimage.ckpt

�  fsimage.ckpt copied to NameNode

�  NameNode replaces fsimage with fsimage.ckpt

�  NameNode replaces edits with edits.new

�  NameNode stores merge time in fstime


DataNodes

�  Actual contents of the files are stored as blocks on the DataNodes

�  Blocks are files on the DataNodes underlying filesystem –  Named: blk_xxxxxxx –  DataNode cannot associate any block it stores with a file

�  Each block is stored on multiple DataNodes –  Default repication factor is 3

�  Each DataNode runs a DataNode daemon –  Controls access to the blocks of data –  Communicates with the NameNode


What Is Contemporary HDFS?

�  The release of the 0.23 branch of Hadoop ( remember the confusing versioning ) introduced Contemporary HDFS

�  Introduced a High Availability NameNode option –  Passive NameNode

�  Introduced a Federation option –  Multiple NameNodes

�  We will address these options in the Advanced Configuration module


HDFS Architecture

�  Blocks

�  Components

�  HDFS – Writing / Reading �  Interacting With HDFS


HDFS Write

Client Node

Client JVM

Distributed FileSystem

HDFS Client

1: create

FSDataOutputStream namenode

JVM

NameNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

2: create

3a: write

4a: write packet

5c: ack packet

4b: write packet

4c: write packet

5b: ack packet

5a: ack packet

7: complete

DataStreamer

3b: Request allocation (as new blocks required)

6: close


Writing To HDFS


Replication Strategy

�  First copy of a block is written to the same node that the client is on –  If the client is not part of the cluster the first block is written to a

random node that is not too busy

�  Second copy of the block is written to a node in a different rack

�  Third copy of the block is written to a different node in the same rack that the second copy was written to


Data Locality

�  Key factors in achieving performance –  Network bandwidth –  Disk seek time –  Node processing power

�  Hadoop emphasizes importance of data locality –  Processing as close as possible to data (notably mapper) –  Minimize use of network bandwidth

�  Terms for data locality (descending preference order) –  Local –  On-Rack –  Off-Rack


Data Locality - Local

�  Data node on same host as processing (mapper)

Rack B

Node B.0

Node B.1

Node B.2

Rack A

Node A.0

Node A.1

Split mapper

Node A.2

HDFS (input)


Rack B

Node B.0

Node B.1

Node B.2

Rack A

Node A.0

Node A.1

Node A.2

HDFS (input)

�  Data node same rack as processing node (mapper)

Data Locality - On Rack

Split

mapper

Very high-speed connection


Rack B

Node B.0

Node B.1

Node B.2

Rack A

Node A.0

Node A.1

Node A.2

HDFS (input)

Data Locality - Off Rack

�  Data node different rack from processing node (mapper)

Split

mapper

Medium to high-speed connection


HDFS Read

Client Node

Client JVM

Distributed FileSystem

HDFS Client

1: open

FSData InputStream namenode

JVM

NameNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

2: Request file block locations

3: read

6: close

4: read from block

5: read from block


Reading From HDFS


Data Corruption

�  Hadoop employs checksums to ensure block integrity

�  As the block is read a checksum is calculated and compared against the checksum calculated and stored when the block was written –  Fast to calculate and space-efficient

�  If the checksums differ the client reads the block from the next DataNode in the list provided by the NameNode –  NameNode will replicate the corrupted block elsewhere

�  DataNode will verify block checksums periodically to avoid 'bit rot'


Data Reliability and Recovery

�  DataNodes send heartbeats to the NameNode every 3 seconds

�  If no heartbeats are received then the DataNode is considered to be lost –  The NameNode will determine which blocks are on the lost

DataNode –  The NameNode finds other DataNodes with valid copies of the

blocks stored on the lost DataNode –  Instructs the DataNodes with the valid copies to replicate those

blocks to other DataNodes in the cluster


HDFS Architecture

�  Blocks

�  Components

�  HDFS – Writing / Reading



Interacting with HDFS

�  Primary interface is Java CLI app – Hadoop shell

�  Java API and others ( C++, Python )

�  Web GUI for read-only access

�  HFTP provides HTTP/S read-only view –  No relation to FTP

�  WebHDFS provides read/write RESTful interface

�  FUSE, MapR allow mounting as standard filesystem

CLI – Command-Line Interface FUSE – Filesystem in Userspace


HDFS Shell Commands

�  Parallels of many common Linux commands

$ hadoop fs –ls /$ hadoop fs –tail /path/on/hdfs/file

Commands Purpose ls, lsr, count, du, dus, stat, test Display file-system information

cat, tail, text Display file contents

cp, mkdir, mv, rm, rmr, touchz Manipulate remote objects

chgrp, chmod, chown Change security permissions


HDFS Unique Shell Commands

$ hadoop fs –put /my/local/file /path/on/hdfs$ hadoop fs –getmerge /directory/with/files /my/local/path$ hadoop fs –setrep –w 5 /path/on/hdfs/file$ hadoop fs –expunge$ hadoop fs –text /path/on/hdfs/file.zip

�  Commands Unique to HDFS Commands Purpose

get, put, copyFromLocal, copyToLocal

Move files between local file-system and HDFS

getmerge Concatenate files in a directory and copy to local setrep Set the replication factor of a file expunge Empty trash (‘deleted’ files are moved to /trash first) text Displays zip or TextRecordInputStream as text


HDFS Admin Commands

�  Administration commands for the HDFS cluster

$ hadoop dfsadmin –safemode enter$ hadoop dfsadmin –setQuota 100 /path/on/hdfs$ hadoop dfsadmin –setSpaceQuota 50g /path/on/hdfs

Commands Purpose report Display status of cluster safemode Toggle safemode, which prevents writes when enabled finalizeUpgrade Remove backup of cluster made in last upgrade refreshNodes Re-apply rules of which hosts can participate in the cluster setQuota, clrQuota

Manage object-count quota for directories

setSpaceQuota, clrSpaceQuota

Manage object-size quota for directories


HDFS Web UI

�  Default location is: –  http://namenode:50070/

�  Summarizes status of cluster

�  Read-only view of file-system

�  DataNodes present status at: –  http://datanode:50010/


About The NameNode


Is The NameNode A Bottleneck?

�  No data ever traverses the NameNode –  During reads –  During writes –  During replication –  During recovery operations


NameNode Memory Usage

�  All metadata is held in RAM for quick response

�  Each entry in the metadata is referred to as an 'item'

�  An 'item' can be: –  A filename –  File permissions –  Block names –  Additional block information

�  Each 'item' consumes roughly 150 to 200 bytes of RAM


NameNode Memory Usage

�  HDFS / NameNode prefers fewer, larger files as a result of the metadata that needs to be stored

�  Consider a 1GB file with the default blocksize of 64MB –  Stored as a single 1GB file

▪  Name: 1 item ▪  Blocks: 16 * Default Replication Factor = 48 ▪  Total Number of items to account for this file: 49

–  Stored as 1000 individual 1MB files ▪  Names: 1000 ▪  Blocks: 1000 * Default Replication Factor = 3000 ▪  Total Number of items to accout for this file: 4000


Lab Exercise Getting Familiar With HDFS


Getting Familiar With HDFS

Lab Exercise

�  Getting ‘Help’ in HDFS

�  Navigating HDFS

�  Loading files into HDFS

�  Verifying data in HDFS


HDFS – NFS Gateway


NFS Gateway

•  Allows HDFS to be mounted as part of the clients local file system

•  Supports a variety of file systems –  Linux / Unix –  Windows –  MacIntosh –  Others

•  Browse HDFS via the local file system •  Download files from HDFS to the local file system •  Upload files from the local file system to HDFS •  Stream data directly to HDFS through the mount point


NFS Gateway Architecture


NFS Gateway Configuration

•  NFS Gateway machine needs the same thing to run an HDFS Client –  Hadoop Jar files –  HADOOP_CONF directory

•  NFS Gateway can be on the same host as the Namenode, Datanode or any HDFS client.


HDFS – NFS Lab Exercise Subtitle 24 Point Arial Title Case

Additional Line 18 Point Arial


Lab Exercise - NFS Gateway

•  Manipulate files in HDFS using Hadoop fs commands or HDFS DFS commands

•  Manipulate file in HDFS via the HDFS mount


All About MapReduce


About MapReduce

�  What is MapReduce?

�  Architecture

�  Fault Tolerance

�  How Does MapReduce Work?

�  Running MapReduce Jobs

�  Lab Exercise – Run a MapReduce Job


What is MapReduce?


MapReduce Is . . .

�  MapReduce is programming model for processing large volumes of data –  Record oriented data processing ( key / value pairs )

�  The goal of MapReduce is automatic parallelization and distribution of tasks

�  Ideally each node processes data stored locally on that node

�  MapReduce incorporates two developer created phases: –  Map phase – performs filtering and sorting –  Reduce phase – performs summary operations

�  Between the Map and Reduce phases is Sort / Shuffle / Merge which sends data from the Mappers to the Reducers


MapReduce Is . . .

�  MapReduce programs are generally written in java ( as is Hadoop itself ) –  Other language support is available using Hadoop Streaming

�  MapReduce framework abstracts all housekeeping requirements from the developer

�  Fault tolerant


Architecture


MapReduce v1.0 or MapReduce v2.0 / YARN

�  Both manage compute resources, jobs & tasks

�  Job API the same

�  MapReduce v1.0 is proven in production

�  MapReduce v2.0 - new framework to leverage YARN –  Yet Another Resource Negotiator –  YARN created to help Yahoo! scale to 4,000 nodes

▪  Still classified as alpha ▪  Not needed for all but the largest clusters

–  YARN is a generic distribution platform ▪  MapReduce v2.0 just one application that runs on YARN


Task Management System Components

YARN – Yet Another Resource Negotiator

MapReduce v1.0 MapReduce v2.0 - YARN

Purpose

Client Client Submit MapReduce job

JobTracker (single) Resource Manager (single)

Manages scheduling & resources

Node Managers (multiple)

Co-ordinates job run (starts, monitors progress, completes

TaskTrackers (multiple) MapReduce Application Master (multiple, per task)

Run map & reduce tasks

Distributed File-System (normally HDFS)

Distributed File-System (normally HDFS)

Used for shared data between other components


Starting Job – MapReduce v1.0

7: heartbeat (returns task)

Client Node

Client JVM

Job MapReduce program

JobTracker Node

JVM

JobTracker

TaskTracker Node

JVM

TaskTracker

Child JVM

TaskTracker child

Mapper or Reducer

1: initiate job 2: request job ID

4: submit job

5: initialize job

6: determine input splits

8: retrieve job jars, data

9: launch 10: run

Shared File-System (e.g. HDFS)

3: copy job jars, config


Starting Job – MapReduce v2.0

Client Node

Client JVM

Job MapReduce program

Jobtracker Node

1: initiate job 2: request new application

3: copy job jars, config

4: submit job

9: retrieve job jars, data

Node Manager Node

JVM

Node manager

Child JVM

YARN child

Mapper or Reducer

10: run

Shared File-System (e.g. HDFS)

6: determine input splits

7b: start container

Node Manager Node

JVM

MRApp Master Node Manager

5b: launch

5c: initialize job

5a: start container

7a: allocate task resources

8: launch

JVM

ResourceManager


Fault Tolerance


Fault Tolerance

•  Task processes periodically send heartbeats up the chain

•  A task that fails to heartbeat during a period is presumed to be dead –  The tasks JVM is killed

•  A task that throws an exception is considered failed •  Tasks that have failed are rescheduled

–  Ideally on a different node that it failed on •  If a task fails four times the job fails


How Does MapReduce Work


MapReduce Terminology

�  A MapReduce job consists of: –  Inputs –  Task(s)

�  A task(s) is: –  A unit of work –  A job will be broken into many tasks –  A task will be either a map task or reduce task

�  Map tasks read input splits –  Generate ‘intermediate’ results

�  Reduce tasks process ‘intermediate’ results –  Generate ‘final’ results


Anatomy Of Job Submission

�  A job is submitted from a client –  The job is assigned a jobID –  The client determines the number of input splits for the job –  Client copies the jar file and configuration items to HDFS

�  A map task is created for each input split –  Periodic heartbeats between slave and master nodes

▪  Identify readiness of the slave node to execute tasks –  Master node assigns tasks to slave nodes with available capacity

�  A JVM is instanced on the slave and the task runs in the JVM –  Tasks run in relative isolation –  Task status is periodically sent to the master node


Basic Tenants of MapReduce

�  Thought Keys . . . –  A Mapper(s) processes a single record at a time and emits a

key / value pair –  The Intermediate key / value pairs produced by Mapper(s) are

passed to the Reducer(s) –  The Reducer(s) aggregates the intermediate keys from the

Mapper(s) and generates final output


MapReduce Simpified


Is MapReduce Similar To A Unix Pipeline?

cat input | grep | sort | unique -c | cat > output

Input Map Sort, Shuffle

& Merge

Reduce Output


Expanding On Those Thought Keys

�  Each Mapper processes a single input split from HDFS –  Likely an HDFS block

�  The Framework passes one record at a time to the Mapper

�  Each record consists of a key / value pair

�  Mapper writes intermediate key / value pairs to local disk

�  During Sort / Shuffle / Merge all values for the same intermediate key are transferred to the same Reducer

�  The Reducer is passed each key and an iterator of values

�  Output from the Reducer is written to HDFS


MapReduce In Detail


WordCount()

�  With the though keys previously described in mind

�  We will use WordCount() to describe the MapReduce infrastructure –  WordCount is MapReduce’s equivalent of ‘Hello World’ –  Not the most elegant of examples –  Simple and understandable example


The Mapper For WordCount

�  Assume the input is a set of text files

�  The key is the byte offset into the file for a line

�  The value is a line in the file at that offset

Map (key, value){ for each word x in value: output (x, 1);

}


WordCount Mapper Input

�  Input to the Mapper

2120 see the quick brown fox jump over the lazy dog

2166 the five boxing wizards jump quickly

2202 a mad boxer shot a quick jab to the jaw of his opponent


WordCount Mapper Code //Import’s omitted for clarity public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String s = value.toString(); for (String word : s.split("\\W+")) { if (word.length() > 0) { output.collect(new Text(word), new IntWritable(1)); } } } }


WordCount Mapper Output

�  Intermediate data output by the Mapper

(see,1) (the,1), (quick,1), (brown,1), (fox,1), (jump,1), (over,1), (the,1),

(lazy,1), (dog,1), (the,1) (five,1), (boxing,1), (wizards,1), (jump,1),

(quickly,1), (a,1), (mad,1), (boxer,1), (shot,1), (a,1) (quick,1), (jab,1),

(to,1), (the,1), (jaw,1), (of,1), (his,1), (opponent,1)


Shuffle

�  Process by which Mapper output gets to Reducers

�  Activities on both Mapper and Reducer side

�  Transparent… until something goes slowly / wrong


Shuffle – Map Side

�  Mapper output written to circular memory buffer

�  When buffer is nearly full, it is prepared for write –  Pairs placed in appropriate partition –  Partition contents sorted by key –  Combiner (if any) run against sorted key/value pairs

�  New spill file created on local file-system

�  Spill files merged together –  If enough separate files, run combiner again

�  Output file marked as available


Shuffle – Map Side - Configuration

Property Use mapreduce.task.io.sort.mb Size of circular buffer

mapreduce.map.sort.spill.percent When to start spilling buffer

mapreduce.cluster.local.dir Spill directory list

mapreduce.task.io.sort.factor How many spills to merge at once

mapreduce.map.combine.minspills Min spills files to run Combiner again

mapreduce.map.output.compress Whether to compress output

mapreduce.map.output.compress.codec Compression codec to use

mapreduce.tasktracker.http.threads Threads to use for transferring files


Shuffle – Reduce Side

�  Reducers poll JobTracker, asking for input locations

�  Reducers copy files from Mappers via HTTP

�  Inputs held in RAM until thresholds reached

�  Files from various Mappers recursively merged –  Optimizations to minimize number of disk writes

�  Reducer invoked


Shuffle – Reduce Side - Configuration

Property Use

mapreduce.reduce.shuffle.parallelcopies Threads used to copy input files

mapreduce.reduce.job.shuffle.input.buffer. percent

Heap to use for buffering input files

mapreduce.reduce.shuffle.merge.percent Size buffer must be filled to cause merge

mapreduce.reduce.merge.inmem.threshold Input files required to cause merge

mapreduce.task.io.sort.factor Merge factor


WordCount Reducer Input

�  Input to the Reducer after Sort / Shuffle / Merge phase

(a, [1,1]) (boxer, [1]) (boxing, [1]) (brown, [1]) (dog, [1]) (five, [1]) (fox, [1]) (his, [1]) (jab, [1]) (jaw, [1]) (jump, [1,1]) (lazy, [1])

(mad, [1]) (of, [1]) (opponent, [1]) (over, [1]) (quick, [1,1]) (quickly, [1]) (see, [1]) (shot, [1]) (the, [1,1,1,1]) (to, [1]) (wizards, [1])


The Reducer For WordCount

�  The keyword is a word

�  The iterator is a list of values associated with the keyword

Reduce (keyword, iterator{ for each x in iterator: sum+=x; final_output(keyword, sum);

}


WordCount Reducer Output

�  Output from the WordCount Reducer written to HDFS

(a, 2) (boxer, 1) (boxing, 1) (brown, 1) (dog, 1) (five, 1) (fox, 1) (his, 1) (jab, 1) (jaw, 1) (jump, 2) (lazy, 1)

(mad, 1) (of, 1) (opponent, 1) (over, 1) (quick, 2) (quickly, 1) (see, 1) (shot, 1) (the, 4) (to, 1) (wizards, 1)


WordCount Reducer Code //Import’s omitted for clarity public class SumReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int wordCount = 0; while (values.hasNext()) { IntWritable value = values.next(); wordCount += value.get(); } output.collect(key, new IntWritable(wordCount)); } }


WordCount Driver Code //Import’s ommitted for clarity public class WordCount { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: WordCount <input dir> <output dir>"); System.exit(-1); } JobConf conf = new JobConf(WordCount.class); conf.setJobName("Word Count"); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordMapper.class); conf.setReducerClass(SumReducer.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } }


How To Run A MapReduce Job


Creating And Running A MapReduce Job

•  Develop the Mapper and Reducer classes •  Develop the Driver class •  Compile the Driver, Mapper and Reducer classes

javac –classpath ‘hadoop classpath’ *.java

•  Create a jar file jar cvf myjob.jar *.class

•  Run the hadoop jar command Hadoop jar myjob.jar Entry input_dir output_dir


Lab Exercise Running A MapReduce Job


Running a MapReduce Job

Lab Exercise

�  Load data into HDFS

�  Compile wordcount driver, Mapper and Reducer

�  Create a jar file

�  Submit the jar file

A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...

Documents

Transcript of A NEW PLATFORM FOR A NEW ERA Immersion V1.1 - CapGemini WebEx.pdf– Sqoop – Data Integration –...