XML Parsing with Map Reduce

www.edureka.co/big-data-and-hadoop

Hadoop the ultimate data storage

And processing Together


Objectives

Analyze different use-cases where MapReduce is used

Differentiate between Traditional way and MapReduce way

Learn about Hadoop 2.x MapReduce architecture and components

Understand execution flow of YARN MapReduce application

Implement basic MapReduce concepts

Run a MapReduce Program

At the end of this module, you will be able to


Where MapReduce is Used?

Weather Forecasting

HealthCare

Problem Statement:» De-identify personal health information.

Problem Statement:» Finding Maximum temperature recorded in a year.


Where MapReduce is Used?

MapReduce

FeaturesLarge Scale Distributed Model

Used in

Function

Design Pattern

Parallel Programming

A Program Model

Classification

Analytics

Recommendation

Index and SearchMap

Reduce

ClassificationEg: Top N records

AnalyticsEg: Join, Selection

RecommendationEg: Sort

SummarizationEg: Inverted Index

Implemented

Google

Apache Hadoop

HDFS

Pig

Hive

HBase

For


The Traditional Way

VeryBig

Data

Split Data matches

Allmatches

grep

grep

grep cat

grep

:

matches

matches

matches

Split Data

Split Data

Split Data


MapReduce Way

VeryBig

Data

Split Data

Allmatches

:

Split Data

Split Data

Split Data

MAP

REDUCE

MapReduce Framework


MapReduce Paradigm

The Overall MapReduce Word Count Process

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3)Deer Bear River

Dear Bear RiverCar Car RiverDeer Car Bear

Bear, 2Car, 3Deer, 2River, 2

Deer, 1Bear, 1River, 1

Car, 1Car, 1

River, 1

Deer, 1Car, 1Bear, 1

K2,List(V2)List(K2,V2)K1,V1

Car Car River

Deer Car Bear

Bear, 2

Car, 3

Deer, 2

River, 2

Bear, (1,1)

Car, (1,1,1)

Deer, (1,1)

River, (1,1)


Anatomy of a MapReduce Program

MapReduce

Map:

Reduce:

(K1, V1) List (K2, V2)

(K2, list (V2)) List (K3, V3)

Key Value


Why MapReduce?

Two biggest Advantages:

» Taking processing to the data

» Processing data in parallel

ab

c

Map Task

HDFS BlockData Center

Rack

Node


ApplicationMaster

» One per application» Short life» Coordinates and Manages MapReduce Jobs» Negotiates with Resource Manager to

schedule tasks» The tasks are started by NodeManager(s)

Job History Server

» Maintains information about submitted MapReduce jobs after their ApplicationMasterterminates

Client

» Submits a MapReduce Job

Resource Manager

» Cluster Level resource manager» Long Life, High Quality Hardware

Node Manager

» One per Data Node» Monitors resources on Data Node

Hadoop 2.x MapReduce Components

Container» Created by NM when requested» Allocates certain amount of resources

(memory, CPU etc.) on a slave node


BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html


MapReduce Application Execution

Executing MapReduce Application on YARN


YARN MR Application Execution Flow

MapReduce Job Execution

» Job Submission

» Job Initialization

» Tasks Assignment

» Memory Assignment

» Status Updates

» Failure Recovery


HDFS

Application Job Object

Client JVM

Client

Resource Manager

Management Node

Run Job

2. Get New Application ID

4. Submit Application Context

3. Prepare the Application submit

context3.1 App Jar

3.2 Job Resources(Block locations)

3.3 User Information

1. Notify Start Application



HDFS

3. Prepare the Application submit

context3.1 App Jar

3.2 Job Resources(Block locations)

3.3 User Information

Node Manager

5. Start AppMaster container / Allocate Context for AppMaster

App Master

6.Alloate Container for AppMaster

7.Request Resources

8.Notify with resources Availability

Data Node


Application Job Object

Client JVM

Client

Resource Manager

Management Node

Run Job

2. Get New Application ID

4. Submit Application Context



HDFS

Resource Manager

3. Prepare the Application submit context

3.1 App Jar3.2 Job Resources(Block

locations)3.3 User Information

Management Node

Node Manager

5. Start AppMaster container / Allocate Context for AppMaster

App Master

6. Allocate Container for AppMaster

7.Request Resources

8.Notify with resources Availability

Data Node

Client

Node Manager

Data node-1

Node Manager

Map Block

9.Start Containerin the worker node

Data node-2

Node Manager

Map Block

10.NM allocate Container

10.NM allocate Container

2. Get New Application

4. Submit Application


9.Start Container in the worker

node




11.Task get Executed.

12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container

13.Output of All the Maps given to reducer and Reducer get executed

14.Once Job finished, Application Master notify the Resource Manager and Client Library

15.Application Master closed.


Hadoop 2.x : YARN Workflow

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Node Manager

Container 1.2

Container 1.1

Container 2.1

Container 2.2

Container 2.3

AppMaster 2

AppMaster 1

Scheduler

Applications Manager (AsM)

Resource

Manager


Summary: Application Workflow

Execution Sequence :

1. Client submits an application Client RM NM AM

1




1. Client submits an application

2. RM allocates a container to start AM

Client RM NM AM

1

2






3. AM registers with RM

Client RM NM AM

1

2

3







4. AM asks containers from RM

Client RM NM AM

1

2

3

4








5. AM notifies NM to launch containers

Client RM NM AM

1

2

3

4

5









6. Application code is executed in container

Client RM NM AM

1

2

3

4

5

6










7. Client contacts RM/AM to monitor application’s status

Client RM NM AM

1

2

3

4

5

7 6










7. Client contacts RM/AM to monitor application’s status

8. AM unregisters with RM

Client RM NM AM

1

2

3

4

5

7

8

6


Input Splits

INPUT DATA

PhysicalDivision

LogicalDivision

HDFSBlocks

InputSplits


Relation Between Input Splits and HDFS Blocks

1 2 3 4 5 6 7 8 9 10 11

Logical records do not fit neatly into the HDFS blocks.

Logical records are lines that cross the boundary of the blocks.

First split contains line 5 although it spans across blocks.

FileLines

BlockBoundary

BlockBoundary

BlockBoundary

BlockBoundary

Split Split Split


MapReduce Job Submission Flow

Input data is distributed to nodes

Node 1 Node 2

INPUT DATA




Each map task works on a “split” of dataMap

Node 1

Map

Node 2

INPUT DATA




Each map task works on a “split” of data

Mapper outputs intermediate data

Map

Node 1

Map

Node 2

INPUT DATA






Data exchange between nodes in a “shuffle” process

Map

Node 1

Map

Node 2

Node 1 Node 2

INPUT DATA







Intermediate data of the same key goes to the same reducer

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA







Intermediate data of the same key goes to the same reducer

Reducer output is stored

Map

Node 1

Map

Node 2

Reduce

Node 1

Reduce

Node 2

INPUT DATA


Combiner

Combiner

Reducer

(B,1)(C,1)(D,1)(E,1)(D,1)(B,1)

(D,1)(A,1)(A,1)(C,1)(B,1)(D,1)

(B,2)(C,1)(D,2)(E,1)

(D,2)(A,2)(C,1)(B,1)

(A, [2])(B, [2,1])(C, [1,1])(D, [2,2])(E, [1])

(A,2)(B,3)(C,2)(D,4)(E,1)

Shuffle

CombinerMapper

Mapper

BCDEDB

DAACBD

Blo

ck 1

Blo

ck 2


Partitioner – Redirecting Output from Mapper

Map

Map

Map

Reducer

Reducer

Reducer

Partitioner

Partitioner

Partitioner


Getting Data to the Mapper

Input File Input File

Input split Input split Input split Input split

RecordReader RecordReader RecordReader RecordReader

Mapper Mapper Mapper Mapper

(intermediates) (intermediates) (intermediates) (intermediates)


Partition and Shuffle

Mapper Mapper Mapper Mapper

(intermediates) (intermediates) (intermediates) (intermediates)

Partitioner Partitioner Partitioner Partitioner

(intermediates) (intermediates) (intermediates)

Reducer Reducer Reducer


Demo of Word Count ProgramTo illustrate Default Input Format

(Text Input Format)

Demo


Input file

Input Split Input Split Input Split

RecordReader

RecordReader

RecordReader

Mapper Mapper Mapper

(Intermediates) (Intermediates) (Intermediates)

Inp

ut

Form

at Input Split

RecordReader

Mapper

Input file

(Intermediates)

Input Format


Combine FileInput Format<K,V>

Text Input Format

Key Value Text Input Format

Nline Input Format

Sequence FileInput Format<K,V>

File Input Format

<K,V>

Input Format<K,V>

org.apache.hadoop.mapreduce

<<interface>>

Composable

Input Format

<K,V>

Composite Input Format

<K,V>

DB Input Format<T>

Sequence File As

Binary Input Format

Sequence File As

Text Input Format

Sequence File Input

Filter<K,V>

Input Format – Class Hierarchy


Reducer

RecordWriter

Output file

Reducer

RecordWriter

Output file

Reducer

RecordWriter

Output file

Outp

ut Form

at

Output Format


Text Output Format<K,V>

Sequence FileOutput Format<K,V>

Output Format <K,V>

org.apache.hadoop.mapreduce

DB Output Format

<K,V>

File Output Format

<K,V>

Null Output Format

<K,V>

Filter Output Format

<K,V>

Sequence File As Binary Output Format

Lazy Output Format

<K,V>

Output Format – Class Hierarchy


Demo

Demo: Custom Input Format

XML Parsing with Map Reduce

Technology

Transcript of XML Parsing with Map Reduce