XML Parsing with Map Reduce
-
Upload
edureka -
Category
Technology
-
view
147 -
download
1
Transcript of XML Parsing with Map Reduce
www.edureka.co/big-data-and-hadoop
Hadoop the ultimate data storage
And processing Together
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
Analyze different use-cases where MapReduce is used
Differentiate between Traditional way and MapReduce way
Learn about Hadoop 2.x MapReduce architecture and components
Understand execution flow of YARN MapReduce application
Implement basic MapReduce concepts
Run a MapReduce Program
At the end of this module, you will be able to
Slide 3 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
Weather Forecasting
HealthCare
Problem Statement:» De-identify personal health information.
Problem Statement:» Finding Maximum temperature recorded in a year.
Slide 4 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
MapReduce
FeaturesLarge Scale Distributed Model
Used in
Function
Design Pattern
Parallel Programming
A Program Model
Classification
Analytics
Recommendation
Index and SearchMap
Reduce
ClassificationEg: Top N records
AnalyticsEg: Join, Selection
RecommendationEg: Sort
SummarizationEg: Inverted Index
Implemented
Apache Hadoop
HDFS
Pig
Hive
HBase
For
Slide 5 www.edureka.co/big-data-and-hadoop
The Traditional Way
VeryBig
Data
Split Data matches
Allmatches
grep
grep
grep cat
grep
:
matches
matches
matches
Split Data
Split Data
Split Data
Slide 6 www.edureka.co/big-data-and-hadoop
MapReduce Way
VeryBig
Data
Split Data
Allmatches
:
Split Data
Split Data
Split Data
MAP
REDUCE
MapReduce Framework
Slide 7 www.edureka.co/big-data-and-hadoop
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Mapping Shuffling Reducing Final Result
List(K3,V3)Deer Bear River
Dear Bear RiverCar Car RiverDeer Car Bear
Bear, 2Car, 3Deer, 2River, 2
Deer, 1Bear, 1River, 1
Car, 1Car, 1
River, 1
Deer, 1Car, 1Bear, 1
K2,List(V2)List(K2,V2)K1,V1
Car Car River
Deer Car Bear
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, (1,1)
Car, (1,1,1)
Deer, (1,1)
River, (1,1)
Slide 8 www.edureka.co/big-data-and-hadoop
Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K2, list (V2)) List (K3, V3)
Key Value
Slide 9 www.edureka.co/big-data-and-hadoop
Why MapReduce?
Two biggest Advantages:
» Taking processing to the data
» Processing data in parallel
ab
c
Map Task
HDFS BlockData Center
Rack
Node
Slide 10 www.edureka.co/big-data-and-hadoop
ApplicationMaster
» One per application» Short life» Coordinates and Manages MapReduce Jobs» Negotiates with Resource Manager to
schedule tasks» The tasks are started by NodeManager(s)
Job History Server
» Maintains information about submitted MapReduce jobs after their ApplicationMasterterminates
Client
» Submits a MapReduce Job
Resource Manager
» Cluster Level resource manager» Long Life, High Quality Hardware
Node Manager
» One per Data Node» Monitors resources on Data Node
Hadoop 2.x MapReduce Components
Container» Created by NM when requested» Allocates certain amount of resources
(memory, CPU etc.) on a slave node
Slide 11 www.edureka.co/big-data-and-hadoop
BATCH(MapReduce)
INTERACTIVE(Text)
ONLINE(HBase)
STREAMING(Storm, S4, …)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
OTHER(Search)
(Weave..)
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce
Slide 12 www.edureka.co/big-data-and-hadoop
MapReduce Application Execution
Executing MapReduce Application on YARN
Slide 13 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
MapReduce Job Execution
» Job Submission
» Job Initialization
» Tasks Assignment
» Memory Assignment
» Status Updates
» Failure Recovery
Slide 14 www.edureka.co/big-data-and-hadoop
HDFS
Application Job Object
Client JVM
Client
Resource Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
3. Prepare the Application submit
context3.1 App Jar
3.2 Job Resources(Block locations)
3.3 User Information
1. Notify Start Application
YARN MR Application Execution Flow
Slide 15 www.edureka.co/big-data-and-hadoop
HDFS
3. Prepare the Application submit
context3.1 App Jar
3.2 Job Resources(Block locations)
3.3 User Information
Node Manager
5. Start AppMaster container / Allocate Context for AppMaster
App Master
6.Alloate Container for AppMaster
7.Request Resources
8.Notify with resources Availability
Data Node
YARN MR Application Execution Flow
Application Job Object
Client JVM
Client
Resource Manager
Management Node
Run Job
2. Get New Application ID
4. Submit Application Context
1. Notify Start Application
Slide 16 www.edureka.co/big-data-and-hadoop
HDFS
Resource Manager
3. Prepare the Application submit context
3.1 App Jar3.2 Job Resources(Block
locations)3.3 User Information
Management Node
Node Manager
5. Start AppMaster container / Allocate Context for AppMaster
App Master
6. Allocate Container for AppMaster
7.Request Resources
8.Notify with resources Availability
Data Node
Client
Node Manager
Data node-1
Node Manager
Map Block
9.Start Containerin the worker node
Data node-2
Node Manager
Map Block
10.NM allocate Container
10.NM allocate Container
2. Get New Application
4. Submit Application
1. Notify Start Application
9.Start Container in the worker
node
YARN MR Application Execution Flow
Slide 17 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
11.Task get Executed.
12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container
13.Output of All the Maps given to reducer and Reducer get executed
14.Once Job finished, Application Master notify the Resource Manager and Client Library
15.Application Master closed.
Slide 18 www.edureka.co/big-data-and-hadoop
Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Container 1.2
Container 1.1
Container 2.1
Container 2.2
Container 2.3
AppMaster 2
AppMaster 1
Scheduler
Applications Manager (AsM)
Resource
Manager
Slide 19 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application Client RM NM AM
1
Slide 20 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
Client RM NM AM
1
2
Slide 21 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
Client RM NM AM
1
2
3
Slide 22 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
Client RM NM AM
1
2
3
4
Slide 23 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
Client RM NM AM
1
2
3
4
5
Slide 24 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
Client RM NM AM
1
2
3
4
5
6
Slide 25 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
Client RM NM AM
1
2
3
4
5
7 6
Slide 26 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an application
2. RM allocates a container to start AM
3. AM registers with RM
4. AM asks containers from RM
5. AM notifies NM to launch containers
6. Application code is executed in container
7. Client contacts RM/AM to monitor application’s status
8. AM unregisters with RM
Client RM NM AM
1
2
3
4
5
7
8
6
Slide 27 www.edureka.co/big-data-and-hadoop
Input Splits
INPUT DATA
PhysicalDivision
LogicalDivision
HDFSBlocks
InputSplits
Slide 28 www.edureka.co/big-data-and-hadoop
Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
Logical records do not fit neatly into the HDFS blocks.
Logical records are lines that cross the boundary of the blocks.
First split contains line 5 although it spans across blocks.
FileLines
BlockBoundary
BlockBoundary
BlockBoundary
BlockBoundary
Split Split Split
Slide 29 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2
INPUT DATA
Slide 30 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of dataMap
Node 1
Map
Node 2
INPUT DATA
Slide 31 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Map
Node 1
Map
Node 2
INPUT DATA
Slide 32 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Map
Node 1
Map
Node 2
Node 1 Node 2
INPUT DATA
Slide 33 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 34 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task works on a “split” of data
Mapper outputs intermediate data
Data exchange between nodes in a “shuffle” process
Intermediate data of the same key goes to the same reducer
Reducer output is stored
Map
Node 1
Map
Node 2
Reduce
Node 1
Reduce
Node 2
INPUT DATA
Slide 35 www.edureka.co/big-data-and-hadoop
Combiner
Combiner
Reducer
(B,1)(C,1)(D,1)(E,1)(D,1)(B,1)
(D,1)(A,1)(A,1)(C,1)(B,1)(D,1)
(B,2)(C,1)(D,2)(E,1)
(D,2)(A,2)(C,1)(B,1)
(A, [2])(B, [2,1])(C, [1,1])(D, [2,2])(E, [1])
(A,2)(B,3)(C,2)(D,4)(E,1)
Shuffle
CombinerMapper
Mapper
BCDEDB
DAACBD
Blo
ck 1
Blo
ck 2
Slide 36 www.edureka.co/big-data-and-hadoop
Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
Slide 37 www.edureka.co/big-data-and-hadoop
Getting Data to the Mapper
Input File Input File
Input split Input split Input split Input split
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Slide 38 www.edureka.co/big-data-and-hadoop
Partition and Shuffle
Mapper Mapper Mapper Mapper
(intermediates) (intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
(intermediates) (intermediates) (intermediates)
Reducer Reducer Reducer
Slide 39 www.edureka.co/big-data-and-hadoop
Demo of Word Count ProgramTo illustrate Default Input Format
(Text Input Format)
Demo
Slide 40 www.edureka.co/big-data-and-hadoop
Input file
Input Split Input Split Input Split
RecordReader
RecordReader
RecordReader
Mapper Mapper Mapper
(Intermediates) (Intermediates) (Intermediates)
Inp
ut
Form
at Input Split
RecordReader
Mapper
Input file
(Intermediates)
Input Format
Slide 41 www.edureka.co/big-data-and-hadoop
Combine FileInput Format<K,V>
Text Input Format
Key Value Text Input Format
Nline Input Format
Sequence FileInput Format<K,V>
File Input Format
<K,V>
Input Format<K,V>
org.apache.hadoop.mapreduce
<<interface>>
Composable
Input Format
<K,V>
Composite Input Format
<K,V>
DB Input Format<T>
Sequence File As
Binary Input Format
Sequence File As
Text Input Format
Sequence File Input
Filter<K,V>
Input Format – Class Hierarchy
Slide 42 www.edureka.co/big-data-and-hadoop
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
Outp
ut Form
at
Output Format
Slide 43 www.edureka.co/big-data-and-hadoop
Text Output Format<K,V>
Sequence FileOutput Format<K,V>
Output Format <K,V>
org.apache.hadoop.mapreduce
DB Output Format
<K,V>
File Output Format
<K,V>
Null Output Format
<K,V>
Filter Output Format
<K,V>
Sequence File As Binary Output Format
Lazy Output Format
<K,V>
Output Format – Class Hierarchy
Slide 44 www.edureka.co/big-data-and-hadoop
Demo
Demo: Custom Input Format