Performance Issues on Hadoop Clusters
-
Upload
xiao-qin -
Category
Technology
-
view
4.366 -
download
2
description
Transcript of Performance Issues on Hadoop Clusters
![Page 1: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/1.jpg)
Performance Issues onHadoop Clusters
Jiong Xie
Advisor: Dr. Xiao Qin
Committee Members:
Dr. Cheryl Seals
Dr. Dean Hendrix
University Reader:
Dr. Fa Foster Dai
04/11/23 1
![Page 2: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/2.jpg)
Overview of My Research
04/11/23 2
Data Placementon Heterogeneous
Cluster[HCW 10]
Data movementData locality Data shuffling
Prefetching Data from Disk to Memory
[Submit to IPDPS]
Reduce network congest
[To Be Submitted]
![Page 3: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/3.jpg)
Data-Intensive Applications
04/11/23 3
![Page 4: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/4.jpg)
Data-Intensive Applications (cont.)
04/11/23 4
![Page 5: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/5.jpg)
Background
• MapReduce programming model is growing in popularity
• Hadoop is used by Yahoo, Facebook, Amazon.
04/11/23 5
![Page 6: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/6.jpg)
Hadoop Overview--Mapreduce Running System
04/11/23 6
(J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150)
![Page 7: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/7.jpg)
Hadoop Distributed File System
04/11/23 7
(http://lucene.apache.org/hadoop)
![Page 8: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/8.jpg)
Motivations
• MapReduce provides– Automatic parallelization & distribution– Fault tolerance– I/O scheduling– Monitoring & status updates
04/11/23 8
![Page 9: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/9.jpg)
Existing Hadoop Clusters
• Observation 1: Cluster nodes are dedicated– Data locality issues– Data transfer time
• Observation 2: The number of nodes is increased Scalability issues Shuffling overhead goes up
04/11/23 9
![Page 10: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/10.jpg)
Proposed Solutions
04/11/23 10
P3: Preshuffling
P1: Data placement
P2: Prefetching
InputInput
OutputOutput
MapMap
MapMap
MapMap
MapMap
MapMap
ReduceReduce
ReduceReduce
ReduceReduce
![Page 11: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/11.jpg)
Solutions
04/11/23 11
P3: Preshuffling
P1: Data placement
P2: Prefetching
Offline, distributed data, heterogeneous node
Online, data preloading Intermediate data movement, reducing traffic
![Page 12: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/12.jpg)
Improving MapReduce Performance through Data
Placement in Heterogeneous Hadoop Clusters
04/11/23 12
![Page 13: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/13.jpg)
Motivational Example
04/11/23 1313
Time (min)
Node A(fast)
Node B(slow)
Node C(slowest)
2x slower
3x slower
1 task/min
![Page 14: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/14.jpg)
The Native Strategy
04/11/23 14
Node A
Node B
Node C
3 tasks
2 tasks
6 tasks
Loading Transferring Processing
Time (min)
![Page 15: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/15.jpg)
Our Solution--Reducing data transfer time
04/11/23 15
Node A’
Node B’
Node C’
3 tasks
2 tasks
6 tasks
Loading Transferring Processing
Time (min)
Node A
![Page 16: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/16.jpg)
Challenges
04/11/23 16
• Does distribution strategy depend on applications?
• Initialization of data distribution
• The data skew problems– New data arrival– Data deletion – Data updating– New joining nodes
![Page 17: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/17.jpg)
Measure Computing Ratios
04/11/23 17
• Computing ratio
• Fast machines process large data sets
Time
Node A
Node B
Node C
2x slower
3x slower
1 task/min
![Page 18: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/18.jpg)
Measuring Computing Ratios
04/11/23 18
Node Response time(s)
Ratio # of File Fragments
Speed
Node A 10 1 6 Fastest
Node B 20 2 3 Average
Node C 30 3 2 Slowest
1. Run an application, collect response time
2. Set ratio of a node offering the shortest response time as 1
3. Normalize ratios of other nodes
4. Calculate the least common multiple of these ratios
5. Determine the amount of data processed by each node
![Page 19: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/19.jpg)
Initialize Data Distribution
04/11/23 19
Namenode
Datanodes
112233
File1445566
778899
aabb
cc
• Input files split into 64MB blocks
• Round-Robin data distribution algorithm
CBA
Portions 3:2:1
![Page 20: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/20.jpg)
Data Redistribution
04/11/23 2020
1
1.Get network topology, ratio, and utilization
2.Build and sort two lists:under-utilized node list L1
over-utilized node list L2
3. Select the source and destination node from the lists.
4.Transfer data
5.Repeat step 3, 4 until the list is empty.
Namenode
1122
33
4455
66778899
aabbcc
CA
CBA
B
234
L1
L2
Portion 3:2:1
![Page 21: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/21.jpg)
Experimental Environment
04/11/23 21
Five nodes in a Hadoop heterogeneous cluster
Node CPU Model CPU(Hz) L1 Cache(KB)
Node A Intel core 2 Duo 2*1G=2G 204
Node B Intel Celeron 2.8G 256
Node C Intel Pentium 3 1.2G 256
Node D Intel Pentium 3 1.2G 256
Node E Intel Pentium 3 1.2G 256
![Page 22: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/22.jpg)
Benckmarks
• Grep: a tool searching for a regular expression in a text file
• WordCount: a program used to count words in a text file
• Sort: a program used to list the inputs in sorted order.
04/11/23 22
![Page 23: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/23.jpg)
Response Time of Grep andWordcount in Each Node
04/11/23 23
Application dependenceComputing ratio is
Data size independence
![Page 24: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/24.jpg)
Computing Ratio for Two Applications
04/11/23 24
Computing ratio of the five nodes with respective of Grep and Wordcount applications
Computing Node Ratios for Grep Ratios for Wordcount
Node A 1 1
Node B 2 2
Node C 3.3 5
Node D 3.3 5
Node E 3.3 5
![Page 25: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/25.jpg)
Six Data Placement Decisions
04/11/23 25
![Page 26: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/26.jpg)
Impact of data placement on performance of Grep
04/11/23 26
![Page 27: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/27.jpg)
Impact of data placement on performance of WordCount
04/11/23 27
![Page 28: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/28.jpg)
Summary of Data Placement
P1: Data Placement Strategy• Motivation: Fast machines process large data sets• Problem: Data locality issue in heterogeneous
clusters• Contributions: Distribute data according to
computing capability– Measure computing ratio– Initialize data placement– Redistribution
04/11/23 28
![Page 29: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/29.jpg)
Predictive Scheduling and Prefetching for Hadoop clusters
04/11/23 29
![Page 30: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/30.jpg)
Prefetching
• Goal: Improving performance
• Approach– Best effort to guarantee data locality.– Keeping data close to computing nodes– Reducing the CPU stall time
04/11/23 30
![Page 31: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/31.jpg)
Challenges
• What to prefetch?
• How to prefetch?
• What is the size of blocks to be prefetched?
04/11/23 31
![Page 32: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/32.jpg)
Dataflow in Hadoop
04/11/23 32
mapmap
mapmap
reducereduce
reducereduce
HDFSHDFS
Block 1
Block 2
3.Read Input
1.Submit job
2.Schedule
Local FS
Local FS
Local FS
Local FS
4. Run map
5.he
artb
eat
6. N
ext t
ask
7.Read new file
![Page 33: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/33.jpg)
Dataflow in Hadoop
04/11/23 33
mapmap
mapmap
reducereduce
reducereduce
HDFSHDFS
Block 1
Block 2
3.Read Input
1.Submit job
2.Schedule+ more task+ meta data
Local FS
Local FS
Local FS
Local FS
4. Run map
5.he
artb
eat
6. N
ext t
ask
5.1.Read new file
6. N
ext t
ask
4. Run map
![Page 34: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/34.jpg)
Prefetching Processing
04/11/23 34
6
7
8
![Page 35: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/35.jpg)
Software Architecture
04/11/23 35
![Page 36: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/36.jpg)
Grep Performance
04/11/23 36
9.5% 1G8.5% 2G
![Page 37: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/37.jpg)
WordCount Performance
04/11/23 37
8.9% 1G8.1% 2G
![Page 38: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/38.jpg)
Large/Small file in a node
04/11/23 38
9.1% Grep8.3% WordCount
18% Grep24% WordCount
![Page 39: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/39.jpg)
Experiment Setting
04/11/23 39
![Page 40: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/40.jpg)
Large/Small file in cluster
04/11/23 40
![Page 41: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/41.jpg)
Summary
P2: Predictive Scheduler and Prefetching• Goal: Moving data before task assigns• Problem: Synchronization task and data• Contributions: Preloading the required data early
than the task assigned– Predictive scheduler– Prefetching mechanism– Worker thread
04/11/23 41
![Page 42: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/42.jpg)
Adaptive Preshuffling in Hadoop clusters
04/11/23 42
![Page 43: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/43.jpg)
Preshuffling
• Observation 1: Too much data move from Map worker to Reduce worker– Solution1: Map nodes apply pre-shuffling
functions to their local output
• Observation 2: No reduce can start until a map is complete.– Solution2: Intermediate data is pipelined
between mappers and reducers. 04/11/23 43
![Page 44: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/44.jpg)
Preshuffling
• Goal : Minimize data shuffle during Reduce
• Approach– Pipeline– Overlap between map and data movement– Group map and reduce
• Challenges– Synchronize map and reduce– Data locality
04/11/23 44
![Page 45: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/45.jpg)
Dataflow in Hadoop
04/11/23 45
mapmap
mapmap
reducereduce
reducereduce
HDFSHDFS
Block 1
Block 2
3.Read Input
1.Submit job 2.Schedule
Local FS
Local FS
Local FS
Local FS
5.he
artb
eat
6. N
ext t
ask
2. New task
HTTP GET
4. Run map3. Request data
HDFSHDFS
5.Write data
4. Send data
![Page 46: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/46.jpg)
PreShuffle
04/11/23 46
Data request
mapmap
mapmap
reducereduce
reducereduce
![Page 47: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/47.jpg)
In-memory buffer
04/11/23 47
![Page 48: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/48.jpg)
Pipelining – A new design
04/11/23 48
HDFSHDFSHDFSHDFS
Block 1
Block 2
mapmap
mapmap
reducereduce
reducereduce
![Page 49: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/49.jpg)
WordCount Performance
04/12/23 49
230 seconds vs 180 seconds
![Page 50: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/50.jpg)
WordCount Performance
04/12/23 50
![Page 51: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/51.jpg)
Sort Performace
04/12/23 51
![Page 52: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/52.jpg)
Summary
P3: Preshuffling• Goal: Minimize data shuffling during the Reduce• Problem: task distribution and synchronization • Contributions: preshuffling agorithm
– Push data instead of tradition pull– In-memory buffer– Pipeline
04/12/23 52
![Page 53: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/53.jpg)
Conclusion
04/12/23 53
InputInput
Output
Output
P3: Preshuffling
P1: Data placement
P2: Prefetching
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Reduce
Reduce
Reduce
Offline, distributed data, heterogeneous node
Online, data preloading, single node
Intermediate data movement, reducing traffic
![Page 54: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/54.jpg)
Future Work
• Extend Pipelining– Implement the pipelining design
• Small files issue– Har file– Sequence file– CombineFileInputFormat
• Extend Data placement
04/12/23 54
![Page 55: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/55.jpg)
Thanks!And Questions?
55
![Page 56: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/56.jpg)
Run Time affected by Network Condition
04/12/23 56
Experiment result conducted by Yixian Yang
![Page 57: Performance Issues on Hadoop Clusters](https://reader036.fdocuments.net/reader036/viewer/2022062405/554f7355b4c9052a518b4591/html5/thumbnails/57.jpg)
Traffic Volume affected by Network Condition
04/12/23 57
Experiment result conducted by Yixian Yang