CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures...
Transcript of CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures...
![Page 1: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/1.jpg)
CPS216: Advanced Database
Systems (Data-intensive
Computing Systems)
Introduction to MapReduce
and Hadoop
Shivnath Babu
![Page 2: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/2.jpg)
Word Count over a Given Set of
Web Pages
see bob throw see 1
bob 1
throw 1
see 1
spot 1
run 1
bob 1
run 1
see 2
spot 1
throw 1
see spot run
Can we do word count in parallel?
![Page 3: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/3.jpg)
The MapReduce Framework
(pioneered by Google)
![Page 4: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/4.jpg)
Automatic Parallel Execution in
MapReduce (Google)
Handles failures automatically, e.g., restarts tasks if a
node fails; runs multiples copies of the same task to
avoid a slow task slowing down the whole job
![Page 5: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/5.jpg)
MapReduce in Hadoop (1)
![Page 6: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/6.jpg)
MapReduce in Hadoop (2)
![Page 7: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/7.jpg)
MapReduce in Hadoop (3)
![Page 8: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/8.jpg)
Data Flow in a MapReduce
Program in Hadoop • InputFormat
• Map function
• Partitioner
• Sorting & Merging
• Combiner
• Shuffling
• Merging
• Reduce function
• OutputFormat
1:many
![Page 9: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/9.jpg)
![Page 10: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/10.jpg)
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
![Page 11: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/11.jpg)
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
![Page 12: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/12.jpg)
Map Wave 1
Reduce Wave 1
Map Wave 2
Reduce Wave 2
Input Splits
Lifecycle of a MapReduce Job
Time
How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
![Page 13: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/13.jpg)
Job Configuration Parameters
• 190+ parameters in
Hadoop
• Set manually or defaults
are used
![Page 14: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/14.jpg)
How to sort data using Hadoop?
![Page 15: CPS216: Advanced Database Systems (Data-intensive ... › courses › fall10 › cps216 › Lectures › intro_to_mapreduce.pdfCPS216: Advanced Database Systems (Data-intensive Computing](https://reader033.fdocuments.net/reader033/viewer/2022060406/5f0f60787e708231d443db2f/html5/thumbnails/15.jpg)
Let us look at a complete
example MapReduce program
in Hadoop