MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data...

22
MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu for R212

Transcript of MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data...

Page 1: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

MapReduce:Simplified Data Processing on Large Clusters

J. Dean, S. Ghemawat, OSDI, 2004.

Review by Mariana Marasoiu for R212

Page 2: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Motivation: Large scale data processing

We want to:

Extract data from large datasets

Run on big clusters of computers

Be easy to program

Page 3: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Solution: MapReduce

A new programming model: Map & Reduce

Provides:Automatic parallelization and distributionFault toleranceI/O schedulingStatus and monitoring

Page 4: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

(1, you are in Cambridge)

(2, I like Cambridge)

(3, we live in Cambridge)

(you, 1)(are, 1)(in, 1)(Cambridge, 1)

(I, 1)(like, 1)(Cambridge, 1)

(we, 1)(live, 1)(in, 1)(Cambridge, 1)

Map

map (in_key, in_value) → list(out_key, intermediate_value)

Page 5: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

(you, 1)(are, 1)(in, 1)(Cambridge, 1)

(I, 1)(like, 1)(Cambridge, 1)

(we, 1)(live, 1)(in, 1)(Cambridge, 1)

Page 6: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Partition

(we, 1)

(you, 1)

(live, 1)

(are, 1)

(Cambridge, 1)(Cambridge, 1)(Cambridge, 1)

(in, 1)(in, 1)

(I, 1)

(like, 1)

(you, 1)(are, 1)(in, 1)(Cambridge, 1)

(I, 1)(like, 1)(Cambridge, 1)

(we, 1)(live, 1)(in, 1)(Cambridge, 1)

Page 7: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Partition Reduce

(we, 1)

(you, 1)

(live, 1)

(are, 1)

(Cambridge, 1)(Cambridge, 1)(Cambridge, 1)

(in, 1)(in, 1)

(I, 1)

(like, 1)

(you, 1)(are, 1)(in, 1)(Cambridge, 1)

(I, 1)(like, 1)(Cambridge, 1)

(we, 1)(live, 1)(in, 1)(Cambridge, 1)

(you, 1)

(are, 1)

(in, 2)

(Cambridge, 3)

(I, 1)

(like, 1)

(we, 1)

(live, 1)

reduce (out_key, list(intermediate_value)) -> list(out_value)

Page 8: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

File 3

UserProgram

Input files

Page 9: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

worker

worker

worker

worker

worker

File 3

UserProgram Master

Input files

fork

forkfork

Page 10: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

worker

worker

worker

worker

worker

File 3

UserProgram Master

Input files

fork

assignmap

assignreduce

Page 11: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

File 3

UserProgram Master

Input files

M splits

Mapphase

fork

assignmap

assignreduce

split

read

Page 12: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

File 3

UserProgram Master

Input files

M splits

Mapphase

Intermediate files(on local disks)

fork

assignmap

assignreduce

split

readlocalwrite

Page 13: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

File 3

UserProgram Master

Input files

M splits

Mapphase

Intermediate files(on local disks)

Reducephase

fork

assignmap

assignreduce

split

readlocalwrite remote

read

Page 14: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

OutputFile 1

OutputFile 2

File 3

UserProgram Master

Input files

M splits

Mapphase

Intermediate files(on local disks)

Reducephase

R Output files

fork

assignmap

assignreduce

split

readlocalwrite remote

read

write

Page 15: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Fine task granularity

M so that data is between 16MB and 64MBR is small multiple of workersE.g. M = 200,000, R = 5,000 on 2,000 workers

Advantages:dynamic load balancingfault tolerance

Page 16: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Fault tolerance

Workers:Detect failure via periodic heartbeat

Re-execute completed and in-progress map tasks

Re-execute in progress reduce tasks

Task completion committed through master

Master:Not handled - failure unlikely

Page 17: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Refinements

Locality optimizationBackup tasksOrdering guaranteesCombiner functionSkipping bad recordsLocal execution

Page 18: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Performance

Tests run on 1800 machines:Dual 2GHz Intel Xeon processors

with Hyper-Threading enabled4GB of memoryTwo 160GB IDE disksGigabit Ethernet link

2 Benchmarks:MR_Grep 1010 x 100 byte entries, 92k matchesMR_Sort 1010 x 100 byte entries

Page 19: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

MR_Grep

150 seconds run (startup overhead of ~60 seconds)

Page 20: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

MR_Sort Normal execution No backup tasks 200 tasks killed

Page 21: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Experience

Rewrite of the indexing systemfor Google web search

Large scale machine learning

Clustering for Google News

Data extraction for Google Zeitgeist

Large scale graph computations

Page 22: MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data Processing on Large Clusters J. Dean, S. Ghemawat, OSDI, 2004. Review by Mariana Marasoiu

Conclusions

MapReduce:useful abstractionsimplifies large-scale computationseasy to use

However:expensive for small applicationslong startup time (~1 min)chaining of map-reduce phases?