MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data...

MapReduce:Simplified Data Processing on Large Clusters

J. Dean, S. Ghemawat, OSDI, 2004.

Review by Mariana Marasoiu for R212

Motivation: Large scale data processing

We want to:

Extract data from large datasets

Run on big clusters of computers

Be easy to program

Solution: MapReduce

A new programming model: Map & Reduce

Provides:Automatic parallelization and distributionFault toleranceI/O schedulingStatus and monitoring

(1, you are in Cambridge)

(2, I like Cambridge)

(3, we live in Cambridge)

(you, 1)(are, 1)(in, 1)(Cambridge, 1)

(I, 1)(like, 1)(Cambridge, 1)

(we, 1)(live, 1)(in, 1)(Cambridge, 1)

map (in_key, in_value) → list(out_key, intermediate_value)

Partition

(we, 1)

(you, 1)

(live, 1)

(are, 1)

(Cambridge, 1)(Cambridge, 1)(Cambridge, 1)

(in, 1)(in, 1)

(I, 1)

(like, 1)

Partition Reduce

(we, 1)

(you, 1)

(live, 1)

(are, 1)

(Cambridge, 1)(Cambridge, 1)(Cambridge, 1)

(in, 1)(in, 1)

(I, 1)

(like, 1)

(you, 1)

(are, 1)

(in, 2)

(Cambridge, 3)

(I, 1)

(like, 1)

(we, 1)

(live, 1)

reduce (out_key, list(intermediate_value)) -> list(out_value)

File 1

File 2

File 3

UserProgram

Input files

File 1

File 2

worker

File 3

UserProgram Master

Input files

forkfork

File 1

File 2

worker

File 3

UserProgram Master

Input files

assignmap

assignreduce

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

File 3

UserProgram Master

Input files

M splits

Mapphase

assignmap

assignreduce

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

File 3

UserProgram Master

Input files

M splits

Mapphase

Intermediate files(on local disks)

assignmap

assignreduce

readlocalwrite

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

File 3

UserProgram Master

Input files

M splits

Mapphase

Reducephase

assignmap

assignreduce

readlocalwrite remote

File 1

File 2

split 0

split 1

split 2

split 3

split 4

worker

OutputFile 1

OutputFile 2

File 3

UserProgram Master

Input files

M splits

Mapphase

Reducephase

R Output files

assignmap

assignreduce

readlocalwrite remote

Fine task granularity

M so that data is between 16MB and 64MBR is small multiple of workersE.g. M = 200,000, R = 5,000 on 2,000 workers

Advantages:dynamic load balancingfault tolerance

Fault tolerance

Workers:Detect failure via periodic heartbeat

Re-execute completed and in-progress map tasks

Re-execute in progress reduce tasks

Task completion committed through master

Master:Not handled - failure unlikely

Refinements

Locality optimizationBackup tasksOrdering guaranteesCombiner functionSkipping bad recordsLocal execution

Performance

Tests run on 1800 machines:Dual 2GHz Intel Xeon processors

with Hyper-Threading enabled4GB of memoryTwo 160GB IDE disksGigabit Ethernet link

2 Benchmarks:MR_Grep 1010 x 100 byte entries, 92k matchesMR_Sort 1010 x 100 byte entries

MR_Grep

150 seconds run (startup overhead of ~60 seconds)

MR_Sort Normal execution No backup tasks 200 tasks killed

Experience

Rewrite of the indexing systemfor Google web search

Large scale machine learning

Clustering for Google News

Data extraction for Google Zeitgeist

Large scale graph computations

Conclusions

MapReduce:useful abstractionsimplifies large-scale computationseasy to use

However:expensive for small applicationslong startup time (~1 min)chaining of map-reduce phases?

MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data...

Documents

Transcript of MapReduce - University of Cambridgeey204/teaching/ACS/R212_2014...MapReduce: Simplified Data...

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean & Sanjay Ghemawat Appeared in: OSDI '04: Sixth Symposium on Operating System Design.

MapReduce - University of Cambridgeey204/teaching/ACS/R244_2017_2018/... · • The same map() and reduce() function on all data. Only allows for data parallelisation. • Ineﬃcient

BM 499 Ghemawat, Chapter 2

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

Utah Distributed Systems Meetup and Reading …Map Reduce & Spark Map Reduce MapReduce: Simplied Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com,

MapReduce - polito.itdbdmg.polito.it/.../uploads/2019/11/04-MapReduce.pdf · MapReduce •Published in 2004 by Google •J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing

Map-Reduce - Chalmers · MapReduce: Simplified Data Processing on Large Clusters . by Jeffrey Dean and Sanjay Ghemawat . In Symposium on Operating Systems Design & Implementation

MapReduce a distribuovane´ vy´pocˇtyufal.mff.cuni.cz/~straka/courses/npfl102/mapreduce_slides_2009.pdf · MapReduce Implementace MapReduce Google MapReduce Hadoop Phoenix Mars.

Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Presented by Shen Lipbg.cs.illinois.edu/.../lectures/18-MapReduce-Shen.pdfPaper List • [1] MapReduce: Simplified Data Processing on Large Clusters, J. Dean, and S. Ghemawat, OSDI’04.

5 filesystem gfs hdfs copy - GitHub Pages · 2020. 8. 26. · • Seminal Google papers: The Google file system by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003) MapReduce:

Mapreduce framework suffling & sorting. mapreduce example - wordcount.

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.

Competition and Business Strategy in Historical Perspective - Ghemawat

MODELLING RESILIENCE IN CLOUD-SCALE DATA CENTRES · like: where Amazon uses virtualization (DeCandia et al. 2007), Google uses MapReduce (Dean and Ghemawat 2008). Secondly, it is

MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat Presented By :- Venkataramana Chunduru.

© 2014 Pankaj Ghemawat © 2009 Pankaj Ghemawat Management and Market Failures Pankaj Ghemawat NYU Stern/IESE October 16, 2014.

Ghemawat - Commitment

MapReduce: Simplified Data Processing on Large Cluster Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Long Kai and Philbert Lin.