CS 744: MAPREDUCE
Transcript of CS 744: MAPREDUCE
![Page 1: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/1.jpg)
CS 744: MAPREDUCE
Shivaram VenkataramanFall 2020
![Page 2: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/2.jpg)
ANNOUNCEMENTS
• Assignment 1 deliverables– Code (comments, formatting)– Report
• Partitioning analysis (graphs, tables, figures etc.)• Persistence analysis (graphs, tables, figures etc.)• Fault-tolerance analysis (graphs, tables, figures etc.)
• See Piazza for Spark installation
![Page 3: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/3.jpg)
Scalable Storage Systems
Datacenter Architecture
Resource Management
Computational Engines
Machine Learning SQL Streaming Graph
Applications
![Page 4: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/4.jpg)
BACKGROUND: PTHREADSvoid *myThreadFun(void *vargp) {
sleep(1); printf(“Hello World\n"); return NULL;
}
int main() {
pthread_t thread_id_1, thread_id_2; pthread_create(&thread_id_1, NULL, myThreadFun, NULL); pthread_create(&thread_id_2, NULL, myThreadFun, NULL); pthread_join(thread_id_1, NULL); pthread_join(thread_id_2, NULL); exit(0);
}
![Page 5: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/5.jpg)
BACKGROUND: MPIint main(int argc, char** argv) {
MPI_Init(NULL, NULL);
// Get the number of processesint world_size;MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the processint world_rank;MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Print off a hello world messageprintf("Hello world from rank %d out of %d processors\n",
world_rank, world_size);
// Finalize the MPI environment.MPI_Finalize();
}
mpirun -n 4 -f host_file ./mpi_hello_world
![Page 6: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/6.jpg)
MOTIVATION
Build Google Web Search- Crawl documents, build inverted indexes etc.
Need for - automatic parallelization- network, disk optimization- handling of machine failures
![Page 7: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/7.jpg)
OUTLINE
- Programming Model- Execution Overview- Fault Tolerance- Optimizations
![Page 8: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/8.jpg)
PROGRAMMING MODEL
Data type: Each record is (key, value)
Map function:(Kin, Vin) à list(Kinter, Vinter)
Reduce function:(Kinter, list(Vinter)) à list(Kout, Vout)
![Page 9: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/9.jpg)
Example: Word Count
def mapper(line):for word in line.split():
output(word, 1)
def reducer(key, values):output(key, sum(values))
![Page 10: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/10.jpg)
Word Count Execution
the quickbrown fox
the fox ate the mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
Input Map Shuffle & Sort Reduce Output
![Page 11: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/11.jpg)
Word Count Execution
the quickbrown fox
the fox ate the mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown, 2fox, 2how, 1now, 1the, 3
ate, 1cow, 1
mouse, 1quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
![Page 12: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/12.jpg)
ASSUMPTIONS
![Page 13: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/13.jpg)
ASSUMPTIONS
1. Commodity networking, less bisection bandwidth2. Failures are common3. Local storage is cheap4. Replicated FS
![Page 14: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/14.jpg)
Word Count Execution
the quickbrown fox
Map Map
the fox ate the mouse
Map
how nowbrown cow
Automatically split work
Schedule taskswith locality
JobTrackerSubmit a Job
![Page 15: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/15.jpg)
Fault RecoveryIf a task crashes:
– Retry on another node– If the same task repeatedly fails, end the job
the quickbrown fox
Map Map
the fox ate the mouse
Map
how nowbrown cow
![Page 16: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/16.jpg)
Fault Recovery
If a node crashes:– Relaunch its current tasks on other nodes
What about task inputs ? File system replication
the quickbrown fox
Map Map
the fox ate the mouse
Map
how nowbrown cow
![Page 17: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/17.jpg)
the quickbrown fox
Map
Fault Recovery
If a task is going slowly (straggler):– Launch second copy of task on another node– Take the output of whichever finishes first
the quickbrown fox
Map
the fox ate the mouse
Map
how nowbrown cow
![Page 18: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/18.jpg)
MORE DESIGN
Master failure
Locality
Task Granularity
![Page 19: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/19.jpg)
MAPREDUCE: SUMMARY
- Simplify programming on large clusters with frequent failures
- Limited but general functional API- Map, Reduce, Sort- No other synchronization / communication
- Fault recovery, straggler mitigation through retries
![Page 20: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/20.jpg)
DISCUSSIONhttps://forms.gle/mAHD4QuMXko7vnjB6
![Page 21: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/21.jpg)
DISCUSSION
List one similarity and one difference between MPI and MapReduce
![Page 22: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/22.jpg)
DISCUSSION
Indexing pipeline where you start with HTML documents. You want to index the documents after removing the most commonly occurring words. 1. Compute most common words.2. Remove them and build the index. What are the main shortcomings of using MapReduce to do this?
![Page 23: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/23.jpg)
![Page 24: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/24.jpg)
Jeff Dean, LADIS 2009
![Page 25: CS 744: MAPREDUCE](https://reader031.fdocuments.net/reader031/viewer/2022022504/621710866276cc342a6c056e/html5/thumbnails/25.jpg)
NEXT STEPS
• Next lecture: Spark• Assignment 1: Use Piazza!