CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744:...

24
CS 744: MAPREDUCE Shivaram Venkataraman Fall 2019

Transcript of CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744:...

Page 1: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

CS 744: MAPREDUCE

Shivaram Venkataraman Fall 2019

Shivaram
Page 2: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

ANNOUNCEMENTS

•  Assignment 1 out •  CloudLab notes on Piazza •  No teams yet?

Page 3: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Scalable Storage Systems

Datacenter Architecture

Resource Management

Computational Engines

Machine Learning SQL Streaming Graph

Applications

Shivaram
Shivaram
Page 4: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

BACKGROUND: PTHREADS void*myThreadFun(void*vargp){sleep(1);printf(“HelloWorld\n");returnNULL;}intmain(){pthread_tthread_id_1,thread_id_2;pthread_create(&thread_id_1,NULL,myThreadFun,NULL);pthread_create(&thread_id_2,NULL,myThreadFun,NULL);pthread_join(thread_id_1,NULL);pthread_join(thread_id_2,NULL);exit(0);}

Shivaram
Shivaram
Page 5: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

BACKGROUND: MPI intmain(intargc,char**argv){MPI_Init(NULL,NULL);//Getthenumberofprocessesintworld_size;MPI_Comm_size(MPI_COMM_WORLD,&world_size);//Gettherankoftheprocessintworld_rank;MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);//Printoffahelloworldmessageprintf("Helloworldfromrank%doutof%dprocessors\n",world_rank,world_size);//FinalizetheMPIenvironment.MPI_Finalize();}

mpirun-n4-fhost_file./mpi_hello_world

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 6: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

MOTIVATION

Build Google Web Search - Crawl documents, build inverted indexes etc.

Need for

- automatic parallelization - network, disk optimization - handling of machine failures

Shivaram
Shivaram
Shivaram
Page 7: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

OUTLINE

-  Programming Model -  Execution Overview -  Fault Tolerance -  Optimizations

Page 8: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

PROGRAMMING MODEL

Data type: Each record is (key, value) Map function:

(Kin, Vin) à list(Kinter, Vinter)

Reduce function: (Kinter, list(Vinter)) à list(Kout, Vout)

Shivaram
Page 9: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Example: Word Count

def mapper(line): for word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 10: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Word Count Execution

the quick brown fox

the fox ate the mouse

how now brown cow

Map

Map

Map

Reduce

Reduce

Input Map Shuffle & Sort Reduce Output

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 11: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Word Count Execution

the quick brown fox

the fox ate the mouse

how now brown cow

Map

Map

Map

Reduce

Reduce

brown, 2 fox, 2 how, 1 now, 1 the, 3

ate, 1 cow, 1

mouse, 1 quick, 1

the, 1 brown, 1

fox, 1

quick, 1

the, 1 fox, 1 the, 1

how, 1 now, 1

brown, 1

ate, 1 mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Page 12: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

ASSUMPTIONS

Page 13: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

ASSUMPTIONS

1.  Commodity networking, less bisection bandwidth 2.  Failures are common 3.  Local storage is cheap 4.  Replicated FS

Shivaram
Shivaram
Shivaram
Shivaram
Page 14: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Word Count Execution

the quick brown fox

Map Map

the fox ate the mouse

Map

how now brown cow

Automatically split work

Schedule tasks with locality

JobTracker Submit a Job

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 15: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Fault Recovery If a task crashes:

–  Retry on another node –  If the same task repeatedly fails, end the job

the quick brown fox

Map Map

the fox ate the mouse

Map

how now brown cow

Page 16: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Fault Recovery

If a node crashes: –  Relaunch its current tasks on other nodes What about task inputs ? File system replication

the quick brown fox

Map Map

the fox ate the mouse

Map

how now brown cow

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 17: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

the quick brown fox

Map

Fault Recovery

If a task is going slowly (straggler): –  Launch second copy of task on another node –  Take the output of whichever finishes first

the quick brown fox

Map

the fox ate the mouse

Map

how now brown cow

Shivaram
Shivaram
Page 18: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

MORE DESIGN

Master failure Locality Task Granularity

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 19: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

REFINEMENTS

- Combiner functions -  Counters

-  Skipping bad records

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 20: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

Jeff Dean, LADIS 2009

Shivaram
Shivaram
Shivaram
Shivaram
Page 21: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

DISCUSSION https://forms.gle/hK8wFDxBDfS6chD28

Page 22: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

DISCUSSION

Indexing pipeline where you start with HTML documents. You want to index the documents after removing the most commonly occurring words. 1.  Compute most common words. 2.  Remove them and build the index. What are the main shortcomings of using MapReduce?

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 23: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

DISCUSSION

Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Shivaram
Page 24: CS 744: MAPREDUCEpages.cs.wisc.edu/~shivaram/cs744-fa19-slides/cs744-mapred-notes.pdfCS 744: MAPREDUCE Shivaram Venkataraman Fall 2019 . ANNOUNCEMENTS • Assignment 1 out • CloudLab

NEXT STEPS

•  Next lecture: Spark •  Assignment 1: Use Piazza! •  Project topics: End of this week