Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs...
Transcript of Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs...
![Page 1: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/1.jpg)
Map Reduce Information Retrieval
© Crista Lopes, UCI
![Page 2: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/2.jpg)
Example: term frequency
• Given one or more text files, produce the list of words and their frequencies, and display the N (e.g. 25) most frequent words and corresponding frequencies
• Trivial problem
• How to do this?
• Many styles, i.e. ways of thinking about the problem
![Page 3: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/3.jpg)
Style 1: Monolithic
• The problem is not subdivided into smaller problems at all; instead, it is solved as a whole. The programming task consists in defining the data and control flow that rules this monolith.
• Main characteristics of this style:
• One monolithic conceptual unit that takes input and produces the desired output
• For all but the simplest problems, the control flow becomes very large
• Potentially small memory footprint
![Page 4: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/4.jpg)
display
file
Read l ine
Get next character
EOF?
Inside word?
alpha numeric
?
New word
detected
no
yes
alpha numeric
?
yes
yes
End of word detected; Emit word
Stop word? yes
Seen before?
no
Add the word with freq=1
no
Increment freq of wi
yes
Freq(wi) > Freq(wi-1)?
Swap wi, wi-1
yes
EOL?
yes no
no
yes
no
![Page 5: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/5.jpg)
Style 2: Functional
• Main characteristics of this style:
• Functions take input, produce output, don't have side effects
• There is no shared state
• The large problem is solved by composing functions one after the other, as a faithful reproduction of mathematical function composition f ◦ g ("f after g")
![Page 6: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/6.jpg)
Read
File
Filter
Chars
Normalize
Scan
Remove
Stop
Words
Count
Frequencies
Sort
display
file
sort(frequencies(remove_stop_words(scan(normalize(filter_chars(\
read_file(sys.argv[1])))))))
![Page 7: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/7.jpg)
Style 3: Imperative
• Conceptual processing units similar to the functional style. Radical difference in the way that the units share data. In this style, the units all share a pool of state, which they read and modify.
• Main characteristics of this style:
• Units may or may not take input, don't produce output relevant to the main problem
• Units change the state of global, shared data
• The large problem is solved by issuing commands, one after the other, that change, or add to, the global state
![Page 8: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/8.jpg)
Read
File
Filter Chars
Normalize
Scan
Remove
Stop
Words
Count
Frequencies
Sort
display
file
Char Data
Word Data
Word Freqs
read_file(sys.argv[1])
filter_chars_and_normalize()
scan()
remove_stop_words()
frequencies()
sort()
![Page 9: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/9.jpg)
Many more styles.....
![Page 10: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/10.jpg)
Map-Reduce, Basic
• Key observation is that this problem of counting words can be done in a divide-and-conquer style over the input data.:
• split the input into chunks, and process each chunk independently so to produce data that can then be reduced into counts of words.
• Main characteristics of this style:
• Two key abstractions:
• (1) map takes chunks of data and applies a given function to each chunk independently, producing a collection of results;
• (2) reduce takes a collection of results and applies a given function to them in order to extract some global knowledge out of those results
• Uses functions as data, i.e. as input to other functions
• The map operation over independent chunks of data can be parallelized, potentially resulting in considerable performance gains
![Page 11: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/11.jpg)
Read
File
file Partition
Map
Split
Words
Reduce
Count
Words
Sort
display
[chunk1, chunk2, ...]
[result1, result2, ...]
splits = map(split_words,partition(read_file(sys.argv[1]),200))
splits.insert(0, []) # normalize input to reduce
word_freqs = sort(reduce(count_words, splits))
functions
![Page 12: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/12.jpg)
Map-Reduce, Hadoop
• The previous style allows for parallelization of the map step, but requires serialization of the reduce step. Google map-reduce and Hadoop use a slight variation that makes the reduce step also potentially parallelizable. The main idea is to regroup, or reshuffle, the list of results from the map step so that the regroupings are amenable to further mapping of a reducible function.
• Main characteristics of this style:
• Similar to basic map-reduce, but with additional regroup (aka shuffling) step, followed by another map that maps a reducible function to a collection of inputs.
![Page 13: Map Reduce - ics.uci.edulopes/teaching/cs221W17/slides/MapReduce.… · Word Data Word Freqs read_file(sys.argv[1]) filter_chars_and_normalize() scan() remove_stop_words() frequencies()](https://reader035.fdocuments.net/reader035/viewer/2022081600/605b125b84bebb6f987d0831/html5/thumbnails/13.jpg)
Read
File
file Partition
Map
Split
Words
Regroup
Count
Words
Sort
display
[chunk1, chunk2, ...]
[result1, result2, ...]
Map
Reduce
[(w1, (...)), (w2, (..)), ...]
splits = map(split_words,partition(read_file(sys.argv[1]),200))
splits_per_word = regroup(splits)
word_freqs = sort(map(count_words, splits_per_word.items()))
functions
reduce inside