Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
-
date post
13-Sep-2014 -
Category
Technology
-
view
1.984 -
download
1
description
Transcript of Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
![Page 1: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/1.jpg)
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 3 September 8, 2011
Matt Lease
School of Information
University of Texas at Austin
ml at ischool dot utexas dot edu
Jason Baldridge
Department of Linguistics
University of Texas at Austin
Jasonbaldridge at gmail dot com
![Page 2: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/2.jpg)
Acknowledgments
Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park
Some figures courtesy of the following excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
![Page 3: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/3.jpg)
Today’s Agenda
• Review
• Toward MapReduce “design patterns”
– Building block: preserving state across calls
– In-Map & In-Mapper combining (vs. combiners)
– Secondary sorting (via value-to-key Conversion)
– Pairs and Stripes
– Order Inversion
• Group Work (examples)
– Interlude: scaling counts, TF-IDF
![Page 4: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/4.jpg)
Review
![Page 5: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/5.jpg)
MapReduce: Recap
Required:
map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
All values with the same key are reduced together
Optional:
partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]
Often a simple hash of the key, e.g., hash(k’) mod n
Divides up key space for parallel reduce operations
combine ( K2, list(V2) ) → list ( K2, V2 )
Mini-reducers that run in memory after the map phase
Used as an optimization to reduce network traffic
The execution framework handles everything else…
![Page 6: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/6.jpg)
“Everything Else”
The execution framework handles everything else…
Scheduling: assigns workers to map and reduce tasks
―Data distribution‖: moves processes to data
Synchronization: gathers, sorts, and shuffles intermediate data
Errors and faults: detects worker failures and restarts
Limited control over data and execution flow
All algorithms must expressed in m, r, c, p
You don’t know:
Where mappers and reducers run
When a mapper or reducer begins or finishes
Which input a particular mapper is processing
Which intermediate key a particular reducer is processing
![Page 7: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/7.jpg)
combine combine combine combine
b a 1 2 c 9 a c 5 2 b c 7 8
partition partition partition partition
map map map map
k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6
b a 1 2 c c 3 6 a c 5 2 b c 7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
![Page 8: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/8.jpg)
Shuffle and Sort
Mapper
Reducer
other mappers
other reducers
circular buffer
(in memory)
spills (on disk)
merged spills
(on disk)
intermediate files
(on disk)
Combiner
Combiner
![Page 9: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/9.jpg)
Shuffle and 2 Sorts
As map emits values, local sorting runs in tandem (1st sort)
Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)
Partition determines which (logical) reducer Rj each key will go to
Node’s TaskTracker tells JobTracker it has keys for Rj
JobTracker determines node to run Rj based on data locality
When local map/combine/sort finishes, sends data to Rj’s node
Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
For each (K, list(V)) tuple in merged output, call reduce(…)
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178
![Page 10: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/10.jpg)
Scalable Hadoop Algorithms: Themes
Avoid object creation
Inherently costly operation
Garbage collection
Avoid buffering
Limited heap size
Works for small datasets, but won’t scale!
• Yet… we’ll talk about patterns involving buffering…
![Page 11: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/11.jpg)
Importance of Local Aggregation
Ideal scaling characteristics:
Twice the data, twice the running time
Twice the resources, half the running time
Why can’t we achieve this?
Synchronization requires communication
Communication kills performance
Thus… avoid communication!
Reduce intermediate data via local aggregation
Combiners can help
![Page 12: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/12.jpg)
Tools for Synchronization
Cleverly-constructed data structures
Bring partial results together
Sort order of intermediate keys
Control order in which reducers process keys
Partitioner
Control which reducer processes which keys
Preserving state in mappers and reducers
Capture dependencies across multiple keys and values
![Page 13: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/13.jpg)
Secondary Sorting
MapReduce sorts input to reducers by key
Values may be arbitrarily ordered
What if want to sort value also?
E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…
Solutions?
Swap key and value to sort by value?
What if we use (k,v) as a joint key (and change nothing else)?
![Page 14: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/14.jpg)
Secondary Sorting: Solutions
Solution 1: Buffer values in memory, then sort
Tradeoffs?
Solution 2: ―Value-to-key conversion‖ design pattern
Form composite intermediate key: (k, v1)
Let execution framework do the sorting
Preserve state across multiple key-value pairs
…how do we make this happen?
![Page 15: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/15.jpg)
Secondary Sorting (Lin 57, White 241)
Create composite key: (k,v)
Define a Key Comparator to sort via both
Possibly not needed in some cases (e.g. strings & concatenation)
Define a partition function based only on the (original) key
All pairs with same key should go to same reducer
Multiple keys may still go to the same reduce node; how do you
know when the key changes across invocations of reduce()?
• i.e. assume you want to do something with all values associated with
a given key (e.g. print all on the same line, with no other keys)
Preserve state in the reducer across invocations
reduce() will be called separately for each pair, but we need to
track the current key so we can detect when it changes
Hadoop also provides Group Comparator
![Page 16: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/16.jpg)
Preserving State in Hadoop
Mapper object
configure
map
close
state
one object per task
Reducer object
configure
reduce
close
state
one call per input
key-value pair
one call per
intermediate key
API initialization hook
API cleanup hook
![Page 17: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/17.jpg)
Combiner Design
Combiners and reducers share same method signature
Sometimes, reducers can serve as combiners
Often, not…
Remember: combiner are optional optimizations
Should not affect algorithm correctness
May be run 0, 1, or multiple times
![Page 18: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/18.jpg)
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)
![Page 19: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/19.jpg)
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
Combiner?
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)
![Page 20: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/20.jpg)
MapReduce Algorithm Design
![Page 21: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/21.jpg)
Design Pattern for Local Aggregation
―In-mapper combining‖
Fold the functionality of the combiner into the mapper,
including preserving state across multiple map calls
Advantages
Speed
Why is this faster than actual combiners?
• Construction/deconstruction, serialization/deserialization
• Guarantee and control use
Disadvantages
Buffering! Explicit memory management required
• Can use disk-backed-buffer, based on # items or byes in memory
• What if multiple mappers running on the same node? Do we know?
Potential for order-dependent bugs
![Page 22: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/22.jpg)
“Hello World”: Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
Combine = reduce
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)
![Page 23: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/23.jpg)
Word Count: in-map combining
Are combiners still needed?
![Page 24: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/24.jpg)
Word Count: in-mapper combining
Are combiners still needed?
![Page 25: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/25.jpg)
Example 2: Compute the Mean (v1)
Why can’t we use reducer as combiner?
![Page 26: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/26.jpg)
Example 2: Compute the Mean (v2)
Why doesn’t this work?
![Page 27: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/27.jpg)
Example 2: Compute the Mean (v3)
![Page 28: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/28.jpg)
Computing the Mean:
in-mapper combining
Are combiners still needed?
![Page 29: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/29.jpg)
Example 3: Term Co-occurrence
Term co-occurrence matrix for a text collection
M = N x N matrix (N = vocabulary size)
Mij: number of times i and j co-occur in some context
(for concreteness, let’s say context = sentence)
Why?
Distributional profiles as a way of measuring semantic distance
Semantic distance useful for many language processing tasks
![Page 30: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/30.jpg)
MapReduce: Large Counting Problems
Term co-occurrence matrix for a text collection
= specific instance of a large counting problem
A large event space (number of terms)
A large number of observations (the collection itself)
Goal: keep track of interesting statistics about the events
Basic approach
Mappers generate partial counts
Reducers aggregate partial counts
How do we aggregate partial counts efficiently?
![Page 31: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/31.jpg)
Approach 1: “Pairs”
Each mapper takes a sentence:
Generate all co-occurring term pairs
For all pairs, emit (a, b) → count
Reducers sum up counts associated with these pairs
Use combiners!
![Page 32: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/32.jpg)
Pairs: Pseudo-Code
![Page 33: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/33.jpg)
“Pairs” Analysis
Advantages
Easy to implement, easy to understand
Disadvantages
Lots of pairs to sort and shuffle around (upper bound?)
Not many opportunities for combiners to work
![Page 34: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/34.jpg)
Another Try: “Stripes”
Idea: group together pairs into an associative array
Each mapper takes a sentence:
Generate all co-occurring term pairs
For each term, emit a → { b: countb, c: countc, d: countd … }
Reducers perform element-wise sum of associative arrays
(a, b) → 1
(a, c) → 2
(a, d) → 5
(a, e) → 3
(a, f) → 2
a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
a → { b: 1, d: 5, e: 3 }
a → { b: 1, c: 2, d: 2, f: 2 }
a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +
![Page 35: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/35.jpg)
Stripes: Pseudo-Code
![Page 36: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/36.jpg)
“Stripes” Analysis
Advantages
Far less sorting and shuffling of key-value pairs
Can make better use of combiners
Disadvantages
More difficult to implement
Underlying object more heavyweight
Fundamental limitation in terms of size of event space
• Buffering!
![Page 37: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/37.jpg)
Cluster size: 38 cores
Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),
which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)
![Page 38: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/38.jpg)
![Page 39: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/39.jpg)
Relative Frequencies
How do we estimate relative frequencies from counts?
Why do we want to do this?
How do we do this with MapReduce?
'
)',(count
),(count
)(count
),(count)|(
B
BA
BA
A
BAABf
![Page 40: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/40.jpg)
f(B|A): “Stripes”
Easy!
One pass to compute (a, *)
Another pass to directly compute f(B|A)
a → {b1:3, b2 :12, b3 :7, b4 :1, … }
![Page 41: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/41.jpg)
f(B|A): “Pairs”
For this to work:
Must emit extra (a, *) for every bn in mapper
Must make sure all a’s get sent to same reducer (use partitioner)
Must make sure (a, *) comes first (define sort order)
Must hold state in reducer across different key-value pairs
(a, b1) → 3
(a, b2) → 12
(a, b3) → 7
(a, b4) → 1
…
(a, *) → 32
(a, b1) → 3 / 32
(a, b2) → 12 / 32
(a, b3) → 7 / 32
(a, b4) → 1 / 32
…
Reducer holds this value in memory
![Page 42: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/42.jpg)
“Order Inversion”
Common design pattern
Computing relative frequencies requires marginal counts
But marginal cannot be computed until you see all counts
Buffering is a bad idea!
Trick: getting the marginal counts to arrive at the reducer before
the joint counts
Optimizations
Apply in-memory combining pattern to accumulate marginal counts
Should we apply combiners?
![Page 43: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/43.jpg)
Synchronization: Pairs vs. Stripes
Approach 1: turn synchronization into an ordering problem
Sort keys into correct order of computation
Partition key space so that each reducer gets the appropriate set
of partial results
Hold state in reducer across multiple key-value pairs to perform
computation
Illustrated by the ―pairs‖ approach
Approach 2: construct data structures that bring partial
results together
Each reducer receives all the data it needs to complete the
computation
Illustrated by the ―stripes‖ approach
![Page 44: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/44.jpg)
Recap: Tools for Synchronization
Cleverly-constructed data structures
Bring data together
Sort order of intermediate keys
Control order in which reducers process keys
Partitioner
Control which reducer processes which keys
Preserving state in mappers and reducers
Capture dependencies across multiple keys and values
![Page 45: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/45.jpg)
Issues and Tradeoffs
Number of key-value pairs
Object creation overhead
Time for sorting and shuffling pairs across the network
Size of each key-value pair
De/serialization overhead
Local aggregation
Opportunities to perform local aggregation varies
Combiners make a big difference
Combiners vs. in-mapper combining
RAM vs. disk vs. network
![Page 46: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/46.jpg)
Group Work (Examples)
![Page 47: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/47.jpg)
How many distinct words in the document collection start
with each letter?
Note: ―types‖ vs. ―tokens‖
Task 5
![Page 48: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/48.jpg)
How many distinct words in the document collection start
with each letter?
Note: ―types‖ vs. ―tokens‖
Ways to make more efficient?
Task 5
Mapper<String,String String,String>
Map(String docID, String document)
for each word in document
emit (first character, word)
![Page 49: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/49.jpg)
How many distinct words in the document collection start
with each letter?
Note: ―types‖ vs. ―tokens‖
Ways to make more efficient?
Task 5
Mapper<String,String String,String>
Map(String docID, String document)
for each word in document
emit (first character, word)
Reducer<Integer,Integer Integer,V3>
Reduce(Integer length, Iterator<K2> values):
set of words = empty set;
for each word
add word to set
emit(letter, size word set)
![Page 50: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/50.jpg)
How many distinct words in the document collection start
with each letter?
How to use in-mapper combining and a separate combiner
Tradeoffs
Task 5b
Mapper<String,String String,String>
Map(String docID, String document)
for each word in document
emit (first character, word)
![Page 51: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/51.jpg)
How many distinct words in the document collection start
with each letter?
How to use in-mapper combining and a separate combiner
Tradeoffs?
Task 5b
Mapper<String,String String,String>
Map(String docID, String document)
for each word in document
emit (first character, word)
Combiner<String,String String,String>
Combine(String letter, Iterator<String> words):
set of words = empty set;
for each word
add word to set
for each word in set
emit(letter, word)
![Page 52: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/52.jpg)
Task 6: find median document length
![Page 53: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/53.jpg)
Task 6: find median document length
Mapper<K1,V1 Integer,Integer>
Map(K1 xx, V1 xx)
10,000 / N times
emit( length(generateRandomDocument()), 1)
![Page 54: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/54.jpg)
Task 6: find median document length
conf.setNumReduceTasks(1)
Problems with this solution?
Mapper<K1,V1 Integer,Integer>
Map(K1 xx, V1 xx)
10,000 / N times
emit( length(generateRandomDocument()), 1)
Reducer<Integer,Integer Integer,V3>
Reduce(Integer length, Iterator<K2> values):
static list lengths = empty list;
for each value
append length to list
Close() { output median }
![Page 55: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/55.jpg)
Interlude: Scaling counts
Many applications require counts of words in some
context.
E.g. information retrieval, vector-based semantics
Counts from frequent words like ―the‖ can overwhelm the
signal from content words such as ―stocks‖ and ―football‖
Two strategies for combating high frequency words:
Use a stop list that excludes them
Scale the counts such that high frequency words are downgraded.
![Page 56: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/56.jpg)
Interlude: Scaling counts, TF-IDF
TF-IDF, or term frequency—inverse document frequency
is a standard way of scaling.
Inverse document frequency for a term t is the ratio of the
number of documents in the collection to the number of
documents containing t:
TF-IDF is just the term frequency times the idf:
![Page 57: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/57.jpg)
Interlude: Scaling counts, TF-IDF
TF-IDF, or term frequency—inverse document frequency
is a standard way of scaling.
Inverse document frequency for a term t is the ratio of the
number of documents in the collection to the number of
documents containing t:
TF-IDF is just the term frequency times the idf:
![Page 58: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/58.jpg)
Interlude: Scaling counts using DF
Recall the word co-occurrence counts task from the earlier
slides.
mij represents the number of times word j has occurred in the
neighborhood of word i.
The row mi gives a vector profile of word i that we can use for
tasks like determining word similarity (e.g. using cosine distance)
Words like ―the‖ will tend to have high counts that we want to scale
down so they don’t dominate this computation.
The counts in mij can be scaled down using dfj. Let’s
create a transformed matrix S where:
![Page 59: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)](https://reader034.fdocuments.net/reader034/viewer/2022051411/54138be38d7f728a698b4645/html5/thumbnails/59.jpg)
Task 7
Compute S, the co-occurrence counts scaled by document
frequency.
• First: do the simplest mapper
• Then: simplify things for the reducer