Big Data Infrastructure - GitHub...
Transcript of Big Data Infrastructure - GitHub...
Big Data Infrastructure
Jimmy LinUniversity of Maryland
Monday, February 23, 2015
Session 4: MapReduce – Structured and Unstructured Data
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Today’s Agenda¢ Structured data
l Processing relational data with MapReduce
¢ Unstructured datal Basics of indexing and retrieval
l Inverted indexing in MapReduce
Relational Databases¢ A relational database is comprised of tables
¢ Each table represents a relation = collection of tuples (rows)
¢ Each tuple consists of multiple fields
OLTP/OLAP Architecture
OLTP OLAPETL���
(Extract, Transform, and Load)
OLTP/OLAP/Hadoop Architecture
OLTP OLAPETL���
(Extract, Transform, and Load)
Hadoop
Structure of Data Warehouses
SELECT P.Brand, S.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id WHERE D.YEAR = 1997 AND P.Product_Category = 'tv' GROUP BY P.Brand, S.Country;
Source: Wikipedia (Star Schema)
OLAP Cubes
store
prod
uct
slice and dice
Common operations
roll up/drill down
pivot
MapReduce algorithms ���for processing relational data
Source: www.flickr.com/photos/stikatphotography/1590190676/
Source: www.flickr.com/photos/7518432@N06/2237536651/
Design Pattern: Secondary Sorting¢ MapReduce sorts input to reducers by key
l Values are arbitrarily ordered
¢ What if want to sort value also?l E.g., k → (v1, r), (v3, r), (v4, r), (v8, r)…
Secondary Sorting: Solutions¢ Solution 1:
l Buffer values in memory, then sortl Why is this a bad idea?
¢ Solution 2:l “Value-to-key conversion” design pattern: ���
form composite intermediate key, (k, v1)l Let execution framework do the sorting
l Preserve state across multiple key-value pairs to handle processing
l Anything else we need to do?
Value-to-Key Conversion
k → (v1, r), (v4, r), (v8, r), (v3, r)…
(k, v1) → (v1, r)
Before
After
(k, v3) → (v3, r)(k, v4) → (v4, r)(k, v8) → (v8, r)
Values arrive in arbitrary order…
…
Values arrive in sorted order…Process by preserving state across multiple keysRemember to partition correctly!
Working Scenario¢ Two tables:
l User demographics (gender, age, income, etc.)l User page visits (URL, time spent, etc.)
¢ Analyses we might want to perform:l Statistics on demographic characteristics
l Statistics on page visits
l Statistics on page visits by URL
l Statistics on page visits by demographic characteristicl …
Relational Algebra¢ Primitives
l Projection (π)l Selection (σ)
l Cartesian product (×)
l Set union (∪)
l Set difference (-)l Rename (ρ)
¢ Other operationsl Join (⋈)
l Group by… aggregation
l …
Projection
R1
π
R2
R3
R4
R5
R1
R2
R3
R4
R5
Projection in MapReduce¢ Easy!
l Map over tuples, emit new tuples with appropriate attributesl No reducers, unless for regrouping or resorting tuples
l Alternatively: perform in reducer, after some other processing
¢ Basically limited by HDFS streaming speedsl Speed of encoding/decoding tuples becomes important
l Take advantage of compression when available
l Semistructured data? No problem!
Selection
R1
σ
R2
R3
R4
R5
R1
R3
Selection in MapReduce¢ Easy!
l Map over tuples, emit only tuples that meet criterial No reducers, unless for regrouping or resorting tuples
l Alternatively: perform in reducer, after some other processing
¢ Basically limited by HDFS streaming speedsl Speed of encoding/decoding tuples becomes important
l Take advantage of compression when available
l Semistructured data? No problem!
Group by… Aggregation¢ Example: What is the average time spent per URL?
¢ In SQL:
l SELECT url, AVG(time) FROM visits GROUP BY url
¢ In MapReduce:l Map over tuples, emit time, keyed by urll Framework automatically groups values by keys
l Compute average in reducer
l Optimize with combiners
Relational Joins
Source: Microsoft Office Clip Art
Relational Joins
R1
R2
R3
R4
S1
S2
S3
S4
R1 S2
R2 S4
R3 S1
R4 S3
Types of Relationships
One-to-One One-to-Many Many-to-Many
Join Algorithms in MapReduce¢ Reduce-side join
¢ Map-side join
¢ In-memory joinl Striped variant
l Memcached variant
Reduce-side Join¢ Basic idea: group by join key
l Map over both sets of tuplesl Emit tuple as value with join key as the intermediate key
l Execution framework brings together tuples sharing the same key
l Perform actual join in reducer
l Similar to a “sort-merge join” in database terminology
¢ Two variants
l 1-to-1 joinsl 1-to-many and many-to-many joins
Reduce-side Join: 1-to-1
R1
R4
S2
S3
R1
R4
S2
S3
keys valuesMap
R1
R4
S2
S3
keys values
Reduce
Note: no guarantee if R is going to come first or S
Reduce-side Join: 1-to-many
R1
S2
S3
R1
S2
S3
S9
keys valuesMap
R1 S2
keys values
Reduce
S9
S3 …
What’s the problem?
Reduce-side Join: V-to-K Conversion
R1
keys values
In reducer…
S2
S3
S9
R4
S3
S7
New key encountered: hold in memory
Cross with records from other set
New key encountered: hold in memory
Cross with records from other set
Reduce-side Join: many-to-many
R1
keys values
In reducer…
S2
S3
S9
Hold in memory
Cross with records from other set
R5
R8
What’s the problem?
Map-side Join: Basic IdeaAssume two datasets are sorted by the join key:
R1
R2
R3
R4
S1
S2
S3
S4
A sequential scan through both datasets to join���(called a “merge join” in database terminology)
Map-side Join: Parallel Scans¢ If datasets are sorted by join key, join can be accomplished by a
scan over both datasets
¢ How can we accomplish this in parallel?l Partition and sort both datasets in the same manner
¢ In MapReduce:l Map over one dataset, read from other corresponding partition
l No reducers necessary (unless to repartition or resort)
¢ Consistently partitioned datasets: realistic to expect?
In-Memory Join¢ Basic idea: load one dataset into memory, stream over other
datasetl Works if R << S and R fits into memory
l Called a “hash join” in database terminology
¢ MapReduce implementationl Distribute R to all nodes
l Map over S, each mapper loads R in memory, hashed by join key
l For every tuple in S, look up join key in Rl No reducers, unless for regrouping or resorting tuples
In-Memory Join: Variants¢ Striped variant:
l R too big to fit into memory? l Divide R into R1, R2, R3, … s.t. each Rn fits into memory
l Perform in-memory join: ∀n, Rn ⋈ S
l Take the union of all join results
¢ Memcached join:l Load R into memcached
l Replace in-memory hash lookup with memcached lookup
Memcached
Database layer: 800 eight-core Linux servers running MySQL (40 TB user data)
Caching servers: 15 million requests per second, 95% handled by memcache (15 TB of RAM)
Source: Technology Review (July/August, 2008)
Circa 2008 Architecture
Memcached Join¢ Memcached join:
l Load R into memcachedl Replace in-memory hash lookup with memcached lookup
¢ Capacity and scalability?l Memcached capacity >> RAM of individual node
l Memcached scales out with cluster
¢ Latency?l Memcached is fast (basically, speed of network)
l Batch requests to amortize latency costs
Source: See tech report by Lin et al. (2009)
Which join to use?¢ In-memory join > map-side join > reduce-side join
l Why?
¢ Limitations of each?l In-memory join: memory
l Map-side join: sort order and partitioning
l Reduce-side join: general purpose
Processing Relational Data: Summary¢ MapReduce algorithms for processing relational data:
l Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce
l Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer
l Multiple strategies for relational joins
¢ Complex operations require multiple MapReduce jobsl Example: top ten URLs in terms of average time spent
l Opportunities for automatic optimization
Source: www.flickr.com/photos/7518432@N06/2237536651/
Today’s Agenda¢ Structured data
l Processing relational data with MapReduce
¢ Unstructured datal Basics of indexing and retrieval
l Inverted indexing in MapReduce
First, nomenclature…¢ Information retrieval (IR)
l Focus on textual information (= text/document retrieval)l Other possibilities include image, video, music, …
¢ What do we search?l Generically, “collections”
l Less-frequently used, “corpora”
¢ What do we find?l Generically, “documents”
l Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.
Information Retrieval Cycle
Source Selection
Search
Query
Selection
Results
Examination
Documents
Delivery
Information
Query Formulation
Resource
source reselection
System discovery Vocabulary discovery Concept discovery Document discovery
The Central Problem in Search
Do these represent the same concepts?
AuthorSearcher
“tragic love story” “fateful star-crossed romance”
Concepts
Query Terms
Concepts
Document Terms
Abstract IR Architecture
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonlinedocument acquisition ���
(e.g., web crawling)
How do we represent text?¢ Remember: computers don’t “understand” anything!
¢ “Bag of words”
l Treat all the words in a document as index termsl Assign a “weight” to each term based on “importance” ���
(or, in simplest case, presence/absence of word)l Disregard order, structure, meaning, etc. of the words
l Simple, yet effective!
¢ Assumptionsl Term occurrence is independent
l Document relevance is independent
l “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 ووققاالل مماارركك ررييججييفف - االلننااططقق ببااسسمم
االلخخااررججييةة االلإإسسرراائئييللييةة - إإنن ششاارروونن ققببلل االلددععووةة ووسسييققوومم للللممررةة االلأأووللىى ببززييااررةة
تتووننسس٬، االلتتيي ككااننتت للففتتررةة ططووييللةة االلممققرر االلررسسمميي للممننظظممةة االلتتححررييرر االلففللسسططييننييةة ببععدد خخررووججههاا ممنن للببنناانن ععاامم 1982.
Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.
भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 मे ंसात फ़ीसदी विकास दर हासिल करने का आकलन किया है और कर सुधार पर ज़ोर दिया है
日米連合で台頭中国に対処…アーミテージ前副長官提言
조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의 보도를 부인했다.
Sample DocumentMcDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.
…
14 × McDonalds
12 × fat
11 × fries
8 × new
7 × french
6 × company, said, nutrition
5 × food, oil, percent, reduce, taste, Tuesday
…
“Bag of Words”
Counting Words…
Documents
InvertedIndex
Bag of Words
case folding, tokenization, stopword removal, stemming
syntax, semantics, word knowledge, etc.
Boolean Retrieval¢ Users express queries as a Boolean expression
l AND, OR, NOTl Can be arbitrarily nested
¢ Retrieval is based on the notion of setsl Any given query divides the collection into two sets: ���
retrieved, not-retrievedl Pure Boolean systems do not define an ordering of the results
Inverted Index: Boolean Retrieval
one fish, two fish Doc 1
red fish, blue fish Doc 2
cat in the hat Doc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
green eggs and ham Doc 4
1red
1two
2red
1two
Boolean Retrieval¢ To execute a Boolean query:
l Build query syntax tree
l For each clause, look up postings
l Traverse postings and apply Boolean operator
¢ Efficiency analysisl Postings traversal is linear (assuming sorted postings)
l Start with shortest posting first
( blue AND fish ) OR ham
blue fish
AND ham
OR
1
2 blue
fish 2
Strengths and Weaknesses¢ Strengths
l Precise, if you know the right strategiesl Precise, if you have an idea of what you’re looking for
l Implementations are fast and efficient
¢ Weaknessesl Users must learn Boolean logic
l Boolean logic insufficient to capture the richness of language
l No control over size of result set: either too many hits or nonel When do you stop reading? All documents in the result set are
considered “equally good”l What about partial matches? Documents that “don’t quite match” the
query may be useful also
Ranked Retrieval¢ Order documents by how likely they are to be relevant
l Estimate relevance(q, di)l Sort documents by relevance
l Display sorted results
¢ User modell Present hits one screen at a time, best results first
l At any point, users can decide to stop looking
¢ How do we estimate relevance?l Assume document is relevant if it has a lot of query terms
l Replace relevance(q, di) with sim(q, di)l Compute similarity of vector representations
Vector Space Model
Assumption: Documents that are “close together” in vector space “talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θφ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
Similarity Metric¢ Use “angle” between the vectors:
¢ Or, more generally, inner products:
dj = [wj,1, wj,2, wj,3, . . . wj,n]dk = [wk,1, wk,2, wk,3, . . . wk,n]
cos ✓ =
dj · dk|dj ||dk|
sim(dj , dk) =dj · dk|dj ||dk|
=
Pni=0 wj,iwk,iqPn
i=0 w2j,i
qPni=0 w
2k,i
sim(dj , dk) = dj · dk =nX
i=0
wj,iwk,i
Term Weighting¢ Term weights consist of two components
l Local: how important is the term in this document?l Global: how important is the term in the collection?
¢ Here’s the intuition:l Terms that appear often in a document should get high weights
l Terms that appear in many documents should get low weights
¢ How do we capture this mathematically?l Term frequency (local)
l Inverse document frequency (global)
TF.IDF Term Weighting
ijiji n
Nw logtf ,, ⋅=
jiw ,
ji ,tf
N
in
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
2
1
1
2
1
1
1
1
1
1
1
Inverted Index: TF.IDF
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
tfdf
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1red
1 1two
1red
1two
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
green eggs and hamDoc 4
3
4
1
4
4
3
2
1
2
2
1
Positional Indexes¢ Store term position in postings
¢ Supports richer queries (e.g., proximity)
¢ Naturally, leads to larger indexes…
[2,4]
[3]
[2,4]
[2]
[1]
[1]
[3]
[2]
[1]
[1]
[3]
2
1
1
2
1
1
1
1
1
1
1
Inverted Index: Positional Information
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
tf df
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1 red
1 1 two
1 red
1 two
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
green eggs and hamDoc 4
3
4
1
4
4
3
2
1
2
2
1
Retrieval in a Nutshell¢ Look up postings lists corresponding to query terms
¢ Traverse postings for each query term
¢ Store partial query-document scores in accumulators
¢ Select top k results to return
Retrieval: Document-at-a-Time¢ Evaluate documents one at a time (score all query terms)
¢ Tradeoffsl Small memory footprint (good)
l Skipping possible to avoid reading all postings (good)
l More seeks and irregular data accesses (bad)
fish 2 1 3 1 2 31 9 21 34 35 80 …
blue 2 1 19 21 35 …
Accumulators(e.g. min heap)
Document score in top k?
Yes: Insert document score, extract-min if heap too largeNo: Do nothing
Retrieval: Query-At-A-Time¢ Evaluate documents one query term at a time
l Usually, starting from most rare term (often with tf-sorted postings)
¢ Tradeoffs
l Early termination heuristics (good)l Large memory footprint (bad), but filtering heuristics possible
fish 2 1 3 1 2 31 9 21 34 35 80 …
blue 2 1 19 21 35 …Accumulators���
(e.g., hash)Score{q=x}(doc n) = s
MapReduce it?¢ The indexing problem
l Scalability is criticall Must be relatively fast, but need not be real time
l Fundamentally a batch operation
l Incremental updates may or may not be important
l For the web, crawling is a challenge in itself
¢ The retrieval problem
l Must have sub-second response timel For the web, only need relatively few results
Perfect for MapReduce!
Uh… not so good…
Indexing: Performance Analysis¢ Fundamentally, a large sorting problem
l Terms usually fit in memoryl Postings usually don’t
¢ How is it done on a single machine?
¢ How can it be done with MapReduce?
¢ First, let’s characterize the problem size:l Size of vocabulary
l Size of postings
Vocabulary Size: Heaps’ Law
¢ Heaps’ Law: linear in log-log space
¢ Vocabulary size grows unbounded!
bkTM =M is vocabulary sizeT is collection size (number of documents)k and b are constants
Typically, k is between 30 and 100, b is between 0.4 and 0.6
Heaps’ Law for RCV1
Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)
k = 44b = 0.49
First 1,000,020 terms: Predicted = 38,323 Actual = 38,365
Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)
Postings Size: Zipf’s Law
¢ Zipf’s Law: (also) linear in log-log space
l Specific case of Power Law distributions
¢ In other words:
l A few elements occur very frequentlyl Many elements occur very infrequently
ic
i =cf cf is the collection frequency of i-th common termc is a constant
Zipf’s Law for RCV1
Fit isn’t that good… but good enough!
Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)
Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)
Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipf's law.” Contemporary Physics 46:323–351.
Power Laws are everywhere!
MapReduce: Index Construction¢ Map over all documents
l Emit term as key, (docno, tf) as valuel Emit other information as necessary (e.g., term position)
¢ Sort/shuffle: group postings by term
¢ Reduce
l Gather and sort the postings (e.g., by docno or tf)l Write postings to disk
¢ MapReduce does all the heavy lifting!
1
1
2
1
1
2 2
11
1
11
1
1
1
2
Inverted Indexing with MapReduce
1one
1two
1fish
one fish, two fishDoc 1
2red
2blue
2fish
red fish, blue fishDoc 2
3cat
3hat
cat in the hatDoc 3
1fish 2
1one1two
2red
3cat2blue
3hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
Inverted Indexing: Pseudo-Code
1: class Mapper2: method Map(docid n, doc d)3: H new AssociativeArray . histogram to hold term frequencies4: for all term t 2 doc d do . processes the doc, e.g., tokenization and stopword removal5: H{t} H{t}+ 1
6: for all term t 2 H do
7: Emit(term t, posting hn,H{t}i) . emits individual postings
1: class Reducer2: method Reduce(term t, postings [hn1, f1i . . .])3: P new List4: for all hn, fi 2 postings [hn1, f1i . . .] do5: P.Append(hn, fi) . appends postings unsorted
6: P.Sort() . sorts for compression7: Emit(term t, postingsList P )
Figure 2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce.
Given an existing single-machine indexer, one simple way to take advantage of MapReduce is toleverage reducers to merge indexes built on local disk. This might proceed as follows: an existingindexer is embedded inside the mapper, and mappers are applied over the entire document collection.Each indexer operates independently and builds an index on local disk for the documents it encounters.Once the local indexes have been built, compressed postings are emitted as values, keyed by the term.In the reducer, postings from each locally-built index are merged into a final index.3
Another relatively straightforward adaptation of a single-machine indexer is demonstrated byNutch.4 Its algorithm processes documents in the map phase, and emits pairs consisting of docidsand analyzed document contents. The sort and shu✏e phase in MapReduce is used essentially for doc-ument partitioning, and the reducers build each individual index partition independently. In contrastwith the above approach, Nutch basically embeds a traditional indexer in the reducers, instead of themappers. With this approach, the number of reducers specifies the number of document partitions—which limits the degree of parallelization that can be achieved.
We decided not to pursue the two approaches discussed above since they seemed like incrementalimprovements over existing indexing methods. Instead, we implemented and evaluated two distinctalgorithms that make fuller use of the MapReduce programming model. The first is a scalable variant ofthe baseline inverted indexing algorithm in MapReduce, in which the mappers emit individual postings.The second is an algorithm in which the mappers emit partial lists of postings. The algorithms primarilydi↵er in how postings are sorted: by the execution framework (in the first algorithm) or by the indexingcode itself (in the second algorithm). Detailed descriptions of both are provided below, followed by ageneral discussion of their relative merits.
3.2 Emitting Individual Postings
The starting point of our first algorithm, based on mappers emitting individual postings, is an obser-vation about a significant bottleneck in the baseline algorithm in Figure 2: it assumes that there issu�cient memory to hold all postings associated with the same term before sorting them. Since theMapReduce execution framework makes no guarantees about the ordering of values associated withthe same key, the reducer must first bu↵er all postings and then perform an in-memory sort before the
3Indri is capable of distributed indexing using exactly this approach, albeit outside of the MapReduce framework.4http://lucene.apache.org/nutch/
5
[2,4]
[1]
[3]
[1]
[2]
[1]
[1]
[3]
[2]
[3][2,4]
[1]
[2,4]
[2,4]
[1]
[3]
1
1
2
1
1
2
1
1
2 2
11
1
11
1
Positional Indexes
1one
1two
1fish
2red
2blue
2fish
3cat
3hat
1fish 2
1one1two
2red
3cat2blue
3hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
Inverted Indexing: Pseudo-Code
1: class Mapper2: method Map(docid n, doc d)3: H new AssociativeArray . histogram to hold term frequencies4: for all term t 2 doc d do . processes the doc, e.g., tokenization and stopword removal5: H{t} H{t}+ 1
6: for all term t 2 H do
7: Emit(term t, posting hn,H{t}i) . emits individual postings
1: class Reducer2: method Reduce(term t, postings [hn1, f1i . . .])3: P new List4: for all hn, fi 2 postings [hn1, f1i . . .] do5: P.Append(hn, fi) . appends postings unsorted
6: P.Sort() . sorts for compression7: Emit(term t, postingsList P )
Figure 2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce.
Given an existing single-machine indexer, one simple way to take advantage of MapReduce is toleverage reducers to merge indexes built on local disk. This might proceed as follows: an existingindexer is embedded inside the mapper, and mappers are applied over the entire document collection.Each indexer operates independently and builds an index on local disk for the documents it encounters.Once the local indexes have been built, compressed postings are emitted as values, keyed by the term.In the reducer, postings from each locally-built index are merged into a final index.3
Another relatively straightforward adaptation of a single-machine indexer is demonstrated byNutch.4 Its algorithm processes documents in the map phase, and emits pairs consisting of docidsand analyzed document contents. The sort and shu✏e phase in MapReduce is used essentially for doc-ument partitioning, and the reducers build each individual index partition independently. In contrastwith the above approach, Nutch basically embeds a traditional indexer in the reducers, instead of themappers. With this approach, the number of reducers specifies the number of document partitions—which limits the degree of parallelization that can be achieved.
We decided not to pursue the two approaches discussed above since they seemed like incrementalimprovements over existing indexing methods. Instead, we implemented and evaluated two distinctalgorithms that make fuller use of the MapReduce programming model. The first is a scalable variant ofthe baseline inverted indexing algorithm in MapReduce, in which the mappers emit individual postings.The second is an algorithm in which the mappers emit partial lists of postings. The algorithms primarilydi↵er in how postings are sorted: by the execution framework (in the first algorithm) or by the indexingcode itself (in the second algorithm). Detailed descriptions of both are provided below, followed by ageneral discussion of their relative merits.
3.2 Emitting Individual Postings
The starting point of our first algorithm, based on mappers emitting individual postings, is an obser-vation about a significant bottleneck in the baseline algorithm in Figure 2: it assumes that there issu�cient memory to hold all postings associated with the same term before sorting them. Since theMapReduce execution framework makes no guarantees about the ordering of values associated withthe same key, the reducer must first bu↵er all postings and then perform an in-memory sort before the
3Indri is capable of distributed indexing using exactly this approach, albeit outside of the MapReduce framework.4http://lucene.apache.org/nutch/
5
What’s the problem?
Scalability Bottleneck¢ Initial implementation: terms as keys, postings as values
l Reducers must buffer all postings associated with key (to sort)l What if we run out of memory to buffer postings?
¢ Uh oh!
[2,4]
[9]
[1,8,22]
[23]
[8,41]
[2,9,76]
[2,4]
[9]
[1,8,22]
[23]
[8,41]
[2,9,76]
2
1
3
1
2
3
Another Try…
1fish
9
21
(values)(key)
34
35
80
1fish
9
21
(values)(keys)
34
35
80
fish
fish
fish
fish
fish
How is this different?• Let the framework do the sorting• Term frequency implicitly stored
Where have we seen this before?
Inverted Indexing: Pseudo-Code1: class Mapper2: method Map(docid n, doc d)3: H new AssociativeArray4: for all term t 2 doc d do . builds a histogram of term frequencies5: H{t} H{t}+ 1
6: for all term t 2 H do
7: Emit(tuple ht, ni, tf H{t}) . emits individual postings, with a tuple as the key
1: class Partitioner2: method Partition(tuple ht, ni, tf f)3: return Hash(t) mod NumOfReducers . keys of same term are sent to same reducer
1: class Reducer2: method Initialize3: tprev ;4: P new PostingsList5: method Reduce(tuple ht, ni, tf [f ])6: if t 6= tprev ^ tprev 6= ; then7: Emit(term t, postings P ) . emits postings list of term tprev8: P.Reset()
9: P.Append(hn, fi) . appends postings in sorted order10: tprev t
11: method Close12: Emit(term t, postings P ) . emits last postings list from this reducer
Figure 3: Pseudo-code of the inverted indexing algorithm based on emitting individual postings (IP).
postings can be written out to disk. Of course, as collections grow in size there may not be su�cientmemory to perform this sort (bound by the term with the largest df).
Since the MapReduce programming model guarantees that keys arrive at each reducer in sortedorder, we can overcome the scalability bottleneck by letting the execution framework do the sorting.Instead of emitting key-value pairs of the form:
(term t, posting hdocid, fi)
we emit intermediate key-value pairs of the form:
(tuple ht, docidi, tf f)
In other words, the key is a tuple containing the term and the document number, while the value isthe term frequency. We need to redefine the sort order so that keys are sorted first by term t, and thenby docid n. Additionally, we need a custom partitioner to ensure that all tuples with the same termare shu✏ed to the same reducer. Having implemented these two changes, the MapReduce executionframework ensures that the postings arrive in the correct order. This, combined with the fact thatreducers can hold state across multiple keys, allows compressed postings to be written with minimalmemory usage.
The revised MapReduce inverted indexing algorithm is shown in Figure 3. The mapper remainsunchanged for the most part, other than di↵erences in the intermediate key-value pairs. The key spaceof the intermediate output is partitioned by term; that is, all keys with the same term are sent to thesame reducer. This is guaranteed by the partitioner. The reducer contains two additional methods:
6
2
1
1
2
1
1
1
1
1
1
1
Inverted Index (Again)
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
tfdf
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1red
1 1two
1red
1two
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
green eggs and hamDoc 4
3
4
1
4
4
3
2
1
2
2
1
Chicken and Egg?
1fish
9
[2,4]
[9]
21 [1,8,22]
(value)(key)
34 [23]
35 [8,41]
80 [2,9,76]
fish
fish
fish
fish
fish
Write postings
We’d like to store the df at the front of the postings list
But we don’t know the df until we’ve seen all postings!
…
Sound familiar?
Getting the df¢ In the mapper:
l Emit “special” key-value pairs to keep track of df
¢ In the reducer:l Make sure “special” key-value pairs come first: process them to
determine df
¢ Remember: proper partitioning!
Getting the df: Modified Mapper
one fish, two fishDoc 1
1fish [2,4]
(value)(key)
1one [1]
1two [3]
«fish [1]
«one [1]
«two [1]
Input document…
Emit normal key-value pairs…
Emit “special” key-value pairs to keep track of df…
Getting the df: Modified Reducer
1fish
9
[2,4]
[9]
21 [1,8,22]
(value)(key)
34 [23]
35 [8,41]
80 [2,9,76]
fish
fish
fish
fish
fishWrite postings
«fish [63] [82] [27] …
…
First, compute the df by summing contributions from all “special” key-value pair…
Write the df…
Important: properly define sort order to make sure “special” key-value pairs come first!
Where have we seen this before?
2 1 3 1 2 3
2 1 3 1 2 3
Postings Encoding
1fish 9 21 34 35 80 …
1fish 8 12 13 1 45 …
Conceptually:
In Practice:
• Don’t encode docnos, encode gaps (or d-gaps) • But it’s not obvious that this save space…
Overview of Index Compression¢ Byte-aligned vs. bit-aligned
¢ Byte-aligned technique
l VBytel Simple9 and variants
l PForDelta
¢ Bit-alignedl Unary codes
l γ codes
l δ codes
l Golomb codes (local Bernoulli model)
Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
VByte¢ Simple idea: use only as many bytes as needed
l Need to reserve one bit per byte as the “continuation bit”l Use remaining bits for encoding value
¢ Works okay, easy to implement…
0
1 0
1 1 0
7 bits
14 bits
21 bits
Inverted Indexing: IP1: class Mapper2: method Map(docid n, doc d)3: H new AssociativeArray4: for all term t 2 doc d do . builds a histogram of term frequencies5: H{t} H{t}+ 1
6: for all term t 2 H do
7: Emit(tuple ht, ni, tf H{t}) . emits individual postings, with a tuple as the key
1: class Partitioner2: method Partition(tuple ht, ni, tf f)3: return Hash(t) mod NumOfReducers . keys of same term are sent to same reducer
1: class Reducer2: method Initialize3: tprev ;4: P new PostingsList5: method Reduce(tuple ht, ni, tf [f ])6: if t 6= tprev ^ tprev 6= ; then7: Emit(term t, postings P ) . emits postings list of term tprev8: P.Reset()
9: P.Append(hn, fi) . appends postings in sorted order10: tprev t
11: method Close12: Emit(term t, postings P ) . emits last postings list from this reducer
Figure 3: Pseudo-code of the inverted indexing algorithm based on emitting individual postings (IP).
postings can be written out to disk. Of course, as collections grow in size there may not be su�cientmemory to perform this sort (bound by the term with the largest df).
Since the MapReduce programming model guarantees that keys arrive at each reducer in sortedorder, we can overcome the scalability bottleneck by letting the execution framework do the sorting.Instead of emitting key-value pairs of the form:
(term t, posting hdocid, fi)
we emit intermediate key-value pairs of the form:
(tuple ht, docidi, tf f)
In other words, the key is a tuple containing the term and the document number, while the value isthe term frequency. We need to redefine the sort order so that keys are sorted first by term t, and thenby docid n. Additionally, we need a custom partitioner to ensure that all tuples with the same termare shu✏ed to the same reducer. Having implemented these two changes, the MapReduce executionframework ensures that the postings arrive in the correct order. This, combined with the fact thatreducers can hold state across multiple keys, allows compressed postings to be written with minimalmemory usage.
The revised MapReduce inverted indexing algorithm is shown in Figure 3. The mapper remainsunchanged for the most part, other than di↵erences in the intermediate key-value pairs. The key spaceof the intermediate output is partitioned by term; that is, all keys with the same term are sent to thesame reducer. This is guaranteed by the partitioner. The reducer contains two additional methods:
6
What’s the assumption?
Inverted Indexing: LP
1: class Mapper2: method Initialize3: M new AssociativeArray . holds partial lists of postings
4: method Map(docid n, doc d)5: H new AssociativeArray . builds a histogram of term frequencies6: for all term t 2 doc d do
7: H{t} H{t}+ 1
8: for all term t 2 H do
9: M{t}.Add(posting hn,H{t}i) . adds a posting to partial postings lists
10: if MemoryFull() then11: Flush()
12: method Flush . flushes partial lists of postings as intermediate output13: for all term t 2M do
14: P SortAndEncodePostings(M{t})15: Emit(term t, postingsList P )
16: M.Clear()
17: method Close18: Flush()
1: class Reducer2: method Reduce(term t, postingsLists [P1, P2, . . .])3: Pf new List . temporarily stores partial lists of postings4: R new List . stores merged partial lists of postings5: for all P 2 postingsLists [P1, P2, . . .] do6: Pf .Add(P )7: if MemoryNearlyFull() then8: R.Add(MergeLists(Pf ))9: Pf .Clear()
10: R.Add(MergeLists(Pf ))11: Emit(term t, postingsList MergeLists(R)) . emits fully merged postings list of term t
Figure 4: Pseudo-code of the inverted indexing algorithm based on emitting lists of postings (LP).
documents). These partial postings lists are emitted as values, keyed by the terms. In our actualimplementation, positional information is also encoded in the postings lists, but this detail is omittedfrom the pseudo-code for presentation purposes.
In the reduce phase, all partial postings lists associated with the same term are brought together bythe execution framework. The reducer must then merge all these partial lists (arbitrarily ordered) intoa final postings list. For this, we adopted a two-pass approach. In the first pass, the algorithm readspostings lists (let’s call them p1, p2, . . .) into memory until memory is nearly exhausted. These are thenmerged to create a new postings list (let’s call this pa). The partial postings lists are in compressedform, which means we can store quite a few of them in memory. The memory needed for mergingis relatively modest for two reasons: First, we know how many postings are in p1, p2, . . ., so we cancompress pa incrementally—very few postings are actually materialized. Second, the d-gaps in pa aresmaller post-merging, so compression becomes more e�cient. At the end of the first pass, we obtain asmaller number of partial postings lists (pa, pb, . . . in R), which are then merged in a second pass intoa single postings list. This is emitted as the final value, keyed by the term, and written to disk. Asin the previous algorithm, the key space is partitioned by term. The final index will be split across r
8
Inverted Indexing: LP
1: class Mapper2: method Initialize3: M new AssociativeArray . holds partial lists of postings
4: method Map(docid n, doc d)5: H new AssociativeArray . builds a histogram of term frequencies6: for all term t 2 doc d do
7: H{t} H{t}+ 1
8: for all term t 2 H do
9: M{t}.Add(posting hn,H{t}i) . adds a posting to partial postings lists
10: if MemoryFull() then11: Flush()
12: method Flush . flushes partial lists of postings as intermediate output13: for all term t 2M do
14: P SortAndEncodePostings(M{t})15: Emit(term t, postingsList P )
16: M.Clear()
17: method Close18: Flush()
1: class Reducer2: method Reduce(term t, postingsLists [P1, P2, . . .])3: Pf new List . temporarily stores partial lists of postings4: R new List . stores merged partial lists of postings5: for all P 2 postingsLists [P1, P2, . . .] do6: Pf .Add(P )7: if MemoryNearlyFull() then8: R.Add(MergeLists(Pf ))9: Pf .Clear()
10: R.Add(MergeLists(Pf ))11: Emit(term t, postingsList MergeLists(R)) . emits fully merged postings list of term t
Figure 4: Pseudo-code of the inverted indexing algorithm based on emitting lists of postings (LP).
documents). These partial postings lists are emitted as values, keyed by the terms. In our actualimplementation, positional information is also encoded in the postings lists, but this detail is omittedfrom the pseudo-code for presentation purposes.
In the reduce phase, all partial postings lists associated with the same term are brought together bythe execution framework. The reducer must then merge all these partial lists (arbitrarily ordered) intoa final postings list. For this, we adopted a two-pass approach. In the first pass, the algorithm readspostings lists (let’s call them p1, p2, . . .) into memory until memory is nearly exhausted. These are thenmerged to create a new postings list (let’s call this pa). The partial postings lists are in compressedform, which means we can store quite a few of them in memory. The memory needed for mergingis relatively modest for two reasons: First, we know how many postings are in p1, p2, . . ., so we cancompress pa incrementally—very few postings are actually materialized. Second, the d-gaps in pa aresmaller post-merging, so compression becomes more e�cient. At the end of the first pass, we obtain asmaller number of partial postings lists (pa, pb, . . . in R), which are then merged in a second pass intoa single postings list. This is emitted as the final value, keyed by the term, and written to disk. Asin the previous algorithm, the key space is partitioned by term. The final index will be split across r
8
MapReduce it?¢ The indexing problem
l Scalability is paramountl Must be relatively fast, but need not be real time
l Fundamentally a batch operation
l Incremental updates may or may not be important
l For the web, crawling is a challenge in itself
¢ The retrieval problem
l Must have sub-second response timel For the web, only need relatively few results
Just covered
Now
Retrieval with MapReduce?¢ MapReduce is fundamentally batch-oriented
l Optimized for throughput, not latencyl Startup of mappers and reducers is expensive
¢ MapReduce is not suitable for real-time queries!l Use separate infrastructure for retrieval…
Important Ideas¢ Partitioning (for scalability)
¢ Replication (for redundancy)
¢ Caching (for speed)
¢ Routing (for load balancing)
The rest is just details!
Term vs. Document Partitioning
…
T
D
T1
T2
T3
D
T…
D1 D2 D3
Term ���Partitioning
Document���Partitioning
Katta Architecture ���(Distributed Lucene)
http://katta.sourceforge.net/
Source: Wikipedia (Japanese rock garden)
Questions?