Tamer ElsayedQatar University
On Large-Scale Retrieval Taskswith Ivory and MapReduce
Nov 7th, 2012
2
My Field …
Information Retrieval (IR) is …Finding material (usually documents)
of an unstructured nature (usually text) that satisfies an information need
from within large collections
Quite effective (at some things) Highly visible (mostly) Commercially successful (some of them)
3
IR is not just “Document Retrieval” Clustering and Classification Question answering Filtering, tracking, routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identification Meta-search (multi-collection searching) Summarization Cross-language mechanisms Evaluation techniques Multimedia retrieval Social media analysis …
4
My Research …
Text
Large-ScaleProcessing
emails
+ web pages
Enron
CLuEWebIdentity
Resolution
WebSearch
~500,000
~1,000,000,000
User Application
5
Back in 2009 … Before 2009, small text collections are available● Largest: ~ 1M documents
ClueWeb09● Crawled by CMU in 2009● ~ 1B documents !● need to move to cluster environments
MapReduce/Hadoop seems like promising framework
6
E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections
● + ClueWeb09 Open source release Implements state-of-the-art retrieval models
http://ivory.ccIvory
7
MapReduce Framework
map
map
map
map
reduce
reduce
reduce
input
input
input
input
output
output
output
Shuffling
group values by: [keys]
(a) Map (b) Shuffle (c) Reduce
(k2, [v2])(k1, v1)
[(k3, v3)][k2, v2]
Framework handles “everything else” !
8
The IR Black Box
DocumentsQuery
Hits
9
Inside the IR Black Box
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonline
10
Indexing
ClintonCheney
B
ClintonObamaClinton
A
ClintonBarackObama
CCheney
Barack
Obama
ClintonA, 2
C, 1B, 1
A, 1C, 1
B, 1
C, 1
Collection Inverted IndexDocuments, IDs Terms, Posting Lists
11
Indexing
ClintonRomney
B
ClintonObamaClinton
A
ClintonBarackObama
CRomney
Barack
Obama
ClintonA, 2
C, 1B, 1
A, 1C, 1
B, 1
C, 1
Collection Inverted IndexDocuments, IDs Terms, Posting Lists
12
Indexing(a) Map (b) Shuffle (c) Reduce
Clinton
Romney
Clinton
Barack
Obama
Clinton
Clinton
Obama
Clinton
Obama
Romney
Barack
Romney
Barack
Obama
Clinton
ClintonRomney
ClintonBarackObama
ClintonObamaClinton
Shuffl
ing
reducemap
map
mapreduce
reduce
reduce
ClintonObamaClinton
ClintonRomney
ClintonBarackObama
2
B
A
C
Retrieval Directly from HDFS!
Cute hack: use Hadoop to launch partition servers● Embed an HTTP server inside each mapper● Mappers start up, initialize servers, enter into infinite service
loop! Why do this?
● Unified Hadoop ecosystem● Simplifies data management issues
PartitionServer
PartitionServer
PartitionServer
RetrievalBroker
SearchClient
HDFSdatanode
HDFSdatanode
HDFSdatanode
HDFSdatanode
HDFSnamenode
PartitionServer
Local Disk
TREC’10
TREC’09
14
RoadmapIndexing
& Retrieval
• Batch Retrieval• Approx. Pos. Indexes
Pairwise Similarity
• Monolingual• Cross-Lingual
Pseudo Test
Collection• Training L2R
Iterative Process
• iHadoop
Ivory
SIGIR 2011
SIGIR 2011
CIKM 2011
ACL 2008
TREC 2009TREC 2010
CloudCom 2011
15
RoadmapIndexing
& Retrieval
• Batch Retrieval• Approx. Pos. Indexes
Pairwise Similarity
• Monolingual• Cross-Lingual
Pseudo Test
Collection• Training L2R
Iterative Process
• iHadoop
Ivory
SIGIR 2011ACL 2008
16
Abstract Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
Applications: Clustering Coreference resolution “more-like-that” queries
17
Decomposition
reduce
Each term contributes only if appears in
map
18
Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs
Clinton
Barack
Romney
Obama
2
1
1
1
1
1
1
2
2
1
11
2
2 2
2
1
13
1
19
Terms: Zipfian Distribution
term rank
doc
freq
(df)
each term t contributes o(dft2) partial results
very few terms dominate the computations
most frequent term (“said”) 3%most frequent 10 terms 15%
most frequent 100 terms 57%most frequent 1000 terms 95%
~0.1% of total terms(99.9% df-cut)
20
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rmed
iate
Pai
rs (
bil
lio
ns)
no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%
8 trillionintermediate pairs
0.5 trillion intermediate pairs
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k doc
21
EffectivenessEffect of df-cut on effectiveness
Medline04 - 909k abstracts- Ad-hoc retrieval
50
55
60
65
70
75
80
85
90
95
100
99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)
Re
lati
ve
P5
(%
)
Drop 0.1% of terms“Near-Linear” Growth
Fit on diskCost 2% in Effectiveness
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
ACL’08
22
Cross-Lingual Pairwise Similarity Find similar document pairs in different languages
Multilingual text mining, Machine Translation
Application: automatic generation of potential “interwiki” language links
More difficult than monolingual!
23
Vocabulary Space Matching
MTDoc A
MT translate
doc vector A
German English
DocB
English
doc vector B
Doc A
CLIR project
doc vector A
German
DocB
English
doc vector B
doc vector ACLIR
Ff
Ff
fdfefpedf
ftfefpetf
)()|()(
)()|()(
*
*
24
Locality-Sensitive Hashing (LSH) Cosine score is a good similarity measure but expensive! LSH is a method for effectively reducing the search
space when looking for similar pairs Each vector is converted into a compact representation,
called a signature
A sliding window-based algorithm uses these signatures to search for similar articles in the collection
Vectors close to each other are likely to have similar signatures
Solution Overview
CLIRprojection
Nf German articles
Ne
Englisharticles
Preprocess
Ne+Nf
English document
vectors
Ne+Nf
SignaturesSignature
generation
Sliding window
algorithm
Similar article pairs
<nobel=0.324, prize=0.227, book=0.01, …>
0111000010111100001010
Random Projection/Minhash/Simhash
MapReduce 1: Table Generation Phase
Signatures
….110110111010111000010110101010000…
S1’
sortp1
pQ
.
.
.
S1
SQ
.
.
.
SQ’
sort
….111110010110010100111010010000101…
….111111010101001100011001100100100…
permute
….011001001001001100011011111101010…
….001010011101001000010111111001011…
tables
27
MapReduce 2: Detection Phase
00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011
table chunks
28
Evaluation Ground truth: ● Sample 1064 German articles ● cosine score >= 0.3
Compare sliding window with brute force approach● required for exact solution● good reference as an upper-bound for recall and running time
Evaluation
95% recall at 39% cost
99% recall at 62% cost
No Free Lunch!
30
Contribution to Wikipedia Identify links between German and English Wikipedia
articles● “Metadaten” “Metadata”, “Semantic Web”, “File Format”● “Pierre Curie” “Marie Curie”, “Pierre Curie”, “Helene
Langevin-Joliot”● “Kirgisistan” “Kyrgyzstan”, “Tulip Revolution”, “2010
Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan”
Bad results when significant difference in length.
SIGIR’11
31
RoadmapIndexing
& Retrieval
• Batch Retrieval• Approx. Pos. Indexes
Pairwise Similarity
• Monolingual• Cross-Lingual
Pseudo Test
Collection• Training L2R
Iterative Process
• iHadoop
Ivory
CIKM 2011
32
Approximate Positional Indexes
Learn
“Learning to Rank” models
Termpositions
effective ranking functions
Proximity features
Approximate
Largeindex
Slow query evaluation
√
X XSmaller index
Faster query evaluation√ √
Close Enough is Good Enough?
33
Variable-Width Buckets 5 buckets / document
………...........….………...........….………...........….………...........….………...........….
d2d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….
12345
21
534
34
Fixed-Width Buckets Buckets of length W
………...........….………...........….………...........….………...........….………...........….
d2
123
d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….
12345
EffectivenessCIKM’11
36
Roadmap
Indexing & Retrieval
• Batch Retrieval• Approx. Pos.
Indexes
Pairwise Similarity
• Monolingual• Cross-Lingual
Pseudo Test
Collections
• Training L2R• Evaluation
Ivory
SIGIR ‘11
iHadoop
Test Collections Documents, queries, and relevance judgments Important driving force behind IR innovation Without test collections, it’s impossible to:
● Evaluate search systems● Tune ranking functions / train models
Traditional● Exhaustive● Pooling
Recent Methodologies● Behavioral logging (query logs, click logs, etc.)● Minimal test collections● Crowdsourcing
Web Graphweb search
SIGIR 2012
web search
web search
web search
web search
P1
P4
P2
P5
P7
P3
P6
Queries and Judgments?
SIGIR 2012P1
P4
P2
P7
P3
P6
web search
BingP5
anchor text lines ≈ pseudo queries
target pages ≈ relevant candidates
noise reduction ?
40
SIGIR’11
41
RoadmapIndexing
& Retrieval
• Batch Retrieval• Approx. Pos. Indexes
Pairwise Similarity
• Monolingual• Cross-Lingual
Pseudo Test
Collection• Training L2R
Iterative Process
• iHadoop
Ivory
CloudCom 2011
42
Iterative MapReduce Applications Many machine learning, and data mining applications● PageRank, k-means, HITS, …
Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time)
Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth)
MapReduce is not designed to run iterative applications efficiently
43
Goal
44
Asynchronous PipelineCloudCom’11
45
Conclusion MapReduce allows large-scale processing over web data Ivory
● E2E open-source IR retrieval engine for research● Completely on Hadoop
• even retrieval: from HDFS
Efficiency-effectiveness tradeoff ● Cross-Lingual Pairwise Similarity
• Efficient implementation using MapReduce• Efficiency-effectiveness tradeoff
● Approx Positional Indexes• Efficient and as effective as exact positions
● Pseudo Test Collections• Possible!• Effective for training L2R models
MapReduce is not good for iterative algorithms
http://ivory.cc
46
Collaborators Jimmy Lin Don Metzler Doug Oard Ferhan Ture Nima Asadi Lidan Wang Eslam Elnikety Hany Ramadan
47
Thank You!
Questions?
Top Related