Engineering a Set Intersection Algorithm for Information Retrieval
description
Transcript of Engineering a Set Intersection Algorithm for Information Retrieval
![Page 1: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/1.jpg)
Engineering a Set Intersection Algorithm for Information
Retrieval
Alex Lopez-Ortiz
UNB / InterNAP
Joint work with Ian Munro and Erik Demaine
![Page 2: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/2.jpg)
Overview
• Web Search Engine Basics
• Algorithms for set operations
• Theoretical Analysis
• Experimental Analysis
• Engineering an Improved Algorithm
• Conclusions
![Page 3: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/3.jpg)
Web Search Engine Basics• Crawl: sequential gathering process• Document ID (DocID) for each web page
Cool sites:
• SIGIR
• SIGACT
• SIGCOMM
SIGIR
SIGCOMM
SIGACT
http://acm.org/home.html
1
2
3
4
![Page 4: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/4.jpg)
• Indexing: List of entries of type
<word, docID1 , docID2 , . . . , > E.g.
<cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4>
<SIG, 1, 2, 3, 4>
SIGCOMM
1 3 4 2
Cool sites:
• SIGIR
• SIGACT
• SIGCOMM
SIGIR SIGACT
![Page 5: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/5.jpg)
• Postings set: Set of docID’s containing a word or pattern.
SIGACT {1,3}
SIGCOMM {1,4}
SIGCOMM
1 3 4 2
Cool sites:
• SIGIR
• SIGACT
• SIGCOMM
SIGIR SIGACT
![Page 6: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/6.jpg)
Search Engine Basics (cont.)
Postings set stored implicitly/explicitly in a string matching data structure
• PAT tree/array
• Inverted word index
• Suffix trees
• KMP (grep) ...
![Page 7: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/7.jpg)
String Matching Problem
• Different performance characteristics for each solution
• Time/Space tradeoff (empirical)
• Linear time/linear space lower bound [Demaine/L-O, SODA 2001]
![Page 8: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/8.jpg)
Search Engine Basics (cont.)
A user query is of the form:
keyword1 keyword2 … keywordn
where is one of {and,or}
E.g.
computer and science or internet
![Page 9: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/9.jpg)
Evaluating a Boolean Query
The interpretation of a boolean query is the mapping:
• keyword postings set• and (set intersection)• or (set union)
E.g.
{computer} {science} {internet}
![Page 10: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/10.jpg)
Set Operations for Web Search Engines
• Average postings set size > 10 million
• Postings set are sorted
![Page 11: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/11.jpg)
Intersection Time Complexity
• Worst case linear on size of postings sets:
Θ(n)
{1,3,5,7} {1,3,5,7}
• On size of output?
{1,3,5,7} {2,4,6,8}
![Page 12: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/12.jpg)
Adaptive Algorithms
• Assume the intersection is empty.
What is the min number of comparisons
needed to ascertain this fact?
Examples
{1,2,3,4} {5,6,7,8}
![Page 13: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/13.jpg)
Much ado About Nothing
A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection.
E.g. A={1,3,5,7} B={2,4,6,8} a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4
![Page 14: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/14.jpg)
Adaptive Algorithms
In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in:
k · | shortest proof of non-intersection |
steps.
Ideal for crawled, “bursty” data sets
![Page 15: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/15.jpg)
How does it work?
• <SIGACT, 1, 3, i, n>
1,_,3,... i n
DocID universe set
![Page 16: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/16.jpg)
Measuring Performance
• 100MB Web Crawl
• 5000 queries from Google
![Page 17: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/17.jpg)
Baseline Standard Algorithm
• Sort sets by size
• Candidate answer set is smallest set
• For each set S in increasing order by size– For each element e in candidate set
• Binary search for e in S• If e is not found remove from candidate set• Remove elements before e in S
![Page 18: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/18.jpg)
Upper Bound: Adaptive/Traditional Two-Smallest Algorithm
![Page 19: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/19.jpg)
Lower Bound: Adaptive/Shortest Proof
![Page 20: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/20.jpg)
Middle Bound: Adaptive/ Encoding of Shortest Proof
![Page 21: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/21.jpg)
Side by Side
Lower Bound
Middle Bound
![Page 22: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/22.jpg)
Possible Improvements
• Adaptive performs best in two-three sets
• Traditional algorithm often terminates after first pair of sets
• Galloping seems better than binary search
• Adaptive keeps a dynamic definition of “smallest set”
• Candidate elements aggressively tested
![Page 23: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/23.jpg)
Example
{6, 7,10,11,14}
{4, 8,10,11,15}
{1, 2, 4, 5, 7, 8, 9}
![Page 24: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/24.jpg)
Experimental Results
Test orthogonally each possible improvement
• Cyclic or Two Smallest
• Symmetric
• Update Smallest
• Advance on Common Element
• Gallop Factor/Binary Search
![Page 25: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/25.jpg)
Binary Search vs. Gallop
![Page 26: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/26.jpg)
Advance on Common Element
![Page 27: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/27.jpg)
Small Adaptive
Combines best of Adaptive and Two-Smallest
• Two-smallest
• Symmetric
• Advance on common element
• Update on smallest
• Gallop with factor 2
![Page 28: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/28.jpg)
Small Adaptive
![Page 29: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/29.jpg)
Small AdaptiveSmall Adaptive is faster than Two-Smallest
Aggregate speed-up 2.9x comparisons
Faster than Adaptive
![Page 30: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/30.jpg)
ConclusionsFaster intersection algorithm for Web Search
EnginesAdaptive measure for set operationsInformation theoretic “middle bound”Standard speed-up techniques for other
settings
THE END
![Page 31: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/31.jpg)
Total #
of
elements
in a
query
Number of queries for each total size
Query Log
![Page 32: Engineering a Set Intersection Algorithm for Information Retrieval](https://reader036.fdocuments.net/reader036/viewer/2022062517/56813a3f550346895da22bb3/html5/thumbnails/32.jpg)
Example
{6, 7,10,11,14}
{4, 8,10,11,15}
{1, 2, 4, 5, 7, 8, 9, 12}