Type Less, Find More: Fast Autocompletion Search with a Succinct Index
description
Transcript of Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Type Less, Find More:Fast Autocompletion Search
with a Succinct Index
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with Ingmar Weber
SIGIR 2006 in Seattle, USA, August 6 - 11
Basic Autocompletion
– saves typing
– no more information than necessary
salton
– find out about formulations used
autocomplete, autocompose
– error correction
autocomplit, autocompleet
It's useful
It's more useful
Complete to phrases
– phrase voronoi diagram → add word voronoi_diagram to index
Complete to subwords
– compound word eigenproblem → add word problem to index
Complete to category names
– author Börkur Sigurbjörnsson → add sigurbjörnson:börkur::author börkur::sigurbjörnson:author
Faceted search
– add ct:conference:sigir
– add ct:author:Börkur_Sigurbjörnson
– add ct:year:2005
all via the same mechanism
Workshop onFaceted Search
on Thursday
Related Engines
Related Engines
Basic Problem Definition
Query
– a set D of documents (= hits for the first part of the query)
– a range W of words (= potential completions of last word)
Answer
– all documents D' from D, containing a word from W
– all words W' from W, contained in a document from D
Extensions (see paper)
– ranking (best hits from D' and best completions from W')
– positional information (proximity queries)
First try: inverted index (INV)
Processing 1-word queries with INV
For example, sigir*
D all documents
W all words matching sigir*
Iterate over all words from W
sigir Doc.18, Doc. 53, Doc. 591, ...
sigir03 Doc. 3, Doc. 66, Doc. 765, ...
sigir04 Doc. 25, Doc. 98, Doc. 221, ...
sigirlist Doc. 67, Doc. 189, Doc. 221, ...
sigirforum Doc. 16, Doc. 110, Doc. 141, ...
Merge the documents lists
D' Doc. 3, Doc. 16, Doc. 18, Doc. 25, …
Output all words from range as completions
W' sigir, sigir03, sigir04, sigirlist, …
Expensive!
Trivialfor 1-word
queries
Processing multi-word queries with INV
For example, sigir* sal*
D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for sigir*)
W all words matching sal*
Iterate over all words from W
salary Doc. 8, Doc. 23, Doc. 291, ...
salesman Doc. 24, Doc. 36, Doc. 165, ...
salton Doc. 3, Doc. 18, Doc. 66, ...
salutation Doc. 56, Doc. 129, Doc. 251, ...
salvador Doc. 18, Doc. 21, Doc. 25, ...
Intersect each list with D, then merge
D' Doc. 3, Doc. 18, Doc. 25, …
Output all words with non-empty intersection
W' salton, salvador
Most intersection are empty, but
INV has to compute them
all!
INV — Problems
Asymptotic time complexity is bad (for our problem)
– many intersections (one per potential completion)
– has to merge/sort (the non-empty intersections)
Still hard to beat INV in practice
– highly compressible
half the space on disk means half the time to read it
– INV has very good locality of access
the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory
– simple code
instruction cache, branch prediction, etc.
A Hybrid Index (HYB)
But this looks very wasteful
Basic Idea: have lists for ranges of words
salary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ...
Problem: not enough to show completions
Solution: store the word(s) along with each doc idsalary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ...
salary salvador salton salary
salton salvador
HYB — Details
HYB has a block for each word range, conceptually:
Replace doc ids by gaps and words by frequency ranks:
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A
+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st
Encode both gaps and ranks such that x log2 x bits
+0 0 +1 10 +2 110
1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
An actual block of HYB
How well does it compress? Which block size?
INV vs. HYB — Space Consumption
Theorem: The empirical entropy of INV is
Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
MEDICINE44,015 docs
263,817 wordswith positions
WIKIPEDIA2,866,503 docs
6,700,119 words
with positions
TREC .GOV25,204,013 docs
25,263,176 words
no positions
raw size 452 MB 7.4 GB 426 GB
INV 13 MB 0.48 GB 4.6 GB
HYB 14 MB 0.51 GB 4.9 GB
Nice match of theory and practice
ni = number of documents containing i-th word, n = number of
documents
INV vs. HYB — Query Time
MEDICINE44,015 docs
263,817 words5,732 real queries
with proximity
avg : 0.03 secsmax: 0.38 secs
avg : .003 secsmax: 0.06 secs
INV
HYB
WIKIPEDIA2,866,503 docs
6,700,119 words100 random queries
with proximity
avg : 0.17 secsmax: 2.27 secs
avg : 0.05 secsmax: 0.49 secs
Theoretical analysis see paper
Experiment: type ordinary queries from left to right
– sig , sigi , sigir , sigir sal , sigir salt , sigir salto , sigir salton
TREC .GOV25,204,013 docs
25,263,176 words50 TREC queries
no proximity
avg : 0.58 secsmax: 16.83 secs
avg : 0.11 secsmax: 0.86 secs
HYB better by an order of magnitude
System Design — High Level View
Debugging such an application is hell!
Compute ServerC++
Web ServerPHP
User ClientJavaScript
Summary of Results
Properties of HYB
– highly compressible (just like INV)
– fast prefix-completion queries (perfect locality of access)
– fast indexing (no full inversion necessary)
Autocompletion and more
– phrase and subword completion, semantic completion, XML support, …
– faceted search (Workshop Talk on Thursday)
– efficient DB joins: author[sigir sigmod]NEW
all with one and the same (efficient) mechanism
INV vs. HYB — Space Consumption
Theorem: H(INV)
Theorem: The empirical entropy of HYB with block size ε∙n is
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))MED BOOKS
44,015 docs263,817 words
WIKIPEDIA2,866,503 docs
6,700,119 words
TREC .GOV25,204,013 docs
25,263,176 words
raw size 452 MB 7.4 GB 426 GB
INV 13 MB 0.48 GB 4.6 GB
HYB 14 MB 0.51 GB 4.9 GB
Perfect match of theory and practice
ni = number of documents containing i-th word, n = number of
documents
Σ ni ∙ (1/ln 2 +
log2(n/ni))
Definition: empirical entropy H = optimal number of bits
INV vs. HYB — Space Consumption
Theorem: Entropy(INV) = Σ ni ∙ (1/ln 2 +
log2(n/ni))Theorem: Entropy(HYB) = Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
MED BOOKS44,015 docs
263,817 words
WIKIPEDIA2,866,503 docs
6,700,119 words
TREC .GOV25,204,013 docs
25,263,176 words
raw size 452 MB 7.4 GB 426 GB
INV 13 MB 0.48 GB 4.6 GB
HYB 14 MB 0.51 GB 4.9 GB
Perfect match of theory and practice
We define a notion of empirical entropy in the paper, in terms of
ni = number of documents containing i-th word, n = number of
documents
HYB vs. INV — Query Time
MED BOOKS44,015 docs
263,817 words
WIKIPEDIA2,866,503 docs
6,700,119 words
TREC .GOV25,204,013 docs
25,263,176 words
INVavg:0.03 secs avg: 0.17 secs avg: 0.58 secs
max:0.38 secsmax: 2.27 secs max: 16.83 secs
HYBavg:.003 secs avg: 0.05 secs avg: 0.11 secs
max0.06 secsmax: 0.49 secs max: 0.86 secs
Processing a 1-word Query with INV
sigir Doc.18, Doc. 53, Doc. 591, ...
sigir03 Doc. 3, Doc. 66, Doc. 765, ...
sigir04 Doc. 25, Doc. 98, Doc. 221, ...
sigir05 Doc. 57, Doc.99, Doc. 110, ...
sigirlist Doc. 67, Doc. 189, Doc. 221, ...
sigirforum Doc. 16, Doc. 110, Doc. 141, ...
Hits Doc. 3, Doc. 16, Doc. 18, ...
Processing a 1-word query, e.g., sigir*
1. Iterate over all words matching sigir*
2. Merge the documents lists
Completions
sigir, sigir03, sigir04, sigir05, ...
Processing sigir* sal with INV
Iterate over all words matching sigir*
sigir Doc.18, Doc. 53, Doc. 591, ...
sigir03 Doc. 3, Doc. 66, Doc. 765, ...
sigir04 Doc. 25, Doc. 98, Doc. 221, ...
sigirlist Doc. 67, Doc. 189, Doc. 221, ...
sigirforum Doc. 16, Doc. 110, Doc. 141, ...
Merge the documents lists
Hits D' Doc. 3, Doc. 16, Doc. 18, …
Output all words from range as completions
Completions W' sigir, sigir03, sigir05, …
Expensive!
Trivialfor 1-word
queries
Using an Inverted Index (INV)salary Doc.18, Doc. 53, Doc. 591, ...
salesman Doc. 3, Doc. 66, Doc. 765, ...
salient Doc. 25, Doc. 98, Doc. 221, ...
salton Doc. 57, Doc.99, Doc. 110, ...
salutation Doc. 67, Doc. 189, Doc. 221, ...
salvador Doc. 16, Doc. 110, Doc. 141, ...
salvucci Doc. 18, Doc. 25, Doc. 765, ...
salzberg Doc. 53, Doc. 121, Doc. 187, ...
D Doc. 57, Doc 87, Doc. 110, ...
W salary - salzberg
D' Doc. 57, Doc. 110, ...
W' salton, salvador
Problem 1: one intersection per potential completion
Problem 2: merging of non-empty intersections
HYB — Details
1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15
+1+2+0+2+0+1+1+1+0+1+2+0+0+1+1+23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st
+0 0 +1 10 +2 110
1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
D A C A B A C A D A A B C A C Awordsdocument ids
gapsranks by frequency
universalencoding:
small gaps/ranks=> short codes
one block of HYB
HYB has a block for each word range
INV vs. HYB — Query Time
MED BOOKS44,015 docs
263,817 words
avg: 0.03 secsmax: 0.38 secs
avg: .003 secsmax: 0.06 secs
INV
HYB
avg = average time per keystrokemax = maximum time per keystroke (outliers removed)
WIKIPEDIA2,866,503 docs
6,700,119 words
avg: 0.17 secsmax: 2.27 secs
avg: 0.05 secsmax: 0.49 secs
TREC .GOV25,204,013 docs
25,263,176 words
avg: 0.58 secsmax: 16.83 secs
avg: 0.11 secsmax: 0.86 secs
Start with DEMO
autocompsigsigir
sigir salsal
Related Search Engine Features
Complete from precompiled list of queries
– Google Suggest
– AllTheWeb Livesearch
– …
Desktop Search engines
– Apple Spotlight
– Copernic Desktop Search
– …