Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
-
Upload
tyrone-williamson -
Category
Documents
-
view
217 -
download
0
Transcript of Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Indexing
LBSC 708A/CMSC 838L
Session 7, October 23, 2001
Philip Resnik
Agenda
• Questions
• Finish up evaluation from last time
• Computational complexity
• Inverted indexes
• Question answering and predictive annotation
• Two minute paper
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
Some Questions for Today
• How long will it take to find a document?– Is there any work we can do in advance?
• If so, how long will that take?
• How big a computer will I need?– How much disk space? How much RAM?
• What if more documents arrive?– How much of the advance work must be repeated?
– Will searching become slower?
– How much more disk space will be needed?
A Cautionary Tale
• Searching is easy - just ask Microsoft!– “Find” can search my 1 GB disk in 30 seconds
• Well, actually it only looks at the file names...
• How long do you think find would take for– A 100 GB disk?– For the World Wide Web?
• Computers are getting faster, but…– How does AltaVista give answers in 5 seconds?
Find “complex” in the dictionary
marsupial
belligerent
complex
marsupial
belligerent
complex
arcade
astronomical
mastiff
relatively
relaxation
resplendent
Computational Complexity
• One question: how long does stuff take?
• Another: how much space do you need?
• Things you need to know:– What is the size of the input? (usu. n)
• What aspects of the input are we paying attention to?
– How is the input represented?– How is the output represented?– What are the internal data structures?– What is the algorithm?
Worst Case Complexity
0
500
1000
1500
2000
2500
3000
3500
4000
4500
10 20 30 40
10n
n^2
100n
0
20000
40000
60000
80000
100000
120000
140000
50 200 350
10n
n^2
100n
100n+25263
10n: O(n)100n: O(n)100n+25263: O(n)
n2: O(n2)n2+45662: O(n2)
Hierarchy of Complexity• Constant, i.e. O(1)
n doesn’t matter • Sublinear, e.g. O(log n)
n = 65536 log n = 16• Linear, i.e. O(n)
n = 65536 n = 65536• Polynomial, e.g. O(n3)
n = 65536 n3 = 281,474,976,710,656• Exponential, e.g. O(2n)
n = 65536 beyond astronomical
Example: matching URLshttp://goodstuff.com/eng/menu-en.htmhttp://goodstuff.com/eng/prods.htmhttp://goodstuff.com/eng/help.htm…
http://goodstuff.com/fra/menu-fr.htmhttp://goodstuff.com/fra/prods.htmhttp://goodstuff.com/fra/help.htm…
P patterns
M URLs
en fr
en fra
eng fr
eng fre
… …
N URLs
• Generate alternatives and look them up– Store French URLs in a hash table, takes O(M) time– For each English URL, substitute all possible patterns…
http://goodstuff.com/eng/menu-en.htm
http://goodstuff.com/fra/menu-en.htm
http://goodstuff.com/fra/mfru-en.htm
http://goodstuff.com/fra/menu-fr.htm
http://goodstuff.com/fra/mfru-fr.htm
…
Example, cont’d
Exponential!O(N * 2k)
Example, cont’d
• Alternative: use clever string matching– Can compare a URL e to URL f in O(L2) time,
where L is length of URLs in characters (small!)
• Requires O(N*M) comparisons of URLs– Too slow! (Sometimes N and M are 1000’s…)
• Speculation: O(M+N) algorithm if you store URLs in a trie?
One Good Trie?menu-fr2
mercredi
merci
m
e
n
u
-
f
r
r
c
r
e
d i2
i
menu-en2
en, eng
Looks promising, but theobvious solution still leadsto an exponential algorithm.
Agenda
• Questions
• Computational complexity
• Inverted indexes
• Question answering and predictive annotation
• Two minute paper
The “Inverted File” Trick
• Organize the bag of words matrix by terms– You know the terms that you are looking for
• Look up terms like you search dictionaries– For each letter, jump directly to the right spot
• For terms of reasonable length, this is very fast
– For each term, store the document identifiers• For every document that contains that term
• At query time, use the document identifiers– Consult a “postings file”
An Example
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term Doc
1
Doc
2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
A
B
C
FD
GJLMNOPQ
T
AIALBABR
THTI
4, 82, 4, 61, 3, 7
1, 3, 5, 72, 4, 6, 8
3, 53, 5, 7
2, 4, 6, 83
1, 3, 5, 72, 4, 82, 6, 8
1, 3, 5, 7, 86, 81, 3
1, 5, 72, 4, 6
PostingsInverted File
The Finished Product
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
Term
A
B
C
FD
GJLMNOPQ
T
AIALBABR
THTI
4, 82, 4, 61, 3, 7
1, 3, 5, 72, 4, 6, 8
3, 53, 5, 7
2, 4, 6, 83
1, 3, 5, 72, 4, 82, 6, 8
1, 3, 5, 7, 86, 81, 3
1, 5, 72, 4, 6
PostingsInverted File
What Goes in a Postings File?
• Boolean retrieval– Just the document number
• Ranked Retrieval– Document number and term weight (TF*IDF, ...)
• Proximity operators– Word offsets for each occurrence of the term
• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
How Big Is the Postings File?
• Very compact for Boolean retrieval– About 10% of the size of the documents
• If an aggressive stopword list is used!
• Not much larger for ranked retrieval– Perhaps 20%
• Enormous for proximity operators– Sometimes larger than the documents!
• But access is fast - you know where to look
Building an Inverted Index• Simplest solution is a single sorted array
– Fast lookup using binary search– But sorting large files on disk is very slow– And adding one document means starting over
• Tree structures allow easy insertion– But the worst case lookup time is linear
• Balanced trees provide the best of both– Fast lookup and easy insertion– But they require 45% more disk space
Starting a B+ Tree Inverted File
now timegoodall
aaaaa now
Now is the time for all good …
Adding a New Term
now timegoodall
aaaaa now
Now is the time for all good men …
aaaaa men
men
How Big is the Inverted Index?
• Typically smaller than the postings file– Depends on number of terms, not documents
• Eventually almost all terms will be indexed– But the postings file will continue to grow
• Postings dominate asymptotic space complexity– Linear in the number of documents
• Assuming that the documents remain about the same size
Some Facts About Disks
• It takes a long time to get the first byte– A Pentium can do 1,000,000 operations in 10 ms
• But you can get 1,000 bytes just about as fast– 40 MB/sec transfer rates are typical
• So it pays to put related stuff in each “block”– M-ary trees B+ are better than binary B+ trees
• Time complexity is measured in disk blocks read– Since computing time is negligible by comparison
Indexing and Searching
• Indexing– Walk the inverted file, splitting if needed– Insert into the postings file in sorted order– Hours or days for large collections
• Query processing– Walk the inverted file– Read the postings file– Manipulate postings based on query– Seconds, even for enormous collections
Agenda
• Questions
• Computational complexity
• Inverted indexes
• Question answering and predictive annotation
• Two minute paper
Question Answering
• “In what year did Edison patent the light bulb?”– Is the user really looking for a list of documents?
• Better answer: excerpt containing the answer… after Edison’s patenting of the light bulb in 1879…
• Question words become very important!– “How much money did Clinton earn in 1999?”
QA and Predictive Annotation• Predictive annotation (Prager et al.):
– Infer possible question types within documents
– Index them along with the terms
– At search time, map question words to question types and include them in the query
• Requires– Identification of entities and their categories
– Indexing multiple items at one location
– Analysis of queries to derive question types
• Works pretty well!
Predictive Annotation, cont’d
In reality, at the time of Edison’s 1879 patent, the light bulb
had been in existence for some five decades ….
TIMEPERSON
Who patented the light bulb?When was the light bulb patented?
What did Thomas Edison patent?
patent light bulb PERSONpatent light bulb TIME
???
In what year was the light bulb patented?
Summary
• Slow indexing yields fast query processing
• We use extra disk space to save query time– Index space is in addition to document space– Time and space complexity must be balanced
• Disk block reads are the critical resource– Fast disks are more useful than fast computers
One Minute Paper
• Suppose insertions of new documents are more common than getting new queries. For example, imagine filtering news stories as they arrive, based on a small set of queries that describe long-standing topics of interest. What kind of an index should you build?
• What was the muddiest point in today’s lecture?