Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
-
Upload
earl-mcgee -
Category
Documents
-
view
220 -
download
2
Transcript of Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
Matching
LBSC 708A/CMSC 828O
March 8, 1999
Douglas W. Oard and Dagobert Soergel
Agenda
• Bag of words representation
• Boolean matching
• Similarity-based matching
• Probabilistic matching
Retrieval System Model
Query Formulation
Detection
Delivery
Selection
Examination
Index
Docs
User
Indexing
Detection Component Model
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t P
roce
ssin
g
“Bag of Words” Representation
• Simple strategy for representing documents• Count how many times each term occurs• A “term” is any lexical item that you chose
– A fixed-length sequence of characters (an “n-gram”)
– A word (delimited by “white space” or punctuation)
– Some standard “root form” of each word (e.g., a stem)
– A phrase (e.g., phrases listed in a dictionary)
• Counts can be recorded in any consistent order
Bag of Words Example
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
quick
brown
fox
over
lazy
dog
back
now
is
time
forall
good
men
tocome
jump
aid
of
their
party
00110110110010100
11001001001101011
Indexed Term D
ocum
ent 1
Doc
umen
t 2
Stopword List
Boolean “Free Text” Retrieval
• Limit the bag of words to “absent” and “present”– “Boolean” values, represented as 0 and 1
• Represent terms as a “bag of documents”– Same representation, but rows rather then columns
• Combine the rows using “Boolean operators”– AND, OR, NOT
• Any document with a 1 remaining is “detected”
Boolean Free Text Example
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term Doc
1
Doc
2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
• dog AND fox – Doc 3, Doc 5
• dog NOT fox – Empty
• fox NOT dog – Doc 7
• dog OR fox – Doc 3, Doc 5, Doc 7
• good AND party – Doc 6, Doc 8
• good AND party NOT over
– Doc 6
Proximity Operator Example
• time AND come– Doc 2
• time (NEAR 2) come– Empty
• quick (NEAR 2) fox– Doc 1
• quick WITH fox– Empty
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0 1 (9)
Term1 (13)1 (6)
1 (7)
1 (8)
1 (16)
1 (1)
1 (2)1 (15)1 (4)
0
00
0
00
0
0
0
0
0
0
00
0
0
1 (5)
1 (9)
1 (3)
1 (4)
1 (8)
1 (6)
1 (10)
Doc
1
Doc
2
Controlled Vocabulary Example
• Canine AND Fox– Doc 1
• Canine AND Political action– Empty
• Canine OR Political action– Doc 1, Doc 2
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
VolunteerismPolitical action
FoxCanine 0
011
1100
Descriptor Doc
1D
oc 2
[Canine][Fox]
[Political action][Volunteerism]
Ranked Retrieval Paradigm
• Exact match retrieval often gives useless sets– No documents at all, or way too many documents
• Query reformulation is one “solution”– Manually add or delete query terms
• “Best-first” ranking can be superior– Select every document within reason– Put them in order, with the “best” ones first– Display them one screen at a time
Advantages of Ranked Retrieval
• Closer to the way people think– Some documents are better than others
• Enriches browsing behavior– Decide how far down the list to go as you read it
• Allows more flexible queries– Long and short queries can produce useful results
Ranked Retrieval Challenges
• “Best first” is easy to say but hard to do!– Probabilistic retrieval tries to approximate it
• How can the user understand the ranking?– It is hard to use a tool that you don’t understand
• Efficiency may become a concern– More complex computations take more time
Document Similarity
• How similar are two documents?– In particular, how similar is their bag of words?
1
1
1
1: Nuclear fallout contaminated Montana.
2: Information retrieval is interesting.
3: Information retrieval is complicated.
1
1
1
1
1
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
1
1 2 3
Similarity-Based Queries
• Treat the query as if it were a document– Create a query bag-of-words
• Find the similarity of each document– Using the coordination measure, for example
• Rank order the documents by similarity– Most similar to the query first
A Simple Ranking Strategyinformation AND retrieval
Readings in Information RetrievalInformation Storage and RetrievalSpeech-Based Information Retrieval for Digital LibrariesWord Sense Disambiguation and Information Retrieval
information NOT retrieval
The State of the Art in Information Filtering
Inference Networks for Document RetrievalContent-Based Image Retrieval SystemsVideo Parsing, Retrieval and BrowsingAn Approach to Conceptual Text Retrieval Using the EuroWordNet …Cross-Language Retrieval: English/Russian/French
retrieval NOT information
The Coordination Measure
1
1
1
1
1
1
1
1
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
1
1 2 3
Query: complicated retrievalResult: 3, 2
Query: information retrievalResult: 2, 3
Query: interesting nuclear falloutResult: 1, 2
1
1
1
1
1
1
1
1
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
1
1 2 3
The Vector Space Model
• Model similarity as inner product– Line up query and doc vectors– Multiply weights for each term– Add up the results
• With binary term weights, same results as coordination match
Term Frequency
• Terms tell us about documents– If “rabbit” appears a lot, it may be about rabbits
• Documents tell us about terms– “the” is in every document -- not discriminating
• Documents are most likely described well by rare terms that occur in them frequently– Higher “term frequency” is stronger evidence– Low “collection frequency” makes it stronger still
Incorporating Term Frequency
• High term frequency is evidence of meaning– And high IDF is evidence of term importance
• Recompute the bag-of-words– Compute TF * IDF for every element
Let be the total number of documents
Let of the documents contain term
Let be the number of times term appears in document
Then
N
n N i
i j
wN
n
i j
i j i j
tf
tf log
,
, ,
TF*IDF Example
4
5
6
3
1
3
1
6
5
3
4
3
7
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
2
1 2 3
2
3
2
4
4
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
idfi
tf ,i jwi j,
The Document Length Effect
• Humans look for documents with useful parts– But term weights are computed for the whole
• Document lengths vary in many collections– So term weight calculations could be inconsistent
• Two strategies– Adjust term weights for document length– Divide the documents into equal “passages”
Document Length Normalization
• Long documents have an unfair advantage– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly• So they have much higher term frequencies
• Normalization seeks to remove these effects– Related somehow to maximum term frequency– But also sensitive to the of number of terms
“Cosine” Normalization
• Compute the length of each document vector– Multiply each weight by itself– Add all the resulting values– Take the square root of that sum
• Divide each weight by that length
Let be the unnormalized weight of term in document
Let be the normalized weight of term in document
Then
w i j
w i j
ww
w
i j
i j
i j
i j
i jj
,
,
,
,
,
2
Cosine Normalization Example
0.29
0.37
0.53
0.13
0.62
0.77
0.57
0.14
0.19
0.79
0.05
0.71
1 2 3
0.69
0.44
0.57
4
4
5
6
3
1
3
1
6
5
3
4
3
7
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
2
1 2 3
2
3
2
4
4
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
idfi
1.70 0.97 2.67 0.87Length
tf ,i jwi j, wi j,
Why Call It “Cosine”?
d2
d1 w j1,
w j2,
w2 2,
w2 1,
w1 1,w1 2,
Let document 1 have unit length with coordinates and
Let document 2 have unit length with coordinates and
Then
w w
w w
w w w w
1 1 2 1
1 2 2 2
1 1 1 2 2 1 2 2
, ,
, ,
, , , ,cos ( ) ( )
Interpreting the Cosine Measure
• Think of a document as a vector from zero
• Similarity is the angle between two vectors– Small angle = very similar– Large angle = little similarity
• Passes some key sanity checks– Depends on pattern of word use but not on length– Every document is most similar to itself
The Okapi BM25 Formula• Discovered mostly through trial and error
– Requires a large IR test collection
Let be the number of documents in the collection
Let be the number of documents containing term
Let be the frequency of term in document
Let be the contribution of term to the relevance of document
Then
N
n t
t d
w t d
w
Nn
N
t
t d
t d
t dt
tf
.. tf
tf . .length(d)
avglen
log.
log
,
,
,t,d
t,d
0 4
0 6
05 15
05
1
Summary So Far
• Find documents most similar to the query
• Optionally, Obtain query term weights– Given by the user, or computed from IDF
• Compute document term weights– Some combination of TF and IDF
• Normalize the document vectors– Cosine is one way to do this
• Compute inner product of query and doc vectors– Multiply corresponding elements and then add
Probability Ranking Principle
• Claim:– Documents should be ranked in order of decreasing
probability of relevance to the query
• Binary relevance & independence assumptions– Each document is either relevant or it is not– Relevance of one doc reveals nothing about another
Probabilistic Retrieval Strategy
• Estimate how terms contribute to relevance– Use term frequency as evidence
• Make “binary independence” assumptions– Listed on the next slide
• Compute document relevance probability– Combine evidence from all terms
• Order documents by decreasing probability
Inference Networks
• A flexible way of combining term weights– Boolean model– Binary independence model– Probabilistic models with weaker assumptions
• Efficient large-scale implementation– InQuery text retrieval system from U Mass
Binary Independence Assumptions
• No prior knowledge about any document
• Each document is either relevant or it is not
• Relevance of one doc tells nothing about others
• Presence of one term tells nothing about others
A Binary Independence Network
bat
d1 d2 d3 d4
cat fat hat mat pat rat vatsat
query
Probability Computation
• Turn on exactly one document at a time– Boolean: Every connected term turns on– Binary Ind: Connected terms gain their weight
• Compute the query value– Boolean: AND and OR nodes use truth tables– Binary Ind: Fraction of the possible weight
w
w
t dt d
t dt
,
,
A Boolean Inference Net
bat
d1 d2 d3 d4
cat fat hat mat pat rat vat
ANDOR
sat
AND
Critique of Probabilistic Retrieval
• Most of the assumptions are not satisfied!– Searchers want utility, not relevance– Relevance is not binary– Terms are clearly not independent– Documents are often not independent
• The best known term weights are quite ad hoc– Unless some relevant documents are known
A Response
• Ranked retrieval paradigm is powerful– Well suited to human search strategies
• Probability theory has explanatory power– At least we know where the weak spots are
• Inference networks are extremely flexible– Easily accommodates newly developed models
• Implemented by InQuery– Effective, efficient, and large-scale
Summary
• Three fundamental models– Boolean, probabilistic, vector space
• Common characteristics– Bag of words representation– Choice of indexing terms
• Probabilistic & vector space share commonalties– Ranked retrieval– Term weights– Term independence assumption