Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.

Matching

LBSC 708A/CMSC 828O

March 8, 1999

Douglas W. Oard and Dagobert Soergel

Agenda

• Bag of words representation

• Boolean matching

• Similarity-based matching

• Probabilistic matching

Retrieval System Model

Query Formulation

Detection

Delivery

Selection

Examination

Index

Docs

User

Indexing

Detection Component Model

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t P

roce

ssin

g

“Bag of Words” Representation

• Simple strategy for representing documents• Count how many times each term occurs• A “term” is any lexical item that you chose

– A fixed-length sequence of characters (an “n-gram”)

– A word (delimited by “white space” or punctuation)

– Some standard “root form” of each word (e.g., a stem)

– A phrase (e.g., phrases listed in a dictionary)

• Counts can be recorded in any consistent order

Bag of Words Example

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

quick

brown

fox

over

lazy

dog

back

now

is

time

forall

good

men

tocome

jump

aid

of

their

party

00110110110010100

11001001001101011

Indexed Term D

ocum

ent 1

Doc

umen

t 2

Stopword List

Boolean “Free Text” Retrieval

• Limit the bag of words to “absent” and “present”– “Boolean” values, represented as 0 and 1

• Represent terms as a “bag of documents”– Same representation, but rows rather then columns

• Combine the rows using “Boolean operators”– AND, OR, NOT

• Any document with a 1 remaining is “detected”

Boolean Free Text Example

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term Doc

1

Doc

2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

• dog AND fox – Doc 3, Doc 5

• dog NOT fox – Empty

• fox NOT dog – Doc 7

• dog OR fox – Doc 3, Doc 5, Doc 7

• good AND party – Doc 6, Doc 8

• good AND party NOT over

– Doc 6

Proximity Operator Example

• time AND come– Doc 2

• time (NEAR 2) come– Empty

• quick (NEAR 2) fox– Doc 1

• quick WITH fox– Empty

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0 1 (9)

Term1 (13)1 (6)

1 (7)

1 (8)

1 (16)

1 (1)

1 (2)1 (15)1 (4)

0

00

0

00

0

0

0

0

0

0

00

0

0

1 (5)

1 (9)

1 (3)

1 (4)

1 (8)

1 (6)

1 (10)

Doc

1

Doc

2

Controlled Vocabulary Example

• Canine AND Fox– Doc 1

• Canine AND Political action– Empty

• Canine OR Political action– Doc 1, Doc 2

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

VolunteerismPolitical action

FoxCanine 0

011

1100

Descriptor Doc

1D

oc 2

[Canine][Fox]

[Political action][Volunteerism]

Ranked Retrieval Paradigm

• Exact match retrieval often gives useless sets– No documents at all, or way too many documents

• Query reformulation is one “solution”– Manually add or delete query terms

• “Best-first” ranking can be superior– Select every document within reason– Put them in order, with the “best” ones first– Display them one screen at a time

Advantages of Ranked Retrieval

• Closer to the way people think– Some documents are better than others

• Enriches browsing behavior– Decide how far down the list to go as you read it

• Allows more flexible queries– Long and short queries can produce useful results

Ranked Retrieval Challenges

• “Best first” is easy to say but hard to do!– Probabilistic retrieval tries to approximate it

• How can the user understand the ranking?– It is hard to use a tool that you don’t understand

• Efficiency may become a concern– More complex computations take more time

Document Similarity

• How similar are two documents?– In particular, how similar is their bag of words?

1

1

1

1: Nuclear fallout contaminated Montana.

2: Information retrieval is interesting.

3: Information retrieval is complicated.

1

1

1

1

1

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

1

1 2 3

Similarity-Based Queries

• Treat the query as if it were a document– Create a query bag-of-words

• Find the similarity of each document– Using the coordination measure, for example

• Rank order the documents by similarity– Most similar to the query first

A Simple Ranking Strategyinformation AND retrieval

Readings in Information RetrievalInformation Storage and RetrievalSpeech-Based Information Retrieval for Digital LibrariesWord Sense Disambiguation and Information Retrieval

information NOT retrieval

The State of the Art in Information Filtering

Inference Networks for Document RetrievalContent-Based Image Retrieval SystemsVideo Parsing, Retrieval and BrowsingAn Approach to Conceptual Text Retrieval Using the EuroWordNet …Cross-Language Retrieval: English/Russian/French

retrieval NOT information

The Coordination Measure

1

1

1

1

1

1

1

1

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

1

1 2 3

Query: complicated retrievalResult: 3, 2

Query: information retrievalResult: 2, 3

Query: interesting nuclear falloutResult: 1, 2

1

1

1

1

1

1

1

1

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

1

1 2 3

The Vector Space Model

• Model similarity as inner product– Line up query and doc vectors– Multiply weights for each term– Add up the results

• With binary term weights, same results as coordination match

Term Frequency

• Terms tell us about documents– If “rabbit” appears a lot, it may be about rabbits

• Documents tell us about terms– “the” is in every document -- not discriminating

• Documents are most likely described well by rare terms that occur in them frequently– Higher “term frequency” is stronger evidence– Low “collection frequency” makes it stronger still

Incorporating Term Frequency

• High term frequency is evidence of meaning– And high IDF is evidence of term importance

• Recompute the bag-of-words– Compute TF * IDF for every element

Let be the total number of documents

Let of the documents contain term

Let be the number of times term appears in document

Then

N

n N i

i j

wN

n

i j

i j i j

tf

tf log

,

, ,

TF*IDF Example

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

idfi

tf ,i jwi j,

The Document Length Effect

• Humans look for documents with useful parts– But term weights are computed for the whole

• Document lengths vary in many collections– So term weight calculations could be inconsistent

• Two strategies– Adjust term weights for document length– Divide the documents into equal “passages”

Document Length Normalization

• Long documents have an unfair advantage– They use a lot of terms

• So they get more matches than short documents

– And they use the same words repeatedly• So they have much higher term frequencies

• Normalization seeks to remove these effects– Related somehow to maximum term frequency– But also sensitive to the of number of terms

“Cosine” Normalization

• Compute the length of each document vector– Multiply each weight by itself– Add all the resulting values– Take the square root of that sum

• Divide each weight by that length

Let be the unnormalized weight of term in document

Let be the normalized weight of term in document

Then

w i j

w i j

ww

w

i j

i j

i j

i j

i jj

,

,

,

,

,

2

Cosine Normalization Example

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

idfi

1.70 0.97 2.67 0.87Length

tf ,i jwi j, wi j,

Why Call It “Cosine”?

d2

d1 w j1,

w j2,

w2 2,

w2 1,

w1 1,w1 2,

Let document 1 have unit length with coordinates and

Let document 2 have unit length with coordinates and

Then

w w

w w

w w w w

1 1 2 1

1 2 2 2

1 1 1 2 2 1 2 2

, ,

, ,

, , , ,cos ( ) ( )

Interpreting the Cosine Measure

• Think of a document as a vector from zero

• Similarity is the angle between two vectors– Small angle = very similar– Large angle = little similarity

• Passes some key sanity checks– Depends on pattern of word use but not on length– Every document is most similar to itself

The Okapi BM25 Formula• Discovered mostly through trial and error

– Requires a large IR test collection

Let be the number of documents in the collection

Let be the number of documents containing term

Let be the frequency of term in document

Let be the contribution of term to the relevance of document

Then

N

n t

t d

w t d

w

Nn

N

t

t d

t d

t dt

tf

.. tf

tf . .length(d)

avglen

log.

log

,

,

,t,d

t,d

0 4

0 6

05 15

05

1

Summary So Far

• Find documents most similar to the query

• Optionally, Obtain query term weights– Given by the user, or computed from IDF

• Compute document term weights– Some combination of TF and IDF

• Normalize the document vectors– Cosine is one way to do this

• Compute inner product of query and doc vectors– Multiply corresponding elements and then add

Probability Ranking Principle

• Claim:– Documents should be ranked in order of decreasing

probability of relevance to the query

• Binary relevance & independence assumptions– Each document is either relevant or it is not– Relevance of one doc reveals nothing about another

Probabilistic Retrieval Strategy

• Estimate how terms contribute to relevance– Use term frequency as evidence

• Make “binary independence” assumptions– Listed on the next slide

• Compute document relevance probability– Combine evidence from all terms

• Order documents by decreasing probability

Inference Networks

• A flexible way of combining term weights– Boolean model– Binary independence model– Probabilistic models with weaker assumptions

• Efficient large-scale implementation– InQuery text retrieval system from U Mass

Binary Independence Assumptions

• No prior knowledge about any document

• Each document is either relevant or it is not

• Relevance of one doc tells nothing about others

• Presence of one term tells nothing about others

A Binary Independence Network

bat

d1 d2 d3 d4

cat fat hat mat pat rat vatsat

query

Probability Computation

• Turn on exactly one document at a time– Boolean: Every connected term turns on– Binary Ind: Connected terms gain their weight

• Compute the query value– Boolean: AND and OR nodes use truth tables– Binary Ind: Fraction of the possible weight

w

w

t dt d

t dt

,

,

A Boolean Inference Net

bat

d1 d2 d3 d4

cat fat hat mat pat rat vat

ANDOR

sat

AND

Critique of Probabilistic Retrieval

• Most of the assumptions are not satisfied!– Searchers want utility, not relevance– Relevance is not binary– Terms are clearly not independent– Documents are often not independent

• The best known term weights are quite ad hoc– Unless some relevant documents are known

A Response

• Ranked retrieval paradigm is powerful– Well suited to human search strategies

• Probability theory has explanatory power– At least we know where the weak spots are

• Inference networks are extremely flexible– Easily accommodates newly developed models

• Implemented by InQuery– Effective, efficient, and large-scale

Summary

• Three fundamental models– Boolean, probabilistic, vector space

• Common characteristics– Bag of words representation– Choice of indexing terms

• Probabilistic & vector space share commonalties– Ranked retrieval– Term weights– Term independence assumption

Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.

Documents

Transcript of Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.