Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · [email protected] Formal...

49
[email protected] Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science [email protected] http://www.cs.odu.edu/˜jbollen January 30, 2003 Page 1

Transcript of Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · [email protected] Formal...

Page 1: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Lecture 2:

IR Models.

Johan BollenOld Dominion University

Department of Computer Science

[email protected]

http://www.cs.odu.edu/˜jbollen

January 30, 2003 Page 1

Page 2: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Structure

1. IR formal characterization

(a) Mathematical framework

(b) Document and keyterm

representations

2. Models

(a) Boolean Model

(b) Vector Space Model

(c) Probabilistic Model

January 30, 2003 Page 2

Page 3: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Formal Characterization of IR systemsAn information retrieval model is defined as the quadruple:

[D,Q ,F ,R (qi ,d j )]where

1. D is a set of document representations (logical views)

2. Q is a set of representations constituting the user’s information needs or query

3. F represents the framework for modeling document representations, queries and their

relationships

4. R (qi ,d j )] is a ranking function which associates a number∈ R with a queryqi ∈ Qand a document representationd j ∈D.

January 30, 2003 Page 3

Page 4: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Formal Characterization of IR systems

January 30, 2003 Page 4

Page 5: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Classic IR concepts

1. Concepts:

(a) Documents represented by in-

dex terms

(b) Index term weights: speci-

ficity for specific document

(c) Same may apply for query

2. Relevant to boolean, vector and

probabilistic models

ki ∈K = {k1, · · · ,kt}: generic index term,

d j : document

wi, j > 0: weight associated with(ki ,d j )

if ki not ind j , wi, j = 0

d j is thus associated with index term vector

~d j = (w1, j ,w2, j , · · · ,wt, j )

functiongi return weight of indexki in ~d j ,

such thatgi(~d j ) = wi, j

January 30, 2003 Page 5

Page 6: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

ExampleEdgar Allen Poe’s The Conqueror Worm, 1843

Last half of last paragraph:

1) And theangels, all pallid andwan,

2) Uprising, unveiling, affirm3) That theplay is thetragedy, ”Man,”

4) And itshero theConqueror Worm .

Hand selected keyterms:

1) affirm 7) play

2) angel 8) tragedy

3) conqueror 9) unveil

4) hero 10) uprise

5) man 11) wan

6) pallid 12) worm

g5(~d3) = 1,g3(~d2) = 0

January 30, 2003 Page 6

Page 7: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Matrix representationw1,1 w2,1 · · · w1,t

w2,1 w2,2 · · · w2,t

· · ·wn,1 wn,1 · · · wn,t

0 1 0 0 0 1 0 0 0 0 1 0

1 0 0 0 0 0 0 0 1 1 0 0

0 0 0 0 1 0 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1

d3 = (0,0,0,0,1,0,1,1,0,0,0,0), ~k4 = (0,0,0,1)

1. Keyterm-document, or in this

case document-keyterm, ma-

trix

2. Represents weight values be-

tween keyterms and docu-

ments

3. Usually sparse, and very large

4. Assumption of independence

or orthogonality: keyterms

are not related!

5. In this specific case: weight

values are binary

January 30, 2003 Page 7

Page 8: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

What are IR models

1. Abstract models of IR techniques:

(a) characteristics

(b) Assumptions

(c) General modus operandi

(d) Properties

2. Do not comprise full blown IR systems

(a) Technical details will be discussedlater

(b) Focus is assumptions and

characteristics

3. Present discussion is not end all be all

of IR

(a) New models can be introduced

(b) Devil is often in details

(c) Large collections require specific

engineering efforts

January 30, 2003 Page 8

Page 9: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Boolean Model

1. Basic principles:

(a) User query is shaped as Booleanexpression

(b) Documents are predicted to beeither relevant or not

2. Advantages:

(a) Neat formalism

(b) Easy and efficient implementation

(c) Widespread adoptation has

accustomed users to Boolean

queries

3. Disadvantages:

(a) No document ranking

(b) More like a data retrieval than

information retrieval model

(c) In spite of user preferences, not

straightforward to formulate

adequate queries

January 30, 2003 Page 9

Page 10: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Boolean Model: Formalizationwi, j ∈ {0,1}

Queryq consists of keyterms connected byNOT, AND andOR.

For example,q = [ka∧ (kb∨¬kc)]or: q = ka&& (kb ‖!kc)

or: “brothers” AND ( “chemical” OR NOT “doobie” )

For every Boolean statement there exists a Disjunctive Normal Form (DNF):

q = [ka∧ (kb∨¬kc)] = (1,0,0)∨ (1,1,0)∨ (1,1,1)

ka kb kc [ka∧ (kb∨¬kc)]

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 1 ka∧¬kb∧¬kc

1 0 1 0

1 1 0 1 ka∧kb∧¬kc

1 1 1 1 ka∧kb∧kc

January 30, 2003 Page 10

Page 11: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

DNF

KaKb

Kc

Ka AND !Kb AND !Kc

Ka AND Kb and !Kc

Ka AND Kb AND Kc

January 30, 2003 Page 11

Page 12: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Boolean Model: Query matchingLet~qdn f be DNF forq

and~qcc be the conjunctive components of~qdn f

then

sim(d j ,q) =

1 ∃~qcc : (~qcc ∈~qdn f)∧ (∀ki ,gi(~d j ) = gi(~qcc))

0 otherwise

∣∣∣∣∣∣In human language, the similarity between a document and a user query is one, when there

exists a conjunctive component such that every keyterm in the query has the same value in

one of its DNF components as it does in the document (1 = present)

January 30, 2003 Page 12

Page 13: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Boolean Model: ExampleQuery: “play” AND (“tragedy” OR NOT “worm”)

translates to DNF:

(“play” AND NOT “tragedy” AND NOT “worm”) = (100)OR (“play” AND “tragedy” AND NOT “worm”) = (110)

OR (“play” AND “tragedy” AND “warm”) = (111)

1) affirm, 2) angel, 3) conqueror, 4) hero, 5) man, 6) pallid,

7) play, 8) tragedy, 9) unveil, 10) uprise, 11) wan, 12) worm0 1 0 0 0 1 0 0 0 0 1 0

1 0 0 0 0 0 0 0 1 1 0 0

0 0 0 0 1 0 1 1 0 0 0 0

0 0 1 1 0 0 0 0 0 0 0 1

d3 = That theplay is thetragedy, ”Man,”

January 30, 2003 Page 13

Page 14: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Boolean Model: Discussion

1. Very popular because of its simplicityand efficiency

2. Does not rank results

3. Strongly dependent on exact use ofkeyterms

4. Not really IR, but rather data retrieval

(a) keyterms are semantically“shallow” handles on files

(b) User queries are intended to match

completely:

i. problems with jargon

ii. synonymy

iii. user has to be aware of exact

keyterms

iv. humans are not very good at

formulating complicated

Boolean queries

January 30, 2003 Page 14

Page 15: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Vector Space Model

1. Keyterm based model

2. Allows partial matching and rankingof documents

3. Basic principles:

(a) Document represented by keytermvector

(b) t dimensional space spanned byset of keyterms

(c) Query represented by keytermvector

(d) Similarity document-keytermcalculated based on vectordistance

4. Requires:

(a) keyterm weighing for document

vectors

(b) keyterm weighing for query

(c) Distance measure for

document-keyterm vectors

5. Features:

(a) Efficient

(b) Successful

(c) Easy to grasp visual representation

(d) Applicable to document-document

matching

January 30, 2003 Page 15

Page 16: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Vector Space ModelDefinition:

Let

wi, j be associated with(ki ,d j ), wi, j ∈ R+

wi,q be associated with(ki ,q), wi, j ∈ R+

Query vector~q = (w1,q,w2,q, · · · ,wt,q)t is total number of index terms in system

Documentd is represented by:~d j = (w1, j ,w2, j , · · · ,wt, j ).

Both documentd j and queryq are thus represented as t-dimensional vectors.

The similarity between document and query is evaluated by the cosine of their angle e.g.:

sim(d j ,q) =~d j ·~q

‖~d j ‖×‖~q‖

= ∑ti=1 wi, j×wi,q√

∑ti=1 w2

i, j×√

∑ti=1 w2

i,q

Based on definition of dot-product:

d j ·q = ‖d j‖‖q‖cos(α)

cosine similarity∈ [0,1] sincewi, j ≥ 0 andwi, j ≥ 0

January 30, 2003 Page 16

Page 17: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Cosine similarity measure

Q

β

α

k 2

k 1

k 3

2d d1

cos(α) = 0: perpendicular vectors cos(α) = 1: vectors co-linear

January 30, 2003 Page 17

Page 18: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Mommy, where do the index term weights comefrom?

1. Goal is to find weight that expresses

how specific a term is to the intended

set of relevant documents

2. Use of frequency

3. Two issues:

(a) term frequency (TF): frequency

within document

(b) document frequency (DF):

frequency within collection

4. Term frequency

(a) Expresses connection between

term and specific document

(b) Term often occurs within

document indicative of its

meaning

5. Document Frequency

(a) Expresses general frequency of

term, e.g. “the”

(b) Characteristic of language

6. Aim is to find keyterms which balance

high TF and relatively low DF

January 30, 2003 Page 18

Page 19: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

DefinitionTerm frequency

Let N be total number of documents in collection

andni number of documents in which index termki appears.

freqi, j is frequency ofki in d j .

Normalized term frequency is then defined as:

fi, j =freqi, j

maxl freql , j

where maxl freql , j is determined over all terms in documentd j .

Inverse document frequency

idf i = log Nni

Index term weight

wi, j = fi, j × log Nni

January 30, 2003 Page 19

Page 20: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Query keyterm weightsBalance DF and query frequency:

wi,q =(

0.5+0.5freqi,q

maxl freql ,q

)× log N

ni

so that freqi,q is frequency of term in query (usually 1)

Document-keyterm matrix:

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10

d1 w1,1 w1,2 · · · w1,10

d2 w2,1 w2,2 · · · w2,10

d3 w3,1 w3,2 · · · w3,10

d4 w4,1 w4,2 · · · w4,10

January 30, 2003 Page 20

Page 21: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

TFIDF:

January 30, 2003 Page 21

Page 22: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

ExampleEdgar Allen Poe’s The Raven (1845)

First paragraph:

1) Once upon amidnight dreary, while I pondered, weak and weary,

2) Over many a quaint and curiousvolumeof forgottenlore,

3) While I nodded, nearly napping, suddenly there came atapping,

4) As of some one gently rapping, rapping at mychamber door.5) “’Tis somevisiter,” I muttered, ”tapping at mychamber door -

6) Only this, andnothing more.”

Hand selected keyterms - document frequency:1) chamber-2 , 2) door-2, 3) lore-1, 4) midnight-1,

5) nothing-1, 6) tap-1, 7) visiter-1, 8) volume-1

document-keyterm matrix:

6×8 matrix whose entrieswi, j ∈ R+

January 30, 2003 Page 22

Page 23: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Example: Document-Keyterm Matrixchamber door lore midnight nothing tap visiter volume

d1 w1,1 w1,2 · · · w1,8

d2 w2,1 w2,2 · · · w2,8

d3 w3,1 w3,2 · · · w3,8

d4 w4,1 w4,2 · · · w4,8

d5 w4,1 w4,2 · · · w5,8

d6 w4,1 w4,2 · · · w6,8

k1 = “chamber”.

Frequency is zero except ford4 andd5.

w1,4 = fi, j × log Nni

or

w1,4 =freqi, j

maxl freql , j× log N

ni

N = 6, n1 = 2, freqi, j =1, maxl freql , j = 1,

w1,4 = 11 × log10

62 = 0.47

January 30, 2003 Page 23

Page 24: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Example: Document-Keyterm MatrixResult:

chamber door lore midnight nothing tap visiter volume

d1 0 0 0 0.78 0 0 0 0

d2 0 0 0.78 0 0 0 0 0.78

d3 0 0 0 0 0 0.78 0 0

d4 0.48 0.48 0 0 0 0 0 0

d5 0.48 0.48 0 0 0 0 0.78 0

d6 0 0 0 0 0.78 0 0 0

Note:

Weight values for chamber and door are lower because less specific to document.

January 30, 2003 Page 24

Page 25: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Example: QueryQuery:

“Visitor at your door”Translation to known keyterms:

{ “visiter”, “door” }

Query weights:

wi,q =(

0.5+0.5freqi,q

maxl freql ,q

)× log N

ni

“visitor” w1,q =(0.5+ 0.5×1

1

)× log 6

1 = 0.195

“door” w1,q =(0.5+ 0.5×1

1

)× log 6

2 = 0.119

Resulting query vector(0,0.119,0,0,0,0,0.195,0)

January 30, 2003 Page 25

Page 26: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

RetrievalQuery - Document similarity:

sim(d j ,q) =~d j ·~q

‖~d j ‖×‖~q‖

Procedure:

1. Calculate Numerator (inner product ofq and document vector) for all documents

(a) Weight matrix - query vector multiplication = vector of inner products

(b) Store resulting vector

2. Calculate Denominator (norm ofq and document vectors)

3. Determine similarities

(a) Normalize vector of inner products by determined vector norms

(b) Rank documents according to values in resulting vector

January 30, 2003 Page 26

Page 27: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

ExampleInner Products:

0 0 0 0.78 0 0 0 0

0 0 0.78 0 0 0 0 0.78

0 0 0 0 0 0.78 0 0

0.48 0.48 0 0 0 0 0 0

0.48 0.48 0 0 0 0 0.78 0

0 0 0 0 0.78 0 0 0

×

0

0.12

0

0

0

0

0.2

0

=

0

0

0

0.06

0.21

0

January 30, 2003 Page 27

Page 28: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

ExampleNorm of query vector:√

(|~q| ·~qt) = 0.233

Norms of document vectors:|~d1|= 0.608, etc.

(0.78,1.103,0.78,0.679,1.034,0.78)

Inner product document-query vectors:

0

0

0

0.06

0.21

0

→ normalize→

0

0

0

0.379

0.871

0

→ rank→

rank doc

1 d5

2 d6

· · ·

January 30, 2003 Page 28

Page 29: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Document to document similaritiesVector space model can be used to calculate similarities between documents in collection.

Instead of query to document vector similarty, document to document.

Procedure:

1. Rather than operating on query to keyterm, document-document vector

2. Use document-keyterm matrix

(a) Document vector in our example: row vectors

(b) Pairwise cosine similarity

3. Procedure is very expensive

(a) Can be optimized for sparse matrices

(b) Efficient in comparison to other methods

In mathematical terms:Let M be document-keyterm matrix.

Matrix of inner productsMi = M×Mt

Norms ofM’s row vectors. etc.

We define sim(di ,d j ) = 1

January 30, 2003 Page 29

Page 30: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Document to document example

Mi = M×Mt =

0.608 0 0 0 0 0

0 1.217 0 0 0 0

0 0 0.608 0 0 0

0 0 0 0.46 0.46 0

0 0 0 0.46 1.069 0

0 0 0 0 0 0.608

Norms of matrix row vectors (denominator for calculation of weight values):

(0.78,1.103,0.78,0.679,1.034,0.78)

Normalize matrix values:

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0.655 0

0 0 0 0.655 1 0

0 0 0 0 0 1

We find document 4 and 5 to be similar

January 30, 2003 Page 30

Page 31: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Sparse matrices

1. Document-keyterm anddocument-document matrices arenecessarily sparse

(a) Low ratio of non-zero entries overall entries

(b) Expressed as percentage, e.g. 3%density

2. Sparseness is natural

(a) Possible matrix entriesn2

(b) Few “relationships” can besemantically valid

(c) Large, heterogeneous document

collections

(d) Many matrices well below 1%

3. Storage and manipulation techniques

(a) Coordinate system

(b) Compressed Column and

Compressed Row Format

(c) Harwell-Boeing format

(d) Formats only store non-zero

entries

i. dense format:n2

ii. Harwell-Boeing:±n+1+2nz

January 30, 2003 Page 31

Page 32: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Coordinate system

0 2 5 0 0

1 0 0 3 0

0 0 0 0 2

0 1 0 0 0

3 0 0 2 0

Represented by:

row 2 5 1 4 1 2 5 3

col 1 1 2 2 3 4 4 5

values 1 3 2 1 5 3 2 2

January 30, 2003 Page 32

Page 33: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Compressed Column format or Harwell-BoeingFormat

1. Slightly more efficient than coordinate system (about 30% savings)

2. Represent an×mmatrix by three arrays:

(a) Column pointers:m+1 values

(b) Row indexes:nzvalues

(c) Values (non-zero entries)nzvalues

3. Values are retrieved:

(a) Colum pointers lookup

(b) Row index range lookup

(c) Value lookup

January 30, 2003 Page 33

Page 34: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Harwell-Boeing Format

0 2 5 0 0

1 0 0 3 0

0 0 0 0 2

0 1 0 0 0

3 0 0 2 0

Represented by:

colptr 1 3 5 6 8 9

rowindex 2 5 1 4 1 2 5 3

values 1 3 2 1 5 3 2 2

m(1,3) =?

1. Specific:

(a) colptr[3] < rowindex<

colptr[3+1] = 5

(b) rowindex[5] = 1,

(c) corresponding value = 5

2. General

(a) value[i][j] = val[colptr[j] ≤rowindex< colptr[j+1], i]

(b) note: colptr specifies a range of

values in rowindex

(c) colptr: value index of first value to

start a new column

January 30, 2003 Page 34

Page 35: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Perl sparse matrix example# ! / u s r / b in / p e r l

# i n d e x e s s t a r t a t ze ro@colp t r = ( 0 , 2 , 4 , 5 , 7 , 8 ) ;@rowindex = ( 1 , 4 , 0 , 3 , 0 , 1 , 4 , 2 ) ;@values = ( 1 , 3 , 2 , 1 , 5 , 3 , 2 , 2 ) ;

$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;

$ r = $ c o l p t r [ $ j +1]− $ c o l p t r [ $ j ] ;

$va lue = 0 ;f o r ( $n= $ c o l p t r [ $ j ] ; $n<$ c o l p t r [ $ j + 1 ] ; $n ++){

i f ( $rowindex [ $n ] = = $ i ){$va lue = $ v a l u e s [ $n ] ;

}}

p r i n t ” v a l u e : $va lue \n” ;

January 30, 2003 Page 34

Page 36: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Perl sparse matrix - Compressed Row Storage (CRS)# ! / u s r / b in / p e r l

# CRS , i n d e x e s s t a r t a t ze ro

@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;

$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;

$va lue = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){

i f ( $ c o l i n d e x [ $n ] = = $ j ){$va lue = $ v a l u e s [ $n ] ;

}}

p r i n t ” v a l u e : $va lue \n” ;

January 30, 2003 Page 34

Page 37: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Perl sparse matrix - vector multiplication)# ! / u s r / b in / p e r l

# CRS , i n d e x e s s t a r t a t ze ro

@vec = ( 1 , 2 , 3 , 4 , 5 ) ;@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;

$ i = $ARGV[ 0 ] ;

$va lue = 0 ;

f o r ( $ i = 0 ; $ i <5; $ i ++){$sum = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){

$sum + = $ v a l u e s [ $n ]∗ $vec [ $ c o l i n d e x [ $n ] ] ;}p r i n t $sum . ”\n” ;

}

January 30, 2003 Page 34

Page 38: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Vector Space Model conclusion

1. Long History: since 1960s

2. Very efficient

(a) Use of sparse matrix methods

(b) Simple linear algebra

(c) Well proven

3. Flexible:

(a) Use in query resolution

(b) Document to document similarity

(c) Clustering (more later)

4. For these reasons, very popular and

often used

5. Disadvantages:

(a) No clearly defined theoretical

framework

(b) Generation of adequate indexes

(c) Assumption of index term

independence

(d) Scalability?

January 30, 2003 Page 35

Page 39: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

The use of probability theory in IR

1. IR is all about uncertainty

(a) Give user query, will results berelevant or not?

(b) System may make estimate, butcertainty can not be obtained

(c) Probability of relevancy is issue

2. Baysesian probability theory:

(a) relates to the collection ofevidence and making informed

guesses

(b) evidence: was document

containing certain keyterms

relevant?

(c) evidence: was document

containing certain keyterms found

not relevant?

(d) Adjust estimate of probability not

on the basis of frequency, but on

the basis of evidence

January 30, 2003 Page 36

Page 40: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Probabilistic Models

1. Attempts to predict page relevance onprobabilistic grounds

2. Rather than compare specific pagefeatures (index terms)

3. Provide a clear and well-groundedframework for IR

(a) Based on statistical principles:very appropriate to IR!

(b) Relevancy odds can be updated totake into account other factorsthan text content

(c) Possibility of user feedback to be

fed into system

4. Basic idea:

(a) Query is associated with ideal

answer set

(b) Properties of answer→ index

terms

(c) Initial guess at answer set

propertiesrightarrow

probabilistic description

(d) Initial retrieval results

(e) User feedback can finetune

relevancy probabilities

January 30, 2003 Page 37

Page 41: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Bayes’ rulehttp://www.cs.berkeley.edu/~murphyk/Bayes/economist.html

Bayesian probability:

Change your beliefs on the basis of evidence.

Example 1:Assume two eventsa (rain) andb (clouds).P(a|b) = P(b|a)×P(a)

P(b)P(a|b) probability it will rain we see clouds

P(b|a) how often have we seen clouds when it rains?

P(a) how often does it rain?

P(b) how often have we seen clouds?

Example 2:a→ ”Person has cancer”

b→ ”Person is a smoker”

P(a|b) = ?

In Belgium: P(b) = 0.3.

Among those diagnosed with cancer, about 80% smokes, soP(b|a) = 0.8.

Overall probability of cancerP(a) = 0.05P(a|b) = P(b|a)×P(a)P(b) = 0.8×0.05

0.3 = 0.13

Quit now!

January 30, 2003 Page 38

Page 42: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Some basic notions of applying probability theoryto IR

Can we do the same for IR?d j is document in collection.

R represents set of relevant documents.

R represents set of non-relevant documents.

Then we can say things like:

P(d j |R) = probability that if a document is relevant, it isd j

We need to determineP(R|d j ) from:

1. P(R)

2. P(d j |C)

Since queries are specified we want to translate these statements, to index term based

statements, for example:

P(ki |R) = probability that document containingki is relevant.

Great! So how do we determine all these probabilities?

January 30, 2003 Page 39

Page 43: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Probabilistic Models: FormalizationBasic probabilistic notionswi, j ∈ {0,1}, wi,q ∈ {0,1}

R is the set of all relevant documents.

R is R’s complement, i.e. the set of all non-relevant documents.

P(R|~d j ) is the probability thatd j is relevant to queryq or probability of relevancygivend j .

P(R|~d j ) is the probability thatd j is not relevant toq.

sim(d j ,q) = P(R|~d j )P(R|~d j )

Bayes’ rule:P(r|e) = P(e|r)×P(R)P(e) :

sim(d j,q) = P(~d j |R)×P(R)P(~d j |R)×P(R)

P(~d j |R) : probability of selectiond j from set of relevant documents.

P(R): probability that randomly selected document is relevant. As book mentions:P(R) and

P(R) are constant for a given collection, so:

sim(d j ,q)∼ P(~d j |R)P(~d j |R)

January 30, 2003 Page 40

Page 44: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Probabilistic Models: Formalization (Continued)Moving to keyterm level:

sim(d j ,q)∼(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))

(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))

whereP(ki |R) stands for the probability thatki ∈ d j selected from R and

P(ki |R) stands for the probability thatki /∈ ind j selected from R.

P(ki |R) stands for our probability to make a mistake, etc.

Finally the canonical probabilistic information retrieval equation:

sim(d j ,q)∼ ∑ti=1 wi,q×wi, j ×

(log P(ki |R)

1−P(ki |R) + log 1−P(ki |R)P(ki |R)

)Does this work? No! We need an initial estimate ofP(ki |R) and P(ki |R)

Various schemes.

P(ki |R) = 0.5

P(ki |R) = niN

Givenwi,q andwi,q these estimates allow an initial set of documents to be retrieved and

ranked.

January 30, 2003 Page 41

Page 45: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Updating our chancesA bit of recursive reasoning:

Let V be a subset of documents initially retrieved and ranked (e.g. rank above threshold).

Let Vi ⊂V of documents containing documentki .

We can now improve our guesses forP(ki |R) andP(ki |R):

P(k|R) =Vi

V

P(ki |R) =ni −Vi

N−V

So,P(k|R) is updated according to how many of the retrieved set of documents containski .

Similarly, the documents not retrieved indicate our chances of a document containingki not

being relevant.

User assistance can be used to update odds, e.g. user can indicate which documents are

relevant.

January 30, 2003 Page 42

Page 46: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Updating our chances, contd.Problem when retrieved set is small:

Adjustment factors:

P(k|R) =Vi +0.5V +1

P(ki |R) =ni −Vi +0.5N−V +1

This factor increases the “importance” of small retrieval sets.

Constant 0.5 factor may not always be desireable:

P(k|R) =Vi +

niN

V +1

P(ki |R) =ni −Vi +

niN

N−V +1

so that adjustment is is relative to overall number of documents containingki .

January 30, 2003 Page 43

Page 47: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

Conclusion

1. Boolean Model:

(a) Binary index term weights

(b) Disjunctive Normal Factors

(c) No partial matching

(d) No ranking

(e) Simple and very efficient

2. Vector space model

(a) Index term weights:

i. Term Frequencyii. Inverse Document Frequency

(b) Query and document vector

(c) Cosine measure ofquery-document similarity

3. Probabilistic Model:

(a) Based on probabilistic estimate of

relevancy document given certain

keyterms

(b) Allows system to adapt to users

and provides strong formal

framework for retrieval

(c) Rather complicated

(d) Performance:

i. Thought to be less efficient

than vector space model

ii. Problematic initial fine-tuning

phase

iii. Probability: semantics may be

issue

January 30, 2003 Page 44

Page 48: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

How do they stack up?

1. Boolean:

(a) Pro:

i. Efficientii. Easy to implement

iii. Prefered by users in spite ofinability to formulate adequatequeries

(b) Con:

i. No rankingii. No partial matching

iii. Entirely dependent on keytermdefinition and binary weights

2. Vector Space Model:

(a) Pro:

i. Keyterm weighing

ii. Partial Matching

iii. Ranking

iv. Efficient

v. Easy to understand framework

vi. Flexibility

(b) Con:

i. Little formal framework

ii. Assumption of keyterm

independence

3. Probabilistic Model:

(a) Compares poorly to Vector Space

Model

(b) Occam’s razor may apply to IR as

well

(c) Case for many novel models

January 30, 2003 Page 45

Page 49: Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · jbollen@cs.odu.edu Formal Characterization of IR systems An information retrieval model is defined as the

[email protected]

ReadingsSalton (1988) Term weighting approaches in automatic text retrieval

Wildemuth (1998) Full Text Available Hypertext versus Boolean access to biomedicalinformation: a comparison of effectiveness, efficiency, and user preferences

http://www.cs.odu.edu/~jbollen/spring03_IR/readings/

January 30, 2003 Page 46