Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · [email protected] Formal...
Transcript of Lecture 2: IR Models. Johan Bollenjbollen/spring03_IR/cs695_lect2.pdf · [email protected] Formal...
Lecture 2:
IR Models.
Johan BollenOld Dominion University
Department of Computer Science
http://www.cs.odu.edu/˜jbollen
January 30, 2003 Page 1
Structure
1. IR formal characterization
(a) Mathematical framework
(b) Document and keyterm
representations
2. Models
(a) Boolean Model
(b) Vector Space Model
(c) Probabilistic Model
January 30, 2003 Page 2
Formal Characterization of IR systemsAn information retrieval model is defined as the quadruple:
[D,Q ,F ,R (qi ,d j )]where
1. D is a set of document representations (logical views)
2. Q is a set of representations constituting the user’s information needs or query
3. F represents the framework for modeling document representations, queries and their
relationships
4. R (qi ,d j )] is a ranking function which associates a number∈ R with a queryqi ∈ Qand a document representationd j ∈D.
January 30, 2003 Page 3
Classic IR concepts
1. Concepts:
(a) Documents represented by in-
dex terms
(b) Index term weights: speci-
ficity for specific document
(c) Same may apply for query
2. Relevant to boolean, vector and
probabilistic models
ki ∈K = {k1, · · · ,kt}: generic index term,
d j : document
wi, j > 0: weight associated with(ki ,d j )
if ki not ind j , wi, j = 0
d j is thus associated with index term vector
~d j = (w1, j ,w2, j , · · · ,wt, j )
functiongi return weight of indexki in ~d j ,
such thatgi(~d j ) = wi, j
January 30, 2003 Page 5
ExampleEdgar Allen Poe’s The Conqueror Worm, 1843
Last half of last paragraph:
1) And theangels, all pallid andwan,
2) Uprising, unveiling, affirm3) That theplay is thetragedy, ”Man,”
4) And itshero theConqueror Worm .
Hand selected keyterms:
1) affirm 7) play
2) angel 8) tragedy
3) conqueror 9) unveil
4) hero 10) uprise
5) man 11) wan
6) pallid 12) worm
g5(~d3) = 1,g3(~d2) = 0
January 30, 2003 Page 6
Matrix representationw1,1 w2,1 · · · w1,t
w2,1 w2,2 · · · w2,t
· · ·wn,1 wn,1 · · · wn,t
0 1 0 0 0 1 0 0 0 0 1 0
1 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 1 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 1
d3 = (0,0,0,0,1,0,1,1,0,0,0,0), ~k4 = (0,0,0,1)
1. Keyterm-document, or in this
case document-keyterm, ma-
trix
2. Represents weight values be-
tween keyterms and docu-
ments
3. Usually sparse, and very large
4. Assumption of independence
or orthogonality: keyterms
are not related!
5. In this specific case: weight
values are binary
January 30, 2003 Page 7
What are IR models
1. Abstract models of IR techniques:
(a) characteristics
(b) Assumptions
(c) General modus operandi
(d) Properties
2. Do not comprise full blown IR systems
(a) Technical details will be discussedlater
(b) Focus is assumptions and
characteristics
3. Present discussion is not end all be all
of IR
(a) New models can be introduced
(b) Devil is often in details
(c) Large collections require specific
engineering efforts
January 30, 2003 Page 8
Boolean Model
1. Basic principles:
(a) User query is shaped as Booleanexpression
(b) Documents are predicted to beeither relevant or not
2. Advantages:
(a) Neat formalism
(b) Easy and efficient implementation
(c) Widespread adoptation has
accustomed users to Boolean
queries
3. Disadvantages:
(a) No document ranking
(b) More like a data retrieval than
information retrieval model
(c) In spite of user preferences, not
straightforward to formulate
adequate queries
January 30, 2003 Page 9
Boolean Model: Formalizationwi, j ∈ {0,1}
Queryq consists of keyterms connected byNOT, AND andOR.
For example,q = [ka∧ (kb∨¬kc)]or: q = ka&& (kb ‖!kc)
or: “brothers” AND ( “chemical” OR NOT “doobie” )
For every Boolean statement there exists a Disjunctive Normal Form (DNF):
q = [ka∧ (kb∨¬kc)] = (1,0,0)∨ (1,1,0)∨ (1,1,1)
ka kb kc [ka∧ (kb∨¬kc)]
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 1 ka∧¬kb∧¬kc
1 0 1 0
1 1 0 1 ka∧kb∧¬kc
1 1 1 1 ka∧kb∧kc
January 30, 2003 Page 10
DNF
KaKb
Kc
Ka AND !Kb AND !Kc
Ka AND Kb and !Kc
Ka AND Kb AND Kc
January 30, 2003 Page 11
Boolean Model: Query matchingLet~qdn f be DNF forq
and~qcc be the conjunctive components of~qdn f
then
sim(d j ,q) =
1 ∃~qcc : (~qcc ∈~qdn f)∧ (∀ki ,gi(~d j ) = gi(~qcc))
0 otherwise
∣∣∣∣∣∣In human language, the similarity between a document and a user query is one, when there
exists a conjunctive component such that every keyterm in the query has the same value in
one of its DNF components as it does in the document (1 = present)
January 30, 2003 Page 12
Boolean Model: ExampleQuery: “play” AND (“tragedy” OR NOT “worm”)
translates to DNF:
(“play” AND NOT “tragedy” AND NOT “worm”) = (100)OR (“play” AND “tragedy” AND NOT “worm”) = (110)
OR (“play” AND “tragedy” AND “warm”) = (111)
1) affirm, 2) angel, 3) conqueror, 4) hero, 5) man, 6) pallid,
7) play, 8) tragedy, 9) unveil, 10) uprise, 11) wan, 12) worm0 1 0 0 0 1 0 0 0 0 1 0
1 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 1 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 1
d3 = That theplay is thetragedy, ”Man,”
January 30, 2003 Page 13
Boolean Model: Discussion
1. Very popular because of its simplicityand efficiency
2. Does not rank results
3. Strongly dependent on exact use ofkeyterms
4. Not really IR, but rather data retrieval
(a) keyterms are semantically“shallow” handles on files
(b) User queries are intended to match
completely:
i. problems with jargon
ii. synonymy
iii. user has to be aware of exact
keyterms
iv. humans are not very good at
formulating complicated
Boolean queries
January 30, 2003 Page 14
Vector Space Model
1. Keyterm based model
2. Allows partial matching and rankingof documents
3. Basic principles:
(a) Document represented by keytermvector
(b) t dimensional space spanned byset of keyterms
(c) Query represented by keytermvector
(d) Similarity document-keytermcalculated based on vectordistance
4. Requires:
(a) keyterm weighing for document
vectors
(b) keyterm weighing for query
(c) Distance measure for
document-keyterm vectors
5. Features:
(a) Efficient
(b) Successful
(c) Easy to grasp visual representation
(d) Applicable to document-document
matching
January 30, 2003 Page 15
Vector Space ModelDefinition:
Let
wi, j be associated with(ki ,d j ), wi, j ∈ R+
wi,q be associated with(ki ,q), wi, j ∈ R+
Query vector~q = (w1,q,w2,q, · · · ,wt,q)t is total number of index terms in system
Documentd is represented by:~d j = (w1, j ,w2, j , · · · ,wt, j ).
Both documentd j and queryq are thus represented as t-dimensional vectors.
The similarity between document and query is evaluated by the cosine of their angle e.g.:
sim(d j ,q) =~d j ·~q
‖~d j ‖×‖~q‖
= ∑ti=1 wi, j×wi,q√
∑ti=1 w2
i, j×√
∑ti=1 w2
i,q
Based on definition of dot-product:
d j ·q = ‖d j‖‖q‖cos(α)
cosine similarity∈ [0,1] sincewi, j ≥ 0 andwi, j ≥ 0
January 30, 2003 Page 16
Cosine similarity measure
Q
β
α
k 2
k 1
k 3
2d d1
cos(α) = 0: perpendicular vectors cos(α) = 1: vectors co-linear
January 30, 2003 Page 17
Mommy, where do the index term weights comefrom?
1. Goal is to find weight that expresses
how specific a term is to the intended
set of relevant documents
2. Use of frequency
3. Two issues:
(a) term frequency (TF): frequency
within document
(b) document frequency (DF):
frequency within collection
4. Term frequency
(a) Expresses connection between
term and specific document
(b) Term often occurs within
document indicative of its
meaning
5. Document Frequency
(a) Expresses general frequency of
term, e.g. “the”
(b) Characteristic of language
6. Aim is to find keyterms which balance
high TF and relatively low DF
January 30, 2003 Page 18
DefinitionTerm frequency
Let N be total number of documents in collection
andni number of documents in which index termki appears.
freqi, j is frequency ofki in d j .
Normalized term frequency is then defined as:
fi, j =freqi, j
maxl freql , j
where maxl freql , j is determined over all terms in documentd j .
Inverse document frequency
idf i = log Nni
Index term weight
wi, j = fi, j × log Nni
January 30, 2003 Page 19
Query keyterm weightsBalance DF and query frequency:
wi,q =(
0.5+0.5freqi,q
maxl freql ,q
)× log N
ni
so that freqi,q is frequency of term in query (usually 1)
Document-keyterm matrix:
k1 k2 k3 k4 k5 k6 k7 k8 k9 k10
d1 w1,1 w1,2 · · · w1,10
d2 w2,1 w2,2 · · · w2,10
d3 w3,1 w3,2 · · · w3,10
d4 w4,1 w4,2 · · · w4,10
January 30, 2003 Page 20
ExampleEdgar Allen Poe’s The Raven (1845)
First paragraph:
1) Once upon amidnight dreary, while I pondered, weak and weary,
2) Over many a quaint and curiousvolumeof forgottenlore,
3) While I nodded, nearly napping, suddenly there came atapping,
4) As of some one gently rapping, rapping at mychamber door.5) “’Tis somevisiter,” I muttered, ”tapping at mychamber door -
6) Only this, andnothing more.”
Hand selected keyterms - document frequency:1) chamber-2 , 2) door-2, 3) lore-1, 4) midnight-1,
5) nothing-1, 6) tap-1, 7) visiter-1, 8) volume-1
document-keyterm matrix:
6×8 matrix whose entrieswi, j ∈ R+
January 30, 2003 Page 22
Example: Document-Keyterm Matrixchamber door lore midnight nothing tap visiter volume
d1 w1,1 w1,2 · · · w1,8
d2 w2,1 w2,2 · · · w2,8
d3 w3,1 w3,2 · · · w3,8
d4 w4,1 w4,2 · · · w4,8
d5 w4,1 w4,2 · · · w5,8
d6 w4,1 w4,2 · · · w6,8
k1 = “chamber”.
Frequency is zero except ford4 andd5.
w1,4 = fi, j × log Nni
or
w1,4 =freqi, j
maxl freql , j× log N
ni
N = 6, n1 = 2, freqi, j =1, maxl freql , j = 1,
w1,4 = 11 × log10
62 = 0.47
January 30, 2003 Page 23
Example: Document-Keyterm MatrixResult:
chamber door lore midnight nothing tap visiter volume
d1 0 0 0 0.78 0 0 0 0
d2 0 0 0.78 0 0 0 0 0.78
d3 0 0 0 0 0 0.78 0 0
d4 0.48 0.48 0 0 0 0 0 0
d5 0.48 0.48 0 0 0 0 0.78 0
d6 0 0 0 0 0.78 0 0 0
Note:
Weight values for chamber and door are lower because less specific to document.
January 30, 2003 Page 24
Example: QueryQuery:
“Visitor at your door”Translation to known keyterms:
{ “visiter”, “door” }
Query weights:
wi,q =(
0.5+0.5freqi,q
maxl freql ,q
)× log N
ni
“visitor” w1,q =(0.5+ 0.5×1
1
)× log 6
1 = 0.195
“door” w1,q =(0.5+ 0.5×1
1
)× log 6
2 = 0.119
Resulting query vector(0,0.119,0,0,0,0,0.195,0)
January 30, 2003 Page 25
RetrievalQuery - Document similarity:
sim(d j ,q) =~d j ·~q
‖~d j ‖×‖~q‖
Procedure:
1. Calculate Numerator (inner product ofq and document vector) for all documents
(a) Weight matrix - query vector multiplication = vector of inner products
(b) Store resulting vector
2. Calculate Denominator (norm ofq and document vectors)
3. Determine similarities
(a) Normalize vector of inner products by determined vector norms
(b) Rank documents according to values in resulting vector
January 30, 2003 Page 26
ExampleInner Products:
0 0 0 0.78 0 0 0 0
0 0 0.78 0 0 0 0 0.78
0 0 0 0 0 0.78 0 0
0.48 0.48 0 0 0 0 0 0
0.48 0.48 0 0 0 0 0.78 0
0 0 0 0 0.78 0 0 0
×
0
0.12
0
0
0
0
0.2
0
=
0
0
0
0.06
0.21
0
January 30, 2003 Page 27
ExampleNorm of query vector:√
(|~q| ·~qt) = 0.233
Norms of document vectors:|~d1|= 0.608, etc.
(0.78,1.103,0.78,0.679,1.034,0.78)
Inner product document-query vectors:
0
0
0
0.06
0.21
0
→ normalize→
0
0
0
0.379
0.871
0
→ rank→
rank doc
1 d5
2 d6
· · ·
January 30, 2003 Page 28
Document to document similaritiesVector space model can be used to calculate similarities between documents in collection.
Instead of query to document vector similarty, document to document.
Procedure:
1. Rather than operating on query to keyterm, document-document vector
2. Use document-keyterm matrix
(a) Document vector in our example: row vectors
(b) Pairwise cosine similarity
3. Procedure is very expensive
(a) Can be optimized for sparse matrices
(b) Efficient in comparison to other methods
In mathematical terms:Let M be document-keyterm matrix.
Matrix of inner productsMi = M×Mt
Norms ofM’s row vectors. etc.
We define sim(di ,d j ) = 1
January 30, 2003 Page 29
Document to document example
Mi = M×Mt =
0.608 0 0 0 0 0
0 1.217 0 0 0 0
0 0 0.608 0 0 0
0 0 0 0.46 0.46 0
0 0 0 0.46 1.069 0
0 0 0 0 0 0.608
Norms of matrix row vectors (denominator for calculation of weight values):
(0.78,1.103,0.78,0.679,1.034,0.78)
Normalize matrix values:
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0.655 0
0 0 0 0.655 1 0
0 0 0 0 0 1
We find document 4 and 5 to be similar
January 30, 2003 Page 30
Sparse matrices
1. Document-keyterm anddocument-document matrices arenecessarily sparse
(a) Low ratio of non-zero entries overall entries
(b) Expressed as percentage, e.g. 3%density
2. Sparseness is natural
(a) Possible matrix entriesn2
(b) Few “relationships” can besemantically valid
(c) Large, heterogeneous document
collections
(d) Many matrices well below 1%
3. Storage and manipulation techniques
(a) Coordinate system
(b) Compressed Column and
Compressed Row Format
(c) Harwell-Boeing format
(d) Formats only store non-zero
entries
i. dense format:n2
ii. Harwell-Boeing:±n+1+2nz
January 30, 2003 Page 31
Coordinate system
0 2 5 0 0
1 0 0 3 0
0 0 0 0 2
0 1 0 0 0
3 0 0 2 0
Represented by:
row 2 5 1 4 1 2 5 3
col 1 1 2 2 3 4 4 5
values 1 3 2 1 5 3 2 2
January 30, 2003 Page 32
Compressed Column format or Harwell-BoeingFormat
1. Slightly more efficient than coordinate system (about 30% savings)
2. Represent an×mmatrix by three arrays:
(a) Column pointers:m+1 values
(b) Row indexes:nzvalues
(c) Values (non-zero entries)nzvalues
3. Values are retrieved:
(a) Colum pointers lookup
(b) Row index range lookup
(c) Value lookup
January 30, 2003 Page 33
Harwell-Boeing Format
0 2 5 0 0
1 0 0 3 0
0 0 0 0 2
0 1 0 0 0
3 0 0 2 0
Represented by:
colptr 1 3 5 6 8 9
rowindex 2 5 1 4 1 2 5 3
values 1 3 2 1 5 3 2 2
m(1,3) =?
1. Specific:
(a) colptr[3] < rowindex<
colptr[3+1] = 5
(b) rowindex[5] = 1,
(c) corresponding value = 5
2. General
(a) value[i][j] = val[colptr[j] ≤rowindex< colptr[j+1], i]
(b) note: colptr specifies a range of
values in rowindex
(c) colptr: value index of first value to
start a new column
January 30, 2003 Page 34
Perl sparse matrix example# ! / u s r / b in / p e r l
# i n d e x e s s t a r t a t ze ro@colp t r = ( 0 , 2 , 4 , 5 , 7 , 8 ) ;@rowindex = ( 1 , 4 , 0 , 3 , 0 , 1 , 4 , 2 ) ;@values = ( 1 , 3 , 2 , 1 , 5 , 3 , 2 , 2 ) ;
$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;
$ r = $ c o l p t r [ $ j +1]− $ c o l p t r [ $ j ] ;
$va lue = 0 ;f o r ( $n= $ c o l p t r [ $ j ] ; $n<$ c o l p t r [ $ j + 1 ] ; $n ++){
i f ( $rowindex [ $n ] = = $ i ){$va lue = $ v a l u e s [ $n ] ;
}}
p r i n t ” v a l u e : $va lue \n” ;
January 30, 2003 Page 34
Perl sparse matrix - Compressed Row Storage (CRS)# ! / u s r / b in / p e r l
# CRS , i n d e x e s s t a r t a t ze ro
@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;
$ i = $ARGV[ 0 ] ;$ j = $ARGV[ 1 ] ;
$va lue = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){
i f ( $ c o l i n d e x [ $n ] = = $ j ){$va lue = $ v a l u e s [ $n ] ;
}}
p r i n t ” v a l u e : $va lue \n” ;
January 30, 2003 Page 34
Perl sparse matrix - vector multiplication)# ! / u s r / b in / p e r l
# CRS , i n d e x e s s t a r t a t ze ro
@vec = ( 1 , 2 , 3 , 4 , 5 ) ;@rowptr = ( 0 , 2 , 4 , 5 , 6 , 8 ) ;@col index = ( 1 , 2 , 0 , 3 , 4 , 1 , 0 , 3 ) ;@values = ( 2 , 5 , 1 , 3 , 2 , 1 , 3 , 2 ) ;
$ i = $ARGV[ 0 ] ;
$va lue = 0 ;
f o r ( $ i = 0 ; $ i <5; $ i ++){$sum = 0 ;f o r ( $n= $rowpt r [ $ i ] ; $n<$rowpt r [ $ i + 1 ] ; $n ++){
$sum + = $ v a l u e s [ $n ]∗ $vec [ $ c o l i n d e x [ $n ] ] ;}p r i n t $sum . ”\n” ;
}
January 30, 2003 Page 34
Vector Space Model conclusion
1. Long History: since 1960s
2. Very efficient
(a) Use of sparse matrix methods
(b) Simple linear algebra
(c) Well proven
3. Flexible:
(a) Use in query resolution
(b) Document to document similarity
(c) Clustering (more later)
4. For these reasons, very popular and
often used
5. Disadvantages:
(a) No clearly defined theoretical
framework
(b) Generation of adequate indexes
(c) Assumption of index term
independence
(d) Scalability?
January 30, 2003 Page 35
The use of probability theory in IR
1. IR is all about uncertainty
(a) Give user query, will results berelevant or not?
(b) System may make estimate, butcertainty can not be obtained
(c) Probability of relevancy is issue
2. Baysesian probability theory:
(a) relates to the collection ofevidence and making informed
guesses
(b) evidence: was document
containing certain keyterms
relevant?
(c) evidence: was document
containing certain keyterms found
not relevant?
(d) Adjust estimate of probability not
on the basis of frequency, but on
the basis of evidence
January 30, 2003 Page 36
Probabilistic Models
1. Attempts to predict page relevance onprobabilistic grounds
2. Rather than compare specific pagefeatures (index terms)
3. Provide a clear and well-groundedframework for IR
(a) Based on statistical principles:very appropriate to IR!
(b) Relevancy odds can be updated totake into account other factorsthan text content
(c) Possibility of user feedback to be
fed into system
4. Basic idea:
(a) Query is associated with ideal
answer set
(b) Properties of answer→ index
terms
(c) Initial guess at answer set
propertiesrightarrow
probabilistic description
(d) Initial retrieval results
(e) User feedback can finetune
relevancy probabilities
January 30, 2003 Page 37
Bayes’ rulehttp://www.cs.berkeley.edu/~murphyk/Bayes/economist.html
Bayesian probability:
Change your beliefs on the basis of evidence.
Example 1:Assume two eventsa (rain) andb (clouds).P(a|b) = P(b|a)×P(a)
P(b)P(a|b) probability it will rain we see clouds
P(b|a) how often have we seen clouds when it rains?
P(a) how often does it rain?
P(b) how often have we seen clouds?
Example 2:a→ ”Person has cancer”
b→ ”Person is a smoker”
P(a|b) = ?
In Belgium: P(b) = 0.3.
Among those diagnosed with cancer, about 80% smokes, soP(b|a) = 0.8.
Overall probability of cancerP(a) = 0.05P(a|b) = P(b|a)×P(a)P(b) = 0.8×0.05
0.3 = 0.13
Quit now!
January 30, 2003 Page 38
Some basic notions of applying probability theoryto IR
Can we do the same for IR?d j is document in collection.
R represents set of relevant documents.
R represents set of non-relevant documents.
Then we can say things like:
P(d j |R) = probability that if a document is relevant, it isd j
We need to determineP(R|d j ) from:
1. P(R)
2. P(d j |C)
Since queries are specified we want to translate these statements, to index term based
statements, for example:
P(ki |R) = probability that document containingki is relevant.
Great! So how do we determine all these probabilities?
January 30, 2003 Page 39
Probabilistic Models: FormalizationBasic probabilistic notionswi, j ∈ {0,1}, wi,q ∈ {0,1}
R is the set of all relevant documents.
R is R’s complement, i.e. the set of all non-relevant documents.
P(R|~d j ) is the probability thatd j is relevant to queryq or probability of relevancygivend j .
P(R|~d j ) is the probability thatd j is not relevant toq.
sim(d j ,q) = P(R|~d j )P(R|~d j )
Bayes’ rule:P(r|e) = P(e|r)×P(R)P(e) :
sim(d j,q) = P(~d j |R)×P(R)P(~d j |R)×P(R)
P(~d j |R) : probability of selectiond j from set of relevant documents.
P(R): probability that randomly selected document is relevant. As book mentions:P(R) and
P(R) are constant for a given collection, so:
sim(d j ,q)∼ P(~d j |R)P(~d j |R)
January 30, 2003 Page 40
Probabilistic Models: Formalization (Continued)Moving to keyterm level:
sim(d j ,q)∼(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))
(∏gi (~dj )=1 P(ki |R))×(∏gi (~dj )=0 P(ki |R))
whereP(ki |R) stands for the probability thatki ∈ d j selected from R and
P(ki |R) stands for the probability thatki /∈ ind j selected from R.
P(ki |R) stands for our probability to make a mistake, etc.
Finally the canonical probabilistic information retrieval equation:
sim(d j ,q)∼ ∑ti=1 wi,q×wi, j ×
(log P(ki |R)
1−P(ki |R) + log 1−P(ki |R)P(ki |R)
)Does this work? No! We need an initial estimate ofP(ki |R) and P(ki |R)
Various schemes.
P(ki |R) = 0.5
P(ki |R) = niN
Givenwi,q andwi,q these estimates allow an initial set of documents to be retrieved and
ranked.
January 30, 2003 Page 41
Updating our chancesA bit of recursive reasoning:
Let V be a subset of documents initially retrieved and ranked (e.g. rank above threshold).
Let Vi ⊂V of documents containing documentki .
We can now improve our guesses forP(ki |R) andP(ki |R):
P(k|R) =Vi
V
P(ki |R) =ni −Vi
N−V
So,P(k|R) is updated according to how many of the retrieved set of documents containski .
Similarly, the documents not retrieved indicate our chances of a document containingki not
being relevant.
User assistance can be used to update odds, e.g. user can indicate which documents are
relevant.
January 30, 2003 Page 42
Updating our chances, contd.Problem when retrieved set is small:
Adjustment factors:
P(k|R) =Vi +0.5V +1
P(ki |R) =ni −Vi +0.5N−V +1
This factor increases the “importance” of small retrieval sets.
Constant 0.5 factor may not always be desireable:
P(k|R) =Vi +
niN
V +1
P(ki |R) =ni −Vi +
niN
N−V +1
so that adjustment is is relative to overall number of documents containingki .
January 30, 2003 Page 43
Conclusion
1. Boolean Model:
(a) Binary index term weights
(b) Disjunctive Normal Factors
(c) No partial matching
(d) No ranking
(e) Simple and very efficient
2. Vector space model
(a) Index term weights:
i. Term Frequencyii. Inverse Document Frequency
(b) Query and document vector
(c) Cosine measure ofquery-document similarity
3. Probabilistic Model:
(a) Based on probabilistic estimate of
relevancy document given certain
keyterms
(b) Allows system to adapt to users
and provides strong formal
framework for retrieval
(c) Rather complicated
(d) Performance:
i. Thought to be less efficient
than vector space model
ii. Problematic initial fine-tuning
phase
iii. Probability: semantics may be
issue
January 30, 2003 Page 44
How do they stack up?
1. Boolean:
(a) Pro:
i. Efficientii. Easy to implement
iii. Prefered by users in spite ofinability to formulate adequatequeries
(b) Con:
i. No rankingii. No partial matching
iii. Entirely dependent on keytermdefinition and binary weights
2. Vector Space Model:
(a) Pro:
i. Keyterm weighing
ii. Partial Matching
iii. Ranking
iv. Efficient
v. Easy to understand framework
vi. Flexibility
(b) Con:
i. Little formal framework
ii. Assumption of keyterm
independence
3. Probabilistic Model:
(a) Compares poorly to Vector Space
Model
(b) Occam’s razor may apply to IR as
well
(c) Case for many novel models
January 30, 2003 Page 45
ReadingsSalton (1988) Term weighting approaches in automatic text retrieval
Wildemuth (1998) Full Text Available Hypertext versus Boolean access to biomedicalinformation: a comparison of effectiveness, efficiency, and user preferences
http://www.cs.odu.edu/~jbollen/spring03_IR/readings/
January 30, 2003 Page 46