Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

59
Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen

Transcript of Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Page 1: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 1

Terms and Query Operations

Hsin-Hsi Chen

Page 2: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 2

Lexical Analysis and Stoplists

Page 3: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 3

Lexical Analysis for Automatic Indexing

• Lexical AnalysisConvert an input stream of characters into stream words or token.

• What is a word or a token?Tokens consist of letters.

– digits: Most numbers are not good index terms.counterexamples: case numbers in a legal database, “B6” and “B12” in vitamin database.

– Hyphens• break hyphenated words: state-of-the-art, state of the art• keep hyphenated words as a token: “Jean-Claude”, “F-16”

Page 4: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 4

Lexical Analysis for Automatic Indexing(Continued)

– other punctuation: often used as parts of terms, e.g., OS/2

– Case: usually not significant in index terms

• Issues: recall and precision– breaking up hyphenated terms

increase recall but decrease precision

– preserving case distinctionsenhance precision but decrease recall

– commercial information systemsusually take recall enhancing approach

Page 5: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 5

Lexical Analysis for Query Processing

• Tasks– depend on the design strategies of the lexical analyzer

for automatic indexing (search terms must match index terms)

– distinguish operators like Boolean operators, stemming or truncating operators, and weighting functions

– distinguish grouping indicators like parentheses and brackets

Page 6: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 6

stoplist (negative dictionary)

• Avoid retrieving almost very item in a database regardless of its relevance.

• Example (derived from Brown corpus): 425 wordsa, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became, ...

Page 7: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 7

Implementing Stoplists

• approaches– examine lexical analyzer output and remove any

stopwords

– remove stopwords as part of lexical analysis

Page 8: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 8

Stemming Algorithms

Page 9: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 9

Stemmers• programs that relate morphologically similar indexing and search terms• stem at indexing time

– advantage: efficiency and index file compression– disadvantage: information about the full terms is lost

• example (CATALOG system), stem at search timeLook for: system usersSearch Term: users Term Occurrences

1. user 152. users 13. used 34. using 2

Which terms (0=none, CR=all):

Page 10: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 10

Conflation Methods• manual• automatic (stemmers)

– affix removallongest match vs. simple removal

– successor variety– table lookup– n-gram

• evaluation– correctness– retrieval effectiveness– compression performance

Page 11: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 11

Successor Variety

• Definition (successor variety of a string)the number of different characters that follow it in words in some body of text

• Examplea body of text: able, axle, accident, ape, aboutsuccessor variety of apple1st: 4 (b, x, c, p)2nd: (e)

Page 12: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 12

Successor Variety (Continued)• IdeaThe successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase.

• ExampleTest word: READABLECorpus: ABLE, BEATABLE, FIXABLE, READ,

READABLE, READING, RED, ROPE, RIPEPrefix Successor Variety LettersR 3 E, O, IRE 2 A, DREA 1 DREAD 3 A, I, SREADA 1 BREADAB 1 LREADABL 1 EREADABLE 1 blank

Page 13: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 13

The successor variety stemming process

• Determine the successor variety for a word.• Use this information to segment the word.

– cutoff methoda boundary is identified whenever the cutoff value is reached

– peak and plateau methoda character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it

– complete word methoda segment is a complete word

– entropy method

• Select one of the segments as the stem.

Page 14: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 14

n-gram stemmers

• diagrama pair of consecutive letters

• shared diagram methodassociation measures are calculated between pairs of terms

where A: the number of unique diagrams in the first word, B: the number of unique diagrams in the second, C: the number of unique diagrams shared by A

and B.

SC

A B

2

Page 15: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 15

n-gram stemmers (Continued)

• Examplestatistics => st ta at ti is st ti ic csunique diagrams => at cs ic is st ta tistatistical => st ta at ti is st ti ic ca alunique diagrams => al at ca ic is st ta ti

SC

A B

2 2 6

7 80 80

*.

Page 16: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 16

n-gram stemmers (Continued)

• similarity matrixdetermine the semantic measures for all pairs of terms in the database

word1 word2 word3 ... wordn-1

word1

wrod2 S21

word3 S31 S32

.

.Wordn Sn1 Sn2 Sn3 … Sn(n-1)

• terms are clustered using a single link clustering method

Page 17: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 17

Affix Removal Stemmers

• procedureRemove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem.

• example: plural formsIf a word ends in “ies” but not “eies” or “aies”

then “ies” --> “y”If a word ends in “es” but not “aes”, “ees”, or “oes”

then “es” --> “e”If a word ends in “s”, but not “us” or “ss”

then “s” --> NULL

• ambiguity

Page 18: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 18

Affix Removal Stemmers (Continued)

• longest match stemmerremove the longest possible string of characters from a word according to a set of rules

– recoding: AxC--> AyC, e.g., ki --> ky

– partial matching: only n initial characters of stems are used in comparing

• different versionsLovins, Slaton, Dawson, Porter, …Students can refer to the rules listed in the text book.

Page 19: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 19

Thesaurus Constructions

Page 20: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 20

Thesaurus Construction• IR thesaurus

a list of terms (words or phrases) along with relationships among them physics, EE, electronics, computer and control

• INSPEC thesaurus (1979)

cesium (銫, Cs) USE caesium (the preferred form) computer-aided instructionsee also education (cross-referenced terms) UF teaching machines (a set of alternatives)

BT educational computing (broader terms, cf. NT) TT computer applications (root node/top term)

RT education (related terms)

teachingCC C7810C (subject area)

FC C7810Cf (subject area)

Page 21: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 21

Usage

• IndexingSelect the most appropriate thesaurus entries for representing the document.

• SearchingDesign the most appropriate search strategy.

– If the search does not retrieve enough documents, the thesaurus can be used to expand the query.

– If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary.

Page 22: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 22

Features of Thesauri• Coordination Level

– pre-coordination: phrases• phrases are available for indexing and retrieval

• advantage: reducing ambiguity in indexing and searching

• disadvantage: searcher has to be know the phrase formulation rules

– post-coordination: words• phrases are constructed while searching

• advantage: users do not worry about the exact word ordering

• disadvantage: the search precision may fall, e.g.,library school vs. school library

– immediate level: phrases and single words• the higher the level of coordination, the greater the precision of the

vocabulary but the larger the vocabulary size

Page 23: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 23

Features of Thesauri (Continued)

• Term Relationships– Aitchison and Gilchrist (1972)

• equivalence relationships– synonymy: trade names, popular and local usage, superseded

terms

– quasi-synonymy, e.g., harshness and tenderness

• hierarchical relationships, e.g., genus-species

• nonhierarchical relationships, e.g., thing-part, thing-attribute

Page 24: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 24

Features of Thesauri (Continued)

– Wang, Vandendorpe, and Evens (1985)• parts-wholes, e.g., set-element, count-mass

• collocation relations: words that frequently co-occur in the same phrase or sentence

• paradigmatic relations: words that have the same semantic core, e.g., “moon” and “lunar”

• taxonomy and synonymy

• antonymy relations

Page 25: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 25

Features of Thesauri (Continued)

• Number of entries for each term– homographs: words with multiple meanings

– each homograph entry is associated with its own set of relations

– problem: how to select between alternative meanings

• Specificity of vocabulary– the precision associated with the component terms

– a highly specific vocabulary promotes precision in retrieval

Page 26: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 26

Features of Thesauri (Continued)

• Control on term frequency of class members– for statistical thesaurus construction methods

– terms included in the same thesaurus class have roughly equal frequencies

– the total frequency in each class should also be roughly similar

• Normalization of vocabulary– terms should be in noun form

– noun phrases should avoid prepositions unless they are commonly known

– a limited number of adjectives should be used

– ...

Page 27: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 27

Thesaurus Construction

• manual thesaurus construction– define the boundaries of the subject area– collect the terms for each subarea

sources: indexes, encyclopedias, handbooks, textbooks, journal titles and abstracts, catalogues, ...

– organize the terms and their relationship into structures

– review (and refine) the entire thesaurus for consistency

• automatic thesaurus construction– from a collection document items

– by merging existing thesaurus

Page 28: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 28

Thesaurus Construction from Texts

1. Construction of vocabulary normalization and selection of terms phrase construction depending on the coordination level desired2. Similarity computations between terms identify the significant statistical associations between terms3. Organization of vocabulary organize the selected vocabulary into a hierarchy on the basis of the associations computed in step 2.

Page 29: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 29

Construction of Vocabulary

• Objectiveidentify the most informative terms (words and phrases)

• Procedure(1) Identify an appropriate document collection. The document collection should be sizable and representative of the subject area.(2) Determine the required specificity for the thesaurus.(3) Normalize the vocabulary terms. (a) Eliminate very trivial words such as prepositions and conjunctions. (b) Stem the vocabulary. (4) Select the most interesting stems, and create interesting phrases for a higher coordination level.

Page 30: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 30

Stem evaluation and selection

• selection by frequency of occurrence– each term may belong to category of high, medium or

low frequency

– terms in the mid-frequency range are the best for indexing and searching

Page 31: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 31

Stem evaluation and selection (Continued)

• selection by discrimination value (DV)– the more discriminating a term, the higher its value as

an index term

– procedure• compute the average inter-document similarity in the

collection

• Remove the term K from the indexing vocabulary, and recompute the average similarity

• DV(K)=(average similarity without K)-(average similarity with k)

• The DV for good discriminators is positive.

Page 32: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 32

Phrase Construction

• Salton and McGill procedure1. Compute pairwise co-occurrence for high-frequency words.2. If this co-occurrence is lower than a threshold, then do not consider the pair any further.3. For pairs that qualify, compute the cohesion value. COHESION(ti, tj)= co-occurrence-frequency/(sqrt(frequency(ti)*frequency(tj))) COHESION(ti, tj)=size-factor* co-occurrence-frequency/(frequency(ti)*frequency(tj)) where size-factor is the size of thesaurus vocabulary 4. If cohesion is above a second threshold, retain the phrase

Page 33: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 33

Phrase Construction (Continued)

• Choueka Procedure1. Select the range of length allowed for each collocational expression. E.g., 2-6 wsords

2. Build a list of all potential expressions from the collection with the prescribed length that have a minimum frequency.3. Delete sequences that begin or end with a trivial word (e.g., prepositions, pronouns, articles, conjunctions, etc.)

4. Delete expressions that contain high-frequency nontrivial words.5. Given an expression, evaluate any potential sub-expressions for relevance. Discard any that are not sufficiently relevant.6. Try to merge smaller expressions into larger and more meaningful ones.

Page 34: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 34

Similarity Computation

• Cosinecompute the number of documents associated with both terms divided by the square root of the product of the number of documents associated with the first term and the number of documents associated with the second term.

• Dicecompute the number of documents associated with both terms divided by the sum of the number of documents associated with one term and the number associated with the other.

Page 35: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 35

Vocabulary Organization

Assumptions: (1) high-frequency words have broad meaning, while low-frequency words have narrow meaning. (2) if the density functions of twoterms have the same shape, then the two words have similar meaning.1. Identify a set of frequency ranges.2. Group the vocabulary terms into different classes based on their frequencies and the ranges selected in step 1.3. The highest frequency class is assigned level 0, the next, level 1, and so on.4. Parent-child links are determined between adjacent levels as follows. For each term t in level i, compute similarity between t and every term in level i-1. Term t becomes the child of the most similar term in level i-1. If more than one term in level i-1 qualifies for this, then each becomes a parent of t. In other words, a term is allowed to have multiple parents.5. After all terms in level i have been linked to level i-1 terms, check level i-1terms and identify those that have no children. Propagate such terms to level i by creating an identical “dummy” term as its child.6. Perform steps 4 and 5 for each level starting with level.

Page 36: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 36

Merging Existing Thesauri

• simple mergelink hierarchies wherever they have terms in common

• complex merge– link terms from different hierarchies if they are similar

enough.

– similarity is a function of the number of parent and child terms in common

Page 37: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 37

Relevance Feedback and Other Query Modification Techniques

Page 38: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 38

Paraphrase Problem in IR

• Users often input queries containing terms that do not match the terms used to index the majority of the relevant documents.

• relevance feedback and query modification– reweighting of the query terms based on the

distribution of these terms in the relevant and nonrelevant documents retrieved in response to those queries

– changing the actual terms in the query

Page 39: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 39

Early Research

• Rocchio model

– Q0: the vector for the initial query

– Ri: the vector for relevant document i

– Si: the vecotr for nonrelevant document I

– n1: the number of relevant documents

– n2: the number of nonrelevant documents

nn

ii

ii SnRn

QQ21

121101

11

Page 40: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 40

Early Research (Continued)

– Rocchio constraint• only allowing terms to be in Q1 if they either were in Q0 or

occurred in at least half the relevant documents and in more relevant than nonrelevant documents

– Ide constraints• Ricchio formula minus the normalization for the number of

relevant and nonrelevant documents

• allow feedback from relevant documents

• allow limited negative feedback from only the highest-ranked nonrelevant document

• the relevant only strategies worked best for some queries, and other queries did better using negative feedback in addition

Page 41: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 41

Evaluation of relevance feedback

• Standard evaluation (i.e., recall-precision) method is not suitable, because the relevant documents used to reweight the query terms moving to higher ranks.

• The residual collection method– the evaluation of the results compares only the residual

collections, i.e., the initial run is remade minus the documents previously shown the user and this is compared with the feedback run minus the same documents

Page 42: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 42

Research in term reweighting without query expansion

• Distribution of query terms in relevant and nonrelevant documents

– Wij: the term weight for term i in query j

– r: the number of relevant documents for query j having term i

– R: the total number of relevant documents for query j

– n: the number of documents in the collection having term i

– N: the number of documents in the collection

rRnNrnrR

r

Wij

log

2

Nn N-n

relevant R documents

r R-r

n-r N-n-R+r

Page 43: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 43

Research in term reweighting without query expansion (Continued)

• Sparck Jones– a user sees only a few relevant documents in

the initial set of retrieved documents– those documents are the only ones available to

the weighting scheme– add a constant to the above formula to handle

situations in which query terms appeared none of the retrieved documents

Page 44: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 44

• Croft’s methodInitial searchFeedback

• Wijk: the term weight for term i in query j and document kIDFi: the IDF weight for term i in the entire collectionpij: the probability that term i is assigned within the set of relevant documents for query j

qij: the probability that term i is assigned within the set of nonrelevant documents for query j

r: the number of relevant documents for query j having term iR: the total number of relevant documents for query jn: the number of documents in the collection having term iN: the number of documents in the collection

ijk i ikW C IDF f ( ) *

ijkij ij

ij ijikW C

p q

p qf

( log( )

( )) *

1

1

ijp r

R

0 5

100 01

.

..

where r > 0

where r = 0

ijq n r

N R

0 5

10

.

.

Page 45: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 45

Query Expansion without Term Reweighting

• Query expansion can be done using a thesaurus that adds synonyms, broader terms, and other appropriate words

• Query expansion can also be done using list of terms from relevant documents – significant differences between the ranking methods

– too many terms from sorted list decreased performance

– using simulated perfect user selection produces improvement over methods without user selection

Page 46: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 46

Query Expansion with Term Reweighting

• Salton and Buckley methods– Ide Regular– Ide dec-hi– Standard Rocchio

– where Q0: the vector for the initial queryRi: the vector for relevant document iSi: the vector for nonrelevant document in1: the number of relevant documentsn2: the number of nonrelevant documents

1 01 1

1 2Q Q Rn

Sn

ii

ii

1 01

1Q Q Rn

Sii

i

1 011 21

1 2Q Q Rn

n Sn

ni

i

i

i

Page 47: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 47

Relevance Feedback Methodologies

• Issues– the type of retrieval system being used for initial

searching• Boolean-based systems

• systems based on using a vector space model

• systems based on ranking using either an ad hoc combination of term-weighting or using the probabilistic indexing methods

– the type of data that is being used• the length of the documents (short or not short)

• the type of indexing (controlled or full text)

Page 48: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 48

Feedback in Boolean retrieval systems

• options– front-end construction or specially modified

Boolean systems• require major modifications to the entire systems and can

be difficult to tune• offer the greatest improvement in performance

– produce ranked lists of terms for user selection• improve the query by suggesting alternative terms• show the user the term distribution in the retrieved set of

documents

Page 49: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 49

Feedback in retrieval systems based on the vector space model

• Combine term reweighting and query expansion

• Ide de-chi method ( ) is the best general purpose feedback technique

• the use of normalized vector space document weighting is highly recommended

• add only a limited number of terms (mainly to improve response time)

1 01

1Q Q Rn

Sii

i

Page 50: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 50

Feedback in retrieval systems based on the vector space model

(Continued)

– The terms to be added to the query are automatically pulled from a sorted list of new terms taken from relevant documents

– These terms are sorted by their total frequency within all retrieved relevant documents

– The number of terms added is the average number of terms in the retrieved relevant documents

– alternative: substitute user selection from the top of the list rather than adding a fixed number of terms

Page 51: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 51

Feedback in retrieval systems based on other types of statistical ranking

• The term-weighting and query expansion can be viewed as two separate components of feedback, with no specific relationship

Page 52: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 52

Term-weighting scheme

• global importance– the reweighting schemes only affect the

weighting based on the importance of a term within a given document

• local importance– term-weighting based on term importance

within a given document

• Robertson-Jones weighting formula

Page 53: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 53

Initial search ijk i ikW C IDF f ( ) *

ijkij ij

ij ijikW C

p q

p qf

( log( )

( )) *

1

1Feedback

ijp r

R

0 5

100 01

.

..

where r > 0

where r = 0

ijq n r

N R

0 5

10

.

.

freq

freqKKf

k

ikik max

)1( freqik:the frequency of term i in document kmaxfreqk: the maximum frequency in document k

C=0 for automatically indexed collections or for feedback searching (allow IDF or the relevance weighting to be the dominant factor)C>0 for manually indexed collections (allow the mere existence of a term within a document to carry more weight)K=0.3 for initial search of regular length documents (documents having many multiple occurrences of a term)K=0.5 for feedback searchesK=1 for short documents: the within-document frequency is removed (the within-document frequency plays a minimum role)

Page 54: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 54

Query Expansion

• Query expansion by related terms

• Query expansion by terms from relevant documents

Page 55: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 55

Ranking Algorithms

Hsin-Hsi Chen

Page 56: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 56

Ranking Models

• Vector Space Model

tdij: the ith term in the vector for document jtqik: the ith term in the vector for query kn: the number of unique terms in the data set

n

i

n

i

n

i

ikij

ikij

kj

tqtd

tqtdqdsimilarity

1 1

1

22

)(),(

Page 57: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 57

Ranking Models (Continued)

• Probabilistic Model– terms that appear in previously retrieved

documents for a given query should be given a higher weight than if they had not appeared in those relevant documents

Page 58: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 58

Document Relevance

DocumentIndexing

+

-

+ -

r

R-r

R

n-r

N-n-R+r

N-R

n

N-n

N

N: the number of documents in the collectionR: the number of relevant documents for query qn: the number of documents having term tr: the number of relevant documents having term trelative distribution of terms in the relevant and nonrelevant documents

)(

)(log1

NnRr

w )(

)(log2

RNrn

Rr

w

)(

)(log3

nNn

rRr

w

)(

)(log4

rRnNrnrR

r

w

Page 59: Query and Document Operations - 1 Terms and Query Operations Hsin-Hsi Chen.

Query and Document

Operations - 59

(1) Croft and Harper

))(

log(1

Q

i i

ijk

n

nNCsimilarity

Q: the number of matching terms between document j and query kC: a constant for tuning the similarity functionni: the number of documents having term i in the data setN: the number of documents in the data set

(2) Croft

))(1

fIDFCsimilarity ij

Q

iijk