cs707_013112

8/3/2019 cs707_013112

1/37

Prasad L08VSM-tfidf 1

Vector Space Model : TF - IDF

Adapted from Lectures by

Prabhakar Raghavan (Yahoo and Stanford) and

Christopher Manning (Stanford)

8/3/2019 cs707_013112

2/37

Recap last lecturen Collection and vocabulary statistics

n Heaps and Zipfs laws

n Dictionary compression for Boolean indexesn Dictionary string, blocks, front coding

n Postings compressionn Gap encoding using prefix-unique codes

n Variable-Byte and Gamma codes

Prasad 2L08VSM-tfidf

8/3/2019 cs707_013112

3/37

This lecture; Sections 6.2-6.4.3n Scoring documents

nTerm frequency

n Collection statistics

n Weighting schemes

n Vector space scoring


8/3/2019 cs707_013112

4/37

Ranked retrievaln Thus far, our queries have all been Boolean.

n Documents either match or dont.n Good for expert users with precise understanding

of their needs and the collection (e.g., library search).n Also good for applications: Applications can easily

consume 1000s of results.

n Not good for the majority of users.n

Most users incapable of writing Boolean queries(or they are, but they think its too much work).

n Most users dont want to wade through 1000s ofresults (e.g., web search).


8/3/2019 cs707_013112

5/37

Problem with Boolean search:

feast or faminen Boolean queries often result in either too few (=0)

or too many (1000s) results.

n Query 1: standard user dlink 650 200,000hits

n Query 2: standard user dlink 650 no card found:0 hits

n It takes skill to come up with a query thatproduces a manageable number of hits.

n With a ranked list of documents, it does notmatter how large the retrieved set is.


8/3/2019 cs707_013112

6/37

Scoring as the basis of ranked

retrievaln We wish to return in orderthe documents most

likely to be useful to the searcher

n How can we rank-order the documents in thecollection with respect to a query?

nAssign a score say in [0, 1] to each document

n This score measures how well document andquery match.


8/3/2019 cs707_013112

7/37

Query-document matching scoresn We need a way of assigning a score to a query/

document pair

n Lets start with a one-term queryn If the query term does not occur in the document:

score should be 0

n The more frequent the query term in thedocument, the higher the score (should be)

n We will look at a number of alternatives for this.Prasad 7L08VSM-tfidf

8/3/2019 cs707_013112

8/37

Take 1: Jaccard coefficientn Recall: Jaccard coefficient is a commonly used

measure of overlap of two setsA and B

jaccard(A,B) = |A B| / |A B|

jaccard(A,A) = 1

jaccard(A,B) = 0ifA B = 0

n A and B dont have to be the same size.n JC always assigns a number between 0and 1.Prasad 8L08VSM-tfidf

8/3/2019 cs707_013112

9/37

Jaccard coefficient: Scoring

examplen What is the query-document match score that the

Jaccard coefficient computes for each of the two

documents below?

n Query: ides of marchn Document 1: caesar died in marchn Document 2: the long march


8/3/2019 cs707_013112

10/37

Issues with Jaccard for scoringn It doesnt consider term frequency (how many

times a term occurs in a document)

n It doesnt consider document/collectionfrequency (rare terms in a collection are more

informative than frequent terms)

n We need a more sophisticated way ofnormalizing for length

n Later in this lecture, well usen . . . instead of |A B|/|A B| (Jaccard) for length

normalization.

|BA|/|BA|


8/3/2019 cs707_013112

11/37

Recall (Lecture 1): Binary term-

document incidence matrixAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Each document is represented by a binary vector {0,1}|V|


8/3/2019 cs707_013112

12/37

Term-document count matricesn Consider the number of occurrences of a term in

a document:

n Each document is a count vector in v: a columnbelow

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0


8/3/2019 cs707_013112

13/37

Bag of words model

n Vector representation doesnt consider theordering of words in a document

n John is quicker than Maryand Mary is quickerthan John have the same vectors

n This is called the bag of words model.n In a sense, this is a step back: The positional

index was able to distinguish these two

documents.n We will look at recovering positional

information later in this course.


8/3/2019 cs707_013112

14/37

Term frequency tf

n The term frequency tft,dof term tin document disdefined as the number of times that toccurs in d.

n We want to use tfwhen computing query-document match scores. But how?

n Rawterm frequency is notwhat we want:n A document with 10 occurrences of the term may be

more relevant than a document with one occurrence of

the term.n But not 10 times more relevant.

n Relevance does not increase proportionally withterm frequency.


8/3/2019 cs707_013112

15/37

Log-frequency weighting

n The log frequency weight of term t in d is

n 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc.n Score for a document-query pair: sum over terms

tin both q and d:

n scoren The score is 0 if none of the query terms is present in the

document.

>+

=

otherwise0,

0tfif,tflog1

10 t,dt,d

t,dw

+= dqt dt )tflog(1 ,


8/3/2019 cs707_013112

16/37

Document frequency

n Rare terms are more informative than frequent terms

n Recall stop words

n Consider a term in the query that is rare in the collection(e.g., arachnocentric)

n A document containing this term is very likely to berelevant to the query arachnocentric

n We want a higher weight for rare terms likearachnocentric.


8/3/2019 cs707_013112

17/37

Document frequency, continued

n Consider a query term that is frequent in the collection (e.g.,high, increase, line)

n A document containing such a term is more likely to berelevant than a document that doesnt, but its not a sure

indicator of relevance.n For frequent terms, we want positive weights for words

like high, increase, and line, but lower weights than for rare

terms.

n We will use document frequency (df) to capturethis in the score.

n df (N) is the number of documents that containthe term


8/3/2019 cs707_013112

18/37

idf weight

n dft is the document frequency oft: the number ofdocuments that contain t

n df is a measure of the informativeness oftn We define the idf (inverse document frequency)

oftby

n We use log N/dft instead ofN/dft to dampen theeffect of idf.

ttN/dflogidf 10=

Will turn out that the base of the log is immaterial.


8/3/2019 cs707_013112

19/37

idf example, suppose N= 1 million

term dft

idft

calpurnia 1 6

animal 100 4

sunday 1,000 3

fly 10,000 2

under 100,000 1

the 1,000,000 0

There is one idf value for each term tin a collection.


8/3/2019 cs707_013112

20/37

Collection vs. Document frequency

n The collection frequency oftis the number ofoccurrences oftin the collection, counting

multiple occurrences.

n Which word is a better search term (and shouldget a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760


8/3/2019 cs707_013112

21/37

tf-idf weighting

n The tf-idf weight of a term is the product of its tf weightand its idf weight.

n Best known weighting scheme in information retrievaln Note: the - in tf-idf is a hyphen, not a minus sign!n Alternative names: tf.idf, tf x idf

n Increases with the number of occurrences within adocument

n Increases with the rarity of the term in the collection

tdt Ndt df/log)tflog1(w ,, +=


8/3/2019 cs707_013112

22/37

Binary count weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valuedvector of tf-idf weights R|V|


8/3/2019 cs707_013112

23/37

Documents as vectors

n So we have a |V|-dimensional vector spacen Terms are axes of the spacen

Documents are points or vectors in this space

n Very high-dimensional: hundreds of millions ofdimensions when you apply this to a web search

enginen This is a very sparse vector - most entries are

zero.


8/3/2019 cs707_013112

24/37

Queries as vectors

n Key idea 1: Do the same for queries: representthem as vectors in the space

n Key idea 2: Rank documents according to theirproximity to the query in this space

n proximity = similarity of vectorsn proximity inverse of distancen Recall: We do this because we want to get away

from the youre-either-in-or-out Boolean model.

n Instead: rank more relevant documents higherthan less relevant documents


8/3/2019 cs707_013112

25/37

Formalizing vector space proximity

n First cut: distance between two pointsn ( = distance between the end points of the two

vectors)

n Euclidean distance?

n Euclidean distance is a bad idea . . .n

. . . because Euclidean distance is large forvectors of different lengths.


8/3/2019 cs707_013112

26/37

Why distance is a bad idea

The Euclideandistance between q

and d2 is large even

though thedistribution of terms

in the query qand

the distribution of

terms in thedocument d2are

very similar.


8/3/2019 cs707_013112

27/37

Use angle instead of distance

n Thought experiment: take a document d andappend it to itself. Call this document d.

n Semantically d and dhave the same contentn The Euclidean distance between the two

documents can be quite large

n The angle between the two documents is 0,corresponding to maximal similarity.

n Key idea: Rank documents according to anglewith query.


8/3/2019 cs707_013112

28/37

From angles to cosines

n The following two notions are equivalent.n Rank documents in decreasing order of the angle

between query and document

n Rank documents in increasing order ofcosine(query,document)

n Cosine is a monotonically decreasing function forthe interval [0o, 180o]


8/3/2019 cs707_013112

29/37

Length normalization

n A vector can be (length-) normalized by dividingeach of its components by its length for this we

use the L2 norm:

n Dividing a vector by its L2 norm makes it a unit(length) vector

n Effect on the two documents d and d (d appendedto itself) from earlier slide: they have identical

vectors after length-normalization.

=

i i

xx2

2


8/3/2019 cs707_013112

30/37

cosine(query,document)

==

===

=

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dq

dq1

2

1

2

1

),cos(

Dot product Unit vectors

qi is the tf-idf weight of term iin the query

di is the tf-idf weight of term iin the documentcos(q,d) is the cosine similarity ofqand d or,equivalently, the cosine of the angle between qand d.


8/3/2019 cs707_013112

31/37

Cosine similarity amongst 3 documents

term SaS PaP WH

affection 115 58 20

jealous 10 7

gossip 2 0 6

wuthering 0 0 38

How similar are

the novels

SaS: Sense and

Sensibility

PaP: Pride and

Prejudice, and

WH: Wuthering

Heights?Term frequencies (counts)


8/3/2019 cs707_013112

32/37

3 documents example contd.Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP)

0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.00.94cos(SaS,WH)0.79cos(PaP,WH) 0.69

Why do we have cos(SaS,PaP) > cos(SAS,WH)?

8/3/2019 cs707_013112

33/37

Computing cosine scores

8/3/2019 cs707_013112

34/37

tf-idf weighting has many variants

Columns headed n are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

8/3/2019 cs707_013112

35/37

Weighting may differ in queries vs

documents

n Many search engines allow for differentweightings for queries vs documents

n To denote the combination in use in an engine,we use the notation qqq.ddd with the acronyms

from the previous table

n Example: ltn.lnc means:n Query: logarithmic tf (l in leftmost column), idf (t in

second column), no normalization

n Document logarithmic tf, no idf and cosinenormalization

Is this a bad idea?Prasad 35

8/3/2019 cs707_013112

36/37

tf-idf example: ltn.lnc

Term Query Document Prod

tf-raw tf-wt df idf wt tf-raw tf-wt wt nlizedauto 0 0 5000 2.3 0 1 1 1 0.52 0

best 1 1 50000 1.3 1.3 0 0 0 0 0

car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04

insurance 1 1 1000 3.0 3.0 2 1.3 3.9 2.03 6.09

Document: car insurance auto insuranceQuery: best car insurance

Exercise: what is N, the number of docs?

8/3/2019 cs707_013112

37/37

Summary vector space ranking

n Represent the query as a weighted tf-idf vectorn Represent each document as a weighted tf-idf vector

n Compute the cosine similarity score for the queryvector and each document vector

nRank documents with respect to the query by score

n Return the top K(e.g., K= 10) to the user


cs707_013112

Documents

Transcript of cs707_013112