cs707_013112

download cs707_013112

of 37

Transcript of cs707_013112

  • 8/3/2019 cs707_013112

    1/37

    Prasad L08VSM-tfidf 1

    Vector Space Model : TF - IDF

    Adapted from Lectures by

    Prabhakar Raghavan (Yahoo and Stanford) and

    Christopher Manning (Stanford)

  • 8/3/2019 cs707_013112

    2/37

    Recap last lecturen Collection and vocabulary statistics

    n Heaps and Zipfs laws

    n Dictionary compression for Boolean indexesn Dictionary string, blocks, front coding

    n Postings compressionn Gap encoding using prefix-unique codes

    n Variable-Byte and Gamma codes

    Prasad 2L08VSM-tfidf

  • 8/3/2019 cs707_013112

    3/37

    This lecture; Sections 6.2-6.4.3n Scoring documents

    nTerm frequency

    n Collection statistics

    n Weighting schemes

    n Vector space scoring

    Prasad 3L08VSM-tfidf

  • 8/3/2019 cs707_013112

    4/37

    Ranked retrievaln Thus far, our queries have all been Boolean.

    n Documents either match or dont.n Good for expert users with precise understanding

    of their needs and the collection (e.g., library search).n Also good for applications: Applications can easily

    consume 1000s of results.

    n Not good for the majority of users.n

    Most users incapable of writing Boolean queries(or they are, but they think its too much work).

    n Most users dont want to wade through 1000s ofresults (e.g., web search).

    Prasad 4L08VSM-tfidf

  • 8/3/2019 cs707_013112

    5/37

    Problem with Boolean search:

    feast or faminen Boolean queries often result in either too few (=0)

    or too many (1000s) results.

    n Query 1: standard user dlink 650 200,000hits

    n Query 2: standard user dlink 650 no card found:0 hits

    n It takes skill to come up with a query thatproduces a manageable number of hits.

    n With a ranked list of documents, it does notmatter how large the retrieved set is.

    Prasad 5L08VSM-tfidf

  • 8/3/2019 cs707_013112

    6/37

    Scoring as the basis of ranked

    retrievaln We wish to return in orderthe documents most

    likely to be useful to the searcher

    n How can we rank-order the documents in thecollection with respect to a query?

    nAssign a score say in [0, 1] to each document

    n This score measures how well document andquery match.

    Prasad 6L08VSM-tfidf

  • 8/3/2019 cs707_013112

    7/37

    Query-document matching scoresn We need a way of assigning a score to a query/

    document pair

    n Lets start with a one-term queryn If the query term does not occur in the document:

    score should be 0

    n The more frequent the query term in thedocument, the higher the score (should be)

    n We will look at a number of alternatives for this.Prasad 7L08VSM-tfidf

  • 8/3/2019 cs707_013112

    8/37

    Take 1: Jaccard coefficientn Recall: Jaccard coefficient is a commonly used

    measure of overlap of two setsA and B

    jaccard(A,B) = |A B| / |A B|

    jaccard(A,A) = 1

    jaccard(A,B) = 0ifA B = 0

    n A and B dont have to be the same size.n JC always assigns a number between 0and 1.Prasad 8L08VSM-tfidf

  • 8/3/2019 cs707_013112

    9/37

    Jaccard coefficient: Scoring

    examplen What is the query-document match score that the

    Jaccard coefficient computes for each of the two

    documents below?

    n Query: ides of marchn Document 1: caesar died in marchn Document 2: the long march

    Prasad 9L08VSM-tfidf

  • 8/3/2019 cs707_013112

    10/37

    Issues with Jaccard for scoringn It doesnt consider term frequency (how many

    times a term occurs in a document)

    n It doesnt consider document/collectionfrequency (rare terms in a collection are more

    informative than frequent terms)

    n We need a more sophisticated way ofnormalizing for length

    n Later in this lecture, well usen . . . instead of |A B|/|A B| (Jaccard) for length

    normalization.

    |BA|/|BA|

    Prasad 10L08VSM-tfidf

  • 8/3/2019 cs707_013112

    11/37

    Recall (Lecture 1): Binary term-

    document incidence matrixAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

    Antony 1 1 0 0 0 1

    Brutus 1 1 0 1 0 0

    Caesar 1 1 0 1 1 1

    Calpurnia 0 1 0 0 0 0

    Cleopatra 1 0 0 0 0 0

    mercy 1 0 1 1 1 1

    worser 1 0 1 1 1 0

    Each document is represented by a binary vector {0,1}|V|

    Prasad 11L08VSM-tfidf

  • 8/3/2019 cs707_013112

    12/37

    Term-document count matricesn Consider the number of occurrences of a term in

    a document:

    n Each document is a count vector in v: a columnbelow

    Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

    Antony 157 73 0 0 0 0

    Brutus 4 157 0 1 0 0

    Caesar 232 227 0 2 1 1

    Calpurnia 0 10 0 0 0 0

    Cleopatra 57 0 0 0 0 0

    mercy 2 0 3 5 5 1

    worser 2 0 1 1 1 0

    Prasad 12L08VSM-tfidf

  • 8/3/2019 cs707_013112

    13/37

    Bag of words model

    n Vector representation doesnt consider theordering of words in a document

    n John is quicker than Maryand Mary is quickerthan John have the same vectors

    n This is called the bag of words model.n In a sense, this is a step back: The positional

    index was able to distinguish these two

    documents.n We will look at recovering positional

    information later in this course.

    Prasad 13L08VSM-tfidf

  • 8/3/2019 cs707_013112

    14/37

    Term frequency tf

    n The term frequency tft,dof term tin document disdefined as the number of times that toccurs in d.

    n We want to use tfwhen computing query-document match scores. But how?

    n Rawterm frequency is notwhat we want:n A document with 10 occurrences of the term may be

    more relevant than a document with one occurrence of

    the term.n But not 10 times more relevant.

    n Relevance does not increase proportionally withterm frequency.

    Prasad 14L08VSM-tfidf

  • 8/3/2019 cs707_013112

    15/37

    Log-frequency weighting

    n The log frequency weight of term t in d is

    n 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc.n Score for a document-query pair: sum over terms

    tin both q and d:

    n scoren The score is 0 if none of the query terms is present in the

    document.

    >+

    =

    otherwise0,

    0tfif,tflog1

    10 t,dt,d

    t,dw

    += dqt dt )tflog(1 ,

    Prasad 15L08VSM-tfidf

  • 8/3/2019 cs707_013112

    16/37

    Document frequency

    n Rare terms are more informative than frequent terms

    n Recall stop words

    n Consider a term in the query that is rare in the collection(e.g., arachnocentric)

    n A document containing this term is very likely to berelevant to the query arachnocentric

    n We want a higher weight for rare terms likearachnocentric.

    Prasad 16L08VSM-tfidf

  • 8/3/2019 cs707_013112

    17/37

    Document frequency, continued

    n Consider a query term that is frequent in the collection (e.g.,high, increase, line)

    n A document containing such a term is more likely to berelevant than a document that doesnt, but its not a sure

    indicator of relevance.n For frequent terms, we want positive weights for words

    like high, increase, and line, but lower weights than for rare

    terms.

    n We will use document frequency (df) to capturethis in the score.

    n df (N) is the number of documents that containthe term

    Prasad 17L08VSM-tfidf

  • 8/3/2019 cs707_013112

    18/37

    idf weight

    n dft is the document frequency oft: the number ofdocuments that contain t

    n df is a measure of the informativeness oftn We define the idf (inverse document frequency)

    oftby

    n We use log N/dft instead ofN/dft to dampen theeffect of idf.

    ttN/dflogidf 10=

    Will turn out that the base of the log is immaterial.

    Prasad 18L08VSM-tfidf

  • 8/3/2019 cs707_013112

    19/37

    idf example, suppose N= 1 million

    term dft

    idft

    calpurnia 1 6

    animal 100 4

    sunday 1,000 3

    fly 10,000 2

    under 100,000 1

    the 1,000,000 0

    There is one idf value for each term tin a collection.

    Prasad 19L08VSM-tfidf

  • 8/3/2019 cs707_013112

    20/37

    Collection vs. Document frequency

    n The collection frequency oftis the number ofoccurrences oftin the collection, counting

    multiple occurrences.

    n Which word is a better search term (and shouldget a higher weight)?

    Word Collection frequency Document frequency

    insurance 10440 3997

    try 10422 8760

    Prasad 20L08VSM-tfidf

  • 8/3/2019 cs707_013112

    21/37

    tf-idf weighting

    n The tf-idf weight of a term is the product of its tf weightand its idf weight.

    n Best known weighting scheme in information retrievaln Note: the - in tf-idf is a hyphen, not a minus sign!n Alternative names: tf.idf, tf x idf

    n Increases with the number of occurrences within adocument

    n Increases with the rarity of the term in the collection

    tdt Ndt df/log)tflog1(w ,, +=

    Prasad 21L08VSM-tfidf

  • 8/3/2019 cs707_013112

    22/37

    Binary count weight matrix

    Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

    Antony 5.25 3.18 0 0 0 0.35

    Brutus 1.21 6.1 0 1 0 0

    Caesar 8.59 2.54 0 1.51 0.25 0

    Calpurnia 0 1.54 0 0 0 0

    Cleopatra 2.85 0 0 0 0 0

    mercy 1.51 0 1.9 0.12 5.25 0.88

    worser 1.37 0 0.11 4.15 0.25 1.95

    Each document is now represented by a real-valuedvector of tf-idf weights R|V|

    Prasad 22L08VSM-tfidf

  • 8/3/2019 cs707_013112

    23/37

    Documents as vectors

    n So we have a |V|-dimensional vector spacen Terms are axes of the spacen

    Documents are points or vectors in this space

    n Very high-dimensional: hundreds of millions ofdimensions when you apply this to a web search

    enginen This is a very sparse vector - most entries are

    zero.

    Prasad 23L08VSM-tfidf

  • 8/3/2019 cs707_013112

    24/37

    Queries as vectors

    n Key idea 1: Do the same for queries: representthem as vectors in the space

    n Key idea 2: Rank documents according to theirproximity to the query in this space

    n proximity = similarity of vectorsn proximity inverse of distancen Recall: We do this because we want to get away

    from the youre-either-in-or-out Boolean model.

    n Instead: rank more relevant documents higherthan less relevant documents

    Prasad 24L08VSM-tfidf

  • 8/3/2019 cs707_013112

    25/37

    Formalizing vector space proximity

    n First cut: distance between two pointsn ( = distance between the end points of the two

    vectors)

    n Euclidean distance?

    n Euclidean distance is a bad idea . . .n

    . . . because Euclidean distance is large forvectors of different lengths.

    Prasad 25L08VSM-tfidf

  • 8/3/2019 cs707_013112

    26/37

    Why distance is a bad idea

    The Euclideandistance between q

    and d2 is large even

    though thedistribution of terms

    in the query qand

    the distribution of

    terms in thedocument d2are

    very similar.

    Prasad 26L08VSM-tfidf

  • 8/3/2019 cs707_013112

    27/37

    Use angle instead of distance

    n Thought experiment: take a document d andappend it to itself. Call this document d.

    n Semantically d and dhave the same contentn The Euclidean distance between the two

    documents can be quite large

    n The angle between the two documents is 0,corresponding to maximal similarity.

    n Key idea: Rank documents according to anglewith query.

    Prasad 27L08VSM-tfidf

  • 8/3/2019 cs707_013112

    28/37

    From angles to cosines

    n The following two notions are equivalent.n Rank documents in decreasing order of the angle

    between query and document

    n Rank documents in increasing order ofcosine(query,document)

    n Cosine is a monotonically decreasing function forthe interval [0o, 180o]

    Prasad 28L08VSM-tfidf

  • 8/3/2019 cs707_013112

    29/37

    Length normalization

    n A vector can be (length-) normalized by dividingeach of its components by its length for this we

    use the L2 norm:

    n Dividing a vector by its L2 norm makes it a unit(length) vector

    n Effect on the two documents d and d (d appendedto itself) from earlier slide: they have identical

    vectors after length-normalization.

    =

    i i

    xx2

    2

    Prasad 29L08VSM-tfidf

  • 8/3/2019 cs707_013112

    30/37

    cosine(query,document)

    ==

    ===

    =

    V

    i i

    V

    i i

    V

    i ii

    dq

    dq

    d

    d

    q

    q

    dq

    dq

    dq1

    2

    1

    2

    1

    ),cos(

    Dot product Unit vectors

    qi is the tf-idf weight of term iin the query

    di is the tf-idf weight of term iin the documentcos(q,d) is the cosine similarity ofqand d or,equivalently, the cosine of the angle between qand d.

    Prasad 30L08VSM-tfidf

  • 8/3/2019 cs707_013112

    31/37

    Cosine similarity amongst 3 documents

    term SaS PaP WH

    affection 115 58 20

    jealous 10 7

    gossip 2 0 6

    wuthering 0 0 38

    How similar are

    the novels

    SaS: Sense and

    Sensibility

    PaP: Pride and

    Prejudice, and

    WH: Wuthering

    Heights?Term frequencies (counts)

    Prasad 31L08VSM-tfidf

  • 8/3/2019 cs707_013112

    32/37

    3 documents example contd.Log frequency weighting

    term SaS PaP WH

    affection 3.06 2.76 2.30

    jealous 2.00 1.85 2.04

    gossip 1.30 0 1.78

    wuthering 0 0 2.58

    After normalization

    term SaS PaP WH

    affection 0.789 0.832 0.524

    jealous 0.515 0.555 0.465

    gossip 0.335 0 0.405

    wuthering 0 0 0.588

    cos(SaS,PaP)

    0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.00.94cos(SaS,WH)0.79cos(PaP,WH) 0.69

    Why do we have cos(SaS,PaP) > cos(SAS,WH)?

  • 8/3/2019 cs707_013112

    33/37

    Computing cosine scores

  • 8/3/2019 cs707_013112

    34/37

    tf-idf weighting has many variants

    Columns headed n are acronyms for weight schemes.

    Why is the base of the log in idf immaterial?

  • 8/3/2019 cs707_013112

    35/37

    Weighting may differ in queries vs

    documents

    n Many search engines allow for differentweightings for queries vs documents

    n To denote the combination in use in an engine,we use the notation qqq.ddd with the acronyms

    from the previous table

    n Example: ltn.lnc means:n Query: logarithmic tf (l in leftmost column), idf (t in

    second column), no normalization

    n Document logarithmic tf, no idf and cosinenormalization

    Is this a bad idea?Prasad 35

  • 8/3/2019 cs707_013112

    36/37

    tf-idf example: ltn.lnc

    Term Query Document Prod

    tf-raw tf-wt df idf wt tf-raw tf-wt wt nlizedauto 0 0 5000 2.3 0 1 1 1 0.52 0

    best 1 1 50000 1.3 1.3 0 0 0 0 0

    car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04

    insurance 1 1 1000 3.0 3.0 2 1.3 3.9 2.03 6.09

    Document: car insurance auto insuranceQuery: best car insurance

    Exercise: what is N, the number of docs?

  • 8/3/2019 cs707_013112

    37/37

    Summary vector space ranking

    n Represent the query as a weighted tf-idf vectorn Represent each document as a weighted tf-idf vector

    n Compute the cosine similarity score for the queryvector and each document vector

    nRank documents with respect to the query by score

    n Return the top K(e.g., K= 10) to the user

    Prasad 37L08VSM-tfidf