Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR...
Transcript of Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR...
![Page 1: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/1.jpg)
Classic IR Models
5/6/2012 1
![Page 2: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/2.jpg)
Classic IR Models
• Idea
– Each document is represented by index terms.
– An index term is basically a (word) whose semantics give meaning to the document.
– Not all index terms are equally useful for describing the document content.
– The effect of index terms on the document is captured by weights to each term in the document.
5/6/2012 4
![Page 3: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/3.jpg)
Definition
• Let
– t be the number of index terms in the corpus (or system).
– ki a generic index term
– K= { k1, k2, …, kt) the set of index terms
– wi,j >0 is a weight associated with each index term ki in a document dj.
– wi,j =0 if ki does not appear in dj.
– With dj associated an index term vector dj = (w1,j , w2,j , …, wt,j).
– gi is a ranking function that returns the weight associated with the index term
ki in dj, gi (dj)= wi,j.
5/6/2012 5
![Page 4: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/4.jpg)
IR Models
Non-Overlapping Lists
Proximal Nodes
Structured Models
Retrieval:
Adhoc
Filtering
Browsing
U
s
e
r
T
a
s
k
Classic Models
boolean
vector
probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Probabilistic
Inference Network
Belief Network
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Browsing
Flat
Structure Guided
Hypertext
5/6/2012 6
![Page 5: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/5.jpg)
Basic Idea
• Document: set of terms
• Query: Boolean expression over terms
– Satisfying:
• Document evaluates to "true" on single-term query if it contains that term
• Evaluate document on expression query as you would any Boolean expression
• Document satisfies query if evaluates to true on query
Credit: Princeton 5/6/2012 7
![Page 6: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/6.jpg)
Satisfying a Query in the Boolean Model
• What determines if document satisfies
• query?
– That depends ….
• Document model
• Query model
• START SIMPLE
– better understanding
– Use components of simple model later
5/6/2012 8
![Page 7: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/7.jpg)
Boolean Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Query:
– (principles AND knowledge) OR (science AND engineering)
5/6/2012 9
![Page 8: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/8.jpg)
Boolean Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Query:
– (principles AND knowledge) OR (science AND engineering)
0 1 1 0 0
Doc 1: FALSE 5/6/2012 10
![Page 9: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/9.jpg)
Boolean Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Query:
– (principles AND knowledge) OR (science AND engineering)
1 0 1 1 1
Doc 2: TRUE 5/6/2012 11
![Page 10: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/10.jpg)
Exercise
• Use Doc 1 and Doc 2
• (principles OR knowledge) AND (science AND NOT(engineering))
• (principles OR knowledge) AND (science AND NOT(engineering))
5/6/2012 12
![Page 11: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/11.jpg)
Implementation Example (Boolean Model)
• Suppose we have a data set of three documents as follows:
– D1 = Programming in Java
– D2 = OO Programming
– D3 = Databases and SQL Programming
• in, & and dropped (stop words)
5/6/2012 15
![Page 12: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/12.jpg)
Implementation Example (Boolean Model)
• Primary Index
• Inverted Index
Database Java OO Programming SQL
D1 0 1 0 1 0
D2 0 0 1 1 0
D3 1 0 0 1 1
Term Freq. Pointer
Database 1
Java 1
OO 1
Programming 3
SQL 1
Postings List
D3
D1
D2
D1,D2,D3
D3
5/6/2012 16
![Page 13: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/13.jpg)
Term-Document Incidence Boolean Model
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0
otherwise Brutus AND Caesar BUT NOT Calpurnia
5/6/2012 17
![Page 14: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/14.jpg)
Incidence Vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
5/6/2012 18
![Page 15: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/15.jpg)
Answers to Query
• Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
5/6/2012 19
![Page 16: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/16.jpg)
Exercise 1
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
5/6/2012 20
![Page 17: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/17.jpg)
Exercise 2 0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
5/6/2012 21
![Page 18: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/18.jpg)
Retrieval Evaluation
• User Evaluation
– Relevant
– Not relevant
• System Evaluation
– Retrieved
– Not Retrieved
Rel. Not Rel.
Ret. a b
Not Ret. c d
Recall R= a / (a+c)
Precision P = a / (a+b)
5/6/2012 22
![Page 19: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/19.jpg)
Drawing of Recall-Precision
http://ralphlosey.files.wordpress.com http://ilab.cs.ucsb.edu/
5/6/2012 23
![Page 20: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/20.jpg)
Advantage of Boolean IR Modeling
• The Boolean Model
– Fast to implement
– Fast to process a query
– Simple
5/6/2012 24
![Page 21: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/21.jpg)
Boolean Modeling Pitfalls
• Retrieval based on binary decision criteria with no notion of partial
matching.
• No ranking of the documents is provided (absence of a grading scale).
• Information need has to be translated into a Boolean expression which
most users find awkward.
• The Boolean queries formulated by the users are most often too simplistic.
• As a consequence, the Boolean model frequently returns either too few or too many
documents in response to a user query.
5/6/2012 25
![Page 22: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/22.jpg)
Always Remember!
• We care about modeling.
• Implementation can be done in different ways.
• Which way you should select:
– It depends.
• You can go with a hash-table/hash-tree, When?
• You can use a B-tree, When?
• More about this in assignment 2.
• The Boolean model has extended forms.
• The Boolean model does not take care of ranking (setting rendering priorities).
5/6/2012 26
![Page 23: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/23.jpg)
The Inverted Index
Boolean Model Continued
5/6/2012 27
![Page 24: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/24.jpg)
Example from last class
5/6/2012 28
![Page 25: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/25.jpg)
Implementation Example (Boolean Model)
• Suppose we have a data set of three documents as follows:
– D1 = Programming in Java
– D2 = OO Programming
– D3 = Databases and SQL Programming
• in, & and dropped (stop words)
5/6/2012 29
![Page 26: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/26.jpg)
Implementation Example (Boolean Model)
• Primary Index
• Inverted Index
Database Java OO Programming SQL
D1 0 1 0 1 0
D2 0 0 1 1 0
D3 1 0 1 1 1
Term Freq. Pointer
Database 1
Java 1
OO 2
Programming 3
SQL 1
Postings List
D3
D1
D2
D1,D2,D3
D3
5/6/2012 30
![Page 27: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/27.jpg)
Also,
Look at Google’s Paper
The Anatomy of a Large-scale Hypertextual Search Engine
5/6/2012 31
![Page 28: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/28.jpg)
More Detailed Inverted Index
5/6/2012 32
![Page 29: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/29.jpg)
Introduction to Information Retrieval
Inverted index
For each term t, we must store a list of all documents that contain t.
Identify each by a docID, a document serial number
Can we use fixed-size arrays for this?
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
What happens if the word Caesar
is added to document 14?
Sec. 1.2
174
54 101
5/6/2012 33
![Page 30: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/30.jpg)
Introduction to Information Retrieval
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion
Dictionary Postings
Sorted by docID (more later on why).
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
5/6/2012 34
![Page 31: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/31.jpg)
Introduction to Information Retrieval
Tokenizer
Token stream. Friends Romans Countrymen
Inverted index construction
Linguistic
modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
More on
these later.
Documents to
be indexed.
Friends, Romans, countrymen.
Sec. 1.2
5/6/2012 35
![Page 32: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/32.jpg)
Introduction to Information Retrieval
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
5/6/2012 36
![Page 33: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/33.jpg)
Introduction to Information Retrieval
Indexer steps: Sort
Sort by terms And then docID
Core indexing step, why?
Sec. 1.2
5/6/2012 37
![Page 34: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/34.jpg)
Introduction to Information Retrieval
Indexer steps: Dictionary & Postings
Multiple term entries in a single document are merged.
Split into Dictionary and Postings
Doc. frequency information is added.
Why frequency? Will discuss later.
Sec. 1.2
5/6/2012 38
![Page 35: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/35.jpg)
Introduction to Information Retrieval
Where do we pay in storage?
Pointers
Terms and
counts Later in the
course:
•How do we
index
efficiently?
•How much
storage do we
need?
Sec. 1.2
Lists of docIDs
5/6/2012 39
![Page 36: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/36.jpg)
Introduction to Information Retrieval
The index we just built
How do we process a query?
Later - what kinds of queries can we process?
Sec. 1.3
5/6/2012 40
![Page 37: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/37.jpg)
Introduction to Information Retrieval
Query processing: AND
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the Dictionary; Retrieve its postings.
Locate Caesar in the Dictionary; Retrieve its postings.
“Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
Sec. 1.3
5/6/2012 41
![Page 38: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/38.jpg)
Introduction to Information Retrieval
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
34
128 2 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar 2 8
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Sec. 1.3
5/6/2012 42
![Page 39: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/39.jpg)
Introduction to Information Retrieval
Intersecting two postings lists (a “merge” algorithm)
5/6/2012 43
![Page 40: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/40.jpg)
Introduction to Information Retrieval
Boolean queries: Exact match
The Boolean retrieval model is being able to ask a query that is a Boolean expression:
Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words
Is precise: document matches condition or not.
Perhaps the simplest model to build an IR system on
Primary commercial retrieval tool for 3 decades.
Many search systems you still use are Boolean:
Email, library catalog, Mac OS X Spotlight
Sec. 1.3
5/6/2012 44
![Page 41: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/41.jpg)
Introduction to Information Retrieval
Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal
search service (started 1975; ranking added
1992)
Tens of terabytes of data; 700,000 users
Majority of users still use boolean queries
Example query:
What is the statute of limitations in cases involving
the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT
/3 CLAIM
/3 = within 3 words, /S = in same sentence
Sec. 1.4
5/6/2012 45
![Page 42: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/42.jpg)
Introduction to Information Retrieval
Example: WestLaw http://www.westlaw.com/
Another example query:
Requirements for disabled people to be able to access a workplace
disabl! /p access! /s work-site work-place (employment /3 place
Note that SPACE is disjunction, not conjunction!
Long, precise queries; proximity operators; incrementally developed; not like web search
Many professional searchers still like Boolean search
You know exactly what you are getting
But that doesn’t mean it actually works better….
Sec. 1.4
5/6/2012 46
![Page 43: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/43.jpg)
Introduction to Information Retrieval
Boolean queries: More general merges
Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?
What can we achieve?
Sec. 1.3
5/6/2012 47
![Page 44: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/44.jpg)
Introduction to Information Retrieval
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
Can we always merge in “linear” time?
Linear in what?
Can we do better?
Sec. 1.3
5/6/2012 48
![Page 45: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/45.jpg)
Introduction to Information Retrieval
Query optimization
What is the best order for query processing?
Consider a query that is an AND of n terms.
For each of the n terms, get its postings, then AND them together.
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
Query: Brutus AND Calpurnia AND Caesar 49
Sec. 1.3
5/6/2012 49
![Page 46: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/46.jpg)
Introduction to Information Retrieval
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
This is why we kept
document freq. in dictionary
Execute the query as (Calpurnia AND Brutus) AND Caesar.
Sec. 1.3
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
5/6/2012 50
![Page 47: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/47.jpg)
Introduction to Information Retrieval
More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
The Questions is, what size of the AND can be done faster?
Get doc. freq.’s for all terms.
Estimate the size of each OR by the sum of its doc. freq.’s (conservative).
Process in increasing order of OR sizes.
Sec. 1.3
5/6/2012 51
![Page 48: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/48.jpg)
Introduction to Information Retrieval
Exercise
Recommend a query processing order for
Term Freq
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
5/6/2012 52
![Page 49: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/49.jpg)
Classic IR Models
The Vector Space Model
5/6/2012 53
![Page 50: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/50.jpg)
The Vector Space Model
• Document: bag of terms
• Query: list of terms
• Satisfying: – Each document is scored as to the degree it satisfies
query (non-negative real number)
– doc satisfies query if its score is >0
– Documents are returned in a sorted list decreasing by score: • Include only non-zero scores
• Include only highest n documents, some n Hints for
Implementation
5/6/2012 54
![Page 51: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/51.jpg)
How to compute score? Basic Assumptions
• There is a dictionary (aka lexicon) of all terms, numbering t in all
– Number the terms 1, …, t
• Change the model of a document (temporarily):
– A document is a t-dimensional vector
– The ith entry of the vector is the weight (importance of ) term i in the document
5/6/2012 55
![Page 52: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/52.jpg)
The Vector Space
5/6/2012 56
![Page 53: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/53.jpg)
How compute score, continued
• Calculate a vector function of the document vector and the query vector to get the score of the document with respect to the query.
• Choices:
– Measure the distance between the vectors:
• 𝑫𝒊𝒔𝒕 𝒅, 𝒒 = (𝑑𝑖 − 𝑞𝑖)2
𝑡𝑖=1
• Is a dissimilarity measure
• Not normalized: Dist ranges [0, inf.]
• Fix: use e-Dist , range [0,1]
• Is it the right sense of difference?
5/6/2012 57
![Page 54: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/54.jpg)
How compute score, cont’d
• Measure the angle between the vectors:
• Dot product: 𝑑•𝑞 = (𝑑𝑖 ∗ 𝑞𝑖)𝑡𝑖=1
• Is a similarity measure
– Not normalized: dot product ranges (-inf., inf.)
– Fix: use normalized dot product, range [-1,1]
• 𝒔𝒊𝒎 =(𝒅•𝒒)( 𝑑 • 𝑞 )
aka cosine similarity
• • In practice vector components are nonnegative • so range is [0,1]
• • This is the most commonly used function for scoring.
5/6/2012 58
![Page 55: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/55.jpg)
Cosine Similarity
• Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them.
• Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using dot product and magnitude as,
• Dot Product
• Magnitude
Credit: http://www10.org/cdrom/papers/519/node12.html
θ
A
B Cosine Geometrically
5/6/2012 59
![Page 56: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/56.jpg)
How to Compute Weights of Documents
• The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval .
• It is used to evaluate how important a word is to a document in a collection.
• Two factors:
– How frequent the term in the document (More frequent more important)
– How frequent the term in the collection of documents (less frequent more important to the current document)
5/6/2012 60
![Page 57: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/57.jpg)
tf (Term Frequency)
• The term count in the given document is simply the number of times a given term appears in that document.
• Usually normalized (why?) • to prevent a bias towards longer documents.
– e.g. divide by the number of all terms in the document.
• tf is computed as follows: 𝑡𝑓𝑖,𝑗= 𝑛𝑖,𝑗
𝑛𝑘,𝑗𝑘
• ni,j is the number of occurrences of the considered term (ti)
in document dj. • The denominator is the sum of number of occurrences of
all terms in document dj.
5/6/2012 61
![Page 58: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/58.jpg)
idf (Inverse Document Frequency)
• idf is a measure of the general importance of the term.
• Can be computed as follows:
• 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑓 =𝑑:𝑡𝑖 ∈𝑑
𝐷=
#𝑑𝑜𝑐𝑠. 𝑤ℎ𝑒𝑟𝑒 𝑡𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
# 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑜𝑐𝑠.
• 𝑖𝑑𝑓𝑖 = log𝐷
𝑑:𝑡𝑖 ∈𝑑 +𝟏
• Where:
– 𝐷 is the total number of documents in the corpus.
– 𝑑: 𝑡𝑖 ∈ 𝑑 is the number of documents where the term ti appears.
– “1” is usually added to the denominator to prevent division by ZERO.
5/6/2012 62
![Page 59: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/59.jpg)
idf (Inverse Document Frequency)
• idf is a measure of the general importance of the term.
• Can be computed as follows:
• 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑓 =𝑑:𝑡𝑖 ∈𝑑
𝐷=
#𝑑𝑜𝑐𝑠. 𝑤ℎ𝑒𝑟𝑒 𝑡𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
# 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑜𝑐𝑠.
• 𝑖𝑑𝑓𝑖 = log𝐷
𝑑:𝑡𝑖 ∈𝑑 +𝟏
• Where:
– 𝐷 is the total number of documents in the corpus.
– 𝑑: 𝑡𝑖 ∈ 𝑑 is the number of documents where the term ti appears.
– “1” is usually added to the denominator to prevent division by ZERO.
5/6/2012 63
![Page 60: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/60.jpg)
tf (Term Frequency), another way to normalize
• 𝐿𝑒𝑡:
• 𝑵 𝑏𝑒 𝑡𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
• 𝒏𝒊 𝑏𝑒 𝑡𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡𝑎𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡𝑒𝑟𝑚 𝒊 • 𝒇𝒓𝒆𝒒𝒊,𝒋 𝑏𝑒 𝑡𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑡𝑒𝑟𝑚 𝒊 𝑖𝑛 𝑑𝑜𝑐. 𝒋
• 𝐼𝐷𝐹𝑖 =𝑁
𝑛𝑖= 𝑙𝑜𝑔2 𝑁 − 𝑙𝑜𝑔2 𝑛𝑖
• Now,
• 𝑊𝑖,𝑗 = 𝑡𝑓𝑖• 𝑖𝑑𝑓𝑖,𝑗 = 𝑓𝑟𝑒𝑞𝑖,𝑗 ∗ log𝑁 − log𝑛𝑖
We have to normalize, why?
• 𝑊𝑖,𝑗 = 𝑡𝑓𝑖• 𝑖𝑑𝑓𝑖,𝑗 = 𝑓𝑟𝑒𝑞𝑖,𝑗
𝒎𝒂𝒙(𝒇𝒓𝒆𝒒 𝑳𝒋) ∗ log𝑁 − log𝑛𝑖
• max 𝑓𝑟𝑒𝑞 𝐿𝑗 = max𝑓𝑟𝑒𝑞. 𝑜𝑓 𝑡𝑒 𝑜𝑓
𝑡𝑒 𝑚𝑜𝑠𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝒕𝒆𝒓𝒎 𝐿 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑗
• We may also need to ad (+1) to: 𝒍𝒐𝒈𝑵 − 𝒍𝒐𝒈𝒏𝒊
Why?
5/6/2012 64
![Page 61: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/61.jpg)
tf-idf
• The weight of a term in a document is:
Wi,j = tfi,j * idfi
• What does it do?
• It usually filters out common terms.
• What about ranking?
5/6/2012 65
![Page 62: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/62.jpg)
Vector Space Model Example
• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific media and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.
• Frequencies: science 1; knowledge 1; principles 0; engineering 0
• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”
• Frequencies: science 2; knowledge 0; principles 1; engineering 1
5/6/2012 66
![Page 63: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/63.jpg)
Example, cont’d
• Consider having 5 documents in the collection.
• The Idf for terms in the previous example are:
– science ln(5/2) = 0.51
– engineering, principles, knowledge: ln(5/1) = 1.6
5/6/2012 67
![Page 64: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/64.jpg)
Ranking
• Term by Doc. Table: freqjd * log(N/ nj ).
• Using un-normalized dot product for query: science, engineering, knowledge,
principles,
• Also, using 0/1 query vector, we get:
– Cosine (Doc1, Q) = 0.589
– Cosine (Doc2, Q) = 0.807
Doc 1 Doc 2 Query
science 0.51 1.02 0.51
engineering 0 1.6 1.6
principles 0 1.6 1.6
knowledge 1.6 0 1.6
5/6/2012 68
![Page 65: Classic IR Models - Dalhousie Universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · Classic IR Models • Idea –Each document is represented by index terms. –An index term](https://reader030.fdocuments.net/reader030/viewer/2022040421/5e0f7305ad76392e972164a8/html5/thumbnails/65.jpg)
Vector Space Model (Summary)
• Advantages
– The concept of Ranking.
– Not difficult to implement
– Shown to be effective
• Disadvantages
– What threshold to choose?
– Term Independence
– Term Weights
5/6/2012 69