Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval*
1 CS 430: Information Discovery Lecture 5 Ranking.
-
Upload
clemence-harris -
Category
Documents
-
view
217 -
download
0
description
Transcript of 1 CS 430: Information Discovery Lecture 5 Ranking.
1
CS 430: Information Discovery
Lecture 5
Ranking
2
Course Administration
• Optional course readings are optional. Read them if you wish. Some may require a visit to a library!
• Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours.
• Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.
3
Course Administration
Hints on Assignment 1
• You are not building a production system!!! • The volume of test data is quite small.
Therefore
Choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., do not implement B trees).
Consider batch loading of data (e.g., no need to provide for incremental update).
User interface can be minimal (e.g., single letter commands).
To save typing, we will provide the arrays char_class and convert_class from Frake Chapter 7.
4
Term Frequency
Concept
A term that appears many times within a document is likely to be more important than a term that appears only once.
5
Term Frequency
Suppose term j appears fij times in document i
Simple method (as illustrated in Lecture 4) is to use fij as the term frequency.
Standard method
Scale fij relative to the other terms in the document. This partially corrects for variations in the length of the documents.
Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i
Term frequency (tf):
tfij = fij / mi
i
6
Inverse Document Frequency
Concept
A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.
7
Inverse Document Frequency
Suppose there are n documents and that the number of documents in which term j occurs is dj.
Simple method is to use n/dj as the inverse document frequency.
Standard method
The simple method over-emphasizes small differences. Therefore use a logarithm. Inverse document frequency (idf):
idfj = log2 (n/dj) + 1 dj > 0
8
Example of Inverse Document Frequency
Examplen = 1,000 documents
term j dj idfj
A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00
From: Salton and McGill
9
Standard Version of tf.idf Weighting
Combining tf and idf:
(a) Weight is proportional to the number of times that the term appears in the document.
(b) Weight is proportional to the logarithm of the reciprocal of the number of documents that contain the term.
Notation
wij is the weight given to term j in document i fij is the frequency with which term j appears in document i
dj is the number of documents that contain term jmi is the maximum frequency of any term in document in is the total number of documents
10
Standard Form of tf.idf
Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances:
(Weight of term j in document i)
= (Term frequency) * (Inverse document frequency)
The standard tf.idf weighting scheme is:
wij = tfij * idfj
= (fij / mi) * (log2 (n/dj) + 1)
Frake, Chapter 14 discusses many variations on this basic scheme.
11
Ranking Based on Reference Patterns
With term weighting (e.g., tf.idf) documents are ranked depending on how well they match a specific query.
With ranking by reference patterns, documents are ranked based on the references among them. The ranking of a set of documents is independent of any specific query.
In journal literature, references are called citations.
On the web, references are called links or hyperlinks.
12
13
Citation Graph
Paper
cites
is cited by
Note that journal citations always refer to earlier work.
14
Bibliometrics
Techniques that use citation analysis to measure the similarity of journal articles or their importance
Bibliographic coupling: two papers that cite many of the same papers
Co-citation: two papers that were cited by many of the same papers
Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period
15
Graphical Analysis of Hyperlinks on the Web
This page links to many other pages
Many pages link to this page
12
34
5 6
16
Matrix Representation
P1 P2 P3 P4 P5 P6 Number
P1 1 1
P2 1 1 2
P3 1 1 1 3
P4 1 1 1 1 4
P5 1 1
P6 1 1
Cited page (to)
Citing page (from)
Number 4 2 1 1 3 1
17
PageRank Algorithm (Google)
Concept:
The rank of a web page is higher if many pages link to it.
Links from highly ranked pages are given greater weight than links from less highly ranked pages.
18
Intuitive Model
A user:
1. Starts at a random page on the web
2. Selects a random hyperlink from the current page and jumps to the corresponding page
3. Repeats Step 2 a very large number of times
Pages are ranked according to the relative frequency with which they are visited.
19
Basic Algorithm: Normalize by Number of Links from Page
P1 P2 P3 P4 P5 P6
P1 0.33
P2 0.25 1
P3 0.25 0.5 1
P4 0.25 0.5 0.33 1
P5 0.25
P6 0.33
Cited page
Citing page
Number 4 2 1 1 3 1
= BNormalized link matrix
20
Basic Algorithm: Weighting of Pages
Initially all pages have weight 1
w1 = 1
1
1
1
1
1
Recalculate weights
w2 = Bw1 =
0.33
1.25
1.75
2.08
0.25
0.33
21
Basic Algorithm: Iterate
Iterate: wk = Bwk-1
0.33
1.25
1.75
2.08
0.25
0.33
0.08
1.83
2.79
1.12
0.08
0.08
0.03
2.80
2.06
1.05
0.02
0.03
->
->
->
->
->
->
0.00
2.39
2.39
1.19
0.00
0.00
1
1
1
1
1
1
w1 w2 w3 w4 ... converges to ... w
22
Google PageRank with Damping
A user:
1. Starts at a random page on the web
2a. With probability p, selects any random page and jumps to it
2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page
3. Repeats Step 2a and 2b a very large number of times
Pages are ranked according to the relative frequency with which they are visited.
23
The PageRank Iteration
The basic method iterates using the normalized link matrix, B.
wk = Bwk-1
This w is the high order eigenvector of B
Google iterates using a damping factor. The method iterates using a matrix B', where:
B' = pN + (1 - p)B
N is the matrix with every element equal to 1/n.p is a constant found by experiment.
24
Google: PageRank
The Google PageRank algorithm is usually written with the following notation
If page A has pages Ti pointing to it.– d: damping factor– C(A): number of links out of A
Iterate until:
n
i i
i
TCTPddAP
1
1
25
Information Retrieval Using PageRank
Simple Method
Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal.
Display the hits ranked by PageRank.
The disadvantage of this method is that it gives no attention to how closely a document matches a query
26
Reference Pattern Ranking using Dynamic Document Sets
PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries.
Concept. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections.
With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.
27
Reference Pattern Ranking using Dynamic Document Sets
Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)
1. Search using conventional term weighting. Rank the hits using similarity between query and documents.
2. Select the highest ranking hits (e.g., top 5,000 hits).
3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query.
4. Display the results ranked in the order of the reference patterns calculated.
28
Combining Term Weighting with Reference Pattern Ranking
Combined Method
1. Find all documents that share a term with the query vector.
2. The similarity, using conventional term weighting, between the query and document j is sj.
3. The rank of document j using PageRank or other reference pattern ranking is pj.
4. Calculate a combined rank cj = λsj + (1- λ)pj, where λ is a constant.
5. Display the hits ranked by cj.
This method is used in several commercial systems, but the details have not been published.
29
Cornell Note
Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).