1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 5
April 23, 2006
http://www.ee.technion.ac.il/courses/049011
2
Ranking Algorithms
3
PageRank [Page, Brin, Motwani, Winograd 1998]
Motivating principlesRank of p should be proportional to the rank of
the pages that point to p Recommendations from Bill Gates & Steve Jobs vs.
from Moishale and Ahuva
Rank of p should depend on the number of pages “co-cited” with p
Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth
4
Then: r is a non-negative normalized left eigenvector of B with eigenvalue 1
PageRank, Attempt #1
Additional Conditions: r is non-negative: r ≥ 0 r is normalized: ||r||1 = 1
B = normalized adjacency matrix:
5
PageRank, Attempt #1
Solution exists only if B has eigenvalue 1 Problem: B may not have 1 as an eigenvalue
Because some of its rows are 0. Example:
6
= normalization constant
r is a non-negative normalized left eigenvector of B with eigenvalue
1/
PageRank, Attempt #2
7
Any nonzero eigenvalue of B may give a solution = 1/ r = any non-negative normalized left eigenvector of B with
eigenvalue
Which solution to pick? Pick a “principal eigenvector” (i.e., corresponding to maximal )
How to find a solution? Power iterations
PageRank, Attempt #2
8
Problem #1: Maximal eigenvalue may have multiplicity > 1 Several possible solutions Happens, for example, when graph is disconnected
Problem #2: Rank accumulates at sinks. Only sinks or nodes, from which a sink cannot be reached, can
have nonzero rank mass.
PageRank, Attempt #2
9
Then: r is a non-negative normalized left eigenvector of (B + 1eT) with
eigenvalue 1/
PageRank, Final Definition
e = “rank source” vector Standard setting: e(p) = /n for all p ( < 1)
1 = the all 1’s vector
10
Any nonzero eigenvalue of (B + 1eT) may give a solution Pick r to be a principal left eigenvector of (B + 1eT) Will show:
Principal eigenvalue has multiplicity 1, for any graph There exists a non-negative left eigenvector
Hence, PageRank always exists and is uniquely defined
Due to rank source vector, rank no longer accumulates at sinks
PageRank, Final Definition
11
An Alternative View of PageRank:The Random Surfer Model When visiting a page p, a “random surfer”:
With probability 1 - d, selects a random outlink p q and goes to visit q. (“focused browsing”)
With probability d, jumps to a random web page q. (“loss of interest”)
If p has no outlinks, assume it has a self loop. P: probability transition matrix:
12
PageRank & Random Surfer Model
Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P.
Suppose:
Then:
13
PageRank & Markov Chains
PageRank vector is normalized principal left eigenvector of (B + 1eT).
Hence, PageRank vector is also a principal left eigenvector of P
Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain.
PageRank(p) = r(p) = probability of random surfer visiting page p at the limit.
Note: “Random jump” guarantees Markov Chain is ergodic.
14
HITS: Hubs and Authorities [Kleinberg, 1997]
HITS: Hyperlink Induced Topic Search Main principle: every page p is associated with
two scores: Authority score: how “authoritative” a page is about the
query’s topic Ex: query: “IR”; authorities: scientific IR papers Ex: query: “automobile manufacturers”; authorities: Mazda,
Toyota, and GM web sites Hub score: how good the page is as a “resource list”
about the query’s topic Ex: query: “IR”; hubs: surveys and books about IR Ex: query: “automobile manufacturers”; hubs: KBB, car link
lists
15
Mutual Reinforcement
HITS principles: p is a good authority, if it is linked by many
good hubs. p is a good hub, if it points to many good
authorities.
16
HITS: Algebraic Form
a: authority vector h: hub vector A: adjacency matrix
Then:
Therefore:
a is principal eigenvector of ATA h is principal eigenvector of AAT
17
Co-Citation and Bibilographic Coupling ATA: co-citation matrix
ATAp,q = # of pages that link both to p and to q.
Thus: authority scores propagate through co-citation.
AAT: bibliographic coupling matrix AAT
p,q = # of pages that both p and q link to.
Thus: hub scores propagate through bibliographic coupling.
p
q
p
q
18
Principal Eigenvector Computation
E: n × n matrix |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E
Suppose 1 > 0 v1,…,vn: corresponding eigenvectors Eigenvectors are form an orthornormal basis Input:
The matrix E A unit vector u, which is not orthogonal to v1
Goal: compute 1 and v1
19
The Power Method
20
Why Does It Work?
Theorem: As t , w c · v1 (c is a constant)
• Convergence rate: Proportional to (2/1)t
• The larger the “spectral gap” 2 - 1, the faster the convergence.
21
Spectral Methods in
Information Retrieval
22
Outline
Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD
23
Synonymy and Polysemy
Synonymy: multiple terms with (almost) the same meaningEx: cars, autos, vehiclesHarms recall
Polysemy: a term with multiple meaningsEx: java (programming language, coffee,
island)Harms precision
24
Traditional Solutions
Query expansionSynonymy: OR on all synonyms
Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision
Polysemy: AND on term and additional specializing terms
Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall
25
Syntactic Space
D: document collection, |D| = n T: term space, |T| = m At,d: “weight” of t in d (e.g., TFIDF) ATA: pairwise document similarities AAT: pairwise term similarities
A m
n
terms
documents
26
Syntactic Indexing
Index keys: terms Limitations
Synonymy (Near)-identical rows
Polysemy Space inefficiency
Matrix usually is not full rank
Gap between syntax and semantics:Information need is semantic but index and query are syntactic.
27
Semantic Space
C: concept space, |C| = r Bc,d: “weight” of c in d Change of basis Compare to wavelet and Fourier transforms
B r
n
concepts
documents
28
Latent Semantic Indexing (LSI)[Deerwester et al. 1990]
Index keys: concepts Documents & query: mixtures of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap Space-efficient
Concepts are orthogonal Matrix is full rank
Questions What is the concept space? What is the transformation from the syntax space to the semantic
space? How to filter “noise concepts”?
29
Singular Values
A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exist a
pair of vectors u,v s.t. Av = u and ATu = v
u and v are called singular vectors.
Ex: = ||A||2 = max||x||2 = 1 ||Ax||2. Corresponding singular vectors: x that maximizes ||Ax||2 and y =
Ax / ||A||2.
Note: ATAv = 2v and AATu = 2u 2 is eigenvalue of ATA and AAT
u eigenvector of ATA v eigenvector of AAT
30
Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there
exists a singular value decomposition:
A = U VT
1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A
= Diag(1,…,r)
U: column-orthonormal m×r matrix (UT U = I) V: column-orthonormal n×r matrix (VT V = I)
A U VT× ×=
31
Singular Values vs. EigenvaluesA = U VT
1,…,r: singular values of A 1
2,…,r2: non-zero eigenvalues of ATA and AAT
u1,…,ur: columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of ATA
v1,…,vr: columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AAT
32
LSI as SVD
A = U VT UTA = VT
u1,…,ur : concept basis B = VT : LSI matrix Ad: d-th column of A Bd: d-th column of B Bd = UTAd Bd[c] = uc
T Ad
33
Noisy Concepts
B = UTA = VT
Bd[c] = c vd[c] If c is small, then Bd[c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low-
weight concept in d Main idea: filter out all concepts c = k+1,…,r
Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across
the board
34
Low-rank SVD
B = UTA = VT
Uk = (u1,…,uk)
Vk = (v1,…,vk)
k = upper-left k×k sub-matrix of Ak = Uk k Vk
T
Bk = k VkT
rank(Ak) = rank(Bk) = k
35
Low Dimensional Embedding
Forbenius norm:
Fact:
Therefore, if is small, then for “most” d,d’, .
Ak preserves pairwise similarities among documents at least as good as A for retrieval.
36
Computing SVD
Compute singular values of A, by computing eigenvalues of ATA
Compute U,V by computing eigenvectors of ATA and AAT
Running time not too good: O(m2 n + m n2) Not practical for huge corpora
Sub-linear time algorithms for estimating Ak [Frieze,Kannan,Vempala 1998]
37
HITS and SVD
A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of ATA h is principal eigenvector of AAT
Therefore: a and h give A1: the rank-1 SVD of A
Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.
38
Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]
LSI summaryDocuments are embedded in low dimensional
space (m k)Pairwise similarities are preservedMore space-efficient
But why is retrieval better?SynonymyPolysemy
39
Generative Model
A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k
Concept: distribution over terms W: Topic space
Topic: distribution over concepts D: Document distribution
Distribution over W × N
A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times:
Sample a concept c from C according to w Sample a term t from T according to c
40
Simplifying Assumptions
Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a
concept c is at most some constant .
41
LSI Works
A: m×n term-document matrix, representing n documents generated according to the model
Theorem [Papadimitriou et al. 1998]With high probability, for every two documents d,d’, If topic(d) = topic(d’), then
If topic(d) topic(d’), then
42
Proof For simplicity, assume = 0 Want to show:
If topic(d) = topic(d’), Adk || Ad’
k
If topic(d) topic(d’), Adk Ad’
k Dc: documents whose topic is the concept c Tc: terms in supp(c)
Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø A has non-zeroes only in blocks: B1,…,Bk, where
Bc: sub-matrix of A with rows in Tc and columns in Dc
ATA is a block diagonal matrix with blocks BT1B1,…, BT
kBk
(i,j)-th entry of BTcBc: term similarity between i-th and j-th
documents whose topic is the concept c BT
cBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc
43
Proof (cont.) Gc is a “random” graph First and second eigenvalues of BT
cBc are well separated
For all c,c’, second eigenvalue of BTcBc is smaller
than first eigenvalue of BTc’Bc’
Top k eigenvalues of ATA are the principal eigenvalues of BT
cBc for c = 1,…,k Let u1,…,uk be corresponding eigenvectors For every document d on topic c, Ad is orthogonal to
all u1,…,uk, except for uc. Ak
d is a scalar multiple of uc.
44
Extensions[Azar et al. 2001]
A more general generative model Explain also improved treatment of
polysemy
45
End of Lecture 5
Top Related