Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...

44
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)

Transcript of Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...

Page 1: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Data Mining TechniquesCS 6220 - Section 2 - Spring 2017

Lecture 9: Link AnalysisJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)

Page 2: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

• Human-curated (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion

• Text-search(e.g. WebCrawler, Lycos) • Prone to term-spam

Web search before PageRank

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 3: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Links as Votes

• Pages with more inbound links are more important • Inbound links from important pages carry more weight

Not all pages are equally important

Manyinboundlinks

Few/noinboundlinks

Links fromunimportantpages

Links fromimportantpages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 4: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Example: PageRank Scores

B 38.4 C

34.3

E 8.1

F 3.9

D 3.9

A 3.3

1.61.6 1.6 1.6 1.6

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 5: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Recursive Formulation

• A link’s vote is proportional to the importance of its source page

• If page j with importance rj has n out-links, each link gets rj / n votes

• Page j’s own importance is the sum of the votes on its in-links

j

ki

rj/3

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 6: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Equivalent Formulation: Random Surfer

• At time t a surfer is on some page i • At time t+1 the surfer follows a

link to a new page at random • Define rank ri as fraction of time

spent on page i

j

ki

rj/3

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 7: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: The “Flow” Model

• 3 equations, 3 unknowns • Impose constraint: ry + ra + rm = 1• Solution: ry = 2/5, ra = 2/5, rm = 1/5

∑→

=ji

ij

rrid

y

mara/2

ra/2

rm

ry/2

ry/2

“Flow” equations:

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 8: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: The “Flow” Model

∑→

=ji

ij

rrid

“Flow” equations:

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

r = M·r

Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

y

mara/2

ra/2

rm

ry/2

ry/2

Page 9: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Eigenvector Problem

• PageRank: Solve for eigenvector r = M r with eigenvalue λ = 1

• Eigenvector with λ = 1 is guaranteed to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)

• Problem: There are billions of pages on the internet. How do we solve for eigenvector with order 1010 elements?

Page 10: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Power IterationModel for random Surfer:

• At time t = 0 pick a page at random • At each subsequent time t follow an

outgoing link at random

Probabilistic interpretation:

Page 11: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Power Iteration

y

maa/2

y/2a/2

m

y/2

pt converges to r. Iterate until |pt - pt -1| < ε(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 12: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Intermezzo: Markov Chains

Ergodicity

Irreducibility

Stationary distribution

(for ergodic chains)

Markov Property

Page 13: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

• PageRank is assumes a random walkmodel for individual surfers

• Equivalent assumption: flow modelin which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Aside: Ergodicity

Page 14: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Aside: Ergodicity• PageRank is assumes a random walk

model for individual surfers • Equivalent assumption: flow model

in which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Page 15: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Problems

Dead end

1. Dead Ends • Nodes with no outgoing links. • Where do surfers go next?

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Spider trap

2. Spider Traps • Subgraph with no outgoing

links to wider graph • Surfers are “trapped” with

no way out.Not irreducible

Page 16: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Power Iteration: Dead Ends

y

maa/2

y/2a/2

y/2

Probability not conserved(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 17: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Power Iteration: Dead Ends

y

maa/2

y/2a/2

y/2

Fixes “probability sink” issue

(teleport at dead ends)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 18: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Power Iteration: Spider Traps

y

maa/2

y/2a/2

y/2

Probability accumulates in traps (surfers get stuck)

m

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 19: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Solution: Random TeleportsModel for teleporting random surfer:

• At time t = 0 pick a page at random • At each subsequent time t

• With probability β follow an outgoing link at random

• With probability 1-β teleport to a new initial location at random

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

rj =X

i! j

�ri

di+ (1� �) 1

N

PageRank Equation [Page & Brin 1998]

Page 20: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 21: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 22: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 23: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Computing PageRank

• M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk

0 3 1, 5, 71 5 17, 64, 113, 117, 2452 2 13, 23

source node degree destination nodes

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 24: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Block-based Update Algorithm• Break r new into k blocks that fit in memory • Scan M and r old once for each block

0 4 0, 1, 3, 51 2 0, 52 2 3, 4

src degree destination

01

23

45

012345

rnew rold

M

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 25: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Block-Stripe Update Algorithm

0 4 0, 11 3 02 2 1

src degree destination

01

23

45

012345

rnew

rold

0 4 51 3 52 2 4

0 4 32 2 3

Break M into stripes: Each stripe contains only destination nodes in the corresponding block of r new

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 26: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Problems: Term Spam• How do you make your page appear to be

about movies? • (1) Add the word movie 1,000 times to your page

• Set text color to the background color, so only search engines would see it

• (2) Or, run the query “movie” on your target search engine • See what page came first in the listings • Copy it into your page, make it “invisible”

• These and similar techniques are term spam

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 27: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Google’s Solution to Term Spam• Believe what people say about you, rather

than what you say about yourself • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

• PageRank as a tool to measure the “importance” of Web pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 28: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Problems 2: Link Spam• Once Google became the dominant search

engine, spammers began to work out ways to fool Google

• Spam farms were developed to concentrate PageRank on a single page

• Link spam: • Creating link structures that

boost PageRank of a page

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 29: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Link Spamming• Three kinds of web pages from a

spammer’s point of view • Inaccessible pages • Accessible pages

• e.g., blog comments pages (spammer can post links to his pages)

• Owned pages • Completely controlled by spammer • May span multiple domain names

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 30: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Link Farms• Spammer’s goal:

• Maximize the PageRank of target page t

• Technique: • Get as many links from accessible pages

as possible to target page t • Construct “link farm” to get PageRank

multiplier effect

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 31: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Link Farms

Inaccessible

t

Accessible Owned

1

2

M

Oneofthemostcommonandeffective organizationsforalinkfarm

Millions of farm pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 32: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

PageRank: Extensions

• Topic-specific PageRank: • Restrict teleportation to some set S

of pages related to a specific topic • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise

• Trust Propagation • Use set S of trusted pages

as teleport set

Page 33: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Hidden Markov Models

Page 34: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Time Series with Distinct States

Page 35: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Can we use a Gaussian Mixture Model?Time Series Histogram

MixturePosterior on states

Page 36: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Time Series Histogram

MixturePosterior on states

Can we use a Gaussian Mixture Model?

Page 37: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Estimate from GMM

Hidden Markov ModelsEstimate from HMM

• Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states

(more likely to be in same state than different state)

Page 38: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Reminder: Random Surfers in PageRank

y

maa/2

y/2a/2

m

y/2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Page 39: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Reminder: Random Surfers in PageRank

y

maa/2

y/2a/2

m

y/2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

A = M>

Page 40: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Hidden Markov ModelsGaussian Mixture Gaussian HMM

Page 41: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Review: Gaussian MixturesExpectation Maximization1. Update cluster probabilities

2. Update parameters

Page 42: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Forward-backward AlgorithmExpectation step for HMM

z1 ⇠ Discrete(⇡)

zt+1|zt = k ⇠ Discrete(Ak)

xt|zt = k ⇠ Normal(µk,�k)

�t,k = p(zt = k |x1:T ,✓)

=p(x1:t, zt)p(xt+1:T |zt)

p(x1:T )

/ ↵t,k�t,k

↵t,l := p(x1:t, zt)

=X

k

p(xt|µl,�l)Akl↵t�1,k

�t,k := p(xt+1:T | zt)

=X

l

�t+1,l p(xt+1|µl,�l)Akl

↵t,l := p(x1:t, zt)

=X

k

p(xt|µl,�l)Akl↵t�1,k

�t,k := p(xt+1:T | zt)

=X

l

�t+1,l p(xt+1|µl,�l)Akl

Page 43: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Parameter UpdatesE-step: Posterior probabilities on Transitions

M-step updates

Page 44: Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec

Other Examples for HMMsHandwritten Digits

• State 1: Sweeping arc • State 2: Horizontal line

13.2. Hidden Markov Models 615

Figure 13.11 Top row: examples of on-line handwrittendigits. Bottom row: synthetic digits sam-pled generatively from a left-to-right hid-den Markov model that has been trainedon a data set of 45 handwritten digits.

One of the most powerful properties of hidden Markov models is their ability toexhibit some degree of invariance to local warping (compression and stretching) ofthe time axis. To understand this, consider the way in which the digit ‘2’ is writtenin the on-line handwritten digits example. A typical digit comprises two distinctsections joined at a cusp. The first part of the digit, which starts at the top left, has asweeping arc down to the cusp or loop at the bottom left, followed by a second more-or-less straight sweep ending at the bottom right. Natural variations in writing stylewill cause the relative sizes of the two sections to vary, and hence the location of thecusp or loop within the temporal sequence will vary. From a generative perspectivesuch variations can be accommodated by the hidden Markov model through changesin the number of transitions to the same state versus the number of transitions to thesuccessive state. Note, however, that if a digit ‘2’ is written in the reverse order, thatis, starting at the bottom right and ending at the top left, then even though the pen tipcoordinates may be identical to an example from the training set, the probability ofthe observations under the model will be extremely small. In the speech recognitioncontext, warping of the time axis is associated with natural variations in the speed ofspeech, and again the hidden Markov model can accommodate such a distortion andnot penalize it too heavily.

13.2.1 Maximum likelihood for the HMMIf we have observed a data set X = {x1, . . . ,xN}, we can determine the param-

eters of an HMM using maximum likelihood. The likelihood function is obtainedfrom the joint distribution (13.10) by marginalizing over the latent variables

p(X|θ) =!

Z

p(X,Z|θ). (13.11)

Because the joint distribution p(X,Z|θ) does not factorize over n (in contrast to themixture distribution considered in Chapter 9), we cannot simply treat each of thesummations over zn independently. Nor can we perform the summations explicitlybecause there are N variables to be summed over, each of which has K states, re-sulting in a total of KN terms. Thus the number of terms in the summation grows

RNA splicing

• State 1: Exon (relevant) • State 2: Splice site • State 3: Intron (ignored)