Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...
Transcript of Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...
Data Mining TechniquesCS 6220 - Section 2 - Spring 2017
Lecture 9: Link AnalysisJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)
• Human-curated (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion
• Text-search(e.g. WebCrawler, Lycos) • Prone to term-spam
Web search before PageRank
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
PageRank: Links as Votes
• Pages with more inbound links are more important • Inbound links from important pages carry more weight
Not all pages are equally important
Manyinboundlinks
Few/noinboundlinks
Links fromunimportantpages
Links fromimportantpages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Example: PageRank Scores
B 38.4 C
34.3
E 8.1
F 3.9
D 3.9
A 3.3
1.61.6 1.6 1.6 1.6
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
PageRank: Recursive Formulation
• A link’s vote is proportional to the importance of its source page
• If page j with importance rj has n out-links, each link gets rj / n votes
• Page j’s own importance is the sum of the votes on its in-links
j
ki
rj/3
rj/3rj/3rj = ri/3+rk/4
ri/3 rk/4
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Equivalent Formulation: Random Surfer
• At time t a surfer is on some page i • At time t+1 the surfer follows a
link to a new page at random • Define rank ri as fraction of time
spent on page i
j
ki
rj/3
rj/3rj/3rj = ri/3+rk/4
ri/3 rk/4
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
PageRank: The “Flow” Model
• 3 equations, 3 unknowns • Impose constraint: ry + ra + rm = 1• Solution: ry = 2/5, ra = 2/5, rm = 1/5
∑→
=ji
ij
rrid
y
mara/2
ra/2
rm
ry/2
ry/2
“Flow” equations:
ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
PageRank: The “Flow” Model
∑→
=ji
ij
rrid
“Flow” equations:
ry = ry /2 + ra /2ra = ry /2 + rm
rm = ra /2
r = M·r
Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y
mara/2
ra/2
rm
ry/2
ry/2
PageRank: Eigenvector Problem
• PageRank: Solve for eigenvector r = M r with eigenvalue λ = 1
• Eigenvector with λ = 1 is guaranteed to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)
• Problem: There are billions of pages on the internet. How do we solve for eigenvector with order 1010 elements?
PageRank: Power IterationModel for random Surfer:
• At time t = 0 pick a page at random • At each subsequent time t follow an
outgoing link at random
Probabilistic interpretation:
PageRank: Power Iteration
y
maa/2
y/2a/2
m
y/2
pt converges to r. Iterate until |pt - pt -1| < ε(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Intermezzo: Markov Chains
Ergodicity
Irreducibility
Stationary distribution
(for ergodic chains)
Markov Property
• PageRank is assumes a random walkmodel for individual surfers
• Equivalent assumption: flow modelin which equal fractions of surfers follow each link at every time
• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk
Aside: Ergodicity
Aside: Ergodicity• PageRank is assumes a random walk
model for individual surfers • Equivalent assumption: flow model
in which equal fractions of surfers follow each link at every time
• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk
PageRank: Problems
Dead end
1. Dead Ends • Nodes with no outgoing links. • Where do surfers go next?
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Spider trap
2. Spider Traps • Subgraph with no outgoing
links to wider graph • Surfers are “trapped” with
no way out.Not irreducible
Power Iteration: Dead Ends
y
maa/2
y/2a/2
y/2
Probability not conserved(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Power Iteration: Dead Ends
y
maa/2
y/2a/2
y/2
Fixes “probability sink” issue
(teleport at dead ends)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Power Iteration: Spider Traps
y
maa/2
y/2a/2
y/2
Probability accumulates in traps (surfers get stuck)
m
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Solution: Random TeleportsModel for teleporting random surfer:
• At time t = 0 pick a page at random • At each subsequent time t
• With probability β follow an outgoing link at random
• With probability 1-β teleport to a new initial location at random
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
rj =X
i! j
�ri
di+ (1� �) 1
N
PageRank Equation [Page & Brin 1998]
Power Iteration: Teleports
y
ma (can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Power Iteration: Teleports
y
ma (can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Power Iteration: Teleports
y
ma (can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Computing PageRank
• M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk
0 3 1, 5, 71 5 17, 64, 113, 117, 2452 2 13, 23
source node degree destination nodes
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Block-based Update Algorithm• Break r new into k blocks that fit in memory • Scan M and r old once for each block
0 4 0, 1, 3, 51 2 0, 52 2 3, 4
src degree destination
01
23
45
012345
rnew rold
M
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Block-Stripe Update Algorithm
0 4 0, 11 3 02 2 1
src degree destination
01
23
45
012345
rnew
rold
0 4 51 3 52 2 4
0 4 32 2 3
Break M into stripes: Each stripe contains only destination nodes in the corresponding block of r new
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Problems: Term Spam• How do you make your page appear to be
about movies? • (1) Add the word movie 1,000 times to your page
• Set text color to the background color, so only search engines would see it
• (2) Or, run the query “movie” on your target search engine • See what page came first in the listings • Copy it into your page, make it “invisible”
• These and similar techniques are term spam
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Google’s Solution to Term Spam• Believe what people say about you, rather
than what you say about yourself • Use words in the anchor text (words that appear
underlined to represent the link) and its surrounding text
• PageRank as a tool to measure the “importance” of Web pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Problems 2: Link Spam• Once Google became the dominant search
engine, spammers began to work out ways to fool Google
• Spam farms were developed to concentrate PageRank on a single page
• Link spam: • Creating link structures that
boost PageRank of a page
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Link Spamming• Three kinds of web pages from a
spammer’s point of view • Inaccessible pages • Accessible pages
• e.g., blog comments pages (spammer can post links to his pages)
• Owned pages • Completely controlled by spammer • May span multiple domain names
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Link Farms• Spammer’s goal:
• Maximize the PageRank of target page t
• Technique: • Get as many links from accessible pages
as possible to target page t • Construct “link farm” to get PageRank
multiplier effect
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Link Farms
Inaccessible
t
Accessible Owned
1
2
M
Oneofthemostcommonandeffective organizationsforalinkfarm
Millions of farm pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
PageRank: Extensions
• Topic-specific PageRank: • Restrict teleportation to some set S
of pages related to a specific topic • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise
• Trust Propagation • Use set S of trusted pages
as teleport set
Hidden Markov Models
Time Series with Distinct States
Can we use a Gaussian Mixture Model?Time Series Histogram
MixturePosterior on states
Time Series Histogram
MixturePosterior on states
Can we use a Gaussian Mixture Model?
Estimate from GMM
Hidden Markov ModelsEstimate from HMM
• Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states
(more likely to be in same state than different state)
Reminder: Random Surfers in PageRank
y
maa/2
y/2a/2
m
y/2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Reminder: Random Surfers in PageRank
y
maa/2
y/2a/2
m
y/2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
A = M>
Hidden Markov ModelsGaussian Mixture Gaussian HMM
Review: Gaussian MixturesExpectation Maximization1. Update cluster probabilities
2. Update parameters
Forward-backward AlgorithmExpectation step for HMM
z1 ⇠ Discrete(⇡)
zt+1|zt = k ⇠ Discrete(Ak)
xt|zt = k ⇠ Normal(µk,�k)
�t,k = p(zt = k |x1:T ,✓)
=p(x1:t, zt)p(xt+1:T |zt)
p(x1:T )
/ ↵t,k�t,k
↵t,l := p(x1:t, zt)
=X
k
p(xt|µl,�l)Akl↵t�1,k
�t,k := p(xt+1:T | zt)
=X
l
�t+1,l p(xt+1|µl,�l)Akl
↵t,l := p(x1:t, zt)
=X
k
p(xt|µl,�l)Akl↵t�1,k
�t,k := p(xt+1:T | zt)
=X
l
�t+1,l p(xt+1|µl,�l)Akl
Parameter UpdatesE-step: Posterior probabilities on Transitions
M-step updates
Other Examples for HMMsHandwritten Digits
• State 1: Sweeping arc • State 2: Horizontal line
13.2. Hidden Markov Models 615
Figure 13.11 Top row: examples of on-line handwrittendigits. Bottom row: synthetic digits sam-pled generatively from a left-to-right hid-den Markov model that has been trainedon a data set of 45 handwritten digits.
One of the most powerful properties of hidden Markov models is their ability toexhibit some degree of invariance to local warping (compression and stretching) ofthe time axis. To understand this, consider the way in which the digit ‘2’ is writtenin the on-line handwritten digits example. A typical digit comprises two distinctsections joined at a cusp. The first part of the digit, which starts at the top left, has asweeping arc down to the cusp or loop at the bottom left, followed by a second more-or-less straight sweep ending at the bottom right. Natural variations in writing stylewill cause the relative sizes of the two sections to vary, and hence the location of thecusp or loop within the temporal sequence will vary. From a generative perspectivesuch variations can be accommodated by the hidden Markov model through changesin the number of transitions to the same state versus the number of transitions to thesuccessive state. Note, however, that if a digit ‘2’ is written in the reverse order, thatis, starting at the bottom right and ending at the top left, then even though the pen tipcoordinates may be identical to an example from the training set, the probability ofthe observations under the model will be extremely small. In the speech recognitioncontext, warping of the time axis is associated with natural variations in the speed ofspeech, and again the hidden Markov model can accommodate such a distortion andnot penalize it too heavily.
13.2.1 Maximum likelihood for the HMMIf we have observed a data set X = {x1, . . . ,xN}, we can determine the param-
eters of an HMM using maximum likelihood. The likelihood function is obtainedfrom the joint distribution (13.10) by marginalizing over the latent variables
p(X|θ) =!
Z
p(X,Z|θ). (13.11)
Because the joint distribution p(X,Z|θ) does not factorize over n (in contrast to themixture distribution considered in Chapter 9), we cannot simply treat each of thesummations over zn independently. Nor can we perform the summations explicitlybecause there are N variables to be summed over, each of which has K states, re-sulting in a total of KN terms. Thus the number of terms in the summation grows
RNA splicing
• State 1: Exon (relevant) • State 2: Splice site • State 3: Intron (ignored)