Clustering in kernel embedding spaces and organization of...
Transcript of Clustering in kernel embedding spaces and organization of...
-
Clustering in kernel embedding spaces and organization of documents
Stéphane Lafon
Collaborators:
Raphy Coifman (Yale), Yosi Keller (Yale), Ioannis G. Kevrekidis(Princeton), Ann B. Lee (CMU), Boaz Nadler (Weizmann)
-
Motivation
Data everywhere:
• Scientific databases: biomedical/physics
• The web (web search, ads targeting, ads spam/fraud detection…)
• Books: Conventional libraries + {Amazon, Google Print…}
• Military applications (Intelligence, ATR, undersea mapping, situational awareness…)
• Corporate and financial data
• …
-
Motivation
Tremendous need for automated ways of
• Organizing this information (clustering, parametrization of data sets, adapted metrics on the data)
• Extracting knowledge (learning, clustering, pattern recognition,dimensionality reduction)
• Make it useful to end-users (dimensionality reduction)
All of this in the most efficient way.
-
Challenges
• data sets are high-dimensional: need for dimension reduction
• data sets are nonlinear: need to learn these nonlinearities
• data likely to be noisy: need for robustness.
• data sets are massive.
-
Highlights of this talk:
• Review the diffusion framework: parametrization of data sets, dimensionality reduction, diffusion metric
• Present a few results on clustering in the diffusion space
• Extend these results to non-symmetric graphs: beyond diffusions.
-
From data sets to graphs
{ }( )
1 2, ,...,
,
( , )
Let be a data set.
Construct a graph where to each point corresponds a node, every two nodes are connected by an edge with a non-negative weight .
n
i
X x x x
G X Wx
w x y
=
=
••
-
Choice of the weight matrix
( , ).
The quantity should reflect the degree of similarityor interaction between and
The choice of the weight is crucial and application-driven.
w x yx y
2( , ) /( , )
Examples: Take a correlation matrix, and throw away values that are not close to 1.
If we have a distance metric on the data, consider using . Sometimes, the data already comes in
d x yd x y e ε−•
•• the form of a graph (e.g. social networks).
In practise, difficult problem! Feature selection, dynamical data sets...
-
Markov chain on the data
1
( ) ( , ).
( , )( , )( )
( , ) 1 ( , ) 0
Define the degree of node as
Form the by matrix with entries . In other words, .
Because and ,
is the transition matrix of a Markov
z X
y X
x d x w x z
w x yn n P p x y P D Wd x
p x y p x y
P
∈
−
∈
=
= =
= ≥
∑
∑
( , )
chain on the graph of the data.
is called "normalized graph Laplacian" [Chung'97].Notation: entries of are denoted .t t
I PP p x y
−
-
Powers of P
Main idea: the behavior of the Markov chain over different time scales will reveal geometric structures of the data.
-
Time parameter t
( , ) is the probability of transition from to in time steps.Therefore, it is close to 1 if is easily reachable from in steps.This happens if there are many paths connecting these two po
tp x y x y ty x t
ints
defines the granularity of the analysis.Increasing the value of is a way to integrate the local geometric information:
transitions are likely to happen between similar data points and occur rar
tt
ely otherwise.
-
Assumptions on the weights
( , ) ( , ).
Our object of interest: powers of .
We now assume that the weight function is symmetric, i.e., that the graph is symmetric.
P
w x y w y x=
-
Spectral decomposition of P
0 1 1
0 0
{ }
11
1
1
With these conditions, has a sequence of eigenvaluesand a collection of corresponding (right) eigenvectors
In addition, it can be checked that and
n
mt t
m m m
P
P
λ λ λψ
ψ λ ψ
λ ψ
−≥ ≥ ≥
=
⎛ ⎞⎜ ⎟⎜ ⎟= ≡⎜ ⎟⎜ ⎟⎝ ⎠
…
.
-
Diffusion coordinates
{ }.
0
Each eigenvector is a function defined on the data points.Therefore, we can think of the (right) eigenvectors
as forming a on the data set For any choice of
set of coordin , defin
ae the ma
tesm
Xt
ψ
≥
1 1
2 2
1 1
( )( )
:
( )
pping:t
t
t
tn n
xx
x
x
λψλψ
λ ψ− −
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
ψ
Original space Diffusion space
tψ
-
Over the past 5 years, new techniques have emerged for manifold learning
• Isomap [Tenenbaum-DeSilva-Langford’00]
• L.L.E. [Roweis-Saul’00]
• Laplacian eigenmaps [Belkin-Niyogi’01]
• Hessian eigenmaps [Donoho-Grimes’03]
• …
They all aim at finding coordinates on data sets by computing the eigenfunctions of a psd matrix.
-
Diffusion coordinates
2 / 2( , )
Data set: collection of images of handwritten digits "1"
Weight kernel where is the distance between two images and x yw x y e x y L x yε− −•
• = −
unordered
-
Diffusion coordinates
-
Diffusion distance
2 ( ,1/
( ) ( )
( ) ( ) ( , ) ( , ) ( , )
Meaning of mapping : two data points and are mapped as and so that the distance between them is equal to the so-called "diffusion distance":
t t t
t t t t t L X
x y x y
x y D x y p x p y
Ψ Ψ Ψ
Ψ −Ψ = = −i i).
Two points are in this metric ifclose highly connect they are in the gred aph.
π
Original space Diffusion space
Euclidean distance in diffusion space
Diffusion metric in original space
tψ
-
Diffusion distanceThe diffusion metric measures proximity in terms of connectivity in the graph.
In particular,
• It is useful to detect and characterize clusters.
• It allows to develop learning algorithms based on the preponderance of evidences.
• It is very robust to noise, unlike the geodesic distance.
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
A
B
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
A
B
0 0.5 1 1.5 20
50
100
Geodesic distance
0 0.5 1 1.5 20
50
100
Diffusion distance
Resistance distance, mean commute time, Green’s function…
-
Dimension reduction2
1 1
2 2
1 1
1
( )( )
:
( )0,
1Since , not all terms are numerically significant in
Therefore, for a given value of one only needs to embed usingcoordinates for which
t
t
t
tn n
tm
xx
x
xt
λ λ
λψλψ
λ ψ
λ
− −
≥ ≥ ≥
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠≥
ψ
…
is non-negligible.The key to dimensionality reduction: decay of the spectrum
of the powers of the transition matrix.
0 20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
P2
P4
P8
P16
-
Example: image analysis
1. Take an image
2. Form the set of all 7x7 patches
3. Compute the diffusion maps
Parametrization
-
Example: image analysis
−0.2 00.2 0.4
0.6 0.81 1.2
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1−0.5
00.5
1
−1−0.5
00.5
1
−1
−0.5
0
0.5
1
Regular diffusion maps Density-normalized diffusion maps
−0.2 00.2 0.4
0.6 0.81 1.2
−0.5
0
0.5
1
−1
−0.5
0
0.5
1
−1−0.5
00.5
1
−1−0.5
00.5
1−1
−0.5
0
0.5
1
Regular diffusion maps Density-normalized diffusion maps
Now, if we drop some patches…
-
Example: Molecular dynamics(with Raphy R. Coifman, Ioannis Kevrekidis, Mauro Maggioni and Boaz Nadler)
Output of the simulation of a molecule of 7 atoms in a solvent. The molecule evolves in a potential, and is also subject to interactions with the solvent.
Data set: millions of snapshots of the molecule (coordinates of atoms).
-
Graph matching and data alignment(with Yosi Keller)
Suppose we have two datasets and with approximately the same geometry.Question: how can we match/align/register and ?
Since we have coordinates for and , we can embed both sets,and align them
X YX Y
X Y in diffusion space.
-
Graph matching and data alignmentIllustration: moving heads.
The heads of 3 subjects wearing strange masks are recorded in 3 movies.
β
α
We obtain 3 data sets where each frame is a data point.
Because this is a constrained system (head-neck mechanical articulation), all 3 sets exhibit approximately the same geometry.
-
Graph matching and data alignment
We compute 3 density-invariant embeddings.
0
0.5
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Yellow set embedding0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.5
1
Red set embedding
0
0.5
1 00.2
0.40.6
0.81
0
0.2
0.4
0.6
0.8
1
Black set embedding
We then align the 3 data sets in diffusion spaceusing a limited number of landmarks.
-
Graph matching and data alignment
-
Science News articles (reloaded ☺)
:
( , )
Data set: collection of documents from Science News Data modeling : we form a mutual information feature matrix .
For each document and word , we compute the information shared by and Q
i j i j
Q i j
••
,,
:log :
:
number of words in document where total number of words in document
total number of words in all documents
We can work on the rows of
i ji j
ii j
j
N j iN
N iN N
N j
Q
⎧⎛ ⎞ ⎪= ⎜ ⎟ ⎨⎜ ⎟ ⎪⎝ ⎠
⎩•
( , )
in order to organize documentsor on its columns to organize words. Word j
Document i Q i j
↓
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟→⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
-
Clustering and data subsampling
[TODO: Matlab demo]
Organizing documents: we work on rows of .
Automatic extraction of themes, organization of documents according to these themes.Can work at different levels of granularity by changing
the number of diff
Q
usion coordinates considered.
-
Clustering of documents in diffusion space(with Ann B. Lee)
In the diffusion space, we can use classical geometric objects:
• can use hyperplanes
• can use Euclidean distance
• can use balls
• can apply usual geometric algorithms: k-means, coverings…
We now show that clustering in the diffusion space is equivalent to compressing the diffusion matrix P.
-
−1
0
1−1.5
−1
−0.5
0
0.5
1
123
−1
0
1−1.5
−1
−0.5
0
0.5
1
123
Learning nonlinear structures
K-means in diffusion spaceK-means in original space
-
Coarsening of the graph and Markov chain lumping
{ }1Given an arbitrary partition of the nodes into subsets,we coarsen the graph into a new graph with meta-nodes. Each meta-node is identified with a subset The new edge-weight funct
i i k
i
S k
G G kS
≤ ≤
•
•
1
( , ) ( , )
( ) ( , )
( , )( , )
( )
ion is defined by
We form a new Markov chain on this new graph:
i j
i jx S y S
i i jj k
i ji j
i
w S S w x y
d S w S S
w S Sp S S
d S
∈ ∈
≤ ≤
=
•
=
=
∑ ∑
∑
-
Coarsening of the graph and Markov chain lumping
Coarsening
w(x,y)
S3
S1
S1 S3
S2
S2
w(S1, S3)~
-
Equivalence with matrix approximation problem
( ) ( )( )
The new Markov matrix is , and it is an approximation of .Are the spectral properties of preserved by the coarsening operation?
Define the approximate eigenvectors of by
lx S
l i
P k k PP
Px x
Sπ ψ
ψ ∈
×
=
{ }
( )
These are not actual eigenvectors of , but only approximations. The quality of approximation depends on the choice of the partition .
i
ix S
i
x
PS
π∈
∑∑
-
Error of approximation
12
1 11
1
( ) ( )( ) ( ) ( )( ) ( )
Proposition: We have, for all 0 ,
where is the distortion in the diffusion space:
This is thei i
l l l
ii k x S z S i i
l k
P D
D
x zD S x zS S
πψ λψ
π πππ π≤ ≤ ∈ ∈
≤ ≤ −
− ≤
⎛ ⎞= Ψ −Ψ⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑∑
{ }
This results justifies the
quantization distortion in
use of k-means in diffusion
duced by ,where all the points have a mass distribut
space (MeilaShi) and coverings for data clustering.
ion .
Any clus
iSπ
tering scheme that aims at minimizing thisdistortion preserves the spectral properties of theMarkov chain.
-
Clustering of words: automatic lexicon organization
Organizing words: we work on columns of .We can subsample the word collection by quantizing the diffusion space.
Clusters can be interpretated as "wordlets", or semantic units.Centers of the clusters
Q
are representative of a concept.
Automatic extraction of concepts from a collection of documents.
-
Clustering and data subsampling
-
Clustering and data subsampling
-
Going multiscale
−1 −0.5 0 0.5 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−1 −0.5 0 0.5 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−1 −0.5 0 0.5 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−4−3
−2−1
0
−4
−2
0
2
4
−4
−3
−2
−1
0
1
2
3
λ1t Ψ
1
λ2t Ψ
2
λ 3t Ψ
3
−4−3
−2−1
0
−4
−2
0
2
4
−4
−3
−2
−1
0
1
2
3
λ1t Ψ
1
λ2t Ψ
2
λ 3t Ψ
3
−4−3
−2−1
0
−4
−2
0
2
4
−4
−3
−2
−1
0
1
2
3
λ1t Ψ
1
λ2t Ψ
2
λ 3t Ψ
3
t=1 t=64t=16
-
Beyond diffusions
( , )( )
The previous ideas can be generalized to the wider settingof a graph, with edge-weights , and a set of positive node-weights .
This time, we consider oriented interactions b
wπ
•
i ii
directed signed
etween datapoints (e.g. source-target). No spectral theory available! The node-weight can be used to rescale the relative importance
of each data point (e.g. density estimate). This framework contain
•
• s diffusion matrices as in this casethe node-weight is equal to the out-degree of each node.
-
Coarsening
{ }This time we have no spectral embedding.
However, for a given partitioning of the nodes,the same coarsening procedure can still be applied to the graph.
iS
Coarsening
S3
S2S1
S3
S2S1
w(S1,S2)~
w(x,y)
( , ) ( , ) ( ) ( ) and i j i
i j ix S y S x S
w S S w x y S xπ π∈ ∈ ∈
= =∑ ∑ ∑
-
Beyond spectral embeddings
1
1
( , ).
.
,
Idea: form the matrix and view each row as a feature vector representing The graph now
lives as a cloud of points in Similarly for the coarser graph form and obtain
a
n
A W a xx G
G A W
−
−
= Π
= Π
i
Question: how do these pointsets compare ?cloud of points in .k
-
Comparison between graphs
A
A
?T Tf A f AQ≈
Tf ATf
T Tf Q f=
Q Q
. is , whereas is They can be related by the matrix whose rows are
the indicators of the partitioning.
A n n A k kn k Q
× ××
In other words, this diagram should approximately commute:QA AQ≈
-
Error of approximation
( )1/
12
1/: 1
1
( ) ( )( ) ( , ) ( , ) .( ) ( )
We now have a similar result as before:Proposition:
sup
where
is the distortion resulting from quantizin
n
i i
T
f f
ii k x S z S i i
f QA AQ D
x zD S a x a zS S
D
ππ
π
π πππ π
∈ =
≤ ≤ ∈ ∈
− ≤
⎛ ⎞= −⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑∑ i i
g the setof rows of , with mass distribution A π
-
Antarctic food web
Squid Krill
Zoo-plankton
Phyto-plankton
Cormo-rant
LargefishGull
Petrel
Sheathbill
Spermwhale
CO2 Sun
Minkewhale
Bluewhale
Alba-tross
Tern
Rightwhale
Fulmar
FinWhale
Leopardseal
Killer whale
Weddellseal
Rossseal
Smallfish
AdeliepenguinGentoopenguinEmperorpenguinMacaronipenguin
Kingpenguin
Chinstrappenguin
Seiwhale
Humpbackwhale
Octopus
Elephantseal
Crabeaterseal
Skua
-
Partitioning the forward graph
Zooplankton
KrillSmall fish
Squid
GullKing penguin
Large fishLeopard seal
OctopusWeddell seal
Ross seal
Adelie penguinChinstrap penguinEmperor penguinGentoo penguin
Macaroni penguinCO2, Sun
Cormorant, AlbatrossCrabeater sealElephant seal
Fin whale, FulmarBlue whale
Humpback whaleSouthern right whale
Minke whaleSei whale
Sperm whaleKiller whale, Petrel
PhytoplanktonSheathbillSkua,Tern
-
Partitioning the reverse graph
CO2, SunPhytoplanktonZooplankton
Krill
GullElephant seal
PetrelSheathbill
Adelie penguinAlbatross, FulmarChinstrap penguin
Crabeater sealEmperor penguinGentoo penguin
King penguinLarge fish, OctopusMacaroni penguin
Ross seal, CormorantSkua, Squid, Tern
Sperm whaleWeddell seal
Blue whaleFin whale
Humpback whaleSouthern right whale
Minke whaleSmall fishSei whale
Killer whaleLeopard seal
-
Conclusion
• Very flexible framework providing coordinates on data sets
• Diffusion distance relevant to data mining and machine learning applications
• Clustering meaningful in this embedding space
• Can be extended to directed graphs
• Multiscale bases and analyses on graphs
Thank you!