Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis...
Transcript of Plato, 427-347 BC - Πανεπιστήμιο Πατρών · Geometrical Data Analysis...
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Plato, 427Plato, 427--347 BC347 BC
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
N. N. LaskarisLaskaris
Algorithms for Algorithms for Geometrical Data AnalysisGeometrical Data Analysis
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Philosophical Inquiries Philosophical Inquiries - Where does this course belong to ?
(e.g. machine learning/vision, pattern recognition)
- What is it about ?( multivariate data, multi-dimensional signals )
- Why is this course necessary ?( generic-character, simplicity, efficiency, user’s-idiosyncrasy )
- Scope of this short course & Goals( How ? vs. Why ? )
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
What isnWhat isn’’t t Geometrical Data Analysis ?Geometrical Data Analysis ?
Statistical Data Analysis
Hypothesis Driven methodologies
A-priori (Top-Down) Data Modeling
Parametric (model fitting) approaches
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A little Motivation A little Motivation
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
απ’ τη σκοπιά μου
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Information-Geometry
vs. Informative - Geometry
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Roger Shepard (1929 - ) Prof. Emeritus of Social Science,
Stanford University
A cognitive scientist (Ph.D. in psychology 1955)and author of ‘‘Toward a Universal Law of Generalization forPsychological Science ’’
He is considered the father of spatial relations
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Science, vol. 237, Sept.1987 Science, vol. 237, Sept.1987
Does psychological science have any hope of achieving a law
that is comparable in generality (if not in predictive accuracy) to Neuton’s universal law of gravitation ?
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Science, vol. 237, Sept.1987 Science, vol. 237, Sept.1987
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Michael KirbyMichael KirbyProfessor of Mathematics and Computer Science
Graduate Program Director, Colorado State University
An Empirical Approach to Dimensionality Reduction
and the Study of Patterns
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Thus, researchers today are confronted with a modern dilemma.
Presumably the more information available concerning a phenomenon the better.
Yet, a massive data set storing the information, in and of itself, a potentially significant barrier to the investigation.
A time-honored approach for the investigation of unexplained phenomena is to attempt to infer laws, or explain processes, from the patterns present in collected data.
‘‘Our phenomenal ability to acquire data has outstripped our ability to analyze it’’
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The book describes several mathematical tools for overcoming problems associated with analyzing high-dimensional and massive data sets.
Kirby’s approach is geometric in nature and the main tool is the dimensionality reducing mappingsdimensionality reducing mappings.
These mappings are required for the analysis and representation of information (patterns) in large data sets generated by physical or numerical experiments.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
(1890-1962)
Sir Ronald Aylmer Fisher
‘‘Let the Data Speak for itself ’’
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Basics of Basics of Geometrical Data AnalysisGeometrical Data Analysis
IntroductionIntroduction
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Feature Extraction
Distance measure
Structure description
Embedding in Feature-Space
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
PartitionalPartitional ClusteringClustering
Outlier
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Hierarchical ClusteringHierarchical Clustering
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
GraphGraph--theoretic Clusteringtheoretic Clustering
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
FeatureFeature--selectionselection
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
FeatureFeature--normalizationnormalization
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Elementary Elementary Geometrical Data AnalysisGeometrical Data Analysis
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
VectorVector--QuantizationQuantization && PrototypingPrototyping
Distances Distances && VisualizationVisualization
OrderingOrdering && NoveltyNovelty//Outlier DetectionOutlier Detection
ClusteringClusteringDimensionalityDimensionality--reductionreductionManifoldManifold--LearningLearning
Elementary Elementary Geometrical Data AnalysisGeometrical Data Analysis
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
- Given an ensemble of N patterns,
a p-dimensional vector xi , i=1,2,…N
is extracted from each one.
xi = [ xi(1) xi(2) ….. xi(p) ]
From Patterns to Distances From Patterns to Distances
-With the feature-extraction step, the set of patterns is represented
by a set of row-vectors.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
-The N vectors are gathered in the so-called Data-Matrix Xdata
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
NpNN
p
p
NpNN
p
p
NNN
datapxN
xxx
xxxxxx
xxx
xxxxxx
pxxx
pxxxpxxx
X
......,,.........,,......,,
......,,.........,,......,,
)]()......,(),([...
)]()......,(),([)]()......,(),([
.
.
.][
21
22221
11211
21
22221
11211
222
111
N
2
1
21
2121
x
xx
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
NpNN
p
p
datapxN
xxx
xxxxxx
X
......,,.........,,......,,
.
.
.][
21
22221
12111
N
2
1
x
xx
standardization of each one of the p variates(after subtraction of its mean)
is performed via a normalization with the stdstdor Whitening based on PCA
normalization of each one of the N vectors,by dividing with its norm,
i.e. replacement of xi with Xi= xi / ⎥⎥xi ⎢⎢
where ⎥⎥xi ⎢⎢= [ xi(1)2 + xi(2)2+….. xi(p)2 ]1/2
Two simple transformations of the Data-matrix are :
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The role of feature-extraction & transformations in the subsequent mining of information
from the input patterns.
For instance, the normalization trick is employed when dealing with time-series patternsand we want to highlight shape(phase) similarities
during the subsequent computation of Euclidean distances
NoteNote
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The geometrical consideration( patterns points in feature space )
is very useful in order :
to conceptualize morphological relationships between patterns
to search for natural groupings inside the sample of patterns
similar patterns are mapped onto nearby points
measuring the geometrical distance between vectors as a means of quantifying (inversely)
common signal/information content.
the similarity between the corresponding patterns. A small distance means great similarity between two patternsand this can be interpreted as
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
d(x1,x2) =⎥⎥x1 - x2 ⎢⎢= [ ( x1(1)- x2(1) ) 2 + ( x1(2) – x2(2) )2+…..+ ( x1(p)-x2(p) )2 ]1/2
For computational considerations, usually its squared form is utilized, i.e.
d(x1,x2) =⎥⎥x1 - x2 ⎢⎢2
For an ensemble of N patterns {xi}i=1:N
all the pairwise distances are gathered
in the so-called (NxN) distance matrix D[NxN]
The Euclidean-distance in p-D space
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A fast computation of this (symmetric) matrix, is given via :
D = diag(A) E + E diag(A) – 2A(1)
] .... [= , 1...1 1
1...1 1= ,= N21
datap) x (NN) x (N
T xxxXXEXXA MM=⎟⎟⎠
⎞⎜⎜⎝
⎛
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
N
1
][
x
xdata
pxNX
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
From Patterns to DistancesFrom Patterns to Distances
SummarySummary
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
NoteNote
If the normalized versions {Xi} of the vectors {xi} replace them in the Data matrix,
then the corresponding pairwise Euclidean distances becomes
d(Xi,Xj) = 2 ( 1- ρ(xi , xj) )
where ρ(xi , xj) is the correlation coefficientbetween two vectors:
ρ(xi , xj) = xi • xj / (⎥⎥x1 ⎢⎢2 ⎥⎥x2 ⎢⎢2) = Xi • Xj
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
An insight to the structural information contained in the Distance-matrix
can be obtained via a simple visualization-scheme
An efficient procedure for unmasking possible outliers - the corresponding rows/columns
are white stripes in the produced layout.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The description of a set of patterns, through the topology of their representing points
can lead to simple descriptors with ready geometric interpretationand without loosing the connection
with conventional approach for studying the data (statistics).
Relating topological descriptors of point sets with the data.
Geometrical concepts like the ‘local point-density’or the outline/skeleton of a point-swarm
can be utilized in building toolsfor understanding and handling the multi-D data.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The interpoint distances &
the gravitational centre of a point set.
The dispersion J, expresses the compactness of a point set.
It is the average distance from the geometrical mean.
{ } ∑∑==
= ⋅=−⋅−=NN
iNi NNJ
11iave
1
2avei1i x1xxx11 x ,)/()( :
note: it is the p-dimensional analogous of (squared) standard deviation for a set of scalars
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
It can be expressed as a summation of pairwise distances :
{ } ∑∑= =
= −⋅−=N
i
N
jNi NNJ
1
2ji
11i xx121 x )(/)( :
and estimated via simple matrix operation:
{ } ],.....,[,)(
)( ][: 111uDu12
1 x 1T
1i =⋅⋅⋅−
== xNNi uNN
J
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
{ } ∑=
= −⋅−=N
iNi NJ
1
2avei1i xx11 x )/()( :
1. The dispersion is a measure of ‘noise’ in the data.
{ } ∑∑= =
= −⋅−=N
i
N
jNi NNJ
1
2ji
11i xx121 x )(/)( :
2. The contribution of the i-th vector to the total dispersion is the sum of its distance to the rest of the points,
i.e. the row-sum of the Dmatrix:
d(i,N).)d(i,)d(i,dist +…++= 21xi )(
It is a simple gauge for unmasking outlying points,and therefore spotting unusual patterns.
3. Conversely, the notion of Vector Median can be introduced.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Unmasking OutliersUsing simple functionals with arguments the pairwise distances,
we built 1D-mappings that are informative about the ‘‘distinctiveness’’ of the corresponding patterns.
1. map each vector to a scalar, 2. locate the vectors with images lying at the extremes
of the obtained scalar distributions,3. identify the corresponding vectors
and make a final judgment about the corresponding patters.
Vector-Ordering schemes
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
In the case of Reduced-Orderingmapping is based on the aggregate distance,
[ ] [ ][N][3][2][1]ordering
N321 dist ....dist dist distdist....dist dist dist ⎯⎯⎯ →⎯
the estimated scalars are ordered
this ordering defines the ordering of the corresponding vectors
[ ] [ ][N][3][2][1]ordering Reduced
N321 x....xxxx....xxx ⎯⎯⎯⎯⎯ →⎯
a ranked list of patterns has been formed in which the elements that deserve further consideration (due to their non-typicality) lie at one end (e.g. novelty detection )
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Note: How many patterns to disregard/underline ?
FromFrom Patterns Patterns toto OrderedOrdered--listslistsSummarySummary
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Alternative Vector-Ordering schemes
2. Radial-ordering
3. Graph-theoretic (MST)
4. Manifold Ranking
5. Diffusion-network
1.
Ranking in Rp
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Cluster AnalysisCluster Analysis
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Clusters everywhereClusters everywhere
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Gestalt psychology (Berlin School) is a theory of mind and brain that proposes thatthe operational principle of the brain is holistic, parallel, and analog, with self-organizing tendencies.
Clusters within our mindClusters within our mind
The Gestalt effect refers to the form-forming capabilityof our senses, particularly with respect to the visualrecognition of figures and whole forms instead of just a collection of simple lines and curves.
Gestalt is a German word meaning shape or form.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Emergence is explained in this way
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The most basic rule of Gestaltis the law of prpräägnanzgnanz :
‘‘we try to experience things in as good a gestaltway as possible’’
In this sense, "good" can mean several things, suchas regular, orderly, simplistic, symmetrical, etc.
So, there is inherent tendency in humans to perform clustering
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
What is clustering ?What is clustering ?
The The ArtArt of identifying of identifying homogeneoushomogeneous groups in the datagroups in the data
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Are there many algorithms ?Are there many algorithms ?
‘‘There are as many clustering algorithms as there are (potential) users’’
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Is there a Is there a ––singlesingle-- best one ? best one ?
Can I design the Can I design the ‘’‘’perfectperfect’’’’ algorithmalgorithm ??
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The clustering The clustering of clustering algorithmsof clustering algorithms
(Metaclustering)Hierarchical, Partitional & Graph-Theoretic
Probabilistic, Possibilistic, Deterministic
Static, Adaptive, Dynamic
Statistical , Neuronal, Heuristical
{ Stochastic vs Batch-mode } {Parallel vs Serial }
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Hierarchical Clustering Algorithms
they work with a dissimilarity matrix(i.e. without using the patterns themselves )
and have a deterministic character(e.g. the Single-linkage algorithm )
The end output is a Dendrogram
Sampling Sampling clustering algorithmsclustering algorithms
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
1. Pair the two points k and l with the smallest distance.
Given the distance matrix D[N x N]
2. Delete the rows (& columns) in D corresponding to k & l
3. Insert a new row (and the corresponding column) containing the distances of the first cluster (k,l)
to the remaining N-2 points. D(kl) i = min ( Dki , Dli ), i≠k,l
4. repeat the procedure from ( 1. ) for the new [N-1 x N-1] distance matrix.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
How do we define the number of clusters ?
dendrogram
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Partitional Clustering Algorithms
Sampling Sampling clustering algorithmsclustering algorithms
they work with a Data matrix(i.e. using the patterns themselves )
and have a stochastic character(e.g. the C-means algorithm )
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Minimization (maximization) of an objective (cost) function
that expresses the separability (compactness) of the produced groups.
Prototypes are emerging naturally.
Fast execution.
Large data-sets can be handled.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The partition matrix U is used to tabulate the resultsIt’s a [CxN] matrix, with each row devoted to one of the C produced clusters
The indicator function ujihas the value 1 if xi belongs to the j-th cluster; otherwise is set to 0
crisp clustering vs fuzzy clustering .
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
the objective function is the total intra-cluster dispersion:
x1oox i1
1
j2
1ji
1∑∑∑∑=
===
=−=Ni
ji
NijiNi
jiCj
uu
,uE:
:::
In matrix operation, the above cost function reads:
uDu 2
1 1
jj1
∑∑==
=⋅⋅⋅
=Ni
jijT
jCjupop,
popE
::
or E= trace( UDUT )
In the case of C-means algorithm
D is distance matrix and popj the population of j-th cluster
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
C-means (or K-means) algorithm startsby partitioning the input points into c initial sets,either at random or using some heuristic data.
It then calculates the mean point, or centroid, of each set.
It constructs a new partitionby associating each point with the closest centroid.
Then the centroids are recalculated for the new clusters,
and the algorithm is repeated by alternate application ofthese two steps until convergence….
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Some remarks on Partitional Clustering
1. Since these algorithms always result to grouped data, a critical issue is does their use really contribute to the understanding of the true point distribution.
A way to justify this is the comparison of measure E with the corresponding dispersion
for the overall point set dispersion.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Some remarks on Partitional Clustering
2. To alleviate the problemof initialization & insufficient convergence, usually the iterative algorithms are applied a few times and the best partition matrix is the final outcome.
3. Outlying points tend to obscure the convergence and the accuracy of the resulting partition.It is suggested to be isolated from the beginning.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Some remarks on Partitional Clustering
4. “How many clusters are there” in the point set ?A simple strategy for estimating the number of clusters C,
is to apply the algorithm for increasing value of C, and by plotting the corresponding values of E as function of C to decide the critical number C0.
Notice that E is by default a monotonically decreasing function of C, with absolute minimum C=N, i.e each point to its one cluster.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Some remarks on Partitional Clustering
5. The objective function has been modified many times in the Pattern Recognition literature,
e.g. so as to bias the creation of highly populated clusters,or to favor specific cluster-shapes
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Sampling Sampling clustering algorithmsclustering algorithms
Subtractive Clustering
Mountain-clustering for delineating cores in a multimodal point distribution
A simple loop :1. Detection of the most significant mode &2. Subtraction of the subset of points
that are coming from the certain mode.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
the technique of Potential Functionsis used so as to construct a mountain,
with height proportional to the local point density.
2
xx
21 x
12
2ji
2i ∑=
−−=
N
j oP
oP rNr
]exp[)(
)PD( /π
1. A mapping xi PD(xi)2. ro : radius of influence
3. PD(xi) can be estimated using D-matrix elements
Remarks :
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Mode-detection: The point of the set that lies closer to the dominant mode is identified as the point xmax
of maximum local point density PD(xmax).
2
xx
21 x
12
2ji
2i ∑=
−−=
N
j oP
oP rNr
]exp[)(
)PD( /π
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A portion of the lower ranked points will be averaged
x j
1 x ]j[
[1]ii
0sel
0
∑=
=
This subset is removed and the procedure is repeated from the detection step.
each point xi in the point-set is orderedaccording to its distance d(xi, xmax )i.e.
the closer to the xmax the point is, the lower its rank [i] will be.
Mode-delineation : points in the vicinity of xmax are collected.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The role of ro
PD(xmax)
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A.K. Jain, R. C. Dubes‘‘Algorithms for Clustering Data’’
Prentice-Hall , 1988
L. Kaufman, P. J. Rousseeuw‘’Finding Groups in Data :
An Introduction to Cluster Analysis’’,Wiley Series in Probability and Statistics, 1990
Classical References Classical References
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Related issuesRelated issues
external vs. internal validation
Comparing two different clustering outputs (MI, Hubert-test, etc )
Automating selection of cluster numbers(Gap-statistic, BIC, MDL, MDE, etc.)
Model-validation
Comparing a clustering output with a given classification
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
ItIt’’s an Ever Expanding field s an Ever Expanding field
# 33 issue
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Clustering Ensembles Clustering Ensembles
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Clustering Dynamics Clustering Dynamics
1. Raw-data. 2. Feature-space. 3. Models
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Kernel-based ClusteringRandomization
Class-projects
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Marc Yuko NaruAraikArmanVaheArmenAskaYuka MihaiHorhe….& Groupies
‘’The Group that groups’’
‘’Okinawa-blues’’
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Nonlinear Nonlinear Dimensionality ReductionDimensionality Reduction
&&DataData--summarizationsummarization
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The curse of dimensionality
Human-machine interactions
‘‘Less is More’’
Why?Why?
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Multidimensional Scaling (MDS)Multidimensional Scaling (MDS)motivations
- sometimes only proximity data are available (e.g. data from psychophysics / behavioral experiments)
- to take advantage of the ‘‘human gift for pattern recognition-tasks’’
like determining modes in a point distribution and recognizing trends in the data
when these are presented in the form of point-diagrams
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
MDS MDS –– definition definition
Any procedure that, - given a dissimilarity matrix corresponding to a set of patterns -configures points in a low dimensional space (usually 2-D) as images of the patterns in a way that the interpoint distances approximate as much as possible the original pairwise dissimilarities.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
MDS results in a 2-D “projection” of the objects, where neighboring relationships /clustering trends
are prominent.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
MDS MDS –– categories categories
metric vs. nonmetric MDS
metric MDS is applied via eigenvectors analysisand has analytical expression.
nonmetric MDS algorithms are iterative in nature and computational demanding,but usually (slightly to moderately) superior
to the metric ones.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Metric-MDS [ Torgerson; 1952 ]
Given a Distance-Matrix D[NxN] for a set of N objects (patterns?)
Negation: A[NxN] = - D[NxN]
Centering: Bij = Aij – Ai. - A.j + A.
EigenAnalysis of B[NxN]:The first r characteristic roots l1, l2, …., lr
& the associated vectors v1 [Nx1] , v2 ,…., vr are computed
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
V[Nxr]=[ v1 v2 ….. vr ]
Normalization of vi : so that viT vi = li
and gathering in a [N x r] matrix
Output: the i-th row of this matrix contains the coordinates of the i-th point in the new r-dimensional space (r = 1, 2 or 3) :
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
==
NrNN
r
r
NrNN
r
r
χχχ
χχχχχχ
χχχ
χχχχχχ
......,,.........,,......,,
......,,.........,,......,,
Χ
21
22221
11211
21
22221
11211
N
2
1
datar]x[Nr]x[N
χ...χχ
V
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
the normalized total discrepancy as a measure of mapping creditability
∑
∑
<
<
Δ−=
jiij
jiijij
D
DStreess
where Δ is the matrix of interpoint distancesΔij=║χi - χj ║2 in the new space.
MDS-quality
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Note
- Possible outliers in the set tend to “dominate” the projection.
- A refined image can be obtained after their isolation and removal.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
MDS-classical exampleWith standard psychophysical experimental procedures, the perceptual similarity (PS) between 14 selected colorswas estimated and tabulated in a [14 x 14] matrix .
The 14 entries correspond to 14 different ‘hues’ with wavelengths :
Wavelength = [434, 445, 465, 472 ,490, 504, 537, 555, 584, 600, 610, 628, 651,674]
bluish hue = = 472, reddish hue = = 674
A point diagram was produced by applying the MDS algorithm to the distance matrix with entries
d(i,j) = 1-PS(i,j), i,j=1:14.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The ‘homeomorphism’ of this plot with the well-known color-disk shown on the right is remarkable
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Related issuesRelated issues
Shammon mapping
Procrustes Analysis
Correspondence-problem
Treating Graphs
MST-planingProjection Pursuit
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
DataData--ManifoldManifold LearningLearning
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
ManifoldsManifolds• What is a
‘’Manifold’’ ?
OXFORD Dictionary : n (techn) a pipe or an enclosed space with several openings that connects with other parts,
eg for taking gases into or out of cylinders in a car engine: The exhaust/inlet manifold
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
(1826-1866)The name manifold comes from Riemann's originalgerman term, Mannigfaltigkeit,
which W. Clifford translated as "manifoldness“ .
Bernhard Riemann
In his Göttingen inaugural lecture,Riemann described the set of all possible valuesof a variable with certain constraintsas a Mannigfaltigkeit
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
(light) Mathematical Definition
A manifold is a space which, in a close-up view,resembles spaces described by Euclidean geometry,
But which may have a more complicated structurewhen viewed as a whole.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Manifold ExamplesManifold Examples
‘‘‘‘Swiss RollSwiss Roll’’’’
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Our own Manifold Our own Manifold ……. .
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The Manifolds way The Manifolds way of perception !of perception !
Human Cognition:‘’The Manifold Ways of Perception’’
H.Seung and D. Lee• Science 22 Dec 2000: Vol. 290(5500), pp. 2268 - 2269
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
the interest in manifolds has been renewed and extended well beyond the mathematicians’community:
(1) Tenenbaum et al. ‘‘A global geometric framework for nonlinear dimensionality reduction’’
(2) Roweis & Saul. ‘’Nonlinear Dimensionality Reduction by Locally Linear Embedding’’.
Recently: Science, vol. 290,Dec,2000
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Nowadays, Manifold-Learninghas become an individual scientific branch.
A well-informed Web-site is : http://www.cse.msu.edu/~lawhiu/manifold/
‘‘‘‘Manifold learningManifold learning’’’’
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
In a nutshellIn a nutshellManifold is ‘a constrained (multidimensional) surface’
This implies the existence of an ambient (vector) spacein which the available data lie in a restricted way.
The famous Swiss-Roll
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
when the available data are multivariate observationsfrom a high dimensional space,
the high-dimensionalityusually obscures the useful information,
and constitutes one of the major component of the ‘curse of dimensionality’.
What is a DATA-Manifold& what is to learn about ?
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
for Data-Analysts : ‘‘Less is Better ’’
methods for data-abstraction and summarization.
Visualization-schemes are highly popular, since some insight into the data can be gained, immediately, by the user through low-dimensional plots and graphs.
efficient techniques for handling high-dimensional data
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Do we really need to learn the DATA-Manifold ?
YES !!! ……. for Moonwalking !
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Radial-Ordering
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Radial-Ordering
Ranking on Manifold
Results on a subset of the USPS data set [Zhou et all., 2004].The top left-hand image is the query,
The 99 top ranked images are shown
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Manifolds are everywhere Manifolds are everywhere ……..
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Minimal Spanning Tree (MST)A Graph-Theoretic tool
to parameterize Data-Manifolds
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A graph is a set of nodes and a set of node pairs called edges.
An edge-weighted graph is a graph with a real number, called weight, assigned to each edge.
A connected graph has a path between any two distinct nodes.
A Spanning Tree is a connected graph that includes all the nodes without loops.
Graph-Theoretic terminology
the MST is the spanning tree of minimum total weight
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
How Graph-Theory is applied in feature space ?
a node is dedicated to each data-pointand the corresponding pairwise distances (generalised dissimilarities)
are assigned as weights to the formed edges.
The MST is the connected graph, emerged from the collection of exactly (N-1) edges,
having minimum total length.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A realistic example
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
What is magic about MST ?
It contains the NN-graph.
It can be used for ranking in RP
(i.e. MST-ordering),
It can be used for visualizing the skeleton of pattern variation
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
ISOMAPA hybrid tool for
visualizing Data-Manifolds
ISOMAP = Graph theory in feature space+ Multidimensional Scaling
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Isomap, comprises simple algorithmic steps,that transform the original distance matrix D to GD
which contains the geodesic interpoint distances.
ISOMAP algorithm
1. The nearest-neighborS graphover the given point sample is constructed.
2. The geodesic interpoint distances are computed as theshortest paths (on this graph) between each pair of points.
3. The MDS is then applied,Y = MDS( GDε )
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
It is an efficient graph-flattening technique and can learn a broad class of nonlinear manifolds.
MDS
ISOMAP
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A very interesting example with many potential applications
in computer vision (e.g. morphing )
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
ISOMAP restrictions
While Isomap is a very competent procedure for learning nonlinear manifolds,
it is restricted by the computational demands of the geodesic-distance estimations.
The handling of more than a few thousandsmultidimensional points (i.e. patterns)
is becoming problematic.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A possible solution the marriage of Isomapwith unsupervised learning techniques (e.g. Kohonen Maps).
As a preprocessing-step, efficient techniques can be, first, appliedto reform the ensemble of patterns as data-chunks, that will be then summarized via prototypeswhich will then be fed to the ISOMAP-routine.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Vector Quantization (VQ) based on Neural-Gas Network
VQ encodes the data manifold in the ambient (high-d) spaceby utilizing only a finite set of reference vectors,
the code vectors.
It actually performs a parcellation of the ambient spaceknown as Voronoi Tessellation.
A Voronoi-region is defined around each code vector:This is a section in the original space comprised of all the points closer to a specific code vector than to any other.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
A realistic example
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The codebook design is the most critical part in VQ. For this step the “neural-gas” algorithm is employed.
Neural Gas is an artificial neural network model, which converges efficiently to a small,
user-defined number C<N of codebook vectors.
The ‘Neural Gas’ algorithm
It is an extension of the Kohonen’s self-organizing maps that shares some characteristics with the Fuzzy C-means .Its name stems from the physics
of the underlying optimization scheme.
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Learning Dynamic-Manifolds from continuous stream data
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
‘Neural-Gas’ based dynamic prediction
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Related issuesRelated issues
Laplacian Eigenmaps, LLE, etc.
Ranking on ManifoldsSemisupervised Learning
Class-projects
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
Thank UThank U
Geometrical Data Analysis Νικος Λασκαρης για το ΠΜΣ-ΗΕΠ
The EndThe End