Latent Variable Spaces For The Construction Of Topology ...webdoc.sub.gwdg.de/ebook/mon/2010/ppn...
Transcript of Latent Variable Spaces For The Construction Of Topology ...webdoc.sub.gwdg.de/ebook/mon/2010/ppn...
Latent Variable Spaces
For The Construction Of Topology
Preserving Mappings
Marian Pena
A thesis submitted in partial fulfillment of the
requirements of the University of Paisley for
the degree of Doctor of Philosophy
June 21, 2007
Marian Pena i
Abstract
This PhD is dedicated to a family of topology preserving mappings similar to the
Generative Topographic Map (GTM) [8]. These techniques can be considered as a
non-linear projection from input or data space to the output or latent space (usu-
ally 2D or 3D), plus a clustering algorithm, that updates the prototypes. The key
difference of the new models explained in this document is that, instead of includ-
ing a neighborhood function in the learning rule in order to maintain the topology,
we project points from an existing latent space to the data space. In doing so, we
separate the clustering (inner loop) from the projection (outer loop) in two different
steps. A common frame based on the GTM structure can be used with different
clustering techniques, giving new properties to the algorithms.
Thus we have two versions of the Harmonic Topographic Mapping (HaToM)
that utilise the K-Harmonic Means (KHM) [91, 92, 95] clustering, and the faster
Topographic Neural Gas (ToNeGas), with the inclusion of Neural Gas in the inner
loop. We compare these with a fruitless attempt to combine the SOM learning
algorithm with K-Harmonic Means.
We first revise in Chapter three the topographic product of experts (ToPoE)
[30], which includes the GTM structure, but with the Product of Experts instead
of the Mixture of Experts and gradient descent learning instead of the Expectation-
Maximisation algorithm used in the GTM. ToPoE like the GTM is more suitable for
continuous data. We extend its theory by investigating properties such as the local
variance of the model, the projection to latent space and convergence properties. We
introduce local kernels instead of the better known Gaussian kernel. We also study
the use of the magnification factors as a tool for measuring topology preservation.
In the fourth chapter we introduce theory underlying the new algorithm, HaToM,
with its two versions, analyse their parameters, and compare the main differences.
The experimental sections are dedicated to the experiments with several datasets,
which illustrate the projection capabilities of these algorithms, explaining in detail
the use of different parameters, different kernels, and how they affect the results.
Chapter 5 introduces the last of the new algorithms, ToNeGas, compare it with
the previous two by analysing differences using experiments with the same data.
We also compare their topology preservation properties and the convergence speed,
comparing as well with the results from the Self-Organizing Map. Finally, the three
Marian Pena ii
new algorithms are evaluated together and compared with each other, and also
against SOM and GTM.
Marian Pena iii
Resumen
Esta tesis esta dedicada a una familia de algoritmos con preservacion de la topologıa,
similares al Generative Topographic Map (GTM) [8]. Estas tecnicas pueden ser con-
sideradas como una proyeccion no lineal desde el espacio de los datos al espacio de
proyeccion o espacio latente (normalmente 2D or 3D), mas un algoritmo de agru-
pamiento (clustering) que actualiza los centros. La diferencia clave de los modelos
explicados en este documento es que, en lugar de incluir una funcion de vecindario
en la regla de aprendizaje para mantener la topologıa, proyectamos los puntos del
espacio de proyeccion al espacio de los datos. Con ello separamos la fase de agru-
pamiento (bucle interior) de la de proyeccion (bucle exterior) en dos pasos. Un
marco conjunto basado en la estructura del GTM puede ser utilizado con diversas
tecnicas de agrupamiento, dando propiedades distintas al algoritmo.
Segun lo anteriormente expuesto tenemos dos versiones del Harmonic Topo-
graphic Mapping (HaToM) que utiliza K-Harmonic Means (KHM) [91, 92, 95] para
el agrupamiento, y el mas rapido Topographic Neural Gas (ToNeGas), que incluye
Neural Gas en el bucle interior. Comparamos estos algoritmos con un infructuoso
intento de combinar SOM y K-Harmonic Means.
Primeramente repasamos en el capıtulo tres el Topographic Product of Experts
(ToPoE) [30], que incluye la estructura del GTM, pero donde el producto de expertos
substituye a la mezcla de expertos del GTM, y gradient descent como algoritmo de
aprendizaje en lugar de Expectation-Maximisation utilizado en el GTM. Extendemos
su teorıa investigando propiedades como la varianza local del modelo, la proyeccion
al espacio latente, y las propiedades de convergencia. Introducimos kernels locales
en sustitucion del mas conocido kernel Gausiano. Ademas estudiamos el uso de
los factores de magnificacion como herramienta para medir la preservacion de la
topologıa.
En el capıtulo cuatro introducimos la teorıa subyacente en el nuevo algoritmo,
HaToM, en sus dos versiones, analizamos sus parametros, y comparamos las diferen-
cias mas importantes. Las secciones de experimentos muestran la aplicacion a varios
datasets, ilustrando las propiedades de proyeccion de estos algoritmos, explicando
en detalle el uso de diferentes parametros, diferentes kernels, y como todo ello afecta
a los resultados.
El capıtulo cinco introduce el ultimo de los algoritmos, ToNeGas, el cual es
comparado con los dos anteriores analizando las diferencias a partir del uso de los
Marian Pena iv
mismos datasets. Tambien comparamos la preservacion de la topologıa y la veloci-
dad de convergencia, comparando asimismo con SOM. Finalmente, los tres nuevos
algoritmos son evaluados conjuntamente entre si y frente a SOM y GTM.
Marian Pena v
Acknowledgments
I would like to thank all the PhD students and staff in the Computing Department
of the University of Paisley, who made my stay a great experience to remember.
Special thanks to Gayle Leen, whom i shared with all the steps through doctorate,
and a new life in Scotland. Thank you also to Cesar Garcıa Osorio who helped
me a great deal especially within my first months in the University of Paisley. I
am grateful also to Jos Koetsier, Donald McDonald, Ying (Hannah) Han, Lina
Petrakieva, Oleksiy Dekhtyarenko, Andreas Loengarov, Nicolas Garcıa Pedrajas,
Benoit Chaperot, Wesam Barbakh, and Ian Miller.
This PhD would not be possible without the close supervision of my Director of
studies Colin Fyfe. His expertise in the area, and supervising PhD students make
the process much easier. Thanks also to my co-supervisors Stephen McGlinchey and
Daniel Livingston.
Finally, thanks to my sister and brother, my family and friends, for supporting
me in this challenge. But most of all, thanks to my mother, who always believes in
me; this thesis is dedicated to her.
Marian Pena vi
Agradecimientos
En primer lugar me gustarıa agradecer a todos los estudiantes de doctorado del De-
partamento de Computacion de la Universidad de Paisley el hacer de mi estancia
una grata experiencia para recordar. Un agradecimiento especial a Gayle Leen, con
la que he compartido todos los pasos del doctorado, y una nueva vida en Esco-
cia. Gracias tambien a Cesar Garcıa Osorio, que me ayudo mucho, sobre todo en
mis primeros meses en la Universidad de Paisley. Estoy agradecida asimismo a Jos
Koetsier, Donald McDonald, Ying (Hannah) Han, Lina Petrakieva, Oleksiy Dekht-
yarenko, Andreas Loengarov, Nicolas Garcıa Pedrajas, Benoit Chaperot, Wesam
Barbakh, e Ian Miller.
Esta tesis no habrıa sido posible sin la supervision de mi director de estudios
Colin Fyfe. Su experiencia en el area y en supervisar estudios de postgrado ha
hecho de este doctorado un proceso mucho mas llevadero. Gracias tambien a mis
co-supervisores Stephen McGlinchey y Daniel Livingston.
Finalmente, gracias a mi hermana y hermano, familia y amigos, por apoyarme
en este desafıo. Pero sobre todo, gracias a mi madre, que siempre cree en mi; esta
tesis esta dedicada a ella.
Marian Pena vii
List of symbols
• d is the dimensionality of the data space.
• q is the dimensionality of the latent space.
• xi is a datapoint in data space.
• yi is the projection of xi in latent space.
• tk is a latent point in latent space.
• mk is the projection of tk in data space; it is also named prototype, centre or
centroid of the cluster.
• W is the matrix of weights associated with the prototypes.
• Φ is a matrix where each row is the response of the basis functions to one
latent point, or alternatively each column of Φ is the response of one of the
basis functions to the set of latent points.
• rik is the responsibility of the kth latent point for the ith data point.
• dik is the distance from the kth latent point to the ith data point.
• γ is the width of the responsibilities.
• K is the number of neurons in the two-dimensional grid.
• β is the noise variance.
• N is the number of data points.
Contents
1 Introduction 1
1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Topology Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature review 8
2.1 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . 9
2.1.2 Independent Component Analysis . . . . . . . . . . . . . . . . 11
2.2 Norms and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 The Minkowski Metric . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 The Mahalanobis distance . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Metrics in high dimensional spaces . . . . . . . . . . . . . . . 15
2.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 K-Means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Harmonic Averages . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 K- Harmonic Means . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.5 Neural Gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Topology preserving mappings . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 MultiDimensional Scaling . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Generative Topographic Map . . . . . . . . . . . . . . . . . . 26
2.4.4 Probabilistic Principal Surface . . . . . . . . . . . . . . . . . 28
2.4.5 Topology Representing Network . . . . . . . . . . . . . . . . . 31
2.4.6 Growing Neural Gas . . . . . . . . . . . . . . . . . . . . . . . 33
viii
CONTENTS ix
2.4.7 Topology preserving Elastic net . . . . . . . . . . . . . . . . . 34
2.5 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 SOM Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.2 GTM Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.3 Netlab Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.4 ICALAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 The Topographic Product of Experts 38
3.1 Topographic Product of Experts . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 Product of Experts . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 The Topographic Product of Experts . . . . . . . . . . . . . . 40
3.1.3 Comparison with the GTM . . . . . . . . . . . . . . . . . . . 42
3.2 Responsibility Estimation . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 The Actual Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Cost functions and Convergence . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 A simplified model . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 The full model . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.3 Projections of the latent points . . . . . . . . . . . . . . . . . 54
3.5 Magnification Factors and Dimensionality Estimation . . . . . . . . . 54
3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.1 Experiment1: 1D Artificial Data . . . . . . . . . . . . . . . . 57
3.6.2 Experiment2: 2D Artificial Data . . . . . . . . . . . . . . . . 58
3.6.3 Experiment3: The Animals data set . . . . . . . . . . . . . . . 60
3.6.4 Experiment4: Bank Notes Data . . . . . . . . . . . . . . . . . 60
3.6.5 Experiment5: The Fundamental Clustering Problems Suite . . 61
3.6.6 Experiment6: The Algae data set . . . . . . . . . . . . . . . . 63
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 The Harmonic Topographic Mapping 69
4.1 The Harmonic Self-Organising Map . . . . . . . . . . . . . . . . . . . 69
4.1.1 HSOM Simulations . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Harmonic topographic Map . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1 Data-driven HaToM . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.2 Model-driven HaToM . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.3 Generalised Harmonic Topographic Map . . . . . . . . . . . . 75
4.3 HaToM Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
CONTENTS x
4.3.1 Experiment 1: 1D Artificial Data . . . . . . . . . . . . . . . . 76
4.3.2 Experiment 2: 2D Artificial Data . . . . . . . . . . . . . . . . 78
4.3.3 Experiment 3: The Animals data set . . . . . . . . . . . . . . 81
4.3.4 Experiment 4: The Fundamental Clustering Problems Suite . 81
4.3.5 Experiment 5: The Algae data set . . . . . . . . . . . . . . . . 87
4.4 G-HaToM Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.1 Experiment 1: Crabs Data . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Experiment 2: Bank Notes Data . . . . . . . . . . . . . . . . . 90
4.4.3 Experiment 3: Oil Data . . . . . . . . . . . . . . . . . . . . . 92
4.4.4 Experiment 4: Algae Data . . . . . . . . . . . . . . . . . . . . 93
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 The Topographic Neural Gas & Algorithms comparison 98
5.1 The Topographic Neural Gas . . . . . . . . . . . . . . . . . . . . . . 98
5.1.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.2 Experiment 1: The Fundamental Clustering Problems Suite . 101
5.1.3 Experiment 2: The Algae data set . . . . . . . . . . . . . . . . 105
5.2 Topology preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.1 Experiment1: Algae dataset . . . . . . . . . . . . . . . . . . . 106
5.2.2 Experiment2: Gene dataset . . . . . . . . . . . . . . . . . . . 112
5.3 Experiment Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Comparison of the algorithms . . . . . . . . . . . . . . . . . . . . . . 123
5.4.1 Growing and Pruning . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.2 One-to-one comparisons . . . . . . . . . . . . . . . . . . . . . 124
5.5 Benefits from separating clustering from projection . . . . . . . . . . 129
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Summary and Future work 132
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Major contributions of this thesis . . . . . . . . . . . . . . . . . . . . 132
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 1
Introduction
Topology preserving mappings can be considered as a combination of three func-
tions: clustering, projection to a space of lower dimensionality or dimensionality
reduction, and topology preservation. Those functions allow for a better visuali-
sation of the data in a two or three dimensional space, and representation of the
dataset with a small set of prototypes; but the property that stands out from these
techniques is topology preservation, which lets us visualise the data in a space of
smaller dimensionality and provides a similar representation of the disposition of the
datapoints in the high dimensional space.
In this thesis we investigate the separation in two steps of the projection and
the clustering functions, that are usually included together in the learning rule. But
first we investigate these three functions separately.
1.1 Clustering
One of the main properties of such mappings is the clustering or quantization of
the data samples. Data clustering is a common technique in unsupervised learning
(learning without a “teacher”, or without using classes’ information), which is used
in many fields, including machine learning, data mining, pattern recognition and
image analysis. Clustering is a division of data into different groups or clusters so
that the data in each cluster has similarity in one trait, often proximity according
to some defined distance measure.
Data clustering algorithms can be hierarchical [41, 42] or partitional. Hierarchical
algorithms find successive clusters using previously established clusters, whereas
partitional algorithms determine all clusters at once. Partitional clustering like K-
Means (see below) can be further subdivided into relocation algorithms and density-
1
Chapter 1: Introduction
based algorithms. Relocation algorithms try to discover clusters by relocating the
prototypes in succesive iterations. Density-based algorithms search for areas with
high population of data. In this thesis we consider only partitional clustering of the
relocation type. More information about all the clustering techniques can be found
in [4].
In clustering, each cluster may have a prototype or associated centre. These
prototypes are usually the average of all the datapoints related to that particular
prototype. The association with a prototype depends on the property analysed,
that is often the Euclidean distance, so that each datapoint belongs to the closest
prototype. This membership can be to a unique prototype (hard membership) or to
several prototypes (soft membership). The first one is the case for K-Means while
the second one is used in algorithms like K-Harmonic Means (see Section 2.3.4).
Representing data with a limited set of prototypes necessarily loses information,
but achieves simplification that allows for the modeling of the data.
1.2 Dimensionality reduction
When working with large databases, it is always wise to reduce the dimensionality in
order to make it more manageable, and also for visualisation purposes. However, the
user should be aware of the reduction of information that this implies. Techniques
such as Principal Component Analysis (PCA) control this by keeping the maximum
variance in the projected data. The projection is linear in this case, but to make the
model more general, a nonlinear projection may be more suitable. Dimensionality
reduction is an “attribute transformation” [4], that is, a representation of all the
attributes of the data samples with a small set of new attributes that are the result
of a function applied to the original attributes. This process can be considered as a
projection from a high-dimensional data space, where the dimensions are the original
attributes, to a low dimensional data space, usually two or three dimensions, with
the new attributes. Those new attributes are often called “hidden causes” or “latent
variables”, and allow for the reduction of noise or redundancy in the data.
Dimensionality reduction can be classified as hard or soft, depending on the
number of dimensions reduced, i.e. a reduction from a very high dimensional space
to a two dimensional space is a hard reduction, while a reduction of just two or three
dimensions is a soft reduction; there is also dimensionality reduction for visualisation,
where the goal is not finding the intrinsic dimension of the data (the number of
independent variables that satisfactorily explain the dataset), but to project the
2
Chapter 1: Introduction
data to two or three dimensions to visualise the data in a low dimensional space.
In some of dimensionality reduction techniques, the intrinsic dimension has to be
given by the user, which asks for a trial-and-error process to find a suitable result
without under- or over- fitting the data. When the aim is to visualise in two or three
dimensions as it is in topology preserving mappings, the intrinsic dimensionality is
ignored and the projection is to two or three dimensional spaces.
More information about dimensionality reduction can be found in [14].
1.3 Topology Preservation
In mathematics, topology began with the investigation of certain questions in ge-
ometry1. General topology is the branch of topology which studies properties of
topological spaces and structures defined on them such as connectedness, compact-
ness and continuity. In topology the geometry is not analysed through shape but
considering the way the objects are put together. For example, the square and the
circle have many properties in common: they are both one dimensional objects (from
a topological point of view) and both separate the plane into two parts, the part
inside and the part outside; a circle is topologically equivalent to an ellipse and a
sphere is equivalent to an ellipsoid. If two objects have the same topological proper-
ties, they are said to be homeomorphic. The objects of topology are formally defined
as topological spaces. To be homeomorphic, the objects have to preserve their topo-
logical properties after deformations like twisting and stretching; the connectivity of
the objects has to be preserved.
The inclusion of topology preservation in a clustering technique was firstly imple-
mented by Kohonen [48] with the Self-Organizing Map (SOM). This implementation
gave for the first time an important property to the algorithms: points closer in the
projected mapping or latent space are also closer in data space. This property
allows the user to visualise the organisation of the high-dimensional space in a low-
dimensional projection, commonly a 2D mapping. The SOM has been extensively
used in many applications [44, 63], and has inspired a great quantity of related al-
gorithms. There are a few drawbacks such as lack of objective function but it is still
the most widely used topology preserving mapping.
Some topology preserving techniques have been defined as examples of the so
called latent variable technique [13], where the projections can follow different cri-
teria depending on the objectives:
1From Wikipedia: http : //en.wikipedia.org/wiki/MainP age
3
Chapter 1: Introduction
• minimising the reconstruction error,
• maximising the variance preservation,
• maximising the distance preservation,
• decorrelating the observed variables,
• making the estimated latent variables as independent as possible.
One possible criterion for topology preservation is the third one, while the last
is the basis for Independent Component analysis (see Chapter 2.1.2).
When considering topology preservation it is very important to keep in mind
that the dimensionality of the data manifold (or intrinsic dimension of the data)
may very likely be different from two, so projecting onto a 2-dimensional latent
space is not giving the right projection in terms of dimensionality reduction. This
dimensionality of the projection, determined by the user, is called the embedding
dimensionality, and is only for visualisation purposes.
1.4 Structure of the Thesis
In the second chapter we review techniques related to our algorithms such as the
dimensionality reduction models, different metrics used to estimate the closest pro-
totype, clustering techniques, and some of the existing topology preserving mappings
(TPM). We finally enumerate some useful toolboxes for TPM available on the in-
ternet.
Chapter three reviews the Topological Product of Experts (ToPoE) algorithm
and develops its theory, by analysing its properties through the examination of the
local variance, topology preservation with the calculation of magnification factors,
convergence properties, and theory related to the projection of the datapoints to the
low-dimensional space for visualisation purposes.
In the fourth and fifth chapters we describe the theory of two new algorithms,
the Harmonic Topographic Mapping (two versions) and the Topographic Neural
Gas, whose properties are discussed. The former makes use of K-Harmonic Means,
while the latter is an extension of Neural Gas as a topology preserving mapping.
Note that, even if Neural Gas is named as a topology preserving mapping in many
publications, they are really referring to the Topology Representing Network (TRP),
which is a combination of Neural Gas and competitive Hebbian learning, created by
the same author as the original Neural Gas [58]. In the Topographic neural Gas
4
Chapter 1: Introduction
(ToNeGas) we strictly use the clustering version of Neural Gas, and apply our new
projection method to maintain the topology preservation.
All the algorithms are jointly compared in Chapter 5. The same datasets have
been used with all the algorithms (though only the experiments showing relevant
information about each algorithm appear) and the results are included in the corre-
sponding chapter. The general comparison is included in Chapter 5, including two
more in-depth experiments that investigate their clustering and topology preserva-
tion, experiments comparison and one-to-one comparisons between algorithms.
1.5 Contribution of the Research
The contributions of the research presented in this thesis are:
• The Topographic Product of Experts (ToPoE) was presented in [30]. In this
thesis we extend the algorithm by
– including new kernels in the calculation of responsibilities,
– analysing the local variance of the model, the projection of datapoints to
the latent space and convergence properties,
– investigating the degree of topology preservation through magnification
factors,
– applying the algorithm to several datasets to compare it with the other
algorithms presented in this thesis.
• The clustering properties of the K-Harmonic Means (KHM) was shown to
overcome some of the drawbacks of the K-Means algorithm. We studied the
inclusion of an underlying latent space with KHM.
• We investigate the separation in two steps of the projection and the clustering
functions for topology preserving mappings, that are usually included together
in the learning rule. This naturally leads to a family of algorithms that share
the projection technique, but that are different in the clustering method.
• We develop two versions of the Harmonic Topographic Mapping (HaToM),
which allow for a major or minor imposition of the modelling, depending on
the nature of the data.
5
Chapter 1: Introduction
• We also further extend HaToM with the generalisation of the Harmonic Topo-
graphic Mapping (G-HaToM), that proved to reduce the computational time
and helps modelling more difficult data.
• The Topographic Neural Gas (ToNeGaS) is the last algorithm of this family,
that was created especially to reduce the computational time, while keeping
all the necessary functionality.
• We investigated the use of fractional distance metrics to the new algorithms
for high-dimensional data.
The above has been published in several publications:
1. Pena, M. and Fyfe, C. The Harmonic Topographic Map. In proceedings of
The Irish conference on Artificial Intelligence and Cognitive Science, AICS05.
pages 245-254. 2005
2. Pena, M. and Fyfe, C. Tight Clusters and Smooth Manifolds with the Har-
monic Topographic. In proceedings of the 5th WSEAS International Con-
ference on Simulation, Modeling And Optimization, WSEAS SMO ’05 Map.
pages 508-513. 2005.
3. Pena, M. and Fyfe, C. Model- and Data-driven Harmonic Topographic Maps.
WSEAS Transactions On Computers. Volume 4, number 9 pages 1033-1044.
September 2005.
4. Pena, M. and Fyfe, C. Faster clustering of complex data with the Gener-
alised Harmonic Topographic Mapping (G-HaToM). In proceedings of the 5th
WSEAS International Conference on Applied Informatics And Communica-
tions, WSEAS AIC ’05. Pages 270-275. 2005.
5. Pena, M. and Fyfe, C. Developments of the Generalised Harmonic Topographic
Mapping. WSEAS Transactions On Computers. Volume 4, number 11, pages
1548-1555. November 2005.
6. Pena, M. and Fyfe, C. The Harmonic Topographic Map. Technical report
Number 35. School of Computing, University of Paisley.
http://cis.paisley.ac.uk/research/reports/
7. Pena, M. and Fyfe, C. Outlier Identification with the Harmonic Topographic
Mapping. In proceedings of the European Symposium on Artificial Neural
Networks, ESANN’06. Pages 289-295. April 2006.
6
Chapter 1: Introduction
8. McGlinchey, S. and Pena, M. and Fyfe, C. Quantization Errors in the Har-
monic Topographic. In proceedings of The 9th WSEAS International Con-
ference on applied mathematics, MATH 06 Mapping. Pages 105-110. May
2006.
9. McGlinchey, S. and Pena, M. and Fyfe, C. Comparison of Quantization Errors
in the Model- and Data-driven Harmonic Topographic Mappings. WSEAS
Transactions On Computers. Volume 5, number 7, pages 1562-1570. July
2006. Pages 241-249.
10. Pena, M. and Fyfe, C. The Topographic Neural Gas. In proceedings of the
7th International Conference on Intelligent Data Engineering and Automated
Learning, IDEAL06. September 2006.
11. Pena, M. and Fyfe, C. Forecasting with topology preserving maps: Harmonic
Topographic Map and Topographic product of experts application. In pro-
ceedings of the First International Conference on Multidisciplinay Information
Sciences and Technologies, InSciT2006. Pages 42-46. October 2006.
12. Pena, M. and Fyfe, C. The Topographic Neural Gas. Journal of Computing
and Information Systems. Volume 10, number 3, pages 6-14. 2006. ISSN
1352-9404.
13. Pena, M. and Fyfe, C. Principal Manifolds for Data Visualisation and Dimen-
sion Reduction. Book chapter: Topology-preserving Mappings for data visu-
alisation. Lecture Notes in Computational Science and Engineering. Springer.
2007.
7
Chapter 2
Literature review
Topographic mappings are a class of dimensionality reduction techniques that seek to
preserve some of the structure of the data in the geometric structure of the mapping.
The term “geometric structure” refers to the relationships between distances in data
space and the distances in the projection to the topographic map. In some cases all
distance relationships between data points are important, which implies a desire for
global isometry between the data space and the map space. Alternatively, it may
only be considered important that local neighbourhood relationships are maintained,
which is referred to as topological ordering [84]. When the topology is preserved, if
the projections of two points are close, it is because, in the original high dimensional
space, the two points were close. The closeness criterion is usually the Euclidean
distance between the data patterns.
One clear example of a topographic mapping is a mercator projection of the
spherical earth into two dimensions; the visualisation is improved, but some of the
distances in certain areas are distorted. These projections imply a loss of some of the
information which inevitably gives some inaccuracy but they are an invaluable tool
for visualisation and data analysis, e.g. for cluster detection. Two previous works
in this area have been the Self-Organizing Map (SOM) [48] and the Generative
Topographic Map (GTM) [8].
In this chapter we detail the best known techniques in topology preservation,
such as the Self-Organizing Map and the Generative Topographic Mapping. We first
review the most common preprocessing techniques for reducing the dimensionality
of the data, a relevant step that reduces the computational time of the topology
preserving mappings by eliminating the redundancy between several variables. Then,
we review clustering techniques without topology preservation, that in most cases
are the basis of a topology preserving mapping of the corresponding section of this
8
Chapter 2: Literature review
Figure 2.1: Data preprocessing with clustering and reduction of the dimensionality[49].
chapter. Finally, we present some of the software tools for topology preservation
available on the internet.
2.1 Dimensionality reduction
Databases are getting larger and larger nowadays, and algorithms have to deal with
big amounts of data that account for higher computational times. Thus, any prepro-
cessing step that reduces the dimensionality or amount of data improves their per-
formance. Linear and nonlinear principal component analysis, whitening or sphering
and independent component analysis are widely used in such a task. The objectives
of preprocessing, depicted in Figure 2.1, are:
• To reduce the dimensionality of the data, i.e. converting many (N) high-
dimensional (d) data, to many (N) low-dimensional (q < d) data
• To represent data by a limited set of prototypes, i.e. converting many (N)
high-dimensional data, to few (K < N) high-dimensional (d) data
2.1.1 Principal Component Analysis
Principal component analysis (PCA) [55, 36], also known as the Karhunen-Loeve
transform, is a classical statistical method based on the reduction of correlation
9
Chapter 2: Literature review
between the variables. Variables that are correlated have a redundancy in informa-
tion, so by eliminating this redundancy we reduce the number of variables, and thus
the dimensionality of the data, keeping at the same time the maximum amount of
information. If the data is from a Gaussian distribution, the information is propor-
tional to the variance, so by projecting the data to the directions with maximum
variance we are preserving as well maximum information. Choosing a number of
major components may be enough to represent the whole data.
Calculating the Principal Components (PCs)
From a symmetric matrix such as the covariance matrix, we can calculate an or-
thogonal basis by finding its eigenvalues and eigenvectors. The eigenvectors ei and
the corresponding eigenvalues λi are the solutions of the equation
Cxei = λiei, i = 1, · · · , n (2.1)
where Cx = E(x− µx)(x− µx)T is the covariance matrix. The components of
Cx, denoted by cik, represent the covariances between the random variable compo-
nents xi and xj. The component cii is the variance of the component xi and µx the
mean value.
By ordering the eigenvectors in the order of descending eigenvalues (largest first),
we find an ordered orthogonal basis with the first eigenvector having the direction
of largest variance of the data.
Let B be a matrix consisting of eigenvectors of the covariance matrix as the row
vectors.
Projecting the data vector x on the coordinate axes defined by the orthogonal
basis, we get the latent variables
y = B(x− µx) (2.2)
which is a point in the orthogonal coordinate system defined by the eigenvectors.
The components of y are the coordinates in the orthogonal basis. We can reconstruct
the original data vector x from y by
x = BTy + µx (2.3)
Instead of using all the eigenvectors of the covariance matrix, we may represent
the data in terms of only a few basis vectors of the orthogonal basis. If we denote
the matrix having the first K eigenvectors as rows by BK
10
Chapter 2: Literature review
y = BK(x− µx) (2.4)
and
x = BTKy + µx (2.5)
Whitening or Sphering
Another preprocessing technique is sphering or whitening of the data [38]; even
algorithms that do not necessarily need sphering, often converge better with sphered
data. The data is also assumed to be centered, i.e., made zero-mean.
Sphering means that the observed variable of x is linearly transformed to a
variable
v = Qx (2.6)
such that the covariance matrix of v is the matrix identity: E{vvT} = I. This
transformation is always possible and very often is performed with PCA, which
allows for data compression as well as reduction of Gaussian noise.
2.1.2 Independent Component Analysis
In many cases the data is assumed to follow a Gaussian distribution (also called
the normal distribution), which simplifies the application of algorithms. A Gaussian
distribution is described by its mean and variance (first and second moments, see
Table 2.1). In other cases like Independent Component Analysis, higher-order statis-
tics such as kurtosis (fourth moment) are required to eliminate mutual information
between the variables that decorrelation is not able to eliminate, giving signals as
independent as possible. We define y1 and y2 to be independent if and only if the
joint pdf is factorizable in the following way p(y1, y2) = p1(y1)p2(y2). A weaker form
of independence is uncorrelatedness. Two random variables y1 and y2 are said to be
uncorrelated, if their covariance is zero: E(y1y2)− E(y1)E(y2) = 0. If the variables
are independent, they are uncorrelated.
A data distribution is skewed if it is not symmetric; leptokurtotic or super-
Gaussian distributions are distributions which are more kurtotic than a Gaussian
distribution, and conversely platykurtotic or sub- Gaussian distributions are less
kurtotic than a Gaussian distribution.
11
Chapter 2: Literature review
Table 2.1: Order Moments of a Gaussian distribution.
The Gaussian distribution f(x) = 1σ√
2πexp−1
2(x−µ
σ)2
The first order (mean) µx = E(X) =∫∞−∞ xf(x)dx
The second order (variance) σ2X = V ar(X) = E[(X − µ)2] =
∫∞−∞(x− µ)2f(x)dx
The third order (skewness) E[(X − µ)3]
The fourth order (Kurtosis) E[(X − µ)4]
Independent Component Analysis (ICA) [39, 79] is a technique with applica-
tions in data analysis, source separation, and feature extraction. It can be used as
a preprocessing technique, where fewer reference vectors are used to represent all
the samples. PCA, as we have seen, is a technique that decorrelates the data by
finding the highest variance projections, and thus depends on second-order statis-
tics. Independent component analysis on the other hand also reduces higher-order
statistical dependencies such as kurtosis, so that all mutual information between
variables is eliminated. ICA represents a multidimensional random vector as a
linear combination of non-Gaussian random variables that are as independent as
possible. Quantitative measures of non-Gaussianity (such as kurtosis, negentropy,
mutual information) are used as measures of independence, allowing for different
algorithms for ICA. There are also ICA models where the objective is to decorrelate
a non-linear combination of the signals.
2.2 Norms and metrics
In linear algebra, functional analysis and related areas of mathematics, a norm is
a function which assigns a positive length or size to all vectors in a vector space,
other than the zero vector1. A simple example is the 2-dimensional Euclidean space
equipped with the Euclidean norm. Elements in this vector space are usually drawn
as arrows in a 2-dimensional Cartesian coordinate system. The Euclidean norm
1From Wikipedia: http : //en.wikipedia.org/wiki/MainP age
12
Chapter 2: Literature review
assigns to each vector the length of its arrow. A vector space with a norm is called
a normed vector space.
2.2.1 The Minkowski Metric
The Minkowski metrics are a family of distance measures given by:
dij =
{d∑
k=1
| x(k)i − x
(k)j |r
}(1/r)
(2.7)
where dik is the distance between the d-dimensional entities i and j, x(k)i is the value
of the kth variable for entity i, x(k)j is the value of the kth variable for entity j, and
r > 0. The most common distance measure is the Euclidean or L2 norm, where
r = 2. The Euclidean distance is used in K-Means (see below), where the objective
function is the sum of squares of distances between datapoints and prototypes, which
in statistics is the total intra-cluster variance.
When r = 1 we have the Taxicab norm or Manhattan norm, where the name
relates to the distance a taxi has to drive in a rectangular street grid to get from the
origin to the point x. The distance that returns the maximum of absolute difference
in coordinates corresponds to r = ∞. Figure 2.2 shows a representation of several
of these distances.
2.2.2 The Mahalanobis distance
In statistics, the Mahalanobis distance is a distance measure introduced by P. C.
Mahalanobis in 1936. It is based on correlations between variables by which different
patterns can be identified and analysed. It is a useful way of determining similarity
of an unknown sample set to a known one. It differs from Euclidean distance in that
it takes into account the correlations of the data set and is scale-invariant.
Mahalanobis distance is defined as the dissimilarity measure between two random
vectors x and y of the same distribution with the covariance matrix Σ :
d(x,y) =√
(x− y)T Σ−1(x− y) (2.8)
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces
to the Euclidean distance. If the covariance matrix is diagonal, then the resulting
distance measure is called the normalized Euclidean distance:
13
Chapter 2: Literature review
0 0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
L0.3
L1
L2
L5
Figure 2.2: First quadrant plot of unit length loci in two dimensions from the originwith various Lr norms [20].
14
Chapter 2: Literature review
d(x,y) =
√√√√d∑
k=1
(x(k) − y(k))2
σ2k
(2.9)
where σi is the standard deviation of the xi over the sample set.
2.2.3 Metrics in high dimensional spaces
The importance of the metric used to calculate distances or dissimilarities in gen-
eral is widely recognised, in topology preserving mappings in particular. However,
familiarity makes the Euclidean norm the most used distance, without always con-
sidering the implications of this choice. In [1, 20, 26] this selection is analysed for
high dimensional data, with normalised or raw data, and different noise models
respectively.
In [20] Doherty et al note that “in a learning context when measuring dissim-
ilarities between two entities, the use of a fractional norm reduces the impact of
extreme individual attribute differences when compared to the equivalent Euclidean
measurements. Conversely, the higher-order norms emphasise the larger attribute
dissimilarities between the two entities and taken to the limit, L∞ reports the dis-
tance based on the single attribute with the maximum dissimilarity”. However, they
don’t find clear benefits from using any particular norm, which rather leaves us with
the use of the familiar Euclidean distance.
As shown in [5], the difference between the maximum and minimum distance to
a given point, increases less than the nearest distance to any point when we increase
the dimensionality of the data. This gives a poor discrimination between nearest
and furthest neighbour in high-dimensional space, making the distance meaningless.
They define as criterion of meaningfulness
Dmax(d)k −Dmin
(d)k
Dmin(d)k
(2.10)
that in [1], is called “relative contrast”. In the latest document after studying this
measure, they defend the Manhattan norm as preferable to the Euclidean in high-
dimensional space.
After getting better results with r = 1 in high dimensional space, they experiment
with r < 1, which they called “fractional distance metrics”. They prove that in this
case, the smaller the fraction, the greater the rate of absolute divergence between
the maximum and minimum distance. They also give results with different datasets
and K-Means, always getting better results with fractional distances.
15
Chapter 2: Literature review
In [26] Francois et al suggest choosing the metric according to the shape of the
noise that is assumed on the data. While it is known that the Euclidean metric is
optimal in the presence of white Gaussian noise, it is shown that other types of noise
require other metrics. They give the example of impulse or burst noise, present in
many high-dimensional data; burst noise is a noise which affects only a minority of
the components of the data elements, but in a significant way.
The experiments carried out with fractional metrics applied to the new algo-
rithms presented in this thesis did not give better results. This could be due to the
improvement in the clustering techniques used.
2.3 Cluster analysis
2.3.1 K-Means
K-Means clustering is an algorithm to divide samples based on attributes/features
into K groups. K is a positive integer that has to be given in advance. The group-
ing is done by minimizing the sum of squares of distances between data and the
corresponding prototypes mk.
The performance function for K-Means may be written as
J =N∑
i=1
K
mink=1
‖ xi −mk ‖2 (2.11)
which we wish to minimise by moving the prototypes to the appropriate positions.
Note that (2.11) detects only the prototypes closest to data points and then dis-
tributes them to give the minimum performance which determines the clustering.
Any prototype which is still far from data is not utilised and does not enter any
calculation to determine minimum performance, which may result in dead proto-
types, which are never appropriate for any cluster. Thus initializing prototypes
appropriately can play a big effect in K-Means.
The algorithm has the following steps:
• Step 1. Begin with a decision on the value of K, the number of clusters.
• Step 2. Put any initial partition that divides the data into K clusters randomly.
• Step 3. Take each sample in sequence and compute its distance from the
prototype of each of the clusters. If a sample is not currently in the cluster
with the closest prototype, switch this sample to that cluster and update the
16
Chapter 2: Literature review
prototype of the cluster gaining the new sample and the cluster losing the
sample.
• Step 4. Repeat step 3 until convergence is achieved, that is until a pass through
the training samples causes no new assignments.
The main problem of the K-Means algorithm is getting trapped in local minima.
There are several extensions of this algorithm [32], including K-Harmonic Means,
explained below.
2.3.2 K-Means++
In order to overcome the initialisation problem of K-Means, Arthur and Vassilvitskii
[3] modified the algorithm by substituting the random allocation of the prototypes
with a seeding technique.
The K-Means algorithm begins with an arbitrary set of cluster prototypes. They
propose a specific way of choosing these prototypes. At any given time, let D(x)
denote the shortest distance from a data point x to the closest prototype already
chosen. Then, the K-Means++ algorithm is as follows:
1. Choose an initial prototype m1 uniformly at random from the dataset X.
2. Choose the next prototype mk, selecting mk = x′ ∈ Xwith probability D(x′)2∑x∈X D(x)2
3. Repeat from Step 2 until we have chosen a total of K prototypes.
4. Proceed as with the standard K-Means algorithm.
They give experimental results that show the advantage in time and accuracy of
this technique.
2.3.3 Harmonic Averages
Harmonic Means or Harmonic Averages are defined for spaces of derivatives. For
example, if you travel 12
of a journey at 10 km/hour and the other 12
at 20 km/hour,
your total time taken is d10
+ d20
and so the average speed is 2dd10
+ d20
= 2110
+ 120
. In
general, the Harmonic Average of K points, a1, ..., aK , is defined as
HA({ai, i = 1, · · · , K}) =K∑K
k=11ak
(2.12)
This average is used in K- Harmonic Means to overcome the problem of local
minima.
17
Chapter 2: Literature review
2.3.4 K- Harmonic Means
This has recently [95] been used to make the K-Means algorithm more robust. Zhang
et al have developed an algorithm based on the harmonic average which converges
to a better solution than the standard algorithm. The algorithm calculates the
Euclidean distance between the ith data point and the kth prototype as d(xi,mk).
Using gradient descent in the performance function
JHA =N∑
i=1
K∑Kk=1
1d(xi,mk)2
(2.13)
we get
mk =
∑Ni=1
1
d4ik
(∑Kl=1
1
d2il
)2xi
∑Ni=1
1
d4ik
(∑Kl=1
1
d2il
)2
(2.14)
where dik is d(xi,mk)
In [95] extensive simulations show that this algorithm converges to a better solu-
tion (less prone to finding a local minimum because of poor initialisation) than both
standard K-Means or a mixture of experts trained using the Expectation Maximi-
sation algorithm.
With this learning rule on a latent space model similar to the GTM, we get a
mapping which has elements of topology preservation in the HaToM algorithm (see
later).
Zhang subsequently developed a generalised version of the algorithm [91, 92, 93]
that includes the pth power of the L2 distance2 which creates a “dynamic weighting
function” that determines how data points participate in the next iteration in the
calculation of the new prototypes mk. The weight is bigger for data points further
away from the prototypes, so that their participation is boosted in the next iteration.
This makes the algorithm insensitive to initialisation and also prevents one cluster
from taking more than one prototype.
The aim of K-Harmonic Means was to improve the winner-takes-all partitioning
strategy of K-Means that gives a very strong relation between each datapoint and
its closest prototype, so that the change in membership is not allowed until another
prototype is closer. The transition of prototypes between areas of high density
is more continuous in K- Harmonic Means due to the distribution of associations
2Note that this distance is different from the Minkowski distance presented in Section 2.2.1,where Lp is applied
18
Chapter 2: Literature review
between prototypes and datapoints. To explain this we consider a general formula
for K-Means and K-Harmonic Means for the updating of the prototypes
mk ←∑N
i=1 mem(mk|xi) ∗ weight(xi) ∗ xi∑Ni=1 mem(mk|xi) ∗ weight(xi)
(2.15)
where
• weight(xi) > 0 is the weighting function that defines how much influence a
data point xi has in recomputing the prototype parameters mk in the next
iteration with constraint weight(xi) > 0.
• mem(mk|xi) with mem(mk|xi) ≥ 0 and∑K
k=1 mem(mk|xi) = 1 is the mem-
bership function that decides the portion of weight(xi) ∗ xi associated with
mk.
K-Means has a hard membership, that is each datapoint is related to just one
prototype, thus
mem(ml|xi) =
{1 if l = arg mink ‖ xi −mk ‖2
0 otherwise(2.16)
and the weighting function is
weight(xi) = 1 (2.17)
with weight > 0 ∀i.The soft membership in K-Harmonic Means on the other hand
mem(mk|xi) =‖ xi −mk ‖−p−2
∑Kl=1 ‖ xi −ml ‖−p−2
(2.18)
allows the data points to belong partly to all prototypes.
The boosting properties for the generalised version of K-Harmonic Means (p > 2)
are given by the weighting function ([32]):
weight(xi) =
∑Kk=1 ‖ xi −mk ‖−p−2
∑Kk=1 ‖ xi −mk ‖−p
2 (2.19)
where the dynamic function gives a variable influence to data in clustering in a
similar way to boosting [27] since the effect of any particular data point on the
19
Chapter 2: Literature review
re-calculation of a prototype is O(‖ xi −mk ‖p2−p−2), which for p > 2 has greatest
effect for larger distances.
An algorithm that uses the clustering properties of K-Harmonic Means for image
segmentation is the Spatial Kernel-based KHM (SKKHM) algorithm [52]. A kernel-
induced metric substitutes the classic Euclidean intensity distance, reducing the
effect of outliers and noise.
2.3.5 Neural Gas
Vector quantization methods encode a set of data points in N -dimensional space
with a smaller set of reference or prototype vectors mk, k = 1, ..., K. Neural Gas
(NG) [57] is a vector quantization technique with soft competition between the units;
it is called the Neural Gas algorithm because the prototypes of the clusters move
around in the data space similar to the Brownian movement of gas molecules in a
closed container. In each training step, the squared Euclidean distances
dik = ‖xi −mk‖2 = (xi −mk)T ∗ (xi −mk) (2.20)
between a randomly selected input vector xi from the training set and all refer-
ence vectors mk are computed; the vector of these distances is d. Each prototype k
is assigned a rank rk(d) = 0, ..., K − 1, where a rank of 0 indicates the closest and a
rank of K-1 the most distant prototype to xi. The learning rule is then
mk = mk + ε ∗ hρ[rk(d)] ∗ (xi −mk) (2.21)
The function
hρ(rk(d)) = e(−rk(d)/ρ) (2.22)
is a monotonically decreasing function of the ranking that adapts not only the closest
prototype, but all the prototypes, with a factor exponentially decreasing with their
rank. The width of this influence is determined by the neighborhood range ρ. The
learning rule is also affected by a global learning rate ε. The values of ρ and ε
decrease exponentially from an initial positive value (ρ(0), ε(0)) to a smaller final
positive value (ρ(T ), ε(T )) according to
ρ(t) = ρ(0) ∗ [ρ(T )/ρ(0)](t/T ) (2.23)
and
ε(t) = ε(0) ∗ [ε(T )/ε(0)](t/T ) (2.24)
20
Chapter 2: Literature review
where t is the time step and T the total number of training steps, forcing more local
changes with time.
The updating rule in NG is very similar to the Self-Organizing Map(SOM) rule
(see Section 2.4.2), the difference being the neighborhood function that SOM uses:
h = exp
(−‖tk − tj‖2σ2
)(2.25)
where tk is the position of the kth latent point in latent space.
In contrast to the NG, the neighborhood function of SOM is evaluated in the
latent space. The advantage of the SOM is the ordered topological structure of
neurons. In contrast, in the original NG, such an order is not given. One can
extend the NG to the topology representing network (TRN) such that topological
relations between neurons are installed during learning, although generally they do
not achieve the simple structure as in SOM lattices [58].
There is also a Growing version of Neural Gas [28] that learns the topology of the
data by combining NG with Competitive Hebbian Learning (CHL), which is then
closer to the SOM algorithm.
2.4 Topology preserving mappings
The interest in feature maps stems directly from their biological importance. A
feature map uses the “physical layout” of the output neurons to model some feature
of the input space. In particular, if two output neurons ya and yb are close together
with respect to some distance measure in the output layer, then the corresponding
inputs x1 and x2 which cause ya and yb to fire must be close together in the input
space. Such maps are also called topology preserving maps (TPM).
As explained in [31], preserving the distances in a TPM means that:
1. Nearby data points give nearby projection points.
2. Distant data points give distant projection points.
3. If the projections of two data points are close, it is because, in the original
high dimensional space, the two data points were close.
4. If the projections are distant, the original data points were distant.
However the techniques presented in this thesis (and also SOM and GTM) only
guarantee the second and third properties.
21
Chapter 2: Literature review
There are several ways of creating feature maps - the most popular are Kohonen’s
SOM and the GTM. Kohonen’s Self-Organizing Map (SOM) is a Neural Network
map called a topology-preserving map. It takes into consideration the physical
arrangement of the nodes. Nodes that are “close” together are going to interact
differently from nodes that are “far” apart. This TPM is by far the most popular,
and it is not uncommon to substitute the term topology preserving maps with self-
organizing maps.
The GTM was developed by Bishop as a probabilistic version of the SOM, in
order to overcome some of the problems of this map, especially the lack of objective
function.
Further information about ordering, convergence properties, energy functions
and topology preservation in self-organizing maps is provided in [22, 23, 47, 53, 54,
56, 76].
2.4.1 MultiDimensional Scaling
Multidimensional scaling (MDS) [50, 77] is an exploratory technique used to visual-
ize dissimilarities in a low dimensional space, usually Euclidean, in which distances
in the projection dij match, as well as possible, the original dissimilarities δij, that
may be distances as well or any other proximity measure, indeed any kind of rela-
tion between a pair of entities that can be translated into a proximity measure, or
conversely into a dissimilarity measure.
Classical scaling, that treats dissimilarities directly as Euclidean distances, and
least squares scaling, which matches distances dij to transformed dissimilarities
f(δij), are known as metric MDS, where metric refers to the type of transforma-
tion. They can be shown to be special cases of principal components analysis. With
Non-metric MDS, the metric nature is abandoned, and only the rank order of dis-
similarities has to be preserved by the transformation.
The coordinates in the distance function (xia, i = 1, ..., N with N = number
of entities, a = 1, ..., d with d = number of dimensions) and the function f which
allows to transform the dissimilarities into distances are estimated by minimising the
following badness of fit function (usually called stress or S-function in the context
of MDS):
S =
(∑Ni=1
∑Nj>i(f(δij)− dij)
2
∑Ni=1
∑Nj>i d
2ij
)1/2
(2.26)
22
Chapter 2: Literature review
Metric Multidimensional Scaling
The most simple case of multidimensional scaling considers quantitative data. In
classical scaling the dissimilarities are treated directly as distances. In metric scaling
two properties have to hold3: these are called non-degeneracy and triangular inequal-
ity. Non-degeneracy states that dij = 0 ⇒ i = j and the triangular inequality means
that dij + djk ≥ dik for all (i, j, k). The matrix obtained after pre-processing is la-
beled D. It can be shown that the elements of the double centered dissimilarity
matrix D equal minus two times the scalar products:
d2ij −
∑Nj=1 d2
ij
N−
∑Ni=1 d2
ij
N+
∑Ni=1
∑Nj>i d
2ij
N2= −2
d∑a=1
xiaxja (2.27)
with i the row index, j the column index, N the number of objects and d the
number of dimensions. Then the matrix of scalar products is:
B = −1
2
[I− 1
NiiT
]D2
[I− 1
NiiT
](2.28)
where I is the N by N identity matrix and i a unity vector of length N.
To obtain the original X values, the singular value decomposition (SVD) is per-
formed, B = V ΛV T ; defining B = XXT we get the matrix of the coordinates with
X = V Λ1/2. To project into a space of lower dimensionality we retain only the first
r eigenvectors: this implies that the summation over a in equation (2.27) runs from
1 to r instead of d.
2.4.2 Self-Organizing Map
Kohonen’s algorithm [47, 48] is an algorithm used to visualize and interpret large
high-dimensional data sets. The map consists of a regular grid of processing units,
“neurons” in a (usually) 2-layer network and competition takes place between the
output neurons. The disposition of the neurons in that grid can be different, as
shown in Figure 2.3 but the one used in this thesis (unless specified otherwise) for
SOM and the other topology preserving mappings is the rectangular lattice.
The map attempts to represent all the available observations with optimal accu-
racy using a restricted set of models. At the same time the models become ordered
on the grid so that similar models are close to each other and dissimilar models far
from each other. Fitting of the model vectors is usually carried out by a sequential
3From http : //www.mathpsyc.uni− bonn.de/doc/delbeke/delbeke.htm
23
Chapter 2: Literature review
100 200 300 400 500 600 700
50
100
150
200
250
Figure 2.3: Examples of Map shapes available for the SOM mapping [87].
regression process: for each sample x(t) where t = 1,2,... is the step index, first the
winner index BMU (Best Matching Unit) is identified by the condition
∀k ‖ x(t)−mBMU(x) ‖≤‖ x(t)−mk(t) ‖ . (2.29)
After that, all model vectors or a subset of them that belong to nodes centered
around node BMU are updated as
mk(t + 1) = mk(t) + h(BMU(x), k)(x(t)−mk(t)). (2.30)
Here h(BMU(x), k) is the neighbourhood function, a decreasing function of the
distance between the kth and BMU nodes on the map grid. Typical functions are
the Gaussian function, and the Difference of Gaussians function shown below; thus
if unit k is at point tk in the output layer then
h(k, k∗) = a exp
(−‖tk − tk∗‖2
2σ21
)− b exp
(−‖tk − tk∗‖2
2σ22
)
This regression is usually iterated over the available samples. See Figure 2.44 to
visualise the ordering process of the neurons in a two-dimensional mapping.
4Adapted from http://www-users.cs.york.ac.uk/ sok/IML/
24
Chapter 2: Literature review
Figure 2.4: Weight vectors during the ordering process in a two dimensional map-ping.
25
Chapter 2: Literature review
The number of neighbours and how much each weight can learn decreases over
time. This whole process is repeated a large number of times, usually more than
1000 times.
There are some drawbacks to the SOM algorithm [24, 64] like the lack of an ob-
jective function and no proof of convergence. However, it is a widely used technique
[44, 63], that has been the basis for many algorithms [51].
There are many extensions of the SOM algorithm, like ViSOM [89, 90], or
visualisation-induced SOM, that splits the updating force of each winner in two:
the first force pointing from the winner to the input data xi. It adapts the neu-
ron towards the input in a direction orthogonal to the tangent plane of the winner.
The second force is a lateral force bringing the neighbour neuron to the winner
neuron. The ViSOM constrains the lateral contraction forces between neurons and
hence regularises the inter-neuron distances so that distances between neurons in
the data space are in proportion to those in the map space, preserving the distance
information on the map, along with the topology.
U-Matrix
There are different methods for visualising the results of a SOM mapping [85], one
of the most common being the U-Matrix (unified distance matrix). The U-matrix
alow us to visualize the distances between the neurons. The distance in data space
between the adjacent neurons is calculated and illustrated with different colourings
between the adjacent nodes. A red colouring between the neurons corresponds to a
large distance and thus a gap between the codebook values in the input space. A
blue colouring between the neurons signifies that the codebook vectors are close to
each other in the input space. Light areas can be thought of as clusters and dark
areas as cluster separators. This can be a helpful presentation when one tries to
find clusters in the input data without having any a priori information about the
clusters.
In Figure 2.5 we can see the U-Matrix representation for the iris dataset that
reveals two clusters in the upper right corner and lower part, separated by an area
of higher distance.
2.4.3 Generative Topographic Map
The Generative Topographic Mapping (GTM) [8, 9, 11, 80, 83] is a non-linear latent
variable model for modeling continuous low-dimensional probability distributions,
26
Chapter 2: Literature review
SOM 08−Mar−2007
U−matrix
0.101
0.982
1.86
Figure 2.5: U-matrix representation of the Self-Organizing Map for the iris data.
embedded in high-dimensional spaces. It is a probabilistic extension of the self-
organizing map, that has an objective function and new visualisation technique.
Two limitations of the basic GTM model are the computational effort required,
that grows exponentially with the intrinsic dimensionality of the density model, and
the initialisation of the parameters, that can lead the algorithm to a local minimum.
It is also a more complex algorithm.
The GTM defines a non-linear, parametric mapping m(t; W ) from a q-dimensional
latent space to a d-dimensional data space x ∈ Rd, where normally q < d. The map-
ping is defined to be continuous and differentiable. m(t; W ) maps every point in the
latent space to a point in the data space. Since the latent space is q-dimensional,
these points will be confined to a q-dimensional manifold non-linearly embedded into
the d-dimensional data space. If we define a probability distribution over the latent
space, p(t), this will induce a corresponding probability distribution into the data
space. Strictly confined to the q-dimensional manifold, this distribution would be
singular, so it is convolved with an isotropic Gaussian noise distribution, given by
p (x|t,W, β) =
(β
2π
)d/2
exp
{−β
2
d∑s=1
(x(s) −m(s)(t, W ))2
}(2.31)
where x is a point in the data space and β denotes the noise inverse variance. By
integrating out the latent variable, we get the probability distribution in the data
space expressed as a function of the parameters β and W ,
27
Chapter 2: Literature review
p (x|W,β) =
∫p (x|t,W, β) p(t)dt (2.32)
Choosing p(t) as a set of K equally weighted delta functions on a regular grid,
p(t) =1
K
K∑
k=1
δ(t− tk) (2.33)
the integral in (2.32) becomes a sum,
p (x|W,β) =1
K
K∑
k=1
p (x|tk,W, β) (2.34)
Each delta function centre maps into the centre of a Gaussian which lies in the
manifold embedded in the data space, as illustrated in Figure 2.6. This algorithm
defines a constrained mixture of Gaussians [40, 43], since the centres of the mixture
components can not move independently of each other, but all depend on the map-
ping m(t; W ). Moreover, all components of the mixture share the same variance,
and the mixing coefficients are all fixed at 1/K . Given a finite set of independent
and identically distributed (i.i.d.) data points, {xNi=1}, the log-likelihood function
of this model is maximized by means of the Expectation Maximisation algorithm
with respect to the parameters of the mixture, namely W and β. The form of the
mapping m(t; W ) is defined as a generalized linear regression model
m(t; W ) = φ(t)T W
where the elements of φ(t) consist of M fixed basis functions φi(t)Mi=1, and W is
a M × d matrix.
There are several extensions of the initial algorithm like the Locally Linear Gen-
erative Topographic Mapping [86], Hierarchical GTM [11, 83], and a combination of
SOM and GTM [46].
2.4.4 Probabilistic Principal Surface
Principal surfaces (curves) [19, 33, 45] are nonlinear generalizations of principal
subspaces that formalize the notion of a low-dimensional manifold passing through
the middle of a dataset in high-dimensional space. The probabilistic principal surface
(PPS) [15, 16], a generalization of the generative topological mapping (GTM), is a
parametric approximation of principal surfaces. The PPS generalizes the GTM
model by building a unified model and shares the same formulation as the GTM,
28
Chapter 2: Literature review
t
1
2
t
3
m( ;w)x
1 x
2
x
t
Figure 2.6: In order to formulate a tractable non linear latent variable model, weconsider a prior distribution p(t) consisting of a superposition of delta functions,located at the nodes of a regular grid in latent space. Each node tk is mapped to acorresponding point m(tk; W ) in data space, and forms the centre of a correspondingGaussian distribution. (Adapted from [80]).
except for an oriented covariance structure for the nodes. Data points projecting near
a principal surface node have higher influences on that node than points projecting
far away from it. This is illustrated in Figure 2.7.
Therefore, each node m(t; W ), t ∈ {tk}Kk=1, has covariance
Σt =α
β
q∑i=1
ei(t)eTi (t) +
(d− αq)
β(d− q)
d∑j=q+1
ej(t)eTj (t), 0 < α <
d
q(2.35)
where
• {ei(t)}qi=1is the set of orthonormal vectors tangential to the manifold at m(t; W ),
• {ej(t)}dj=q+1 is the set of orthonormal vectors orthogonal to the manifold at
m(t; W ).
The parameter α is a clamping factor and determines the orientation of the covari-
ance matrix; this orientation gives the self-consistency property, i.e. every point of
the curve is the average of the data points projecting onto that point of the curve
[33]. The PPS model reduces to GTM for α = 1 and to the manifold-aligned GTM
for α > 1
29
Chapter 2: Literature review
d=
(a) GTM
σα
β≈
σβ
≈
(b) PPS
m
d t
tmm( )
t( )
Figure 2.7: (a)Under a spherical Gaussian model of the GTM, points 1 and 2 haveequal influences on the center node m(t) (b) PPS have an oriented covariance matrixso point 1 is probabilistically closer to the centre node m(t) than point 2. (Adaptedfrom [78]).
Σt =
0 < α < 1 ⊥ to the manifold
α = 1 Idor spherical
1 < α < dq
‖ to the manifold
(2.36)
The EM algorithm can be used to estimate the PPS parameters W and β. The
clamping factor is fixed by the user and is assumed to be constant during the EM
iterations.
Chang proposes in [15] the use of a three dimensional latent space disposed as
a spherical manifold for the application of the PPS with nodes {tk}Kk=1 arranged
regularly on the surface of a sphere in R3 latent space. After a PPS model is fitted
to the data, the data themselves are projected into the latent space as points onto
a sphere (Figure 2.8).
The latent manifold coordinates yi of each data point xi are computed as in the
GTM,
yi =K∑
k=1
riktk (2.37)
where rik are the responsibilities defined as
rik = p (tk|xi) =p (xi|tk) P (tk)∑K
k′=1 p (xi|t′k) P (t′k)(2.38)
30
Chapter 2: Literature review
(a) Manifold in
latent space R3
x
(b) Manifold in
feature space RD
t
(c) t projected onto
manifold in latent space R3
E[ x|t ]m
t
( )
x
Figure 2.8: (a) The spherical manifold in R3 latent space. (b) The spherical manifoldin R3 data space. (c) Projection of data points t onto the latent spherical manifold.(Adapted from [78]).
For a spherical manifold ‖tk‖ = 1 for k = 1, . . . , K and∑K
k=1 rik = 1 for i =
1, . . . , N , thus Equation 2.37 implies that these coordinates lie within a unit sphere,
i.e ‖yi‖ ≤ 1. In projecting the fitted PPS to this spherical manifold, the clusters
are much less overlapped for several experiments presented in that thesis.
2.4.5 Topology Representing Network
This method [58] is a straight-forward combination of neural gas and Competitive
Hebbian Learning (CHL). This, however, would apply also to the growing neural gas
model described later. Topology Representing Networks (TRN) are artificial neural
networks which use unsupervised algorithms to configure a topological representation
of input data. More properly, let xi be an input datapoint from a finite data set X:
X = {x1,x2, · · · ,xN}, and T = {t1, t2, · · · , tK} a topology representing network
composed of K neurons with reference vectors mk, k = 1, ..., K; then the set Vk of
all points in X which have mk as the closest vector, is called the Voronoi region (or
Voronoi polygon) of mk:
Vl = {xi ∈ X : l = arg mink∈{1,...,K}
‖xi −mk‖} (2.39)
Hence, the partition of the input manifold induced by the set of K reference
vectors of given net T is called the Voronoi tessellation of input space: V =
{V1,V2, · · · ,VK} and
X ⊆K⋃
i=1
Vi (2.40)
31
Chapter 2: Literature review
Finally, by connecting all pairs mi, mj whose Voronoi polygons Vi, Vj share
an edge, we get the corresponding Delaunay triangulation. This simply means that
TRN forms connectivity structures which are topology preserving with respect to
input data (i.e. neighbouring inputs tend to be mapped into neighbouring neurons);
such structure evolves as samples xi are sequentially presented to the net, thus giving
the possibility of acquiring further information about the process under examination.
The steps of the TRN algorithm are then:
1. assign initial values to the mk, k = 1...K, and set all the connections to zero
cjk = 0.
2. select an input pattern xi.
3. Calculate a rank rk(d) = 0, ..., K − 1 for each prototype k, where a rank of 0
indicates the closest and a rank of K-1 the most distant prototype to xi.
4. adapt the mk according to the neural gas algorithm
mnewk = mold
k + ε(t) ∗ exp (−rk(d)/ρ(t))(xi −moldk ) (2.41)
where
ρ(t) = ρ(0) ∗ [ρ(T )/ρ(0)](t/T ) (2.42)
and
ε(t) = ε(0) ∗ [ε(T )/ε(0)](t/T ) (2.43)
where t is the time step and T the total number of training steps, forcing more
local changes with time.
5. if it does not exist already, create a connection ci0i1 between the prototypes
ranked 0, i0, and ranked 1, i1, and set the age between i0 and i1 to 0 (“refresh”
the age).
6. increase the age of all connections of i0 by setting agei0ij = agei0ij + 1 for all
ij with ci0ij > 0.
7. remove those connection of i0 the age of which exceeds Tmax by setting ci0ij =
0 for all ij with ci0ij > 0 and agei0ij > Tmax.
8. increase the time parameter t = t + 1. If t < T continue with 2.
32
Chapter 2: Literature review
The updating of the prototypes in TRN is then uniquely done by the NG clustering
technique. The CHL algorithm does not modify their positions, but just finds the
topology preserving map according to those positions, by connecting them depending
on their distance.
2.4.6 Growing Neural Gas
The Growing Neural Gas (GNG) algorithm [28], is an unsupervised clustering al-
gorithm that can be considered a variation of the previous Topology Representing
Network. It uses the clustering properties of the Neural Gas algorithm, and the
variation is in the topology preservation of the mapping. The induced Delaunay
triangulation is in charge of the topology preservation, but the neighbourhood in-
formation is maintained by a variant of competitive Hebbian learning (CHL) [56];
that is, for each input signal xi an edge is inserted between the two closest nodes,
measured by Euclidean distance. GNG is an adaptive algorithm in the sense that
if the input distribution slowly changes over time, GNG is able to adapt, that is to
move the nodes so as to cover the new distribution. Starting with two nodes the
algorithm constructs a graph in which nodes are considered neighbours if they are
connected by an edge.
The growing version means that it is not necessary to decide on the number of
nodes to use a priori since nodes are added incrementally during execution. The
increment in the number of nodes stops when a user defined performance criteria is
met or if a maximum network size has been reached.
The GNG algorithm assumes that each node k consists of the following:
• mk - a reference vector or node in input space.
• errork - a local accumulated error variable.
• A set of edges defining the topological neighbours of node k.
Each new unit is inserted near the unit which has accumulated most error locally
(see Figure 2.9). As in TRN, each edge has an age variable used to decide when to
remove old edges in order to keep the topology updated. The nodes are moved by
NG again, and CHL is responsible for generating the topology.
There is a similar algorithm presented in [18], where the above growth is applied
to K-Means creating the Growing K-Means algorithm, which they present as simpler
and faster than GNG.
33
Chapter 2: Literature review
(i) 2 nodes, 500 iterations (ii) 3 nodes, 1000 iterations (iii) 50 nodes, 50000 iterations
Figure 2.9: An illustration of the GNG algorithm. (i) The state of the GNG al-gorithm after 500 iterations, one node is located in the left most data cluster, theother node is oscillating between the top most and bottom most data clusters. (ii)After 1000 iterations, a third node has been inserted and the nodes now cover thethree data clusters. (iii) After 50000 iterations, 50 nodes are spread out over thethree data clusters matching the topology. [37]
2.4.7 Topology preserving Elastic net
The elastic net, introduced in [21], was firstly applied to the traveling salesman prob-
lem (i.e. visiting a number of cities in the most efficient way), using an optimization
approach.
The energy function is:
E({mk}, Z) = −αZ
N∑i=1
logK∑
k=1
exp
(−‖xi −mk‖
Z2
)+
β
2
K−1∑
k=1
‖mk −mk+1‖2 (2.44)
where xi are the cities, mk the neurons that represent the tour stops, α and β control
the influence of the two terms, and Z is a simulated-annealing term that decreases
to a pre-specified value.
For each Z find mk that approximate all tour points with the minimum length
mapping. Then decrease Z. Finally a solution close to the global minimum of the
traveling salesman combinatorial problem is obtained by iterative optimization.
Adapt mk according to gradient descent:
34
Chapter 2: Literature review
∆mk = −ZδE
δmk
= α
N∑i=1
wik(xi −mk) + βZ(mk+1 + mk−1 − 2mk) (2.45)
Each neuron is subjected to two forces, one that attracts the neuron towards the
datapoint, and the elastic tension or force that pulls towards the mid-point between
neurons. The weights are calculated as
wik =exp
(−‖xi−mk‖
Z2
)
∑Kl=1 exp
(−‖xi−ml‖
Z2
) (2.46)
Repeat until Z is small enough or mK represents a good enough approximation for
xi.
Tereshko [82, 81] developed the topology preserving elastic net which combines
both lateral and synaptic interactions to obtain topologically ordered representa-
tions (receptive fields) of an external stimulus. The author affirms that existing
neural models that preserve the topology by utilizing lateral interactions, such as
the Kohonen map, and by utilizing synaptic interactions, such as cortical mapping
and elastic net, appear as limiting cases of this model.
2.5 Software tools
Most of the algorithms presented in this thesis can be found implemented in different
programming languages on the internet. In this section we discuss three toolboxes
that include many of those techniques. Other useful addresses are:
• Kmeans++ code is available online at www.stanford.edu/Darthur/Kmeanspptst.zip
• A Java implementation of Hard Competitive Learning, Neural Gas, TRN, and
GNG can be accessed at:
http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/
/gsn/DemoGNG/GNG.html,
where it is embedded as Java applet into a Web page, but also available to
download.
• The implementation of Probabilistic Principal Surfaces is available in the
IDEAL (formerly LANS) toolbox at http://www.mathworks.com/matlabcentral.
35
Chapter 2: Literature review
2.5.1 SOM Toolbox
The SOM Toolbox [62] is a function package for Matlab implementing the Self-
Organizing Map (SOM) algorithm and more. You can train SOM with different
network topologies and learning parameters, compute different error and quality
measures for the SOM, visualize data projections using U-matrices, component
planes, cluster colour coding and colour linking between the SOM and other vi-
sualization methods, and do correlation and cluster analysis with SOM. The SOM
Toolbox also features other data analysis methods related to vector quantization,
clustering, dimension reduction, and proximity preserving projections, e.g., data pre-
processing tools, K-Means, K-Nearest Neighbor classifier and LVQ (Learning Vector
Quantizer), agglomerative hierarchical clustering and dendrograms, principal com-
ponent analysis (PCA), Sammon’s projection, and Curvilinear Component Analysis
(CCA).
2.5.2 GTM Toolbox
This implementation of the GTM [6] runs under Matlab and includes all the neces-
sary machinery to create and experiment with GTM, including data visualisation;
it also includes a demo. It comes as a set of Matlab functions and scripts, together
with two short routines in C, which may be compiled into mex-files which can be
called directly from Matlab, provided you have a C-compiler supported by Matlab.
However, the toolbox can also be used as a pure Matlab implementation. In terms
of documentation, there is a User’s Guide in postscript format, which contains a
reference section for all the functions and scripts in the toolbox. The reference in-
formation is also available as Matlab help comments and as a set of html-files, which
can be viewed with a browser. Note that the documentation does not cover any of
the underlying theory of the GTM, for which you are referred to the papers on the
GTM.
2.5.3 Netlab Toolbox
The Netlab toolbox [7] has the advantage of having an accompanying text book
[61] published by Springer in their series Advances in Pattern Recognition. It is
widely used and many authors use Netlab as the basis for the programming of
new algorithms so that this toolbox is needed to use the new algorithms. Netlab is
designed to provide the central tools necessary for the simulation of theoretically well
founded neural network algorithms and related models for use in teaching, research
36
Chapter 2: Literature review
and applications development.
It consists of a toolbox of Matlab functions and scripts based on the approach
and techniques described in [12], but also including more recent developments in
the field. The functions come with Matlab on-line help, and further explanation
is available via HTML files. The software has been written by Ian Nabney and
Christopher Bishop.
The Netlab library includes software implementations of a wide range of data
analysis techniques, many of which are not yet available in standard neural network
simulation packages.
2.5.4 ICALAB
ICALAB toolboxes are presented with an easy-to-use interface that allows for easy
application of many different techniques, including preprocessing and postprocessing
tools. ICALAB for Signal Processing and ICALAB for Image Processing [25] are two
independent demo packages for MATLAB that implement a number of algorithms
for ICA employing higher order statistics, blind source separation employing second
order statistics and linear prediction, and blind signal extraction employing various
methods.
This package can also be used also for multidimensional independent component
analysis and non independent blind source separation.
Preprocessing tools include principal component analysis, pre-whitening, high
pass filtering, low pass filtering, and subband filters.
Postprocessing tools include deflation and reconstruction of original raw data by
removing undesirable components, noise or artifacts.
The algorithms can perform other techniques such as Blind Source Separation,
Factor Analysis and any other possible matrix factorization of the form X = HS +
N .
37
Chapter 3
The Topographic Product of
Experts
3.1 Topographic Product of Experts
This is the first of the family of algorithms within the topology preserving maps
category presented in this thesis, that share a common property: all of them are
based on the Generative Topographic Mapping model. The general structure is
similar to the GTM, so that the underlying structure of the data can be represented
by K latent points, t1, t2, . . . , tK . To allow local and non-linear modeling, those
latent points are mapped through a set of M basis functions, f1(), f2(), . . . , fM().
This gives a matrix Φ where φkm = fm(tk). Thus each row of Φ is the response
of the basis functions to one latent point, or alternatively each column of Φ is the
response of one of the basis functions to the set of latent points. Typically these
functions are Gaussians centered in the latent space. The output of these functions
are then mapped by a set of weights, W , into data space. W is M × d and is the
sole parameter which is changed during training. wm represents the mth column of
W and Φk to represent the row vector of the mapping of the kth latent point. Thus
each latent point is mapped to a point in data space, mk = (ΦkW )T .
3.1.1 Product of Experts
Hinton introduced the Product of Experts (PoE) in [35] with
p(xi|Θ) ∝K∏
k=1
p(xi|k) (3.1)
38
Chapter 3: ToPoE
where Θ is the set of current parameters in the model. The base model using a
Gaussian distribution is
p(xi|Θ) ∝K∏
k=1
(β
2π
) d2
exp
(−β
2||xi −mk||2
)(3.2)
where mk are the neurons in data space, β the inverse variance, and xi the
datapoints. The model was generalised into “products of Gaussian pancakes” in
[88]. The Gaussian case gives
p(xi|Θ) ∝ exp{−1
2
K∑
k=1
(xi −mk)T C−1
k (xi −mk)} (3.3)
where Ck is the covariance matrix of the kth Gaussian expert. The covariance matrix
of the product can easily be shown to be related to the harmonic average of the
individual experts. i.e.
C−1Π =
K∑
k=1
C−1k ⊂ Rd×d (3.4)
The mean of the product of experts can be shown to be
µΠ = CΠ
(K∑
k=1
C−1k mk
)⊂ Rd (3.5)
where a special case is C = 1βI.
To fit this model to the data, the cost function is defined as the negative logarithm
of the probabilities of the data so that
J =N∑
i=1
K∑
k=1
β
2||xi −mk||2 (3.6)
from which the learning rule for the weights is derived as,
∆wmj ∝ − ∂J
∂wmj
=K∑
k=1
β(x(j)i −m
(j)k )
∂m(j)k
∂wmj
=K∑
k=1
β(x(j)i −m
(j)k )φkm
39
Chapter 3: ToPoE
3.1.2 The Topographic Product of Experts
Fyfe [30] introduced responsibilities into the Gaussians:
p(xi|Θ) ∝K∏
k=1
(β
2π
) d2
exp
(−β
2||xi −mk||2rik
)(3.7)
where rik is the responsibility of the kth expert for the data point, xi. Thus all
the experts are acting in concert to create the data points but some will take more
responsibility than others. In contrast to the following algorithms presented in
Chapters 4 and 5, in ToPoE the responsibilities are calculated not only at the end,
as part of the visualisation step, but repeatedly on every iteration; this makes the
responsibilities more crucial in this model. The initial situation is a product of
experts situation, where all nodes have responsibilities for all datapoints, in contrast
to the mixture of experts where the responsibility regions are split between the
experts. The situation can progress in different ways however, always depending
on the modelling of the data, so that the final situation tends to be a mixture of
local products of experts. To prevent a situation where a data point has no expert
associated, if no expert has responsibility for a data point, they all are given equal
responsibility for that data point.
This model may be written as
p(xi|Θ) ∝(
β
2π
) d2
exp
(−β
2
K∑
k=1
(||xi −mk||2rik)
)(3.8)
The objective is to maximise the likelihood of the data set X = {xi : i =
1, · · · , N} under this model. The ToPoE learning rule (3.10) is derived from the
minimisation of − log(p(xi|Θ)) with respect to a set of parameters which generate
the mk.
To change W in online learning, a data point is randomly selected, say xi. The
calculation of the current responsibility of the kth latent point for this data point is,
rik =exp(−γd2
ik)∑Kl=1 exp(−γd2
il)(3.9)
where dik = ||xi −mk||, the Euclidean distance between the ith data point and the
projection of the kth latent point in data space (through the basis functions and then
multiplied by W ). γ is known as the width of the responsibilities. If no prototypes
are close to the data point (the denominator of (3.9) is zero), we set rik = 1K
,∀K.
40
Chapter 3: ToPoE
To maximise (3.7) so that the data is most likely under this model the − log()
of that probability is minimised: define m(d)k =
∑Mm=1 φkmwmd, i.e. m
(d)k is the
projection of the kth latent point on the dth dimension in data space. Similarly let
x(d)i be the dth coordinate of xi. These are used in the update rule
∆iwmd =K∑
k=1
ηφkm(x(d)i −m
(d)k )rik (3.10)
where ∆i signifies the change due to the presentation of the ith data point, xi, so
that the changes due to each latent point’s response to the data points are summed,
and η is the learning rate. Note that the Φ matrix is not changed during training
at all, and that β has been integrated in he responsibilities..
Since − log(p(xi|Θ)) ∝ ∑Kk=1 ‖ xi −mk ‖2 rik, the maximisation of that proba-
bility is equal to minimising the weighted mean square error.
The algorithm steps are then
1. Initialise the W weights randomly and spread the centres of the M basis func-
tions uniformly in latent space.
2. Initialise the K latent points uniformly in latent space.
(a) count=0
(b) Calculate the projection of the latent points to data space. This gives
the K prototypes, mk.
(c) Select a random data point xi and calculate dik = ||xi −mk||.(d) Calculate responsibilities that the kth latent point has for the ith data
point with (3.9).
(e) Recalculate W using (3.10).
(f) If count<MAXCOUNT, count= count +1 and return to (2b).
If we wish to use the mapping for visualisation, we must map data points into
latent space using the responsibilities; the new data point is placed at yi where
yi =K∑
k=1
riktk (3.11)
where tk is the position of the kth latent point in latent space.
41
Chapter 3: ToPoE
3.1.3 Comparison with the GTM
The Generative Topographic Mapping (GTM) [8] is a mixture of experts model
which treats the data as having been generated by a set of latent points. These
K latent points are also mapped through a set of M basis functions and a set of
adjustable weights to the data space. The parameters of the combined mapping are
adjusted to make the data as likely as possible under this mapping. The GTM is
a probabilistic formulation so that if we define m = ΦW = Φ(t)W , where t is the
vector of latent points, the probability of the data is determined by the position of
the projections of the latent points in data space and so we must adjust this position
to increase the likelihood of the data. More formally, let
mk = Φ(tk)W (3.12)
be the projections of the latent points into the feature space. Then, if we assume
that each of the latent points has equal probability
p(x) =K∑
k=1
P (k)p(x|k) =K∑
k=1
1
K
(β
2π
) d2
exp
(−β
2||xi −mk||2
)(3.13)
where d is the dimensionality of the data space. i.e. all the data is assumed to
be noisy versions of the mapping of the latent points. This equation should be
compared with (3.7) and (3.8).
In the GTM, the parameters W and β are updated using the EM algorithm
though the authors do state that they could use gradient ascent. Indeed, in the
ToPoE, the calculation of the responsibilities may be thought of as being a partial
E-step while the weight update rule is a partial M-step. The GTM has been described
as a “principled alternative to the SOM” however it may be criticised on two related
issues:
1. it is optimising the parameters with respect to each latent point independently.
Clearly the latent points interact.
2. using this criterion and optimising the parameters with respect to each latent
point individually does not necessarily give us a globally optimal mapping from
the latent space to the data space.
The ToPoE will overcome some of these shortcomings in that all data points are
acting together. Specifically if no latent point accepts responsibility for a data
point, the responsibility is shared equally amongst all the latent points.
42
Chapter 3: ToPoE
The GTM, however, does have the advantage that it can optimise with respect
to β as well as W . However note that, in (3.7) and (3.8), the variance of each expert
is dependent on its distance from the current data point via the hyper-parameter,
γ. Thus we may define
(βk)|x=xi= βrik = β
exp(−γd2ik)∑K
l=1 exp(−γd2il)
(3.14)
Therefore the responsibilities are adapting the width of each expert locally dependent
on both the expert’s current projection into data space and the data point for which
responsibility must be taken. Initially, rik = 1K
,∀k, i and so we have the standard
product of experts. However during training, the responsibilities are redefined so
that individual latent points take more responsibility for specific data points. We
may view this as the model softening from a true product of experts to something
between that and a mixture of experts.
A model based on products of experts has some advantages and disadvantages.
The major disadvantage is that no efficient EM algorithm exists for optimising pa-
rameters. [35] suggests using Gibbs sampling but even with the method discussed
in that paper, the simulation times are excessive. Thus Fyfe [30] opted for gradient
descent as the parameter optimisation method.
The major advantage which a product of experts method has is that it is possible
to get very much sharper probability density functions with a product rather than
a sum of experts.
In the next sections we extend the ToPoE algorithm by including new kernels in
the responsibility calculation, analysing the local variance of the model, the projec-
tion of datapoints to the latent space and convergence properties, and investigating
the topology preservation through the magnification factors. We finally apply the
algorithm to several datasets in the simulation section.
3.2 Responsibility Estimation
Even though Gaussian kernels are the most often used, there are various other
possible kernels as shown in Table 3.1.
If yi is the point in latent space corresponding to xi, we have
yi =K∑
k=1
riktk ∈ L (3.15)
43
Chapter 3: ToPoE
Kernel K(u)
Uniform 12I(|u| ≤ 1)
Triangle (1− |u|)I(|u| ≤ 1)
Epanechnikov 34(1− u2)I(|u| ≤ 1)
Quartic 1516
(1− u2)2I(|u| ≤ 1)
Triweight 3532
(1− u2)3I(|u| ≤ 1)
Gaussian 1√2π
exp(−12u2)
Cosinus π4
cos(π2u)I(|u| ≤ 1)
Tri-cube (1− u3)3I(|u| ≤ 1)
Table 3.1: Kernel functions for kernel estimation, where I is an indicator function.
44
Chapter 3: ToPoE
where tk is the coordinate of the kth latent point and
rik =exp(−γ||xi −mk||2)∑Kl=1 exp(−γ||xi −ml||2)
(3.16)
is determined in data space. (3.15) recalls the Nadaraya-Watson kernel estimator
f(x) =
∑Ni=1 Kλ(x0, xi)yi∑Ni=1 Kλ(x0, xi)
(3.17)
Table 3.1 shows alternative kernels that can be used in this situation. Two that
we use are the Epanechnikov quadratic kernel [34]
Dλ(i, k) =d2
ik
λ
and Cλ(k, n) =
{34(1−Dλ(i, k)2) if ‖Dλ(i, k)‖ < 1
0 otherwise(3.18)
and the Tri-cube function with
Cλ(i, k) =
{(1−Dλ(i, k)3)3 if ‖Dλ(i, k)‖ < 1
0 otherwise(3.19)
both of which have compact support with
rik =Cλ(i, k)∑Kl=1 Cλ(i, l)
or
yi =
∑Kk=1 tkCλ(i, k)∑Kk=1 Cλ(i, k)
3.3 The Actual Variance
Let us write the ToPoE model as
p(xi|Θ) ∝K∏
k=1
exp(−β||xi −mk||2rik
)(3.20)
Inserting the Gaussian responsibilities gives us
p(xi|Θ) ∝K∏
k=1
exp
(−||xi −mk||2β exp(−γ||xi −mk||2)∑K
l=1 exp(−γ||xi −ml||2)
)(3.21)
45
Chapter 3: ToPoE
If we write
αi =1
β
K∑
l=1
exp(−γ||xi −ml||2) (3.22)
so that αi is dependent only on the position of the ith data point. Then we may
write (3.21) as
p(xi|Θ) ∝K∏
k=1
exp
(− ||xi −mk||2
αi exp(γ||xi −mk||2))
(3.23)
so that we may see that the local variance of the model around the ith data point
due to the kth Gaussian expert is
σik = αi exp(γ||xi −mk||2) (3.24)
Note that this means that experts whose representation in data space, mk, is far
from the current data point are estimating a large variance whereas experts whose
representation is close to the data point estimate a much smaller variance locally
and so are able to put far more of their probability mass around the data point.
Therefore using (3.4) we see that the local variance from the whole model at the
ith data point is
σi =1∑K
k=11
σik
=αi∑K
k=11
exp(γ||xi−mk||2)
(3.25)
Further we may predict where the model will put its estimate of the maximum
likelihood position of the ith data point i.e. its estimate of where a denoised estimate
lies as
µi = σi
(K∑
k=1
1
σik
mk
)(3.26)
Again we note that points far from the data point will have little effect on this
position since they are estimating large variance while points closer will contribute
greatly to this estimate since their estimate of the local variance is much smaller.
Examination of (3.26) shows that it leads to an identical projection of data points
as the previously used value of∑K
k=1 rikmk which was intuitively satisfying but now
has a more complete rationale.
3.3.1 Simulations
To illustrate the above, we generated 60 two dimensional data points, (x1, x2), from
the function x2 = x1 +1.25 sin(x1)+µ where µ is noise from a uniform distribution.
For the first and the last 20 data points we draw µ from the uniform distribution in
46
Chapter 3: ToPoE
0 0.5 1 1.5 2 2.5 3 3.50.5
1
1.5
2
2.5
3
3.5
4Data and projections of latent points
Figure 3.1: The data are shown by red ’+’ marks. The projections of the latentpoints with γ = 20 are shown as blue ’*’s.
[0,0.3] while for the central 20 data points we draw µ from the uniform distribution
in [0,2] (see Figure 3.1). We show in Figure 3.1 the result of a simulation in which we
have 20 latent points deemed to be equally spaced in a one dimensional latent space,
passed through 5 Gaussian basis functions and then mapped to the data space by the
linear mapping W which is the only parameter we adjust. We use 10000 iterations
of the learning rule (randomly sampling with replacement from the data set) with
γ = 20, η = 0.1. The final placement of the projections of the latent points is shown
by the asterisks in the figure and we clearly see that the one dimensional nature of
the data has been identified. Also, the prototypes are placed along this manifold
in the order in which they appear in the latent space showing that a topographic
projection has been created.
In Figure 3.2, we show with asterisks the positions which the model estimates that
the data has come from. The model identifies a continuous distribution following
the manifold previously found. The datapoints are closer in the areas with higher
density in the original dataset. Finally in Figure 3.3, we show the responsibilities
adopted by the model for the data points. We see that the latent points in the centre
are sharing responsibility for the data points far more widely than those at either
end.
We see from these figures that the final responsibilities of the latent points for
the data points can be very narrow: often one latent point assumes much higher
responsibility than all the rest and typically non-zero probability is only assigned
47
Chapter 3: ToPoE
0 0.5 1 1.5 2 2.5 3 3.50.5
1
1.5
2
2.5
3
3.5
4Data and Estimated positions from the ToPoE Model
Figure 3.2: The data and the estimated positions of the data from the model withγ = 20.
0 5 10 15 20
0
10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
latent points
data points
resp
onsi
bilit
ies
Figure 3.3: The responsibilities of the 20 latent points for generating the 60 datapoints with γ = 20.
48
Chapter 3: ToPoE
0 0.5 1 1.5 2 2.5 3 3.50.5
1
1.5
2
2.5
3
3.5
4Data and centres with gamma =2
Figure 3.4: The data are shown by red ’+’ marks. The projections of the latentpoints when the ToPoE model is used with γ = 2 are shown as blue ’*’s.
to 2 or 3 latent points in the low noise region and no more than 4 or 5 in the high
noise region. Another way of stating this is to say that many of the latent experts
are simply saying “I don’t know” when confronted with a data point. The ToPoE
method enables latent points which are far from the data to simply state that, as
far as they are concerned, the data points could have a high probability. The actual
probability of the data point is calculated from the experts whose projections are
close to the data point. This is rather more like a mixture of experts than a product
of experts, however the final probability is calculated as a product and so while the
training does move the model closer to a mixture, it is still firmly in the product of
experts camp.
However, we may change the projections by changing the value of γ. For Figures
3.4, 3.5 and 3.6, we use γ = 2. We see that there is a tendency for the responsibilities
to be more widely shared and so the map is drawn towards the centre. Also there
is less of an ability to denoise the data. The estimates of the data points’ positions
are pulled to their true positions rather than to the underlying manifold. Since
the responsibilities are inversely proportional to the variances, we may equally well
state that Figure 3.6 illustrates the fact that noisy regions of the data manifold will
exhibit a greater variance than less noisy regions.
49
Chapter 3: ToPoE
0 0.5 1 1.5 2 2.5 3 3.50.5
1
1.5
2
2.5
3
3.5
4Data and estimated postions
Figure 3.5: The data in red ’+’ marks and the estimated positions of the data fromthe model with γ = 2 in blue ’*’s.
05
1015
20
0
20
40
600
0.2
0.4
0.6
0.8
latent pointsdata points
resp
onsi
bilit
es
Figure 3.6: The responsibilities of the 20 latent points for generating the 60 datapoints with γ = 2.
50
Chapter 3: ToPoE
3.4 Cost functions and Convergence
We note that since m = ΦW ,∂m
∂t= Φ
∂W
∂t(3.27)
In particular, for a given latent point,
∂mk
∂t=
M∑m=1
φkm∂wT
m
∂t(3.28)
where wm represents the mth row of W i.e. the weights from the mth basis function.
Since φkm > 0,∀m, k 1
∂mk
∂t= 0 ⇐⇒ ∂wT
m
∂t= 0,∀m (3.29)
We will consider a cost function
J =1
2
N∑i=1
K∑
k=1
‖ xi −mk ‖2 rik (3.30)
and show that the ToPoE learning rule can be considered to be approximately
minimising this cost function.
3.4.1 A simplified model
Consider a simplified model in which at each presentation of the data, we find which
projection of the latent points is closest to the data point and update only the
weights associated with the error due to this latent point’s projection. Let k∗ be the
latent point whose projection is closest to the data point, xi; then
∆iwmd = ηφk∗m(x(d)i −m
(d)k∗ ) (3.31)
Let Λk = {xi : k = arg min ‖ xi −mk ‖}. Then J = 12
∑Ni=1
∑Kk=1 ‖ xi −mk ‖2 rik
becomes
J1 =1
2
N∑i=1
K∑
k=1
‖ xi −mk ‖2 I(i, k) (3.32)
1In practice, some, but not all, φkm may be 0 because of the representation of floating pointnumbers in the computer, but this does not change the basis of the argument.
51
Chapter 3: ToPoE
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
Figure 3.7: The cut down ToPoE in which the projections are adjusted due to theeffect on a single latent point. Alternating symbols and colors show the areas affectedby different latent points.
where I(i, k) is the indicator function which returns 1 if mk is the closest projection
to xi, and 0 otherwise. Thus
J1 =1
2
K∑
k=1
∑xi∈Λk
‖ xi −mk ‖2 (3.33)
and ∂J1
∂mk= 0 ⇐⇒ mk = (x)k where (x)k is the average value of xi taken over Λk. In
the limit, as i → ∞, the value of mk → E(xi : xi ∈ Λk). We illustrate this model
on artificial data in Figure 3.7 where we have shown the data points in consecutive
Λk with alternating symbols.
3.4.2 The full model
Returning to the full model, we have the cost function
J =1
2
N∑i=1
K∑
k=1
‖ xi −mk ‖2 rik (3.34)
52
Chapter 3: ToPoE
so that∂J
∂mk
= −N∑
i=1
(xi −mk)rik +1
2
N∑i=1
‖ xi −mk ‖2 ∂rik
∂mk
(3.35)
Now, since
rik =exp(−γ||xi −mk||2)∑Kl=1 exp(−γ||xi −ml||2)
(3.36)
then
∂rik
∂mk
=
∑Kl=1 exp(−γ||xi −ml||2) exp(−γ||xi −mk||2)2γ||xi −mk||
(∑K
l=1 exp(−γ||xi −ml||2))2
− exp(−γ||xi −mk||2) exp(−γ||xi −mk||2)2γ||xi −mk||(∑K
l=1 exp(−γ||xi −ml||2))2(3.37)
Now if rik is large, the first term in the numerator is approximately equal to the
second term and so ∂rik
∂mk≈ 0. If rik is small, the second term in the numerator is
approximately 0 and
∂rik
∂mk
=
∑Kl=1 exp(−γ||xi −ml||2) exp(−γ||xi −mk||2)2γ||xi −mk||
(∑K
l=1 exp(−γ||xi −ml||2))2
=exp(−γ||xi −mk||2)2γ||xi −mk||∑K
l=1 exp(−γ||xi −ml||2)≈ 0
Thus the ToPoE learning rule can be derived as an approximation to the minimisa-
tion of the cost function (3.30). At convergence,
N∑i=1
(xi −mk)rik = 0 (3.38)
and so
mk =
∑Ni=1 xirik∑Ni=1 rik
(3.39)
a weighted average of the data, rather like a Parzen window based approximation
of the mean of the data.
53
Chapter 3: ToPoE
3.4.3 Projections of the latent points
The learning rule
∆iwmd =K∑
k=1
ηφkm(x(d)i −m
(d)k )rik (3.40)
is defined in terms of the weights. But since m = ΦW , ∂m∂t
= Φ∂W∂t
, or ∂W∂t
=
(ΦT Φ)−1ΦT ∂m∂t
, we may investigate the convergence of the projections of the latent
points. Consider the lth latent point.
M∑m=1
∆iφlmwmd = η
M∑m=1
φlm
K∑
k=1
φkm(x(d)i −m
(d)k )rik
i.e. ∆iml = η
K∑
k=1
(x(d)i −m
(d)k )rik
M∑m=1
φlmφkm
There are several consequences of this:
1. The last term∑M
m=1 φlmφkm is constant for any particular l and is the basis
of topology preservation in the feature space: by construction, rows with large
scalar product with each other correspond to points in latent space close to
each other.
2. It is however maximal for terms in the centre of the latent space (unless we
have wrap around in the latent space). Therefore points in the centre of the
latent space will converge first.
3. There is no objective function for this dynamics in this coordinate representa-
tion of latent space. This follows from
∂∆ml
∂mr
= −rir
M∑m=1
φlmφrm +K∑
k=1
(x(d)i −m
(d)k )
∂rik
∂mr
M∑m=1
φlmφkm 6= ∂∆mr
∂ml
(3.41)
Equality is a necessary and sufficient condition for the existence of an objective
function.
3.5 Magnification Factors and Dimensionality Es-
timation
In [10] the magnification factors of the GTM are discussed, which they define as the
ratio of the area, dA′, of the space traced out by the mapping on the manifold in
54
Chapter 3: ToPoE
data space to the corresponding area, dA, traced out in latent space. It is readily
shown that this is equal to the determinant of the Jacobean of the transformation
between the coordinates in the two spaces which in turn equals the square root of
the determinant of the metric tensor.
dA′
dA= J =
√g =
∣∣∣∣δkl∂ek
m
∂eit
∂elm
∂ejt
∣∣∣∣12
(3.42)
where eim is the ith basis vector of the manifold in data space and ei
t is the ith basis
vector of the latent space.
This is used to illustrate the separation of clusters on the manifold: points which
are well separated in data space but adjacent on the projection will be identifiable
by having locally a large value of this ratio. Points close in both basis systems will
have a relatively small value of the ratio. In practice, the ratio is calculated as
dA′
dA= |ΞT W T WΞ| 12 (3.43)
where Ξkj =∂φkj
∂tk∝ (cj − tk)φkj where cj is the centre of the jth Gaussian mapping
the latent space to the feature space.
However the above is based on the assumption that the data actually lie on a
manifold which it may not. Also the method will not identify folds in the manifold.
For these types of purposes, we must consider the mapping from data space to latent
space2 (where the points will actually be visualised):
xi → yi =K∑
k=1
riktk =K∑
k=1
exp(−γ||xi −mk||2)∑Kl=1 exp(−γ||xi −ml||2)
tk (3.44)
when using the Gaussian kernel function or
xi → yi =K∑
k=1
riktk =
∑Kk=1 C(i, k)tk∑Kk=1 C(i, k)
(3.45)
in general, with C(.,.) defined as in e.g. (3.18) or (3.19). In both (3.44) and
(3.45), we are using the latent points, tk, as basis vectors. This is liable to be an
overcomplete basis since typically K >> L, the dimensionality of the latent space.
Therefore the representation of each point in this basis is non-unique.
2For simplicity in the notation we consider a one dimensional latent space
55
Chapter 3: ToPoE
Assuming (3.44) for responsibilities, we find that
∂yi
∂xi
=2γ
∑Kk=1 exp(−γ||xi −mk||2){
∑Kl=1 exp(−γ||xi −ml||2)[xi −mk + xi −ml]}.tk(∑K
l=1 exp(−γ||xi −ml||2))2
(3.46)
which may be better written as
∂yi
∂xi
=4γ
∑Kk=1 exp(−γ||xi −mk||2){
∑Kl=1 exp(−γ||xi −ml||2)[xi − mk+ml
2]}.tk(∑K
l=1 exp(−γ||xi −ml||2))2
(3.47)
Let us assume that one responsibility dominates the others i.e. that xi is much
closer to mk∗ than it is to any other projected latent point. Then
∂yi
∂xi
≈ 4γ{exp(−γ||xi −mk∗||2)}2[xi −mk∗].tk∗(∑Kl=1 exp(−γ||xi −ml||2)
)2 (3.48)
i.e. the closer xi is to the prototype, mk∗, the lower the rate of change of its position
on the projected manifold. This gives a fine tuning effect within the centres of
clusters. Now let xi be close to a small number of projected prototypes indexed by
k ∈ T . Letting exp(−γ||xi −mk||2) = dk, (3.47) becomes
∂yi
∂xi
=4γ
∑k∈T dk{
∑l∈T dl[xi − mk+ml
2]}.tk(∑K
l=1 exp(−γ||xi −ml||2))2 (3.49)
In particular, if T = {k1, k2}, (3.49) becomes
∂yi
∂xi
∝ d2k1
(xi−mk1)tk1 +dk1dk2 [xi−mk1 + mk2
2](tk1 +tk2)+d2
k2(xi−mk2)tk2 (3.50)
At convergence, ∂y∂xi
= 0 and so
{d2k1
(xi −mk1) + dk1dk2 [xi − mk1 + mk2
2]}tk1
= {dk1dk2 [xi − mk1 + mk2
2] + d2
k2(xi −mk2)}tk2
If L ≥ 2, this basis is not overcomplete and so each of these coordinates must
56
Chapter 3: ToPoE
independently equal 0:
{d2k1
(xi −mk1) + dk1dk2 [xi − mk1 + mk2
2]} = 0
{dk1dk2 [xi − mk1 + mk2
2] + d2
k2(xi −mk2)} = 0
and so
d2k1
(xi −mk1) = d2k2
(xi −mk2) (3.51)
Then
(xi −mk1)
(xi −mk2)=
d2k2
d2k1
=(exp (−γ(xi −mk1)
2))2
(exp (−γ(xi −mk2)2))2
=(exp (γ(xi −mk2)
2))2
(exp (γ(xi −mk1)2))2
(3.52)
the effect that xi has on each prototype is inversely proportional to the square
of the exponential of the distance from xi to the prototype.
3.6 Simulations
In this section we apply ToPoE to artificial and real datasets. It should be noted
that, when results are shown for different kernels (Gaussian, Epanechnikov or Tri-
cube), this kernel is applied just in the visualisation step of the algorithm, in order
to compare it with the other algorithms introduced in this document, which use
the responsibilities only in the visualisation step. The responsibilities calculation of
the training step is always as presented above, with Gaussian kernels. Experiments
carried out with different kernels in the training phase did not give significant changes
in the outputs.
3.6.1 Experiment1: 1D Artificial Data
Figure 3.8 shows the result of a simulation in which we have up to 20 latent points
deemed to be equally spaced in a one dimensional latent space, passed through 5
Gaussian basis functions and then mapped to the data space by the linear mapping
W . We generated a two dimensional dataset, (x1, x2), from the function x2 =
x1 + 1.25 sin(x1) + µ where µ is noise from a uniform distribution in [0,1].
Actually ToPoE does not need the growing, which is one of the appeals of this
model. The next algorithms presented in Chapter 4 and 5 do however require a
growing version.
57
Chapter 3: ToPoE
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
Figure 3.8: The ToPoE topology preserving mappings of the 1D data (in red ’+’)with 2, 4, 8 and 20 latent points (in blue connected ’+’ marks). All the latent mk
prototypes stay in the manifold.
3.6.2 Experiment2: 2D Artificial Data
Using an artificial data set with four clusters spaced out evenly on a line we see the
results for ToPoE. This is the first of experiments in which we compare the use of
different kernels in the calculation of responsibilities for visualisation purposes.
The location of prototypes in data space are logically the same in all three cases.
We can notice the continuity in their location, that it is not spread over all clusters
but just in the centre of the figure. The Gaussian kernel does not separate well
the four clusters (Figure 3.9); both Epanechnikov and Tri-cube do it perfectly. The
reason for this difference it is that, in the Epanechnikov and Tri-cube kernel, only
the neurons that are below a certain distance from the datapoint, are used for the
projection into latent space. The Gaussian kernel however considers all the neurons,
and if there are many latent points with a very small responsibility, the projection
can be moved towards them. The Epanechnikov and Tri-cube kernel only consider
the neurons with higher responsibility, projecting only towards their position in the
two dimensional grid.
58
Chapter 3: ToPoE
0 1 2 3 4 5 6 70
1
2
3
4
5
6
7
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5
0
0.5
1
0 1 2 3 4 5 6 70
1
2
3
4
5
6
7
−5 −4 −3 −2 −1 0 1 2 3 4
x 10−3
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5x 10
−4
0 1 2 3 4 5 6 70
1
2
3
4
5
6
7
−4 −3 −2 −1 0 1 2 3
x 10−3
−2
−1.5
−1
−0.5
0
0.5
1x 10
−4
Figure 3.9: Projection of the 4 clusters data for ToPoE with 10*10 latent points andGaussian kernel (top), Epanichov kernel (middle) and Tri-cube kernel (bottom).Left: data space in which blue ∗ shows the positions of the mk prototypes anddatapoints are in red ’+’. Right: Latent space with yi projections.
59
Chapter 3: ToPoE
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1
−0.95
−0.9
−0.85
−0.8
−0.75
−0.7
−0.65
−0.6
−0.55
∗ Dove, Hen, Duck, Goose
o Owl, Hawk, Eagle
+ Fox, Dog, Wolf
∗ Cat
. Tiger,Lion
¦ Horse,Zebra,Cow
Figure 3.10: ToPoE projection of the animals dataset: M=5*5, K=10*10.
3.6.3 Experiment3: The Animals data set
In this case we use ToPoE to map a higher dimensional dataset. This animal dataset
was used in [64] with GTM and SOM, thus we are comparing the results with those
modellings. This is a 13D data with 6 classes of animals3. The animals are classified
in six groups according to the number of legs, size, swimming, running or flying
animals etc. Figure 3.10 shows the ToPoE projection with 5*5 basis functions,
10*10 latent points and 0.1 for their width. We have birds and 4-legged animals
separated as in the GTM and SOM examples, but the clusters are better defined.
3.6.4 Experiment4: Bank Notes Data
The bank data set has 200 observations of six properties of banknotes. Observations
were made for sets of 100 forged and 100 genuine banknotes.
ToPoE is able to completely separate forged, in the half lower part, from genuine
bank notes in the upper half (Figure 3.11).
3http://student.ifs.tuwien.ac.at/animals.tar.gz
60
Chapter 3: ToPoE
−0.9997 −0.9997 −0.9997 −0.9997 −0.9997 −0.9997 −0.99970
0.5
1
1.5
2
2.5x 10
−4
Figure 3.11: ToPoE projection of the bank note data. Forged notes denoted by redasterisks, genuines by green circles.
3.6.5 Experiment5: The Fundamental Clustering Problems
Suite
To illustrate the different kernels in section (3.2), we use some of the datasets that
appear in The Fundamental Clustering Problems Suite (FCPS) [85]; these datasets
are all low-dimensional, but some algorithms like K-Means have difficulty in clus-
tering them, so they are suitable for illustrating different visualisation capabilities
of the different kernels. We use specifically the Hepta and the Target algorithm; the
first one has clusters with different densities while the second one includes several
outliers.
We see in Figure 3.12 that the Gaussian kernel does not give a good impression
of the correct topology for the hepta dataset, so that it is not possible to identify
which is the central cluster (in red) from the projection. Using the other kernels
however this cluster appears in the middle, and the Epanechnikov gives a better
visualisation. In Figure 3.13 we see how the location of the prototypes in data space
is similar, and that is only the different calculation of the responsibilities after the
training that gives a different projection. ToPoE has been applied to other datasets
with clusters with different densities as in the Hepta dataset, and we found that
ToPoE tends to separate the clusters with more density in a different projection
area when using the Gaussian kernel. The reason could be again the amount of
61
Chapter 3: ToPoE
−4−2
02
4
−4
−2
0
2
4−4
−2
0
2
4
−4 −2 0 2 4 6 8 10
x 10−5
−1.5
−1
−0.5
0
0.5
1
1.5x 10
−4
0 0.005 0.01 0.015 0.02 0.025 0.03−8
−6
−4
−2
0
2
4
6
8
10
12x 10
−4
−4 −2 0 2 4 6 8 10 12 14
x 10−3
−4
−2
0
2
4
6
8
10x 10
−4
Figure 3.12: ToPoE projection of the hepta data (top left), with the Gaussian kernel(top right), Epanechnikov λ=18(bottom left) and Tri-cube kernels λ=18(bottomright).
latent points with very small responsibility that are pulling the projection towards
them. This could be useful when the purpose is to isolate areas of high density in
the data. If the aim is to maintain the topology, we have to use the right kernel for
visualisation.
Similarly with the Target dataset, (Figure 3.14), the Gaussian kernel does not
maintain the topology of the clusters, while the other kernels are able to do so, spe-
cially Epanechnikov that is the only one to give an intuitively satisfying projection.
Figure 3.15 shows again prototypes in data space and projections for two of the ker-
nels. The first thing we notice is that ToPoE has imposed a structure for the location
of the prototypes. They all remain within a region, keeping a common direction,
both for the Gaussian and Epanechnikov kernels; this reflects the tension between
the shape of the data and the shape of the map in latent space; the Epanechnikov
kernel however is able to project the right topology in latent space.
Similarly with the hepta dataset, the Gaussian projection for ToPoE separates
the tighter cluster from the other six that are of similar density between them. The
localisation of prototypes with ToPoE for the hepta dataset (Figure 3.13) is also
within a fixed structure. They all remain within a continuous shape (similar to a
62
Chapter 3: ToPoE
−0.20
0.20.4
0.6
−0.5
0
0.5
1−0.5
0
0.5
1
−0.20
0.20.4
0.60.8
0
0.1
0.2
0.3
0.4−0.2
0
0.2
0.4
0.6
−4 −2 0 2 4 6 8 10
x 10−5
−1.5
−1
−0.5
0
0.5
1
1.5x 10
−4
0 0.005 0.01 0.015 0.02 0.025 0.03−8
−6
−4
−2
0
2
4
6
8
10
12x 10
−4
Figure 3.13: ToPoE projection of the hepta data with the Gaussian kernel (left) andthe Epanechnikov kernel (right) at the bottom; the corresponding top figures showthe position of the mk prototypes in data space.
fan for this 3D experiment).
3.6.6 Experiment6: The Algae data set
This is a set of 118 samples from a scientific study of various forms of algae some of
which have been manually identified. Each sample is recorded as an 18 dimensional
vector representing the magnitudes of various pigments. 72 samples have been iden-
tified as belonging to a specific class of algae which are labeled from 1 to 9. 46
samples have yet to be classified and these are labeled 0. The results of a ToPoE
training is depicted in Figure 3.16. Figure 3.17 zooms into the central and left areas
of the previous figure, in order to better visualise the separation of clusters in the
projection space. Finally, Figure 3.18 adds the projection of the unclassified data-
points, according to the previous training. To classify those new points we would
need an additional technique like K-Nearest Neighbourgs in order to associate each
datapoint to a particular class.
It is of interest to compare the GTM on the same data: we use a two dimensional
latent space with a 10×10 grid for comparison. The results are shown in Figures
3.19 and 3.20. The GTM makes a very confident classification: we see that the
63
Chapter 3: ToPoE
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
−0.01 −0.005 0 0.005 0.01 0.015−5
−4
−3
−2
−1
0
1
2
3
4x 10
−4
−0.01 −0.005 0 0.005 0.01 0.015 0.02−8
−6
−4
−2
0
2
4
6
8x 10
−4
Figure 3.14: Target data (top left), ToPoE projection of the data with the Gaussiankernel (top right), with the Epanechnikov kernel λ=18(bottom left) and Tri-cubekernel λ=18(bottom right).
responsibilities for data points are very confidently assigned in that individual classes
tend to be allocated to a single latent point. This, however works against the GTM
in that, even when zooming in to the map, one cannot sometimes disambiguate the
two different classes such as at the points (1,-1) and (1,1). This was not alleviated
by using regularisation in the GTM though we should point out that we have a very
powerful model for a rather small data set.
In fact, we can control the level of quantisation in ToPoE by changing the γ
parameter in (3.9). For example by lowering γ, we share the responsibilities more
equally and so the map contracts to the centre of the latent space to get results
such as shown in Figure 3.21; the different clusters can still be identified but rather
less easily. Alternately, by increasing γ, one tends to get the data clusters confined
to a single node, that which has sole responsibility for that cluster, as in the Self-
Organizing Map.
64
Chapter 3: ToPoE
0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
−6 −4 −2 0 2 4 6 8 10
x 10−3
−3
−2
−1
0
1
2
3
4
5
6
7x 10
−4
Figure 3.15: ToPoE projection of the target data with the Gaussian kernel (left)and the Epanechnikov kernel (right) at the bottom; the corresponding top figuresshow the position of the mk prototypes in data space.
+ Class 1
o Class 2
F Class 3
. Class 4
2 Class 5
O Class 6
. Class 7
+ Class 8
♦ Class 9
Figure 3.16: Projection of the 9 classes by the ToPoE.
65
Chapter 3: ToPoE
Figure 3.17: Left: zooming in on the central portion. Right: zooming in on the leftside.
+ Class 1
o Class 2
F Class 3
. Class 4
2 Class 5
O Class 6
. Class 7
+ Class 8
♦ Class 9
F Unclassified
Figure 3.18: The projection of the whole data set by the ToPoE.
66
Chapter 3: ToPoE
+ Class 1
o Class 2
F Class 3
. Class 4
2 Class 5
O Class 6
. Class 7
+ Class 8
♦ Class 9
Figure 3.19: The projection of the 9 classes of algae data given by the GTM.
+ Class 1
o Class 2
F Class 3
. Class 4
2 Class 5
O Class 6
. Class 7
+ Class 8
♦ Class 9
F Unclassified
Figure 3.20: The projection of the algae data given by the GTM.
67
Chapter 3: ToPoE
+ Class 1
o Class 2
F Class 3
. Class 4
2 Class 5
O Class 6
. Class 7
+ Class 8
♦ Class 9
F Unclassified
Figure 3.21: By lowering the γ parameter, the ToPoE map is contracted.
3.7 Conclusions
We have reviewed the Topographic Product of Experts introduced in [30]. This is
the first of the algorithms in this thesis with the underlying model of the GTM.
ToPoE replaces mixture of experts by product of experts, and the EM algorithm by
gradient descent. We have extended the algorithm investigating its local variance,
the projection to latent space and convergence properties. We also study the use of
the magnification factors as a tool for measuring topology preservation.
We have applied ToPoE to several artificial and real datasets, analysing the
projections and the location of prototypes in latent space. We have seen how the
projections are in general good with a Gaussian kernel, but sometimes need a differ-
ent kernel to better visualise the topology of the data. Real and higher dimensional
datasets were correctly mapped with this algorithm, giving clusters more separated
in the projections for some of them than with the GTM algorithm.
68
Chapter 4
The Harmonic Topographic
Mapping
This is the second of the family of algorithms within the topology preserving maps
category presented in this thesis, that share the Generative Topographic Map-
ping model, but with the important variation that the projection to the lower-
dimensionality space is separate from the clustering step, so that they are not in-
cluded in the same learning rule. The underlying structure is the same as in ToPoE,
and thus the same as in the GTM. The clustering technique though is based on the
K-Harmonic Means.
The first attempt was to apply the K-Harmonic Means (KHM) algorithm devel-
oped by Zhang [91, 92, 95] to the Self-Organising Map (Section 4.1); the results were
not so good, which seemed to suggest using KHM with a different frame of centres.
We then focused our attention on the GTM algorithm, and the Topographic Product
of Experts (Section 3.1). We found its non-linear projection more suitable for our
purpose (allowing us to separate projection from clustering in two different steps),
creating the Harmonic topographic mapping (HaToM)(Section 4.2). We developed
two versions of this algorithm, that give better projections in different situations.
4.1 The Harmonic Self-Organising Map
As seen in the literature review (Chapter 2), the Harmonic Average of K points,
a1, ..., aK , is defined as
HA({ai, i = 1, · · · , K}) =K∑K
k=11ak
(4.1)
69
Chapter 4: HaToM
Using this, Zhang et al developed the K-Harmonic Means algorithm whose re-
cursive formula to update the prototypes is
mk =
∑Ni=1
1
d4ik
(∑Kl=1
1
d2il
)2xi
∑Ni=1
1
d4ik
(∑Kl=1
1
d2il
)2
(4.2)
In an attempt to improve the SOM algorithm using the clustering properties of
K-Harmonic Means, we included this recursive formula in the update rule for the
prototypes of the SOM, creating the Harmonic SOM (HSOM),
mk =
∑Ni=1
1
d4ik
(∑Kl=1
1
d2il
)2xiΛi(k, k∗)
∑Ni=1
1
d4ik
(∑Kl=1
1
d2il
)2
(4.3)
where Λi(k, k∗) denotes the value of the neighbourhood function when k∗ is the
winning neuron when the network is presented with the ith data point, xi.
4.1.1 HSOM Simulations
Uniform distribution artificial data
We first illustrate the Harmonic SOM on the standard data set which is drawn
from a uniform distribution on [0,1]×[0,1]. In Figure 4.1, we show the results of
a simulation in which (4.3) was performed 200 times for each data point. In fact,
there was very little change after the first 20 iterations and so the map found is very
stable. The prototypes were initialised randomly near the mean of the dataset. We
see that nearby prototypes are in fact positioned close to one another in data space
but, because the dimensionality of the map is not the same as that of the data, some
data points which are close to one another are not quantised to prototypes which
are adjacent on the map.
2D artificial data
Now we investigate the ability of the Harmonic SOM to enable users to visualise
manifolds of lower dimensionality embeded in a higher dimensionality dataset, pro-
jecting into a two dimensional space. To illustrate this, we generated 1000 two
dimensional data points, (x1, x2), from the function x2 = x1 +1.25 sin(x1)+µ where
µ is noise from a uniform distribution in [0,1]. Thus this two dimensional data exists
70
Chapter 4: HaToM
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 4.1: The prototypes of a Harmonic SOM trained on data from a uniformdistribution on [0,1]×[0,1].
close to a one dimensional manifold. With randomly initialised prototypes, we get
the results shown in Figure 4.2: the prototypes approximately find the manifold but
there is a twist in the mapping. This would inhibit a user from identifying the shape
of the manifold on which the data lies. Extensive simulations have shown that such
twists are very difficult to remove.
We therefore repeat this experiment but initialise the prototypes to lie on the
first principal component of the data and so the prototypes initially lie on a straight
line in data space spanning the direction of greatest spread. Now (Figure 4.3 left)
the trained map lies on the manifold in a way that enables a user to identify the
manifold merely from the quantisation of the data to the prototypes.
However there is one continuing failing in the mapping which is that the mapping
does not stick to the centre of the manifold but moves from side to side across the
manifold: the mapping is responding too finely to the actual positions of the data
set i.e. to the noise in the data set. This indeed was one of the positive aspects of
the original application of harmonic means to K-Means but is less helpful when we
wish to use the map as a data visualisation tool.
To solve this in this particular case we had to reduce the number of iterations to
two (Figure 4.3 right), which may or may not be sufficient for a real data visualisation
problem, but certainly suggests a lack of stability in the mapping.
71
Chapter 4: HaToM
0 1 2 3 4 5 6 7−2
−1
0
1
2
3
4
5
6
7
8
Figure 4.2: The map found by the HSOM is centered in the data but has a twist(data in red ’+’, prototypes in blue ’*’).
0 1 2 3 4 5 6 7−2
−1
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
Figure 4.3: The Harmonic SOM (data in red ’+’, prototypes in blue ’*’), follows themanifold more smoothly with 2 iterations (right), than with 10 (left).
72
Chapter 4: HaToM
4.2 Harmonic topographic Map
In the light of the good results obtained with ToPoE, and wishing to apply K-
Harmonic Means within a different projection structure, we applied this clustering
technique to the same underlying structure as ToPoE and GTM. We thus remove the
probabilistic underpinnings in the Generative Topographic Mapping. The Harmonic
Topographic Map (HaToM) [59, 60, 67, 68, 69, 70, 71, 72] is then the sum of the GTM
projection plus the K-Harmonic Means, this time separated into two different steps.
This separation allows for the creation of two different variations of the algorithm,
as shown below: Data-driven HaToM (D-HaToM) and Model-driven HaToM (M-
HaToM). The HaToM has the same structure as the GTM, but the similarity ends
there because the objective function is not the GTM one, nor is it optimised with
the Expectation-Maximization (EM) algorithm. Instead, the HaToM uses the well
proved clustering abilities of the K-Means algorithm, improved by using harmonic
means to make it insensitive to initialisation ([95]).
The basic batch algorithm often exhibited twists, such as are well-known in
the SOM [48], so we developed a growing method that prevents the mapping from
developing these twists. The growing also provides a number of advantages discussed
below.
One of the main attractions of the HaToM compared with ToPoE is that the
algorithm does not require responsibilities. These are only used when we are using
HaToM to visualise data, i.e. when we are working in latent space.
4.2.1 Data-driven HaToM
In [67] we opted to begin with a small value of K (for one dimensional latent spaces,
K=2; for two dimensional latent spaces and a square grid, K=2*2) and grow the
mapping. However we do not randomise W each time we augment K. The current
value of W is approximately correct and so we need only continue training from this
current value. Also we use a pseudo-inverse method for the calculation of W from
the positions of the prototypes.
The algorithm in [67] was
1. Initialise K to 2. Initialise the W weights randomly and spread the centres of
the M basis functions uniformly in latent space.
2. Initialise the K latent points uniformly in latent space.
73
Chapter 4: HaToM
3. Calculate the projection of the latent points to data space. This gives the K
prototypes, mk.
(a) count=0.
(b) Randomly select a datapoint xi; calculate dik = ||xi − mk|| for the K
prototypes.
(c) Recalculate prototypes using (4.2).
(d) If count<MAXCOUNT, count= count +1 and return to 3b
4. Recalculate W using
W =
{(ΦT Φ + δI)−1ΦTΞ if K < M
(ΦT Φ)−1ΦTΞ if K ≥ M
where Ξ is the matrix containing the K prototypes, I is identity matrix and
δ is a small constant, necessary because initially K < M and so the matrix
ΦT Φ is singular.
5. If K < Kmax, K = K + 1 and return to 2.
If we wish to use the mapping for visualisation, we must map data points into
latent space. To do this, we define the responsibility that the kth latent point has
for the ith data point as
rik =exp(−γ||xi −mk||2))∑Kl=1 exp(−γ||xi −ml||2))
=exp(−γd2
ik)∑Kl=1 exp(−γd2
il)(4.4)
and the new data point is placed at yi in latent space where
yi =K∑
k=1
riktk (4.5)
where tk is the position of the kth latent point in latent space.
4.2.2 Model-driven HaToM
The steps for the M-HaToM model are:
1. Initialise K to 2. Initialise the W weights randomly and spread the centres of
the M basis functions uniformly in latent space.
74
Chapter 4: HaToM
2. Initialise the K latent points uniformly in latent space. Set count=0.
3. Calculate the projection of the latent points to data space. This gives the K
prototypes, mk = φTk W .
4. Randomly select a datapoint xi; calculate dik = ||xi −mk|| for the K proto-
types.
5. Recalculate prototypes using (4.2).
6. Recalculate W using
W =
{(ΦT Φ + δI)−1ΦTΞ if K < M
(ΦT Φ)−1ΦTΞ if K ≥ M
with the same notation as before.
7. If count<MAXCOUNT, count= count +1 and return to 3
8. If K < Kmax, K = K + 1 and return to 2.
The projection method is the same as above. The difference with the first version
is that the projection is more constrained by the non-linear model, which is con-
stantly imposed inside the inner loop. This ensures a smoother manifold as shown
in Section 4.3.1.
4.2.3 Generalised Harmonic Topographic Map
The generalised version of both D-HaToM and M-HaToM, the Generalised Harmonic
Topographic Map or G-HaToM [65, 66] uses the generalised version of KHM. The
only change from the HaToM is in the recalculation of the prototypes, which in this
case is:
mk =
∑Ni=1
1
dp+2ik
(∑Kl=1
1
dpil
)2xi
∑Ni=1
1
dp+2ik
(∑Kl=1
1
dpil
)2
(4.6)
so that p determines the power of the L2 distance used in the algorithm. This version
has not only the soft membership that allows for a continuous transition from areas
of high density of data, that was already present in the ungeneralised versions, but
also boosting properties due to the dynamic weighting function, since the effect of
any particular data point on the re-calculation of a prototype is O(‖ xi−mk ‖p2−p−2),
75
Chapter 4: HaToM
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
72 latent points
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7mapping with 4 latent points
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7harmonic mapping with 8 latent points
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7Harmonic mapping with 20 latent points
Figure 4.4: The D-HaToM topology preserving mappings of the 1D data with 2, 4,8 and 20 latent points (data in red ’+’, prototypes in blue ’*’).
which for p > 2 has greatest effect for larger distances. Again, the generalised M-
HaToM imposed the model to a greater extent, while the generalised D-HaToM
leaves more freedom for the prototypes to move.
4.3 HaToM Simulations
4.3.1 Experiment 1: 1D Artificial Data
Figure 4.4 shows the result of D-HaToM applied to the 1D case. We see that, for
a sufficiently small number of latent points, the one dimensional nature of the data
set is revealed but when the number of latent points exceeds 15, the manifold found
begins to wander across the true manifold.
Figure 4.5 shows how the M-HaToM algorithm solves this problem, i.e. we
can increment the number of latent points as long as we want, without losing the
manifold shape.
The reason for the creation of the smooth manifold compared to the D-HaToM
76
Chapter 4: HaToM
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7−1
0
1
2
3
4
5
6
7
8
Figure 4.5: The M-HaToM topology preserving mappings of the 1D data with 2, 4,8 and 20 latent points (data in red ’+’, prototypes in blue ’*’). All the latent pointsstay in the centre of the manifold.
77
Chapter 4: HaToM
algorithm is twofold:
1. The δI is a regularising term which ensures that the manifold does not wander
about the data space but sticks closely to the manifold.
2. However, even when we remove this term (for K ≥ M), the regularisation
continues since we are compressing the re-construction of Ξ into M dimensions:
Ξ = ΦT W
where W = (ΦΦT )−1ΦΞ
Therefore Ξ = ΦT (ΦΦT )−1ΦΞ
4.3.2 Experiment 2: 2D Artificial Data
Using an artificial data set with four clusters spaced out evenly we see how the D-
HaToM keeps all the mk prototypes inside the clusters (Figure 4.6 Top left) but it
maps the clusters too close together in latent space, impeding the users’ ability to
differentiate between them (Figure 4.6 Top right); whereas the M-HaToM has a few
mk prototypes outside the clusters but gives a good clustering for visualisation in
latent space (Figure 4.6 Bottom). Note that Figure 4.6 (left) shows the projections
of the latent points in data space while Figure 4.6 (right) shows the positions which
the data points assume in latent space.
To be able to separate the clusters in latent space for the D-HaToM, the user
has to tune the parameters. In Figure 4.7 (top) we see the D-HaToM projection of
the same data changing the width of the responsibility function (i.e. making it less
wide). Now the clusters are well separated. The same parameters in the M-HaToM
give also a tighter clustering (Figure 4.7 bottom).
With the M-HaToM, the W changes with each iteration which re-imposes the
model structure on the mapping in every iteration, while with the D-HaToM, the
W is only changed when the number of latent points is changed which leaves the
data in charge within the inner loop; so in the first case the model forces the mk
prototypes to have a smooth change (so as not to leave a big space between clusters)
while in the second case the data is in charge and keeps the mk prototypes only
where the data is.
78
Chapter 4: HaToM
0 1 2 3 4 5 6−4
−2
0
2
4
6
8
−2 −1.5 −1 −0.5 0 0.5 1 1.5−2
−1.5
−1
−0.5
0
0.5
1
1.5
0 1 2 3 4 5 6−4
−2
0
2
4
6
8
−2 −1.5 −1 −0.5 0 0.5 1 1.5−2
−1.5
−1
−0.5
0
0.5
1
1.5
Figure 4.6: Projection of the 4 clusters data with 10*10 latent points. Left: dataspace in which . shows the positions of the mk prototypes, and the datapoints arecolored according to clusters. Right: Latent space with projections using the sameclusters’ coloring. Top: D-HaToM . Bottom: M-HaToM
79
Chapter 4: HaToM
0 1 2 3 4 5 6−4
−2
0
2
4
6
8
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
0 1 2 3 4 5 6−4
−2
0
2
4
6
8
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Figure 4.7: Projection of the 4 clusters data with 10*10 latent points and narrowerresponsibilities. Left: data space in which . shows the positions of the mk proto-types, and the datapoints are colored according to clusters. Right: Latent space withprojections using the same clusters’ coloring. Top: D-HaToM . Bottom: M-HaToM
80
Chapter 4: HaToM
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
∗ Dove, Hen, Duck, Goose
o Owl, Hawk, Eagle
+ Fox, Dog, Wolf
∗ Cat
. Tiger,Lion
¦ Horse,Zebra,Cow
Figure 4.8: M-HaToM projection of the animals dataset: M=5, K=10.
4.3.3 Experiment 3: The Animals data set
Figure 4.8 shows the M-HaToM projection with 5*5 basis functions, 10*10 latent
points and 0.1 for the width of the Gaussians. We have birds and 4-legged animals
separated as in the GTM and SOM examples, but the clusters are better defined.
Figure 4.9 shows the clustering with the same parameters as above for the nor-
malised data, with D-HaToM. M-HaToM produces a better clustering with and
without normalising, and seems in this case to be a more robust algorithm, im-
proving also upon the projection done by ToPoE in Section 3.6.3, by separating the
different classes better in the projection figure.
4.3.4 Experiment 4: The Fundamental Clustering Problems
Suite
To illustrate the use of the three kernels explained in Section 3.2, we use again the
datasets from The Fundamental Clustering Problems Suite (FCPS) [85]. We also
calculate the magnification factors used for the GTM [10], that are the same for all
the algorithms introduced in this thesis, due to the common structure underlying in
all of them.
We see in Figure 4.10 that the Gaussian kernel in both D- and M-HaToM does not
give a good impression of the correct topology for the hepta dataset, so that it is not
possible to identify which is the central cluster (in red) from the projection. Figure
4.11 shows the improvement achieved with other kernels for M-HaToM. Similarly
81
Chapter 4: HaToM
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
∗ Dove, Hen, Duck, Goose
o Owl, Hawk, Eagle
+ Fox, Dog, Wolf
∗ Cat
. Tiger,Lion
¦ Horse,Zebra,Cow
Figure 4.9: D-HaToM projection of the normalised animals dataset: M=5, K=10.
−4−2
02
4
−4
−2
0
2
4−4
−2
0
2
4
−4−2
02
4
−4
−2
0
2
4−4
−2
0
2
4
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
Figure 4.10: D-HaToM (left) and M-HaToM (right) projection of the hepta data intolatent space with Gaussian kernels (bottom figures); the corresponding top figuresshow the position of the mk prototypes in data space.
82
Chapter 4: HaToM
−4−2
02
4
−4
−2
0
2
4−4
−2
0
2
4
−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
−3 −2 −1 0 1 2 3
x 10−4
−1
−0.5
0
0.5
1
1.5
2x 10
−4
−4 −3 −2 −1 0 1 2 3 4
x 10−5
−1.5
−1
−0.5
0
0.5
1x 10
−5
Figure 4.11: M-HaToM projection of the hepta data (top left) with the Gaussian (topright), Epanechnikov λ=2 (bottom left) and Tri-cube λ=2 (bottom right) kernels.
83
Chapter 4: HaToM
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Figure 4.12: D-HaToM (left) and M-HaToM (right) projection of the target data intolatent space with Gaussian kernels (bottom figures); the corresponding top figuresshow the position of the mk prototypes in data space.
84
Chapter 4: HaToM
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
−6 −4 −2 0 2 4 6
x 10−4
−5
−4
−3
−2
−1
0
1
2
3
4
5x 10
−4
−1 −0.5 0 0.5 1 1.5 2
x 10−4
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1x 10
−4
Figure 4.13: M-HaToM projection of the target data (top left) with the Gaussian(top right), Epanechnikov λ=2 (bottom left) and Tri-cube kernels λ=2 (bottomright).
85
Chapter 4: HaToM
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.14: D-HaToM projection with 30*30 latent points of the target data withGaussian (left), and Epanechnikov λ=4 (right). The magnification factors for eachnode are depicted by green circles, with the size proportional to its value.
with the target dataset, (Figure 4.12), the Gaussian kernel does not separate properly
the outliers from the rest of the data for M-HaToM (D-HaToM projects correctly
this time), while the other kernels are able to do so (Figure 4.13).
The author considers that the projections for D- and M-HaToM with the Gaus-
sian kernel are better than with ToPoE, but the outliers were only properly separated
with D-HaToM. From that figure also we see that the reason for this is the locali-
sation of the mk prototypes in data space: while for M-HaToM they all stay within
the main data, D-HaToM allocates some of the prototypes to the outlying regions.
The HaToM allocation of prototypes (Figure 4.11) shows the influence of K-
Harmonic Means, that maintains the prototypes in data space where the clusters
are. The M-HaToM algorithm, due to the more frequent imposition of the non-linear
projection, always find a smoother manifold, and the prototypes within the data
clusters are always connected by prototypes in between the clusters; the low-density
areas with outliers are not well covered with this version though, while D-HaToM
allocates most of the neurons within the main clusters, but also four of them to the
four outlier areas, which helps to correctly project the topology of this dataset.
As with ToPoE, the Epanechnikov and Tri-cube kernels help to properly visualise
the data with M-HaToM. We can see the similarity between ToPoE and the model
version of HaToM that gives greater weight to the underlying model.
With respect to the magnification factors for each node for the D-HaToM (Figure
4.14) and M-HaToM (Figure 4.15), we see how the former is more sensitive to out-
liers, while the latter does not give a proper visualisation with the Gaussian kernel,
the areas with outliers being magnified to excess.
86
Chapter 4: HaToM
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.15: M-HaToM projection with 30*30 latent points of the target data withGaussian (left), and Epanechnikov λ=4 (right). The magnification factors for eachnode are depicted by green circles, with the size proportional to its value.
The D-HaToM version, while being more sensitive to the visualisation parameters
as seen for the two-dimensional experiment, and creating a less smooth manifold
(i.e. the prototypes wander more around the data, though always inside the data
clusters), locates the centres always near the data so that the use of a specific
kernel has less influence on the visualisation, as the responsibilities will be high for
near datapoints and small for far away data. Also, the magnification factors show
equal magnification for data points and outliers, indicating a consistent relationship
between distances in data space and distances in latent space.
In the case of the M-HaToM version, the Gaussian kernel for the responsibilities
shows a bigger magnification area where the outliers are, which indicates that the
algorithm, trying to fix all data points into a smooth manifold, and being more
influenced by the model than the previous D-HaToM, has allocated prototypes in
between clusters, but not in the outliers area. The Epanechnikov kernel however,
seems to overcome this problem and gives a proper separation between data points
and outliers, due to its local properties, and shows again equal magnification factors
for all nodes around the data.
This shows that the different versions of the algorithm are better for different
kind of data, e.g. D-HaToM is more suitable for finding outliers, while M-HaToM
finds smoother manifolds.
4.3.5 Experiment 5: The Algae data set
In this section we apply D- and M- HaToM to the algae dataset. The D-HaToM
algorithm with 6*6 basis functions has given a better visualisation of the clusters of
87
Chapter 4: HaToM
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
Figure 4.16: The D-HaToM projection of the 9 labelled algae classes and 1 unlabelledclass (0) on a harmonic mapping with a 2 dimensional set of 10*10 latent points(M=6).
the algae data to the author’s belief. Figure 4.16 shows a clustering with a nearly
complete separation of the different classes. With M-HaToM more basis functions
are required (see Figure 4.17), and again we see a tightening of the projection due
to model reinforcement.
Additional tightening is possible as seen above by reducing the width of the
responsibilities (Figures 4.18 and 4.19). The magnification factors depicted in the
same figures are smaller in the data areas (indicating more tight clusters in high-
dimensional space). In Figure 4.18 however we see how the Magnification Factors
(MF) are higher for the classes 9 and 6 areas (red diamonds and pink down triangles),
reflecting the fact that the differences between the distances in the two spaces are
higher for these two clusters.
It is worth noting that this mapping was achieved with 20 iterations of the
algorithms, while for ToPoE we need at least 100,000 iterations. The reduction in
time for not growing the algorithm in ToPoE, is offset with a longer convergence
time.
4.4 G-HaToM Simulations
Using the pth power of the L2 distances we are better able to separate into clusters
high dimensional and also more complex data, such as the crabs or the oil data (see
88
Chapter 4: HaToM
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
Figure 4.17: The M-HaToM projection of the 9 labelled algae classes and 1 unlabelledclass (0) on a harmonic mapping with a 2 dimensional set of 10*10 latent points(M=12).
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.18: D-HaToM projection of the algae data with narrower Gaussian (left).The Magnification factors for each node are depicted on the right side (M=6).
89
Chapter 4: HaToM
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.19: M-HaToM projection of the algae data with narrower Gaussian (left).The Magnification factors for each node are depicted on the right side (M=12).
below). Also, the boosting-like weighting allows the algorithm to achieve a faster
clustering on the algae data as we will see compared with both previous versions of
HaToM. We call the models G-D-HaToM for the generalised D-HaToM and G-M-
HaToM for the generalised M-HaToM.
4.4.1 Experiment 1: Crabs Data
This is a 5 dimensional dataset1 on the morphology of rock crabs of genus Leptograp-
sus, with 50 specimens of each sex of each of two colour forms, blue and orange. This
data is used in the Generative Topographic Map (GTM) by Svensen [80] to show
the projection into latent space of the four clusters with the GTM. We illustrate the
results of the G-D-HaToM algorithm in Figure 4.20, using the L2 distance to the
third power, 20∗20 latent points and 20 iterations, over non-normalised data (unlike
the GTM which needed normalising first). The projection keeps together the two
female clusters on the low part of the figure, while the male clusters are at the top;
only the blue form sexes stay closer.
4.4.2 Experiment 2: Bank Notes Data
G-D-HaToM is able to completely separate forged from genuine bank notes (see
Figure 4.21), without normalising first as was needed for the non-generalised version.
1http://www.stats.ox.ac.uk/pub/PRNN/crabs.dat
90
Chapter 4: HaToM
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5∗ Male blue form
o Female blue form
+Male orange form
. Female orange form
Figure 4.20: G-D-HaToM projection of the two species of crabs with equal proportionof both sexes: p=3.
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Figure 4.21: G-D-HaToM projection of the bank note data with 25 basis functionsand 4*4 latent points; p=2. Forged notes denoted by red asterisks, genuines by greencircles.
91
Chapter 4: HaToM
−2 −1.5 −1 −0.5 0 0.5 1 1.5−2
−1.5
−1
−0.5
0
0.5
1
1.5
Figure 4.22: G-D-HaToM projection of the oil data: p=5.
4.4.3 Experiment 3: Oil Data
The oil flow dataset2 consists of 1000 points classified into three flow configura-
tions. This is synthetic data modelling non-intrusive measurements on a pipe-line
transporting a mixture of oil, water and gas. The flow in the pipe takes one out of
three possible configurations: horizontally stratified, nested annular or homogeneous
mixture flow. The data lives in a 12-dimensional measurement space, but for each
configuration, there are only two degrees of freedom: the fraction of water and the
fraction of oil. (The fraction of gas is redundant, since the three fractions must sum
to one.) Hence, the data lives on a number of ’sheets’ which locally are approxi-
mately 2-dimensional. The data is 12 dimensional and therefore more suitable for
the purpose of showing the capabilities of an algorithm to visualise complex data
sets. This data is used to check the hierarchical GTM in [83].
In this case again the pth power of the L2 distance was better (compared to
D-HaToM) to separate the clusters and Figure 4.22 shows the projection onto a 2
dimensional map with 60 by 60 latent points, 40 iterations and 20*20 basis functions.
The L2 distance was to the power of 5. Augmenting the number of centre points to
40*40 (Figure 4.23) we get a better separation of the three kind of mixtures.
The advantage in comparison with the hierarchical GTM is the simplicity and
the corresponding lower computational cost.
2http://www.ncrg.aston.ac.uk/GTM/
92
Chapter 4: HaToM
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
Figure 4.23: G-D-HaToM projection of the oil data with more basis functions, p=5.
4.4.4 Experiment 4: Algae Data
HaToM gives a very good clustering of the algae data as we showed above (Figures
4.16 and 4.17). To illustrate the improvement with the G-HaToM (both versions),
we reduce the number of latent points to the minimum: G-D-HaToM is able to
cluster this data with only 4*4 latent points and p=3, as shown in Figure 4.24,
while the G-M-HaToM needs at least 10*10 latent points, p=6 with M=6, or p=3
with M=26 (see Figures 4.25 and 4.26). A clear difference with the generalisation
of both versions of HaToM is the increment in separation of clusters (reflected by
the increment in the MF), even intra-cluster in some cases. An example is Class
3 (blue stars), that seems to have two subclasses, always differentiated in these
projections. Further analyses within the clusters could be carried out with G-D-
and G-M-HaToM.
4.5 Conclusion
This Chapter presents the second of the algorithms sharing a common structure
with GTM, the Harmonic Topographic Mapping (HaToM). The main property of
this algorithm is the use of the K-Harmonic Means clustering method that overcomes
some of the drawbacks of K-Means. Another relevant characteristic is the separation
of the clustering and projection steps, allowing the inclusion or not of the projection
in the inner loop, with the clustering steps. Using that characteristic we develop
93
Chapter 4: HaToM
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1∗ Class 1∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.24: G-D-HaToM projection of the algae data: p=3 and 4*4 latent points,and Magnification Factors below.
94
Chapter 4: HaToM
−1 −0.5 0 0.5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1∗ Class 1∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.25: G-M-HaToM projection of the algae data: p=6, M=6 and 10*10 latentpoints, and Magnification Factors below.
95
Chapter 4: HaToM
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1∗ Class 1∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.26: G-M-HaToM projection of the algae data: p=3, M=26 and 10*10 latentpoints, and Magnification Factors below.
96
Chapter 4: HaToM
two versions of HaToM, depending on the necessity to reinforce the model (M-
HaToM) or the clustering (D-HaToM). Both of them can be further extended using
the generalised version of K-Harmonic Means, which give additional properties to
the clustering.
We have also applied the algorithms in Chapter 3 and 4 to other tasks such as
Blind Source Separation (BSS) and forecasting [71]. These however are somewhat
separate from the main uses of topology preserving maps such as feature extraction,
clustering and visualisation, and so we do not report details here.
97
Chapter 5
The Topographic Neural Gas &
Algorithms comparison
This chapter introduces the last of the new algorithms presented in this thesis.
The reason for a new clustering technique within the same underlying model as
ToPoE and HaToM was the search for a faster method. Both HaToM and ToPoE
have proved to be good alternatives to the Self-Organizing Map and the Generative
Topographic Mapping, with artificial and real data, but they both require longer
training time than ToNeGas. The new algorithm makes use of the Neural Gas
technique, and it is thus called the Topographic Neural Gas.
This chapter also discuss a comparison between the three algorithms, and also
against the SOM. We comment on the differences in theory, in the experiments
already presented, and also new experiments that study the topology preservation
and quantization of the algorithms, for which we employ some of the functions in
the somtoolbox [62].
5.1 The Topographic Neural Gas
The Topographic Neural Gas (ToNeGas) [73] unifies the underlying structure of the
GTM for topology preservation, with the technique of Neural Gas which clusters
the prototypes in data space.
The Topographic Neural Gas gains advantages from the Neural gas clustering as
well as from the GTM like structure. Three main advantages of NG model are [74]:
1. faster convergence to low distortion errors,
98
Chapter 5: ToNeGas & Comparisons
2. lower distortion error than that resulting from K-Means clustering, maximum-
entropy clustering and Kohonen’s self-organizing map algorithm [47],
3. obeying a stochastic gradient descent on an explicit energy surface.
From the non-linear projection from latent space to data space, the algorithm obtains
topology preservation as well as visualisation on a low dimensional grid.
As we have seen in Section 2.3.1, in K-Means a set of example vectors is clustered
into a few prototypes iteratively such that a distortion measure is continuously
minimized. Every prototype has its mean vector. In a K-Means iteration, every
example vector is assigned to the prototype with the closest mean vector. After that
every mean vector is replaced by the average of all vectors that have been assigned
to it. The neural-gas algorithm is a generalization of the K-Means algorithm. The
difference is, that every example vector is not assigned to a single prototype but
to more than one prototype. It will be assigned to the closest prototype with a
high weight and to other prototypes with smaller weights. After an iteration, the
mean of a prototype is replaced by the weighted average of all assigned vectors. In
this way, the neural gas algorithm is smoother, and every prototype gets to see all
data (some with a very low weight). The neural-gas algorithm uses a temperature
value t which defines what weight will be given to the closest prototype and to the
second closest prototype etc. The initial higher values of the temperature enables
bigger changes in the values (and uphill moves), so that local minima can be escaped
from, reaching other minima or regions of high density of data. Progressively the
temperature diminishes and the changes allowed are much smaller (the probability
of uphill moves being smaller as well), keeping the values around the minimum
selected. The weights decay exponentially with the increasing distance-rank of the
prototypes, such that the nth-closest prototype gets a weight of exp(−(n − 1)/t).
In ToNeGas we usually use a very low temperature and decrease the temperature
every iteration such that in the end it is virtually zero, and the neural-gas algorithm
resembles the K-Means algorithm.
The algorithm has been implemented based on the Neural Gas algorithm code
included in the SOM Toolbox for Matlab [62]. We develop the same growing model
as in HaToM, starting with 2*2 latent points in a square grid, and using the previous
W value when incrementing the number of latent points. The steps of the algorithm
are as follows:
1. Initialise K to 2. Initialise the W weights randomly and spread the centres of
the M basis functions uniformly in latent space.
99
Chapter 5: ToNeGas & Comparisons
2. Initialise the K latent points uniformly in latent space. Set t=0.
3. Calculate the projection of the latent points to data space. This gives the K
prototypes, mk = Φ(tk)T ∗W .
4. Randomly select a datapoint.
5. Calculate the distances between the datapoint selected and all the prototypes.
6. Calculate the rank rk(d) of each prototype depending on the distances, and
the neighborhood function hρ(rk(d)) = e(−rk(d)/ρ(t)) with
ρ(t) = ρ(0) ∗ [ρ(T )/ρ(0)](t/T ). (5.1)
7. Recalculate prototypes using the learning rule mk = mk + ε(t) ∗ hρ[rk(d)] ∗(x−mk) with
ε(t) = ε(0) ∗ [ε(T )/ε(0)](t/T ). (5.2)
8. If count<MAXCOUNT, t=t+1 and return to 4.
9. Recalculate W using
W =
{(ΦT Φ + δI)−1ΦTΞ if K < M
(ΦT Φ)−1ΦTΞ if K ≥ M
.
10. If K < Kmax, K = K + 1 and return to 2.
11. For every data point, xi, calculate the Euclidean distance between the ith data
point and the kth prototype as dik = ||xi −mk||.
12. Calculate responsibilities that the kth latent point has for the ith data point
and the projections of each datapoint in latent space
rik =Cλ(i, k)∑Kl=1 Cλ(i, l)
and yi =K∑
k=1
riktk (5.3)
where tk is the position of the kth latent point in latent space, and Cλ(i, k) the
100
Chapter 5: ToNeGas & Comparisons
Epanechnikov kernel
Dλ(i, k) =d2
ik
λ
and Cλ(i, k) =
{34(1−Dλ(i, k)2) if ‖Dλ(i, k)‖ < 1
0 otherwise(5.4)
The visualisation is provided by the projection of each datapoint to latent space
yi, using the responsibilities of all the prototypes for each data point rik, and the
fixed prototypes in latent space tk. The responsibilities include the Epanechnikov
kernel that proved to be better also for HaToM [72]. One of the advantages of this
algorithm is that the Neural Gas part is independent of the non-linear projection,
thus the clustering efficiency is not limited by the topology preservation restriction.
As with HaToM, it is possible to include the non-linear mapping in the inner
loop, that would create the Model-driven version of ToNeGas, but the property
looked for is the reduction in convergence time.
5.1.1 Simulations
5.1.2 Experiment 1: The Fundamental Clustering Problems
Suite
For both target and hepta datasets the Topographic Neural Gas separates the clus-
ters well, projecting the right topology into the latent space (Figure 5.1 and Figure
5.2) . The prototypes (bottom left of the Figures) are mainly located within the
clusters.
The Harmonic Topographic Mapping proved to be good also in separating these
datasets (see Section 4.3.4)1. To illustrate how the clustering speed of NG makes a
great improvement of ToNeGas over HaToM we evaluate the time of convergence for
both algorithms and four datasets in Table 5.1. The difference in time is noticeable,
specially when the number of datapoints is large.
Another possible criterion for comparison is the reduction in the Mean Quan-
tisation Error (MQE) while growing the map. In this experiment we calculate the
MQE every time we add new latent points to the map, that is after finishing each
run of the clustering technique (K-Harmonic Means for HaToM and Neural Gas for
ToNeGas). We can see in Figure 5.3 that both techniques reduce the MQE, but the
1Note: We are comparing ToNeGas vs HaToM with the Epanechnikov kernel, because this isthe one used in ToNeGas.
101
Chapter 5: ToNeGas & Comparisons
−4−2
02
4
−4
−2
0
2
4−4
−2
0
2
4
−4−2
02
4
−4
−2
0
2
4−4
−2
0
2
4
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Figure 5.1: Original data (top), 2 prototypes and red dots for datapoints in dataspace (bottom left), and ToNeGas projection of the hepta data (bottom right).
Table 5.1: Convergence time (seconds) for HaToM and ToNeGas.Dataset Four clusters Algae Hepta Target
No samples 800 118 212 770Dim 2 19 3 2
HaToM 174.47 7.07 17.19 155.19ToNeGas 20.21 6.10 7.24 19.23
102
Chapter 5: ToNeGas & Comparisons
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Figure 5.2: Original data (top), 2 prototypes and red dots for datapoints in dataspace (bottom left), and ToNeGas projection of the target data (bottom right).
103
Chapter 5: ToNeGas & Comparisons
0 100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 5.3: Mean quantisation error over time for the Harmonic Topographic Map-ping (top) and the Topographic Neural Gas (bottom) with the algae data. Thehorizontal axes show the moment that new latent points were added to the map-ping.
104
Chapter 5: ToNeGas & Comparisons
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
Figure 5.4: ToNeGas projection of the 9 labelled algae classes and 1 unlabelled class(0).
change is much faster and to a lower final value for ToNeGas.
In conclusion, the clustering speed of Neural Gas gives an important improvement
over the previously developed algorithm, the Harmonic Topographic Mapping, and
has also proved to reduce the mean quantisation error much more than the latter.
5.1.3 Experiment 2: The Algae data set
ToNeGas is able to cluster this data correctly (Figure 5.4). In this case we used wider
responsibilities (in this case controlled by the λ value) to spread the clusters, but as
with HaToM, the projection depicts tighter clusters with narrower responsibilities.
The important difference with this algorithm, is that is the only one so far to
project Classes 5 and 7 (blue squares and red triangles respectively) into separate
clusters.
5.2 Topology preservation
The rest of this chapter is dedicated to the comparison between SOM, ToPoE,
HaToM and ToNeGas. In this section we quantify how the topology is actually
preserved in these algorithms with two datasets, the algae dataset presented above,
and an extreme dataset in which we have only 40 samples of 3036 dimensionality.
We calculate measures of quantization and topology, and apply different techniques
105
Chapter 5: ToNeGas & Comparisons
020
4060
80100
0
50
100
1500
0.01
0.02
0.03
0.04
0.05
0.06
020
4060
80100
0
50
100
1500
0.2
0.4
0.6
0.8
020
4060
80100
0
50
100
1500
0.2
0.4
0.6
0.8
1
Figure 5.5: Responsibilities for ToPoE, HaToM and ToNeGas for the algae data.
to study the results. For several of the results below we have utilised the SOM
Toolbox [62] with default values.
5.2.1 Experiment1: Algae dataset
We apply a 10*10 mapping for all the algorithms: SOM, ToPoE, HaToM (Data
driven version) and ToNeGas in order to compare them. We already saw the pro-
jections of the algorithms using the responsibilities. In the next section we see the
responsibilities and measures of clustering and topology preservation.
Responsibilities
The responsibilities of each latent point for each datapoint are shown in Figure
5.5. We can see how the responsibilities are spread over several neurons for all
datapoints in HaToM and ToNeGas. The responsibilities for ToPoE are really small
(notice the axis of the figure), which shows the high distances from the prototypes
to the datapoints. This is also reflected below in the U-Matrix distances range and
in the value of the Mean Quantisation Error. Only a few neurons have meaningful
responsibilities in this case, as is also depicted in the corresponding hit histogram.
106
Chapter 5: ToNeGas & Comparisons
Figure 5.6: Hit histogram and U-matrix for SOM with the algae data.
U-matrix, Hit Histograms and Distance Matrix
The U-Matrix assigns to each cell the average distance to all of its neighbors. This
enables the identification of regions of similarity using different colors for different
ranges of distances.
The hit histograms are formed by taking a data set, finding the Best Matching
Unit (BMU) of each data sample from the map, and increasing a counter in a map
unit each time it is the BMU. The hit histogram shows the distribution of the data
set on the map. Here, the hit histogram for the whole data set is calculated and
visualized with the U-matrix (Figures 5.6, 5.7, 5.8, 5.9).
All the hitting matrices and U-Matrices show good separations of the classes in
different areas of the grid, but while ToPoE uses only a few neurons to do so, HaToM
shows clusters spread all over the grid. Both SOM and ToNeGas show intermediate
cases.
Surface plot of distance matrix (Figure 5.10): both color and z-coordinate in-
dicate average distance to neighboring map units. This is closely related to the
U-matrix.
107
Chapter 5: ToNeGas & Comparisons
Figure 5.7: Hit histogram and U-matrix for ToPoE with the algae data.
Figure 5.8: Hit histogram and U-matrix for HaToM with the algae data.
108
Chapter 5: ToNeGas & Comparisons
Figure 5.9: Hit histogram and U-matrix for ToNeGas with the algae data.
0
2
4
6
8
0123456789
0.05
0.1
0.15
Distance matrix
0
2
4
6
8
0123456789
0.2
0.4
0.6
0.8
1
1.2
Distance matrix
0
2
4
6
8
0123456789
0.1
0.2
0.3
0.4
0.5
Distance matrix
0
2
4
6
8
0123456789
0.05
0.1
0.15
0.2
Distance matrix
Figure 5.10: Distance matrix for SOM (top left), ToPoE (top right), HaToM (bottomleft), and ToNeGas (bottom right) with the algae data.
109
Chapter 5: ToNeGas & Comparisons
0 20 40 60 80 100 1200
0.5
1
1.5
2
2.5
3
tonegasSOMhatomtopoe
Figure 5.11: Quantization errors of the 118 data points for SOM, ToPoE, HaToM,and ToNeGas with the algae data.
The quality of the map
Any topology preserving map requires a few parameters (such as size and topology
of the map or the learning parameters) to be chosen a priori, and this influences the
quality of the mapping. Typically two evaluation criteria are used: resolution and
topology preservation. If the dimension of the data set is higher than the dimension
of the map grid, these usually become contradictory goals.
We first show the quantization error for each datapoint with the distance to its
Best Matching Unit (BMU) in Figure 5.11.
The mean quantization error qe is the average distance between each data vector
and its BMU. It measures the resolution of the mapping.
qe =1
N
N∑i=1
‖ xi − (BMU(i), k) ‖ . (5.5)
The distortion measure which measures the deviation between the data and the
quantizers is defined as:
E =N∑
i=1
K∑
k=1
h(BMU(i), k) ‖ xi −mk ‖2. (5.6)
We first calculate the total distortion for each unit, and then average for the
110
Chapter 5: ToNeGas & Comparisons
Table 5.2: Quantization error and topology preservation error with Topology-preserving Mappings for the algae data.
Algorithm SOM ToPoE HaToM ToNeGasMean Quanti- 0.0526 2.8445 0.0147 0.0162zation ErrorAverage total distortion 0.0862 0.4530 0.2443 0.2071for each unit (e+003 )Topology preservation 0.0593 0.6915 0.5000 0.4237error
total number of neurons.
Another important measure of the quality of the mapping is the topology preser-
vation te. In this case we calculate the topographic error, i.e. the proportion of all
data vectors for which first and second BMUs are not adjacent units.
te =1
N
N∑i=1
u(xi) (5.7)
where u(xi) is equal to 1 if first and second BMU are not adjacent and 0 otherwise
The higher value for all the clustering errors is for ToPoE, which means that the
neurons are further away from the data. Also the topology error is higher for that
algorithm. Both HaToM and ToNeGas have a lower mean quantisation error than
SOM. SOM gives however the lowest topology preservation error.
PCA projections
To project into a two dimensional space, we used the responsibilities rik, but a
principal component analysis can also be used, projecting both the clusters and the
data onto the same two first eigenvectors of the data. Figure 5.12 shows the PCA
projection of the data with the same classes symbols as in the previous experiments
with the algae data.
We can see in Figure 5.13 that the reason for a worse clustering in ToPoE is
the location of the neurons, that are projected in some cases out of the data area.
Figure 5.14 depicts the map formed with each algorithm.
111
Chapter 5: ToNeGas & Comparisons
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3∗ Class 1
o Class 2
F Class 3
2 Class 4
2 Class 5
O Class 6
. Class 7
F Class 8
♦ Class 9
F Class 0
Figure 5.12: The PCA projection for the algae data.
5.2.2 Experiment2: Gene dataset
We use a dataset containing results of a high-throughput experimental technology
application in molecular biology (microarray data)2. The datasets contains only 40
observations of high-dimensional data (3036) and there are three types of bladder
cancer: T1, T2+ and Ta. The data is examined in the gene space (40 rows of 3036
variables). This is quite a demanding dataset because we have very few observations
for a really high dimensional dataset, and thus it is a demanding test for our algo-
rithms. The dataset has been preprocessed to have zero mean; also, in the original
dataset some data was missing and these values have been filtered out.
Projections in latent space
The three algorithms give a good visualisation of the three types of cancer in the
projection (see Figures 5.15, 5.16, and 5.17), but ToPoE requires to run for 100,000
iterations while the others do it with only 20 passes. The Data-driven version of
HaToM (Figure 5.16) gives a good projection but ToNeGas (Figure 5.17) separates
the clusters faster. In all cases, one datapoint of the Ta class (blue ’*’) is separated
from the others, seeming to be an outlier.
2http://www.math.le.ac.uk/PEOPLE/ag153/homepage/PrincManLeicAug2006.htm
112
Chapter 5: ToNeGas & Comparisons
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
−2 −1.5 −1 −0.5 0 0.5 1 1.5−2
−1.5
−1
−0.5
0
0.5
1
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
Figure 5.13: PCA projection for the algae data of the datapoints (in blue) and theprototypes (in red) for SOM (top left), ToPoE (top right), HaToM (bottom left),and ToNeGas (bottom right) with the algae data.
113
Chapter 5: ToNeGas & Comparisons
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3Colored PC−projection
−2 −1.5 −1 −0.5 0 0.5 1 1.5−2
−1.5
−1
−0.5
0
0.5
1Colored PC−projection
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3Colored PC−projection
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3Colored PC−projection
Figure 5.14: Colored PCA projection for the algae data of the prototypes for SOM(top left), ToPoE (top right), HaToM (bottom left), and ToNeGas (bottom right)with the algae data.
114
Chapter 5: ToNeGas & Comparisons
0 2 4 6 8 10 120
2
4
6
8
10
12
Figure 5.15: The ToPoE projection for the gene data. T1 in red triangles, T2+ ingreen stars and Ta in blue ’*’.
−1 −0.5 0 0.5 1−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 5.16: The HaToM projection for the gene data. T1 in red triangles, T2+ ingreen stars and Ta in blue ’*’.
115
Chapter 5: ToNeGas & Comparisons
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 5.17: The ToNeGas projection for the gene data. T1 in red triangles, T2+in green stars and Ta in blue ’*’.
Responsibilities
The responsibilities of each latent point for each datapoint are shown in Figure
5.18. The responsibilities are much smaller for ToPoE, meaning that the prototypes
are further away from the datapoints and/or that more prototypes are responsible
for each datapoint (but prototypes localised in the same region of the map).The
responsibilities are less widely spread between all prototypes and of higher value,
both in HaToM and ToNeGas, which reflects the mixture of experts model applied
and higher proximity to the datapoints.
U-matrix, Hit Histograms and Distance Matrix
The hit histogram for the whole data set is calculated and visualized with the U-
matrix (Figures 5.19, 5.20, 5.21,and 5.22). Again we can see a good separation of
classes in the hitting matrix, with the correspondent distances in data space shown
in the U-Matrix.
The Surface plot of distance matrix is shown in Figure 5.23.
The quality of the map
The quantization errors for the algae dataset is shown in Figure 5.24. ToNeGas is
clearly reducing further the values for most of the datapoints.
116
Chapter 5: ToNeGas & Comparisons
0
50
100
150
0
10
20
30
400
0.1
0.2
0.3
0.4
0
50
100
150
0
10
20
30
400
0.2
0.4
0.6
0.8
1
0
50
100
150
0
10
20
30
400
0.2
0.4
0.6
0.8
1
Figure 5.18: Responsibilities for ToPoE (top left), HaToM (top right),and ToNeGas(bottom) with the gene data.
Figure 5.19: Hit histogram and U-matrix for SOM with the gene data.
117
Chapter 5: ToNeGas & Comparisons
Figure 5.20: Hit histogram and U-matrix for ToPoE with the gene data.
Figure 5.21: Hit histogram and U-matrix for HaToM with the gene data.
118
Chapter 5: ToNeGas & Comparisons
Figure 5.22: Hit histogram and U-matrix for ToNeGas with the gene data.
0
2
4
6
8
10
0246810
5
10
15
Distance matrix
0
2
4
6
8
10
0246810
3
4
5
6
7
Distance matrix
0
2
4
6
8
10
0246810
5
10
15
20
25
Distance matrix
0
2
4
6
8
10
02468
10
15
20
Distance matrix
Figure 5.23: Distance matrix for SOM (top left), ToPoE (top right), HaToM (bottomleft), and ToNeGas (bottom right) with the gene data.
119
Chapter 5: ToNeGas & Comparisons
0 5 10 15 20 25 30 35 400
5
10
15
20
25
30
35
40
SOMHaToMToNeGasToPoE
Figure 5.24: Quantization errors of the 40 data points for SOM, ToPoE, HaToM,and ToNeGas with the gene data.
In this case, to analyse the influence of each clustering technique on its own, we
calculate as well the quantization error of K-Means, KHM, and Neural Gas with
this dataset and 12 prototypes (see Table 5.3). We can clearly see the improvement
achieved by NG. KHM gives a bigger value, which proves the necessity to apply
the generalised version when using high-dimensional data. Zhang suggested using
p =√
d, being d the dimensionality of the dataset [91, 92]. In this experiment
however it was enough to use p = 3 for HaToM.
Table 5.3: Quantization error with clustering techniques for the gene data.Algorithm K-Means KHM Neural GasMean Quantization Error 22.8778 31.7438 16.0784
The te does not consider diagonal neighbors thus the hexagonal case always gives
lower values of te due to its six neighbors for each unit in comparison to the four in
the rectangular mapping. We are using a rectangular lattice for all algorithms, so
in order to see the right topology error considering all neighbors we need a different
topographic error, such as the Alfa error [2]. The formula for the alpha error is as
follows:
Alfa =1
N
N∑i=1
α(xi) (5.8)
where α(xi) is equal to 1 if first and second BMU are not adjacent nor diagonals
120
Chapter 5: ToNeGas & Comparisons
Table 5.4: Quantization error and topology preservation error with Topology-preserving Mappings for the gene data.
TPM SOM ToPoE HaToM ToNeGasMean Quanti- 11.881 22.410 22.308 8.843zation ErrorMean total distortion 0.597 0.892 1.357 0.851for each unit (e+003 )Topology preservation 0.025 0.250 0.750 0.200errorAlfa error 0 0.025 0.675 0.1
and 0 otherwise. This error gives lower values than te for all the algorithms (see
Table 5.4).
As with the previous data set, the lower mean quantisation error between is for
ToNeGas, thus the mapping is more flexible and adjusts better to the data as seen
also in the distance matrix (Figure 5.23). The topology preservation in this case is
however similar for ToPoE and ToNeGas. The Alfa measure reduces the values for
topology error in all the algorithms.
PCA projections
With this dataset, all algorithms have neurons projected within the data area (Figure
5.25). The two dimensional grid is similar for both ToNeGas and ToPoE (Figure
5.26).
5.3 Experiment Comparisons
This section summarises the experiments carried out with the same datasets for all
the algorithms in the previous chapters, and the latter two in this chapter. We have
seen that when you need to impose the model and find a smoth manifold, ToPoE
and M-HaToM are the best choices. This was shown with the one-dimensional data
and the animals dataset. On the other hand, D-HaToM and ToNeGas are best
options for clustering. The generalised version of D-HaToM with a suitable value
of p should be adequate for a high-dimensional dataset, but ToNeGas would do a
good job as well and in a shorter time (Table 5.1). G-HaToM is able to construct
a topology preserving mapping with less nodes and can be also useful when further
investigation of subclasses is required (Algae data).
121
Chapter 5: ToNeGas & Comparisons
−30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
5
10
15
20
−30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
5
10
15
20
−30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
5
10
15
20
−30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
5
10
15
20
Figure 5.25: PCA projection of the datapoints (in blue) and the centers (in red) forSOM (top left), ToPoE (top right), HaToM (bottom left), and ToNeGas (bottomright) with the gene data.
−30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
5
10
15
20Colored PC−projection
−25 −20 −15 −10 −5 0 5 10 15 20 25−25
−20
−15
−10
−5
0
5
10
15
20Colored PC−projection
−20 −15 −10 −5 0 5 10 15 20−25
−20
−15
−10
−5
0
5
10
15
20Colored PC−projection
−30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
5
10
15
20Colored PC−projection
Figure 5.26: Colored PCA projection of the prototypes for SOM (top left), ToPoE(top right), HaToM (bottom left), and ToNeGas (bottom right) with the gene data.
122
Chapter 5: ToNeGas & Comparisons
ToPoE was not always able to locate the prototypes in data space near the data
(e.g. four clusters data). The most restrictive structure of prototypes in this mod-
elling positions them in a continuous shape, while D-HaToM and ToNeGas locates
m near the data and not in between clusters (four clusters and FCPS data). We have
to take into account however that most of the experiments were carried out with
clustering data, more suitable thus for D-HaToM and ToNeGas. The projections
were nevertheless satisfactory, with a proper separation of clusters (Algae data),
thanks to the visualisation system used; this system is based on responsibilities that
take into account the distances between the datapoints and the prototypes. A vi-
sualisation process similar to SOM where the datapoints are just projected to the
nearest node, would have failed in this task.
In some cases ToPoE and HaToM required the use of local kernels, which give
weights only to the nearest prototypes in the visualisation process (Hepta and Target
datasets). Those kernels were also a good trick for M-HaToM to project outliers in
the target data.
We have also studied the possibility of tightening the projections
• applying Model-driven algorithms like ToPoE or M-HaToM (Algae data),
• reducing the width of the responsibilities (four clusters with D-HaToM).
An investigation of the magnification factors proved to be a useful tool to in-
terpret the projections found by the algorithm, and to estimate the real relative
distances in data space.
Finally we analysed in more detail clustering and topology preservation, finding
adequate values of MQE and topology errors, especially with ToNeGas.
5.4 Comparison of the algorithms
In this family of algorithms as well as in the basic GTM, the algorithm assumes
hyper-spherical, Gaussian clusters, whose flexibility and accuracy to approximate
the data can be traded-off by adjusting the parameters [64]: the flexibility shouldn’t
be too high to prevent over-fitting and also a high accuracy will increment the com-
putational cost. For the SOM there is no clear separation between the parameters
that control both properties. For the GTM and our algorithms, these are the number
of latent points, the width of the Gaussian functions, and the width of the respon-
sibilities. The number of latent points, as in the GTM, controls the accuracy of the
clustering; the width of the Gaussian functions controls the topology preservation of
123
Chapter 5: ToNeGas & Comparisons
the clusters; the rik are not calculated until the end of the algorithm (in all except
ToPoE, which is more similar to GTM), to get the projection into latent space, but
their width controls the ability of the algorithm to separate the clusters, especially
for the D-HaToM. The number of basis functions M in all of them, controls the
flexibility in the shape of the manifold.
As a visualisation technique these algorithms have one advantage over the stan-
dard SOM: the projections of the data onto the grid need not be solely to the grid
nodes. If we project each data point to that node which has highest responsibility
for the data point, we get a similar quantisation to that of the SOM. However if we
project each data point, xi onto∑
k rik ∗ tk, we get a mapping onto the manifold at
intermediate points.
5.4.1 Growing and Pruning
Another advantage of this family of topology preserving mappings is that we can
easily grow a net: we train a net with a small number of latent points and then
increase the number of latent points. Thus we have to recalculate the Φ matrix but
need not change the W matrix of weights which can simply continue to learn from
its current values.
Equally we may question the completed map to investigate whether any latent
point is being mapped to a part of the data space which has no data nearby. If a
latent point does not have the greatest responsibility for any data point, it can be
deleted from the map. This technique is illustrated in Figure 5.27. In each diagram
the red ’+’s show the positions of the data points: the data consists of 4 distinct
clusters. The trained map is shown on the left: the projections of the 20 latent
points map cover the data set but some are placed in positions in which there is
no data for which they need take responsibility. Such points are excluded and the
map continues to learn to get the situation in the right diagram: only 10 latent
points remain. It must be emphasised that we do not alter the positions of either
the latent points (in latent space) or the basis functions when we continue training.
These remain at their original situations.
5.4.2 One-to-one comparisons
The following comparisons between algorithms are summarised in Table 5.5.
124
Chapter 5: ToNeGas & Comparisons
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 72
2.5
3
3.5
4
4.5
5
5.5
6
2 2.5 3 3.5 4 4.5 5 5.5 62
2.5
3
3.5
4
4.5
5
5.5
6
Figure 5.27: In both diagrams the data set is shown by the red ’+’s. Left: theprojections of the 20 latent points are shown with ’*’s. Right: after pruning, the 10remaining latent points may continue training.
Table 5.5: Properties of each of the algorithmsClustering Insens Preserves Cost No Need Separates
Initial. Topology function No clusters clust.&proj.Kmeans
√ √KHM
√ √ √NG
√ √ √SOM
√ √ √GTM
√ √ √ √ToPoE
√ √ √ √HaToM
√ √ √ √ √ √ToNeGas
√ √ √ √ √ √
125
Chapter 5: ToNeGas & Comparisons
ToPoE vs SOM
ToPoE uses the responsibilities within the learning rule to update the prototypes
based on the distances between prototypes and datapoints. The topology is pre-
served through the non-linear projection of those prototypes. SOM on the other
hand keeps the preservation using the distances between latent points in the output
layer in the learning rule.
HaToM vs K-Harmonic Means
The main difference between K-Harmonic Means (KHM) and HaToM is the projec-
tion of the clusters into a non linear mapping which represents the main properties
of the clusters, such as distances, that makes it easier to visualise them [17]. Thus,
HaToM has all the advantages of the KHM, and also the topology preservation and
visualisation of the GTM.
Another important property of HaToM method is that it is not necessary to
determine the number of clusters a priori like in K-Means and KHM; a growing
stopping criterion can be applied (and has been shown to work [71]), such as a small
change in Mean Quantisation Error.
HaToM vs SOM
KHM is less prone to be trapped in a local minima due to the continuous movement
of weights. The visualisation in HaToM can be projecting directly to the winning
neurons as in SOM, or using responsibilities as in the GTM. SOM has a unique
learning rule for clustering and topology preservation, while HaToM is composed of
an inner loop for the clustering, plus an outside non-linear projection applied every
time a new latent point is added. The latest can be included in the inner loop if a
model-driven version is desired.
HaToM vs GTM
HaToM shares the structure of the GTM, with a latent space projected non-linearly
through a feature space into data space. But the HaToM has a cost function based
on the K-Harmonic means which is optimised with gradient descent. This provides
the algorithm with a strong clustering supported with a good visualisation given by
the latent space. GTM requires a careful initialisation to self-organise [46]. HaToM
overcomes this problem using Harmonic Means. The separation of clustering and
topology preservation is again unique for HaToM.
126
Chapter 5: ToNeGas & Comparisons
HaToM vs ToPoE
HaToM was initially inspired by ToPoE but with a clustering focused purpose. Both
HaToM and ToPoE have a similar structure to GTM: a non-linear projection from
latent space to data space. ToPoE starts with a product of experts that adapts
progressively to the data during training, becoming more similar to the mixture of
experts of GTM as the responsibilities sharpen and some latent points lose respon-
sibility for some data points. The gradient descent algorithm optimised the process,
in contrast to EM for GTM. HaToM on the other hand gets the insensitivity to
initialisation provided by the K-Harmonic Means, that clusters the prototypes in
data space.
The main difference in visualisation is that the responsibilities in HaToM are only
calculated once the clustering is done, just to project the data to latent space, while
for ToPoE this calculation is done on every iteration. In the HaToM algorithm no
responsibility is fixed, so that it is the data that says which experts are responsible
and which not; both HaToM and ToPoE permit the data to control the responsi-
bilities of each expert, but while ToPoE gives equal responsibilities to all when a
datapoint has no expert, the HaToM due to the K-Harmonic clustering, moves the
mk prototypes more freely to the data (the soft assignment of membership allows
for a continuous transition of prototypes between areas of high density of data), and
thus there is not a data point for which no prototype takes responsibility.
ToPoE, like the GTM, has a fixed structure in which the mk prototypes have
limited movement in the manifold, always responding to the position of the nearest
prototypes. In HaToM however, the mk prototypes are much more flexible and can
move towards the clusters in data space, while still keeping a smooth manifold by
making the W and mk follow the latent space model, especially in the M-HaToM.
The big advantage with ToPoE though is that it is not necessary to grow the
network, and using the same number of prototypes which the HaToM finally reaches,
gives similar results to those of the growing HaToM. However, it has to be noted
that for the high dimensional cases presented in this thesis, ToPoE required 100.000
iterations, while HaToM and ToNeGas showed good results just with 20.
As we have seen, the main difference between the two HaToM models is the up-
dating of the W and mk prototypes, which in D-HaToM is only made when we grow
the latent space with another point K, so that the data conveys the mk prototypes
to the clusters, while in M-HaToM this updating is done in every iteration, so that
the mk prototypes are forced to follow the model more strictly and therefore give
a smoother manifold; this forces the mk prototypes to be outside the clusters when
127
Chapter 5: ToNeGas & Comparisons
the data is not continuous (see section 4.3.2), but still gives a perfect clustering in
the latent space. The Model- version is thus closer to ToPoE.
HaToM vs ToNeGas
As we saw in Section 5.1.2, ToNeGas reduces the mean quantization error faster and
to a lower value. This algorithm showed in the experiments tried some of the best
results. The clustering properties are similar though: both HaToM and ToNeGas
are insensitive to the initialisation of the prototypes: the former uses the continuous
movement of prototypes between areas of high density of data, while the latter makes
use of the temperature value to allow for changes between “valleys” at the beginning,
looking for the global minima, reducing the temperature progressively to keep the
values around the deepest valley.
The boosting like properties from the generalised version of KHM could indicate
that HaToM would be better for difficult data, but so far, and probably thanks
to the higher and faster reduction of the Mean Quantisation Error, ToNeGas has
proved to be at least as good as HaToM with high dimensional and complex data.
ToNeGas vs K-Means
ToNeGas gets its clustering properties from Neural Gas, so comparing ToNeGas
and K-Means clustering is like comparing NG and K-Means. The Neural-Gas algo-
rithm is quite similar to the K-Means clustering method, but with K-Means there
is no neighborhood involved. Instead only the winner is updated. K-Means is then
a winner takes all method, or hard competitive learning, whereas the Neural-Gas
method belongs to the soft competitive learning or winner takes most methods. We
have seen that K-Means is susceptible to the initial position of the weights, while
“the location of the Neural-Gas nodes after a few iterations is independent from the
initialization, due to the very fast gas-like movement of the nodes in the beginning”
[75].
The Neural-Gas algorithm reaches faster convergence with similar clustering re-
sults and minimizes a global cost function. The reason for the faster convergence in
NG is that the oscillations of the cost functions are small.
The other ToNeGas properties, topology preservation and dimensionality reduc-
tion, are not included in K-Means.
128
Chapter 5: ToNeGas & Comparisons
ToNeGas vs SOM
It is not possible to define a cost function for the Kohonen network, while there is a
well-defined cost function for the Neural-Gas algorithm [57]. Its learning procedure
resembles the simulated annealing approach [75], which is a well-known global opti-
mization technique. Decreasing the temperature-like decay variable helps to escape
from local minima at the beginning when the temperature is high, updating the
weights gradually towards the global minimum.
If two nodes are close in a SOM map, they have higher probability of being
similar. In NG this kind of neighborhood information is not available. However,
combined with projection methods, like the GTM structure, the ToNeGas algorithm
has at least equal visualization power to the Kohonen maps.
The main difference between NG and SOM is the neighborhood concept: in the
Kohonen algorithm, the weights of the nodes are adapted based on the distance of
the nodes to the BMU in latent space, whereas in NG the distances are calculated in
data space, and this is used simply to give weights to each neuron in the updating of
the prototypes for a particular datapoint, without considering topology preservation.
The separation of the topology preservation step from the clustering allows for
a faster clustering, while the topology preservation is still kept at reasonable levels.
ToNeGas vs GTM and ToPoE
ToNeGas, ToPoE and GTM share the underlying structure, but ToNeGas is faster
due to the gas-like clustering. As in HaToM, the continuous transition of prototypes
between clusters due to the soft competition, gives those prototypes in this algorithm
freedom to find the clusters faster, compared with ToPoE.
5.5 Benefits from separating clustering from pro-
jection
As seen in this thesis, the two new developed algorithms (HaToM and ToNeGas)
have in common the separation of the clustering and projection steps. This has a
number of benefits detailed below:
• the clustering and topology preservation efficiency do not interfere with each
other, as occurs when both are in the same learning rule.
129
Chapter 5: ToNeGas & Comparisons
• using the same underlying structure, we can use several clustering techniques
that are better in different situations.
• we have a greater control of the parameters, identifying which one is influencing
each property of the algorithm.
• we can give more or less weight to the clustering or the non-linear projection
by
1. inserting the projection in the inner loop or not.
2. giving more or less iterations to the clustering step.
The non-linear projection is also responsible for the topology preservation, as
long as the mapping function m(x; w) is smooth and continuous [13]. Sepa-
rating clustering from topology preservation is giving us the choice to select
whether we want to give more weight to the reduction of the Mean Quantisa-
tion Error, or the reduction of the Topology Error.
• depending on the above we can get an algorithm more or less sensitive to
outliers depending on the objectives.
HaToM and ToNeGas have been developed in a growing version. This has ad-
vantages [29] such as
• the possibility to use error methods to locate new nodes.
• the possibility to use a performance criterion to stop growing the map, thus
having less parameters to tune.
• we can prune the prototypes that have no great responsibility as seen above.
The growing is not preventing continuous training, since the W is not randomized
each time we increment K, but the previous value is used in the next non-linear
projection m = φW , that includes the new number of latent points in the φ vector.
ToPoE does not separate clustering from topology preservation, but shares a com-
mon non-linear projection with HaToM and ToNeGas. Having information about
the data is always a plus, and the more information we have, the more we know about
the algorithm needed. In this case, having a whole family of related algorithms with
different properties allows a wider selection. Both ToPoE and M-HaToM should be
selected when the model needs to be reinforced and for continuous distributions.
D-HaToM and ToNeGas are better disposed for discrete data and when time con-
vergence is a constraint.
130
Chapter 5: ToNeGas & Comparisons
5.6 Conclusions
The first part of this Chapter introduced the last of the algorithms sharing the GTM
structure, the Topographic Neural gas, which makes use of the Neural Gas technique.
In ToNeGas, as well as in HaToM, the projection to a lower-dimensional space and
clustering technique are not included in the same learning rule, giving more control
of the parameter selection and accuracy results. This algorithm is faster than the
previous HaToM.
In the second part we have compared, jointly and in comparison with standard
topology preserving maps like SOM, the three algorithms: ToPoE, HaToM and
ToNeGas. Firstly we evaluated the clustering and topology preservation using two
different datasets. Then, we summarised all the experimental results. And finally
we make a one-to-one comparison of the different models. We also enumerate the
benefits from separating clustering and projection in two steps, and from having a
family of related algorithms.
131
Chapter 6
Summary and Future work
6.1 Summary
We have presented a family of Topology preserving mappings based on the Genera-
tive Topographic Mapping. ToPoE, HaToM and ToNeGas share a common projec-
tion technique from latent space to data space, that allows for the preservation of
topology, giving a visualisation of the data space represented in two or three dimen-
sions. The visualisation model is also common for all the algorithms, and is based
on a continuous projection to the output space.
ToPoE includes a Product of Expert model, with responsibilities embeded in the
learning process. The greater imposition of the model in this algorithm makes it
more suitable for manifold searching cases.
HaToM and ToNeGas are also based on the same underlying model as GTM
and ToPoE; thus they also keep the topology in the outer space; but the main new
property of these algorithms is the separation of clustering and projection steps.
This provides a number of advances including the development of two versions for a
model- or data- reinforcement. Both K-Harmonic Means and Neural Gas overcome
the local minima curse by allowing hill-climbing changes in the gradient descent
optimisation.
6.2 Major contributions of this thesis
We have further investigated ToPoE’s theory, including the change of local variance,
the magnification factors, and the projection model.
We introduced two new algorithms that belong to the same family of models
with additional properties and more control of the parameter tuning thanks to the
132
Chapter 6: Summary & Future work
separation of clustering and topology preservation steps.
We have compared the algorithms with different artificial and real datasets, find-
ing each of their weaknesses and strengths, and what applications are more suitable,
whether it is clustering or modelling of the distributions. We have seen how the use
of responsibilities as in the GTM algorithm allows us to project in a continuous way,
instead of having discrete projections exclusively to the latent points positions in the
prefixed output space. We also noticed how sometimes too many small responsibili-
ties from latent points far away from the data, can project some of the datapoints to
the wrong side of the latent space. This can be overcome using kernels such as the
Tri-cube or Epanechnikov, that consider only points at a certain distance from the
datapoint, or pruning. This is particular true for ToPoE, that tends to have few high
responsibilities, and a great quantity of latent points with very small responsibility
for all datapoints; the projections are nevertheless adequate due to the visualisation
process characteristics. In HaToM and ToNeGas the corresponding clustering tech-
nique locates the prototypes faster near the data, so their responsibilites are high
for some of the datapoints and small for other.
We have investigated their clustering and their preservation of the topology nu-
merically and in comparison with SOM, one of the most used topology preserving
maps.
6.3 Future Work
The algorithms presented in this thesis could be extended in several ways:
• In Neural Gas the rank of distances from neurons to the datapoints are used
for the updating of the prototypes. The temperature value gives more or
less weight to the influence of the prototypes based on their rank order. In
a similar way ToPoE makes use of the responsibilities to update within the
learning rule. We have used so far a fixed value for the width, but as in NG,
a decreasing parameter proportional to the rank order could be used, so that
higher values at the first stages could allow for hill-climbing, while the reduced
values towards the end of the training would secure the convergence to the
closer minimum.
• We could also use the local variance in ToPoE to see how good the mapping
is. In between clusters the variance should be more spread, while inter cluster
cases should lead to sharper variance.
133
Chapter 6: Summary & Future work
• We have used the underlying structure of the GTM as projecting technique to
preserve the topology. The mapping however is usually two dimensional, so
the preservation of the topology is only true when the manifold of the data is
also two dimensional. In the literature review, we have reviewed an alternative
model used in the Topology Representing Networks to do so, the Competitive
Hebbian learning. This technique finds the true topology of the manifold (once
Neural Gas positions the prototypes within the clusters), without assuming any
prefixed dimensionality or topology of the manifold. This interesting technique
could be used in combination with other clustering techniques such as K-Means
and K-Harmonic Means, to form new Topology Preserving Mappings.
• We could also grow the mapping in the same way as GNG, with the location
of the nodes depending on the error of the map, instead of having a fixed
structure.
• As proposed for GTM [80], Bayesian techniques and/or Cross-validation could
be applied for parameter selection in any of these new models.
• Chang [15, 16] uses a spherical manifold with PPS reducing the overlapping
of clusters. This and other disposition of the latent points in two or three
dimensional shapes could be applied to the TPM algorithms.
• We could introduce a “conscience” term so that frequent winners get a “bad
conscience” for winning so often.
• Chang [15, 16] developed the Probabilistic Principal Surfaces, which are closely
related to the GTM. Thus, an interesting study would be the relationship
between our algorithms and Principal Curves/Surfaces.
• The author has a background in Marine sciences, particularly in Fisheries
acoustics. One of the future tasks will be the application of TPM and other
data mining techniques to marine (oceanographical and fisheries) data.
• Applications to other real data and to other applications like Forecasting, or
Regression Clustering (see [94]).
134
Bibliography
[1] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of
distance metrics in high dimensional space. Lecture Notes in Computer Science,
1973:420–434, 2001.
[2] E. Arsuaga-Uriarte and F. Daz-Martn. Topology preservation in som. Trans-
actions On Engineering, Computing And Technology, 15:1305–5313, 2006.
[3] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding.
In Bay Area Theory Symposium, BATS 06, 2006.
[4] P. Berkhing. Survey of clustering data mining technique. Technical report,
Accrue Software, 2002.
[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest
neighbor” meaningful? Lecture Notes in Computer Science, 1540:217–235,
1999.
[6] Neural Computing Research Group. Aston University. Birmingham. Gtm tool-
box. http://www.ncrg.aston.ac.uk/gtm/.
[7] Neural Computing Research Group. Aston University. Birmingham. Netlab
toolbox. http://www.ncrg.aston.ac.uk/netlab/index.php.
[8] C. M. Bishop, M. Svensen, and C. K. I. Williams. Gtm: The generative topo-
graphic mapping. Neural Computation, 10(1):215–234, 1997.
[9] C. M. Bishop, M. Svensen, and C. K. I. Williams. Magnification Factors for
the GTM Algorithm. In Proceedings of the IEE 5th International Conference
on Artificial Neural Networks, Cambridge, U.K., 64-69P, 1997.
[10] C. M. Bishop, M. Svensen, and C. K. I. Williams. Magnification factors for the
gtm algorithm. Technical Report NCRG/97/006, Neural Computing Research
Group, Aston University, 1997.
135
Bibliography
[11] C. M. Bishop, M. Svensen, and C. K. I. Williams. Developments of the gener-
ative topographic mapping. Neurocomputing, 21(1):203–224, 1998.
[12] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford:Clarendon
Press, 1995.
[13] C.M. Bishop. Latent variable models. Learning in Graphical Models, MIT
Press, pages 371–403, 1999.
[14] M. A. Carreira-Perpinan. A review of dimension reduction techniques. Technical
Report CS-96-09, Dept. of Computer Science. University of Sheffield, 1997.
[15] K. Chang. Nonlinear Dimensionality Reduction Using Probabilistic Principal
Surfaces. PhD thesis, The University of Texas at Austin, Department of Elec-
trical & Computer Engineering, May 2000.
[16] K. Chang and J. Ghosh. A unified model for probabilistic principal surfaces.
IEEE Trans. Pattern Anal. Mach. Intell., 23(1):22–41, 2001.
[17] S. Dasgupta. Experiments with random projection. In UAI ’00: Proceedings
of the 16th Conference on Uncertainty in Artificial Intelligence, pages 143–151,
San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[18] M. Daszykowski, B. Walczak, and D. L. Massart. On the optimal partitioning
of data with k-means, growing k-means, neural gas, and growing neural gas. J.
Chem. Inf. Comput. Sci, 42(6):1378 – 1389, 2002.
[19] P. Delicado. Another look at principal curves and surfaces. Journal of Multi-
variate Analysis, 77:84–116, 2001.
[20] K. Doherty, R. Adams, and N. Davey. Unsupervised learning with normalised
data and non-euclidean norms. Applied Soft Computing, 7(1):203–210, 2007.
[21] R. Durbin and D. Willshaw. An analogue approach to the traveling salesman
problem using an elastic net method. Nature, pages 689–691, 1987.
[22] E. Erwin, K. Obermayer, and K. Schulten. Self-organising maps:ordering, con-
vergence properties and energy functions. Biological Cybernetics, 67:47–55,
1992.
[23] E. Erwin, K. Obermayer, and K. Schulten. Self-organising maps:stationary
states, metastability and convergence rate. Biological Cybernetics, 67:35–45,
1992.
136
Bibliography
[24] A. Flexer. Limitations of self-organizing maps for vector quantization and mul-
tidimensional scaling. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors,
Advances in Neural Information Processing Systems 9. Proceedings of the 1996
Conference, pages 445–51. MIT Press, London, UK, 1997.
[25] Laboratory for Advanced Brain Signal Processing. Riken. Saitama. Japan.
Icalab toolbox. http://www.bsp.brain.riken.go.jp/icalab/.
[26] D. Francois, V. Wertz, and M. Verleysen. Non-euclidean metrics for similarity
search in noisy datasets. In In Proc. of the European Symposium on Artificial
Neural Networks, ESANN 2005, pages 339–344, Bruges, Belgium, 2005.
[27] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a sta-
tistical view of boosting. Annals of Statistics, 28:337–374, 2000.
[28] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S.
Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing
Systems 7, pages 625–632. MIT Press, Cambridge MA, 1995.
[29] B. Fritzke. Growing self-organizing networks – why? In ESANN’96: European
Symposium on Artificial Neural Networks, pages 61–72, 1996.
[30] C. Fyfe. Topographic product of experts. In International Conference on Ar-
tificial Neural Networks, ICANN2005, pages 397–402, 2005.
[31] C. Garcıa-Osorio. Data Mining And Visualization. PhD thesis, School of Com-
puting, University of Paisley, 2005.
[32] G. Hamerly and C. Elkan. Alternatives to the k-means algorithm that find
better clusterings. In CIKM ’02: Proceedings of the eleventh international
conference on Information and knowledge management, pages 600–607, New
York, NY, USA, 2002. ACM Press.
[33] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical
Association, 84(406):502–516, June 1989.
[34] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer, 2001.
[35] G. E. Hinton. Training products of experts by minimizing contrastive diver-
gence. Technical Report GCNU TR 2000-004, Gatsby Computational Neuro-
science Unit, University College, London, 2000.
137
Bibliography
[36] J. Hollmen. Process Modeling Using the Self-Organizing Map. PhD thesis,
Helsinki University of Technology, 1996.
[37] J. Holmstrom. Growing neural gas: Experiments with gng, gng with utility
and supervised gng. Masters thesis in computer science, Uppsala University.
Department of Information Technology, 2002.
[38] A. Hyvarinen. Survey on independent component analysis. Neural Computing
Surveys, 2:94–128, 1999.
[39] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis.
Wiley, 2001.
[40] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures
of local experts. Neural Computation, 3:79–87, 1991.
[41] N. Jardine and R. Sibson. The construction of hierarchic and non-hierarchic
classifications. The Computer Journal, 11:177–184, 1968.
[42] S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 2(3):241–254,
1967.
[43] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em
algorithm. Neural Computation, 6:181–214, 1994.
[44] S. Kaski, J. Kangas, and T. Kohonen. Bibliography of self-organizing map
(som) papers: 1981-1997. Neural Computing Surveys, 1:102–350, 1998.
[45] B. Kegl, A. Krzyzak, T. Linder, and K. Zeger. Learning and design of princi-
pal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(3):281–297, 2000.
[46] K. Kiviluoto and E. Oja. S-map: A network with a simple self-organization
algorithm for generative topographic mappings. In NIPS, 1997.
[47] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, 1984.
[48] T. Kohonen. Self-Organising Maps. Springer, 1995.
[49] B. Krose. Projection and clustering. ASCI Advanced Pattern Recognition
Course. Computer Science Department, University of Amsterdam, May 2002.
[50] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.
138
Bibliography
[51] K. Van Laerhoven. Combining the self-organizing map and k-means clustering
for on-line classification of sensor data. In ICANN, pages 464–469, 2001.
[52] Q. Li, N. Mitianoudis, and T. Stathaki. Spatial kernel k-harmonic means clus-
tering for multi-spectral image thresholding. IEE Vision, Image and Signal
Processing Journal.
[53] Z.-P. Lo and B. Bavarian. On the rate of convergence in topology preserving
neural networks. Biological Cybernetics, 65:11:55–63, 1991.
[54] S. P. Luttrell. Code vector density in topographic mappings: Scalar case. IEEE
Transactions on Neural Networks, 2(4):427–436, July 1991.
[55] K. V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate Analysis. Academic
Press, 1979.
[56] T.M. Martinetz. Competitive hebbian learning rule forms perfectly topology
preserving maps. In Stan Gielen and Bert Kappen, editors, ICANN93, pages
427–434. Springer Verlag, 1993.
[57] T.M. Martinetz, S.G. Berkovich, and K.J. Schulten. ’neural-gas’ network for
vector quantization and its application to time-series prediction. IEEE Trans-
actions on Neural Networks., 4(4):558–569, 1993.
[58] T.M. Martinetz and K.J. Schulten. Topology representing networks. Neural
Networks., 7:507–522, 1994.
[59] S. McGlinchey, M. Pena, and C. Fyfe. Comparison of quantization errors in the
model- and data-driven harmonic topographic mappings. WSEAS Transactions
On Computers, 5(7):1562–1570, July 2006.
[60] S. McGlinchey, M. Pena, and C. Fyfe. Quantization errors in the harmonic
topographic mapping. In The 9th WSEAS International Conference on applied
mathematics, MATH 06, pages 105–110, May 2006.
[61] I. T. Nabney. NETLAB: Algorithms for pattern recognition. Springer-Verlag
New York, Inc., New York, NY, USA, 2002.
[62] Neural Networks Research Centre. Helsinki University of Technology. Som tool-
box. www.cis.hut.fi/projects/somtoolbox.
[63] M. Oja, S. Kaski, and T. Kohonen. Bibliography of self-organizing map (som)
papers: 1998-2001 addendum. Neural Computing Surveys, 3:1–156, 2003.
139
Bibliography
[64] E. Pampalk. Limitations of the som and the gtm.
url:citeseer.ist.psu.edu/670342.html. 2001.
[65] M. Pena and C. Fyfe. Developments of the generalised harmonic topographic
mapping. WSEAS Transactions On Computers, 4(11):1548–1555, November
2005.
[66] M. Pena and C. Fyfe. Faster clustering of complex data with the generalised
harmonic topographic mapping (g-hatom). In 5th WSEAS International Con-
ference on Applied Informatics And Communications, WSEAS AIC ’05, pages
270–275, 2005.
[67] M. Pena and C. Fyfe. The harmonic topographic map. In The Irish conference
on Artificial Intelligence and Cognitive Science, AICS05, pages 245–254, 2005.
[68] M. Pena and C. Fyfe. The harmonic topographic map. Tech-
nical Report 35, School of Computing, University of Paisley,
http://cis.paisley.ac.uk/research/reports/, 2005.
[69] M. Pena and C. Fyfe. Model- and data-driven harmonic topographic maps.
WSEAS Transactions On Computers, 4(9):1033–1044, September 2005.
[70] M. Pena and C. Fyfe. Tight clusters and smooth manifolds with the harmonic
topographic map. In 5th WSEAS International Conference on Simulation, Mod-
eling And Optimization, WSEAS SMO ’05, pages 508–513, 2005.
[71] M. Pena and C. Fyfe. Forecasting with topology preserving maps: Harmonic
topographic map and topographic product of experts application. In First
International Conference on Multidisciplinay Information Sciences and Tech-
nologies, InSciT2006, pages 42–46, October 2006.
[72] M. Pena and C. Fyfe. Outlier identification with the harmonic topographic
mapping. In European Symposium on Artificial Neural Networks, ESANN’06,
pages 289–295, April 2006.
[73] M. Pena and C. Fyfe. The topographic neural gas. In 7th International Con-
ference on Intelligent Data Engineering and Automated Learning, IDEAL06.,
pages 241–249, September 2006.
[74] A.K. Qin and P.N. Suganthan. Enhanced neural gas network for prototype-
based clustering. 38(8):1275–1288, August 2005.
140
Bibliography
[75] F. Questier, Q. Guo, B. Walczak, D.L. Massart, C. Boucon, and S. de Jong. The
neural-gas network for classifying analytical data. Chemometrics and Intelligent
Laboratory Systems, 61/1-2:105–121, 2002.
[76] H. Ritter and T. Kohonen. Self-organising semantic maps. Biological Cybernet-
ics, 61:241–254, 1989.
[77] R. N. Shepard, A. K. Romney, and S. B. Nerlove. Multidimensional Scaling:
Theory and Applications in the Behavioral Sciences., volume 1. Seminar Press,
Inc., 1972.
[78] A. Staiano, R. Tagliaferri, and L. De Vinco. High-d data visualization methods
via probabilistic principal surfaces for data mining applications. In Proceedings
Trim, 2004.
[79] J. V. Stone. Independent Component Analysis. A tutorial introduction. A
bradford book. The MIT Press, 2004.
[80] M. Svensen. GTM: The Generative Topographic Mapping. PhD thesis, Aston
University, Birmingham, UK, 1998.
[81] V. Tereshko. Topology-preserving elastic nets. J. Mira and A. Prieto (Eds.):
Connectionist Models of Neurons, Learning Processes and Artificial Intelligence,
LNCS. Springer-Verlag:Berlin, 2084:551–557, 2001.
[82] V. Tereshko. Deriving cortical maps and elastic nets from topology-preserving
maps. J. Cabestany, A. Prieto, and D.F. Sandoval (Eds.): IWANN’2005,
LNCS. Springer-Verlag: Berlin/Heidelberg, 3512:326–332, 2005.
[83] P. Tino and I. Nabney. Hierarchical GTM: constructing localized non-linear
projection manifolds in a principled way. (IEEE) Transactions on Pattern
Analysis and Machine Intelligence., 24(5):639–656, 2001.
[84] M.E. Tipping. Topographic Mappings and Feed-Forward Neural Networks. PhD
thesis, The University of Aston in Birmingham, 1996.
[85] A. Ultsch. Clustering with som: U*c. In Proc. Workshop on Self-Organizing
Maps, pages 75–82, 2005.
[86] J.J. Verbeek, N. Vlassis, and B. Krose. Locally linear generative topographic
mapping. In Proc. of 12th Belgian-Dutch conf. on Machine Learning, 2002.
141
Bibliography
[87] J. Vesanto, J. Himberg, E. Alhoniemi, and
J. Parhankangas. Som toolbox for matlab 5.
http://www.cis.hut.fi/projects/somtoolbox/package/papers/techrep.pdf.
[88] C.K.I. Williams and F. V. Agakov. Products of gaussians and probabilistic
minor components analysis. Technical Report EDI-INF-RR-0043, University of
Edinburgh, 2001.
[89] H. Yin. Visom-a novel method for multivariate data projection and structure vi-
sualization. IEEE TRANSACTIONS ON NEURAL NETWORKS, 13(1):237–
243, 2002.
[90] H. Yin. Nonlinear multidimensional data projection and visualisation. Lecture
Notes in Computing Sciences, 2690:377–388, 2003.
[91] B. Zhang. Generalized k-harmonic means – boosting in unsupervised learning.
Technical Report HPL-2000-137, HP Laboratories, Palo Alto, October 2000.
[92] B. Zhang. Generalized k-harmonic means– dynamic weighting of data in unsu-
pervised learning. First SIAM international Conference on Data Mining, 2001.
[93] B. Zhang. Comparison of the performance of center-based clustering algorithms.
In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Con-
ference, PAKDD 2003, pages 63–74, 2003.
[94] B. Zhang. Regression clustering. In Third IEEE International Conference on
Data Mining (ICDM’03), page 451. IEEE Computer Society, 2003.
[95] B. Zhang, M. Hsu, and U. Dayal. K-harmonic means - a data clustering algo-
rithm. Technical Report HPL-1999-124, HP Laboratories, Palo Alto, October
1999.
142