Latent Variable Spaces For The Construction Of Topology ...webdoc.sub.gwdg.de/ebook/mon/2010/ppn...

Latent Variable Spaces

For The Construction Of Topology

Preserving Mappings

Marian Pena

A thesis submitted in partial fulfillment of the

requirements of the University of Paisley for

the degree of Doctor of Philosophy

June 21, 2007

Marian Pena i

Abstract

This PhD is dedicated to a family of topology preserving mappings similar to the

Generative Topographic Map (GTM) [8]. These techniques can be considered as a

non-linear projection from input or data space to the output or latent space (usu-

ally 2D or 3D), plus a clustering algorithm, that updates the prototypes. The key

difference of the new models explained in this document is that, instead of includ-

ing a neighborhood function in the learning rule in order to maintain the topology,

we project points from an existing latent space to the data space. In doing so, we

separate the clustering (inner loop) from the projection (outer loop) in two different

steps. A common frame based on the GTM structure can be used with different

clustering techniques, giving new properties to the algorithms.

Thus we have two versions of the Harmonic Topographic Mapping (HaToM)

that utilise the K-Harmonic Means (KHM) [91, 92, 95] clustering, and the faster

Topographic Neural Gas (ToNeGas), with the inclusion of Neural Gas in the inner

loop. We compare these with a fruitless attempt to combine the SOM learning

algorithm with K-Harmonic Means.

We first revise in Chapter three the topographic product of experts (ToPoE)

[30], which includes the GTM structure, but with the Product of Experts instead

of the Mixture of Experts and gradient descent learning instead of the Expectation-

Maximisation algorithm used in the GTM. ToPoE like the GTM is more suitable for

continuous data. We extend its theory by investigating properties such as the local

variance of the model, the projection to latent space and convergence properties. We

introduce local kernels instead of the better known Gaussian kernel. We also study

the use of the magnification factors as a tool for measuring topology preservation.

In the fourth chapter we introduce theory underlying the new algorithm, HaToM,

with its two versions, analyse their parameters, and compare the main differences.

The experimental sections are dedicated to the experiments with several datasets,

which illustrate the projection capabilities of these algorithms, explaining in detail

the use of different parameters, different kernels, and how they affect the results.

Chapter 5 introduces the last of the new algorithms, ToNeGas, compare it with

the previous two by analysing differences using experiments with the same data.

We also compare their topology preservation properties and the convergence speed,

comparing as well with the results from the Self-Organizing Map. Finally, the three

Marian Pena ii

new algorithms are evaluated together and compared with each other, and also

against SOM and GTM.

Marian Pena iii

Resumen

Esta tesis esta dedicada a una familia de algoritmos con preservacion de la topologıa,

similares al Generative Topographic Map (GTM) [8]. Estas tecnicas pueden ser con-

sideradas como una proyeccion no lineal desde el espacio de los datos al espacio de

proyeccion o espacio latente (normalmente 2D or 3D), mas un algoritmo de agru-

pamiento (clustering) que actualiza los centros. La diferencia clave de los modelos

explicados en este documento es que, en lugar de incluir una funcion de vecindario

en la regla de aprendizaje para mantener la topologıa, proyectamos los puntos del

espacio de proyeccion al espacio de los datos. Con ello separamos la fase de agru-

pamiento (bucle interior) de la de proyeccion (bucle exterior) en dos pasos. Un

marco conjunto basado en la estructura del GTM puede ser utilizado con diversas

tecnicas de agrupamiento, dando propiedades distintas al algoritmo.

Segun lo anteriormente expuesto tenemos dos versiones del Harmonic Topo-

graphic Mapping (HaToM) que utiliza K-Harmonic Means (KHM) [91, 92, 95] para

el agrupamiento, y el mas rapido Topographic Neural Gas (ToNeGas), que incluye

Neural Gas en el bucle interior. Comparamos estos algoritmos con un infructuoso

intento de combinar SOM y K-Harmonic Means.

Primeramente repasamos en el capıtulo tres el Topographic Product of Experts

(ToPoE) [30], que incluye la estructura del GTM, pero donde el producto de expertos

substituye a la mezcla de expertos del GTM, y gradient descent como algoritmo de

aprendizaje en lugar de Expectation-Maximisation utilizado en el GTM. Extendemos

su teorıa investigando propiedades como la varianza local del modelo, la proyeccion

al espacio latente, y las propiedades de convergencia. Introducimos kernels locales

en sustitucion del mas conocido kernel Gausiano. Ademas estudiamos el uso de

los factores de magnificacion como herramienta para medir la preservacion de la

topologıa.

En el capıtulo cuatro introducimos la teorıa subyacente en el nuevo algoritmo,

HaToM, en sus dos versiones, analizamos sus parametros, y comparamos las diferen-

cias mas importantes. Las secciones de experimentos muestran la aplicacion a varios

datasets, ilustrando las propiedades de proyeccion de estos algoritmos, explicando

en detalle el uso de diferentes parametros, diferentes kernels, y como todo ello afecta

a los resultados.

El capıtulo cinco introduce el ultimo de los algoritmos, ToNeGas, el cual es

comparado con los dos anteriores analizando las diferencias a partir del uso de los

Marian Pena iv

mismos datasets. Tambien comparamos la preservacion de la topologıa y la veloci-

dad de convergencia, comparando asimismo con SOM. Finalmente, los tres nuevos

algoritmos son evaluados conjuntamente entre si y frente a SOM y GTM.

Marian Pena v

Acknowledgments

I would like to thank all the PhD students and staff in the Computing Department

of the University of Paisley, who made my stay a great experience to remember.

Special thanks to Gayle Leen, whom i shared with all the steps through doctorate,

and a new life in Scotland. Thank you also to Cesar Garcıa Osorio who helped

me a great deal especially within my first months in the University of Paisley. I

am grateful also to Jos Koetsier, Donald McDonald, Ying (Hannah) Han, Lina

Petrakieva, Oleksiy Dekhtyarenko, Andreas Loengarov, Nicolas Garcıa Pedrajas,

Benoit Chaperot, Wesam Barbakh, and Ian Miller.

This PhD would not be possible without the close supervision of my Director of

studies Colin Fyfe. His expertise in the area, and supervising PhD students make

the process much easier. Thanks also to my co-supervisors Stephen McGlinchey and

Daniel Livingston.

Finally, thanks to my sister and brother, my family and friends, for supporting

me in this challenge. But most of all, thanks to my mother, who always believes in

me; this thesis is dedicated to her.

Marian Pena vi

Agradecimientos

En primer lugar me gustarıa agradecer a todos los estudiantes de doctorado del De-

partamento de Computacion de la Universidad de Paisley el hacer de mi estancia

una grata experiencia para recordar. Un agradecimiento especial a Gayle Leen, con

la que he compartido todos los pasos del doctorado, y una nueva vida en Esco-

cia. Gracias tambien a Cesar Garcıa Osorio, que me ayudo mucho, sobre todo en

mis primeros meses en la Universidad de Paisley. Estoy agradecida asimismo a Jos

Koetsier, Donald McDonald, Ying (Hannah) Han, Lina Petrakieva, Oleksiy Dekht-

yarenko, Andreas Loengarov, Nicolas Garcıa Pedrajas, Benoit Chaperot, Wesam

Barbakh, e Ian Miller.

Esta tesis no habrıa sido posible sin la supervision de mi director de estudios

Colin Fyfe. Su experiencia en el area y en supervisar estudios de postgrado ha

hecho de este doctorado un proceso mucho mas llevadero. Gracias tambien a mis

co-supervisores Stephen McGlinchey y Daniel Livingston.

Finalmente, gracias a mi hermana y hermano, familia y amigos, por apoyarme

en este desafıo. Pero sobre todo, gracias a mi madre, que siempre cree en mi; esta

tesis esta dedicada a ella.

Marian Pena vii

List of symbols

• d is the dimensionality of the data space.

• q is the dimensionality of the latent space.

• xi is a datapoint in data space.

• yi is the projection of xi in latent space.

• tk is a latent point in latent space.

• mk is the projection of tk in data space; it is also named prototype, centre or

centroid of the cluster.

• W is the matrix of weights associated with the prototypes.

• Φ is a matrix where each row is the response of the basis functions to one

latent point, or alternatively each column of Φ is the response of one of the

basis functions to the set of latent points.

• rik is the responsibility of the kth latent point for the ith data point.

• dik is the distance from the kth latent point to the ith data point.

• γ is the width of the responsibilities.

• K is the number of neurons in the two-dimensional grid.

• β is the noise variance.

• N is the number of data points.

Contents

1 Introduction 1

1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Topology Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature review 8

2.1 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . 9

2.1.2 Independent Component Analysis . . . . . . . . . . . . . . . . 11

2.2 Norms and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 The Minkowski Metric . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 The Mahalanobis distance . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Metrics in high dimensional spaces . . . . . . . . . . . . . . . 15

2.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 K-Means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.3 Harmonic Averages . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.4 K- Harmonic Means . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.5 Neural Gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Topology preserving mappings . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 MultiDimensional Scaling . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.3 Generative Topographic Map . . . . . . . . . . . . . . . . . . 26

2.4.4 Probabilistic Principal Surface . . . . . . . . . . . . . . . . . 28

2.4.5 Topology Representing Network . . . . . . . . . . . . . . . . . 31

2.4.6 Growing Neural Gas . . . . . . . . . . . . . . . . . . . . . . . 33

viii

CONTENTS ix

2.4.7 Topology preserving Elastic net . . . . . . . . . . . . . . . . . 34

2.5 Software tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.1 SOM Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.2 GTM Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.3 Netlab Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.4 ICALAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 The Topographic Product of Experts 38

3.1 Topographic Product of Experts . . . . . . . . . . . . . . . . . . . . . 38

3.1.1 Product of Experts . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.2 The Topographic Product of Experts . . . . . . . . . . . . . . 40

3.1.3 Comparison with the GTM . . . . . . . . . . . . . . . . . . . 42

3.2 Responsibility Estimation . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 The Actual Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Cost functions and Convergence . . . . . . . . . . . . . . . . . . . . . 51

3.4.1 A simplified model . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4.2 The full model . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.3 Projections of the latent points . . . . . . . . . . . . . . . . . 54

3.5 Magnification Factors and Dimensionality Estimation . . . . . . . . . 54

3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.6.1 Experiment1: 1D Artificial Data . . . . . . . . . . . . . . . . 57

3.6.2 Experiment2: 2D Artificial Data . . . . . . . . . . . . . . . . 58

3.6.3 Experiment3: The Animals data set . . . . . . . . . . . . . . . 60

3.6.4 Experiment4: Bank Notes Data . . . . . . . . . . . . . . . . . 60

3.6.5 Experiment5: The Fundamental Clustering Problems Suite . . 61

3.6.6 Experiment6: The Algae data set . . . . . . . . . . . . . . . . 63

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 The Harmonic Topographic Mapping 69

4.1 The Harmonic Self-Organising Map . . . . . . . . . . . . . . . . . . . 69

4.1.1 HSOM Simulations . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2 Harmonic topographic Map . . . . . . . . . . . . . . . . . . . . . . . 73

4.2.1 Data-driven HaToM . . . . . . . . . . . . . . . . . . . . . . . 73

4.2.2 Model-driven HaToM . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.3 Generalised Harmonic Topographic Map . . . . . . . . . . . . 75

4.3 HaToM Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

CONTENTS x

4.3.1 Experiment 1: 1D Artificial Data . . . . . . . . . . . . . . . . 76

4.3.2 Experiment 2: 2D Artificial Data . . . . . . . . . . . . . . . . 78

4.3.3 Experiment 3: The Animals data set . . . . . . . . . . . . . . 81

4.3.4 Experiment 4: The Fundamental Clustering Problems Suite . 81

4.3.5 Experiment 5: The Algae data set . . . . . . . . . . . . . . . . 87

4.4 G-HaToM Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.4.1 Experiment 1: Crabs Data . . . . . . . . . . . . . . . . . . . . 90

4.4.2 Experiment 2: Bank Notes Data . . . . . . . . . . . . . . . . . 90

4.4.3 Experiment 3: Oil Data . . . . . . . . . . . . . . . . . . . . . 92

4.4.4 Experiment 4: Algae Data . . . . . . . . . . . . . . . . . . . . 93

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5 The Topographic Neural Gas & Algorithms comparison 98

5.1 The Topographic Neural Gas . . . . . . . . . . . . . . . . . . . . . . 98

5.1.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.1.2 Experiment 1: The Fundamental Clustering Problems Suite . 101

5.1.3 Experiment 2: The Algae data set . . . . . . . . . . . . . . . . 105

5.2 Topology preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2.1 Experiment1: Algae dataset . . . . . . . . . . . . . . . . . . . 106

5.2.2 Experiment2: Gene dataset . . . . . . . . . . . . . . . . . . . 112

5.3 Experiment Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.4 Comparison of the algorithms . . . . . . . . . . . . . . . . . . . . . . 123

5.4.1 Growing and Pruning . . . . . . . . . . . . . . . . . . . . . . . 124

5.4.2 One-to-one comparisons . . . . . . . . . . . . . . . . . . . . . 124

5.5 Benefits from separating clustering from projection . . . . . . . . . . 129

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 Summary and Future work 132

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2 Major contributions of this thesis . . . . . . . . . . . . . . . . . . . . 132

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Chapter 1

Introduction

Topology preserving mappings can be considered as a combination of three func-

tions: clustering, projection to a space of lower dimensionality or dimensionality

reduction, and topology preservation. Those functions allow for a better visuali-

sation of the data in a two or three dimensional space, and representation of the

dataset with a small set of prototypes; but the property that stands out from these

techniques is topology preservation, which lets us visualise the data in a space of

smaller dimensionality and provides a similar representation of the disposition of the

datapoints in the high dimensional space.

In this thesis we investigate the separation in two steps of the projection and

the clustering functions, that are usually included together in the learning rule. But

first we investigate these three functions separately.

1.1 Clustering

One of the main properties of such mappings is the clustering or quantization of

the data samples. Data clustering is a common technique in unsupervised learning

(learning without a “teacher”, or without using classes’ information), which is used

in many fields, including machine learning, data mining, pattern recognition and

image analysis. Clustering is a division of data into different groups or clusters so

that the data in each cluster has similarity in one trait, often proximity according

to some defined distance measure.

Data clustering algorithms can be hierarchical [41, 42] or partitional. Hierarchical

algorithms find successive clusters using previously established clusters, whereas

partitional algorithms determine all clusters at once. Partitional clustering like K-

Means (see below) can be further subdivided into relocation algorithms and density-

1

Chapter 1: Introduction

based algorithms. Relocation algorithms try to discover clusters by relocating the

prototypes in succesive iterations. Density-based algorithms search for areas with

high population of data. In this thesis we consider only partitional clustering of the

relocation type. More information about all the clustering techniques can be found

in [4].

In clustering, each cluster may have a prototype or associated centre. These

prototypes are usually the average of all the datapoints related to that particular

prototype. The association with a prototype depends on the property analysed,

that is often the Euclidean distance, so that each datapoint belongs to the closest

prototype. This membership can be to a unique prototype (hard membership) or to

several prototypes (soft membership). The first one is the case for K-Means while

the second one is used in algorithms like K-Harmonic Means (see Section 2.3.4).

Representing data with a limited set of prototypes necessarily loses information,

but achieves simplification that allows for the modeling of the data.

1.2 Dimensionality reduction

When working with large databases, it is always wise to reduce the dimensionality in

order to make it more manageable, and also for visualisation purposes. However, the

user should be aware of the reduction of information that this implies. Techniques

such as Principal Component Analysis (PCA) control this by keeping the maximum

variance in the projected data. The projection is linear in this case, but to make the

model more general, a nonlinear projection may be more suitable. Dimensionality

reduction is an “attribute transformation” [4], that is, a representation of all the

attributes of the data samples with a small set of new attributes that are the result

of a function applied to the original attributes. This process can be considered as a

projection from a high-dimensional data space, where the dimensions are the original

attributes, to a low dimensional data space, usually two or three dimensions, with

the new attributes. Those new attributes are often called “hidden causes” or “latent

variables”, and allow for the reduction of noise or redundancy in the data.

Dimensionality reduction can be classified as hard or soft, depending on the

number of dimensions reduced, i.e. a reduction from a very high dimensional space

to a two dimensional space is a hard reduction, while a reduction of just two or three

dimensions is a soft reduction; there is also dimensionality reduction for visualisation,

where the goal is not finding the intrinsic dimension of the data (the number of

independent variables that satisfactorily explain the dataset), but to project the

2


data to two or three dimensions to visualise the data in a low dimensional space.

In some of dimensionality reduction techniques, the intrinsic dimension has to be

given by the user, which asks for a trial-and-error process to find a suitable result

without under- or over- fitting the data. When the aim is to visualise in two or three

dimensions as it is in topology preserving mappings, the intrinsic dimensionality is

ignored and the projection is to two or three dimensional spaces.

More information about dimensionality reduction can be found in [14].

1.3 Topology Preservation

In mathematics, topology began with the investigation of certain questions in ge-

ometry1. General topology is the branch of topology which studies properties of

topological spaces and structures defined on them such as connectedness, compact-

ness and continuity. In topology the geometry is not analysed through shape but

considering the way the objects are put together. For example, the square and the

circle have many properties in common: they are both one dimensional objects (from

a topological point of view) and both separate the plane into two parts, the part

inside and the part outside; a circle is topologically equivalent to an ellipse and a

sphere is equivalent to an ellipsoid. If two objects have the same topological proper-

ties, they are said to be homeomorphic. The objects of topology are formally defined

as topological spaces. To be homeomorphic, the objects have to preserve their topo-

logical properties after deformations like twisting and stretching; the connectivity of

the objects has to be preserved.

The inclusion of topology preservation in a clustering technique was firstly imple-

mented by Kohonen [48] with the Self-Organizing Map (SOM). This implementation

gave for the first time an important property to the algorithms: points closer in the

projected mapping or latent space are also closer in data space. This property

allows the user to visualise the organisation of the high-dimensional space in a low-

dimensional projection, commonly a 2D mapping. The SOM has been extensively

used in many applications [44, 63], and has inspired a great quantity of related al-

gorithms. There are a few drawbacks such as lack of objective function but it is still

the most widely used topology preserving mapping.

Some topology preserving techniques have been defined as examples of the so

called latent variable technique [13], where the projections can follow different cri-

teria depending on the objectives:

1From Wikipedia: http : //en.wikipedia.org/wiki/MainP age

3


• minimising the reconstruction error,

• maximising the variance preservation,

• maximising the distance preservation,

• decorrelating the observed variables,

• making the estimated latent variables as independent as possible.

One possible criterion for topology preservation is the third one, while the last

is the basis for Independent Component analysis (see Chapter 2.1.2).

When considering topology preservation it is very important to keep in mind

that the dimensionality of the data manifold (or intrinsic dimension of the data)

may very likely be different from two, so projecting onto a 2-dimensional latent

space is not giving the right projection in terms of dimensionality reduction. This

dimensionality of the projection, determined by the user, is called the embedding

dimensionality, and is only for visualisation purposes.

1.4 Structure of the Thesis

In the second chapter we review techniques related to our algorithms such as the

dimensionality reduction models, different metrics used to estimate the closest pro-

totype, clustering techniques, and some of the existing topology preserving mappings

(TPM). We finally enumerate some useful toolboxes for TPM available on the in-

ternet.

Chapter three reviews the Topological Product of Experts (ToPoE) algorithm

and develops its theory, by analysing its properties through the examination of the

local variance, topology preservation with the calculation of magnification factors,

convergence properties, and theory related to the projection of the datapoints to the

low-dimensional space for visualisation purposes.

In the fourth and fifth chapters we describe the theory of two new algorithms,

the Harmonic Topographic Mapping (two versions) and the Topographic Neural

Gas, whose properties are discussed. The former makes use of K-Harmonic Means,

while the latter is an extension of Neural Gas as a topology preserving mapping.

Note that, even if Neural Gas is named as a topology preserving mapping in many

publications, they are really referring to the Topology Representing Network (TRP),

which is a combination of Neural Gas and competitive Hebbian learning, created by

the same author as the original Neural Gas [58]. In the Topographic neural Gas

4


(ToNeGas) we strictly use the clustering version of Neural Gas, and apply our new

projection method to maintain the topology preservation.

All the algorithms are jointly compared in Chapter 5. The same datasets have

been used with all the algorithms (though only the experiments showing relevant

information about each algorithm appear) and the results are included in the corre-

sponding chapter. The general comparison is included in Chapter 5, including two

more in-depth experiments that investigate their clustering and topology preserva-

tion, experiments comparison and one-to-one comparisons between algorithms.

1.5 Contribution of the Research

The contributions of the research presented in this thesis are:

• The Topographic Product of Experts (ToPoE) was presented in [30]. In this

thesis we extend the algorithm by

– including new kernels in the calculation of responsibilities,

– analysing the local variance of the model, the projection of datapoints to

the latent space and convergence properties,

– investigating the degree of topology preservation through magnification

factors,

– applying the algorithm to several datasets to compare it with the other

algorithms presented in this thesis.

• The clustering properties of the K-Harmonic Means (KHM) was shown to

overcome some of the drawbacks of the K-Means algorithm. We studied the

inclusion of an underlying latent space with KHM.

• We investigate the separation in two steps of the projection and the clustering

functions for topology preserving mappings, that are usually included together

in the learning rule. This naturally leads to a family of algorithms that share

the projection technique, but that are different in the clustering method.

• We develop two versions of the Harmonic Topographic Mapping (HaToM),

which allow for a major or minor imposition of the modelling, depending on

the nature of the data.

5


• We also further extend HaToM with the generalisation of the Harmonic Topo-

graphic Mapping (G-HaToM), that proved to reduce the computational time

and helps modelling more difficult data.

• The Topographic Neural Gas (ToNeGaS) is the last algorithm of this family,

that was created especially to reduce the computational time, while keeping

all the necessary functionality.

• We investigated the use of fractional distance metrics to the new algorithms

for high-dimensional data.

The above has been published in several publications:

1. Pena, M. and Fyfe, C. The Harmonic Topographic Map. In proceedings of

The Irish conference on Artificial Intelligence and Cognitive Science, AICS05.

pages 245-254. 2005

2. Pena, M. and Fyfe, C. Tight Clusters and Smooth Manifolds with the Har-

monic Topographic. In proceedings of the 5th WSEAS International Con-

ference on Simulation, Modeling And Optimization, WSEAS SMO ’05 Map.

pages 508-513. 2005.

3. Pena, M. and Fyfe, C. Model- and Data-driven Harmonic Topographic Maps.

WSEAS Transactions On Computers. Volume 4, number 9 pages 1033-1044.

September 2005.

4. Pena, M. and Fyfe, C. Faster clustering of complex data with the Gener-

alised Harmonic Topographic Mapping (G-HaToM). In proceedings of the 5th

WSEAS International Conference on Applied Informatics And Communica-

tions, WSEAS AIC ’05. Pages 270-275. 2005.

5. Pena, M. and Fyfe, C. Developments of the Generalised Harmonic Topographic

Mapping. WSEAS Transactions On Computers. Volume 4, number 11, pages

1548-1555. November 2005.

6. Pena, M. and Fyfe, C. The Harmonic Topographic Map. Technical report

Number 35. School of Computing, University of Paisley.

http://cis.paisley.ac.uk/research/reports/

7. Pena, M. and Fyfe, C. Outlier Identification with the Harmonic Topographic

Mapping. In proceedings of the European Symposium on Artificial Neural

Networks, ESANN’06. Pages 289-295. April 2006.

6


8. McGlinchey, S. and Pena, M. and Fyfe, C. Quantization Errors in the Har-

monic Topographic. In proceedings of The 9th WSEAS International Con-

ference on applied mathematics, MATH 06 Mapping. Pages 105-110. May

2006.

9. McGlinchey, S. and Pena, M. and Fyfe, C. Comparison of Quantization Errors

in the Model- and Data-driven Harmonic Topographic Mappings. WSEAS

Transactions On Computers. Volume 5, number 7, pages 1562-1570. July

2006. Pages 241-249.

10. Pena, M. and Fyfe, C. The Topographic Neural Gas. In proceedings of the

7th International Conference on Intelligent Data Engineering and Automated

Learning, IDEAL06. September 2006.

11. Pena, M. and Fyfe, C. Forecasting with topology preserving maps: Harmonic

Topographic Map and Topographic product of experts application. In pro-

ceedings of the First International Conference on Multidisciplinay Information

Sciences and Technologies, InSciT2006. Pages 42-46. October 2006.

12. Pena, M. and Fyfe, C. The Topographic Neural Gas. Journal of Computing

and Information Systems. Volume 10, number 3, pages 6-14. 2006. ISSN

1352-9404.

13. Pena, M. and Fyfe, C. Principal Manifolds for Data Visualisation and Dimen-

sion Reduction. Book chapter: Topology-preserving Mappings for data visu-

alisation. Lecture Notes in Computational Science and Engineering. Springer.

2007.

7

Chapter 2

Literature review

Topographic mappings are a class of dimensionality reduction techniques that seek to

preserve some of the structure of the data in the geometric structure of the mapping.

The term “geometric structure” refers to the relationships between distances in data

space and the distances in the projection to the topographic map. In some cases all

distance relationships between data points are important, which implies a desire for

global isometry between the data space and the map space. Alternatively, it may

only be considered important that local neighbourhood relationships are maintained,

which is referred to as topological ordering [84]. When the topology is preserved, if

the projections of two points are close, it is because, in the original high dimensional

space, the two points were close. The closeness criterion is usually the Euclidean

distance between the data patterns.

One clear example of a topographic mapping is a mercator projection of the

spherical earth into two dimensions; the visualisation is improved, but some of the

distances in certain areas are distorted. These projections imply a loss of some of the

information which inevitably gives some inaccuracy but they are an invaluable tool

for visualisation and data analysis, e.g. for cluster detection. Two previous works

in this area have been the Self-Organizing Map (SOM) [48] and the Generative

Topographic Map (GTM) [8].

In this chapter we detail the best known techniques in topology preservation,

such as the Self-Organizing Map and the Generative Topographic Mapping. We first

review the most common preprocessing techniques for reducing the dimensionality

of the data, a relevant step that reduces the computational time of the topology

preserving mappings by eliminating the redundancy between several variables. Then,

we review clustering techniques without topology preservation, that in most cases

are the basis of a topology preserving mapping of the corresponding section of this

8

Chapter 2: Literature review

Figure 2.1: Data preprocessing with clustering and reduction of the dimensionality[49].

chapter. Finally, we present some of the software tools for topology preservation

available on the internet.

2.1 Dimensionality reduction

Databases are getting larger and larger nowadays, and algorithms have to deal with

big amounts of data that account for higher computational times. Thus, any prepro-

cessing step that reduces the dimensionality or amount of data improves their per-

formance. Linear and nonlinear principal component analysis, whitening or sphering

and independent component analysis are widely used in such a task. The objectives

of preprocessing, depicted in Figure 2.1, are:

• To reduce the dimensionality of the data, i.e. converting many (N) high-

dimensional (d) data, to many (N) low-dimensional (q < d) data

• To represent data by a limited set of prototypes, i.e. converting many (N)

high-dimensional data, to few (K < N) high-dimensional (d) data

2.1.1 Principal Component Analysis

Principal component analysis (PCA) [55, 36], also known as the Karhunen-Loeve

transform, is a classical statistical method based on the reduction of correlation

9


between the variables. Variables that are correlated have a redundancy in informa-

tion, so by eliminating this redundancy we reduce the number of variables, and thus

the dimensionality of the data, keeping at the same time the maximum amount of

information. If the data is from a Gaussian distribution, the information is propor-

tional to the variance, so by projecting the data to the directions with maximum

variance we are preserving as well maximum information. Choosing a number of

major components may be enough to represent the whole data.

Calculating the Principal Components (PCs)

From a symmetric matrix such as the covariance matrix, we can calculate an or-

thogonal basis by finding its eigenvalues and eigenvectors. The eigenvectors ei and

the corresponding eigenvalues λi are the solutions of the equation

Cxei = λiei, i = 1, · · · , n (2.1)

where Cx = E(x− µx)(x− µx)T is the covariance matrix. The components of

Cx, denoted by cik, represent the covariances between the random variable compo-

nents xi and xj. The component cii is the variance of the component xi and µx the

mean value.

By ordering the eigenvectors in the order of descending eigenvalues (largest first),

we find an ordered orthogonal basis with the first eigenvector having the direction

of largest variance of the data.

Let B be a matrix consisting of eigenvectors of the covariance matrix as the row

vectors.

Projecting the data vector x on the coordinate axes defined by the orthogonal

basis, we get the latent variables

y = B(x− µx) (2.2)

which is a point in the orthogonal coordinate system defined by the eigenvectors.

The components of y are the coordinates in the orthogonal basis. We can reconstruct

the original data vector x from y by

x = BTy + µx (2.3)

Instead of using all the eigenvectors of the covariance matrix, we may represent

the data in terms of only a few basis vectors of the orthogonal basis. If we denote

the matrix having the first K eigenvectors as rows by BK

10


y = BK(x− µx) (2.4)

and

x = BTKy + µx (2.5)

Whitening or Sphering

Another preprocessing technique is sphering or whitening of the data [38]; even

algorithms that do not necessarily need sphering, often converge better with sphered

data. The data is also assumed to be centered, i.e., made zero-mean.

Sphering means that the observed variable of x is linearly transformed to a

variable

v = Qx (2.6)

such that the covariance matrix of v is the matrix identity: E{vvT} = I. This

transformation is always possible and very often is performed with PCA, which

allows for data compression as well as reduction of Gaussian noise.

2.1.2 Independent Component Analysis

In many cases the data is assumed to follow a Gaussian distribution (also called

the normal distribution), which simplifies the application of algorithms. A Gaussian

distribution is described by its mean and variance (first and second moments, see

Table 2.1). In other cases like Independent Component Analysis, higher-order statis-

tics such as kurtosis (fourth moment) are required to eliminate mutual information

between the variables that decorrelation is not able to eliminate, giving signals as

independent as possible. We define y1 and y2 to be independent if and only if the

joint pdf is factorizable in the following way p(y1, y2) = p1(y1)p2(y2). A weaker form

of independence is uncorrelatedness. Two random variables y1 and y2 are said to be

uncorrelated, if their covariance is zero: E(y1y2)− E(y1)E(y2) = 0. If the variables

are independent, they are uncorrelated.

A data distribution is skewed if it is not symmetric; leptokurtotic or super-

Gaussian distributions are distributions which are more kurtotic than a Gaussian

distribution, and conversely platykurtotic or sub- Gaussian distributions are less

kurtotic than a Gaussian distribution.

11


Table 2.1: Order Moments of a Gaussian distribution.

The Gaussian distribution f(x) = 1σ√

2πexp−1

2(x−µ

σ)2

The first order (mean) µx = E(X) =∫∞−∞ xf(x)dx

The second order (variance) σ2X = V ar(X) = E[(X − µ)2] =

∫∞−∞(x− µ)2f(x)dx

The third order (skewness) E[(X − µ)3]

The fourth order (Kurtosis) E[(X − µ)4]

Independent Component Analysis (ICA) [39, 79] is a technique with applica-

tions in data analysis, source separation, and feature extraction. It can be used as

a preprocessing technique, where fewer reference vectors are used to represent all

the samples. PCA, as we have seen, is a technique that decorrelates the data by

finding the highest variance projections, and thus depends on second-order statis-

tics. Independent component analysis on the other hand also reduces higher-order

statistical dependencies such as kurtosis, so that all mutual information between

variables is eliminated. ICA represents a multidimensional random vector as a

linear combination of non-Gaussian random variables that are as independent as

possible. Quantitative measures of non-Gaussianity (such as kurtosis, negentropy,

mutual information) are used as measures of independence, allowing for different

algorithms for ICA. There are also ICA models where the objective is to decorrelate

a non-linear combination of the signals.

2.2 Norms and metrics

In linear algebra, functional analysis and related areas of mathematics, a norm is

a function which assigns a positive length or size to all vectors in a vector space,

other than the zero vector1. A simple example is the 2-dimensional Euclidean space

equipped with the Euclidean norm. Elements in this vector space are usually drawn

as arrows in a 2-dimensional Cartesian coordinate system. The Euclidean norm

1From Wikipedia: http : //en.wikipedia.org/wiki/MainP age

12


assigns to each vector the length of its arrow. A vector space with a norm is called

a normed vector space.

2.2.1 The Minkowski Metric

The Minkowski metrics are a family of distance measures given by:

dij =

{d∑

k=1

| x(k)i − x

(k)j |r

}(1/r)

(2.7)

where dik is the distance between the d-dimensional entities i and j, x(k)i is the value

of the kth variable for entity i, x(k)j is the value of the kth variable for entity j, and

r > 0. The most common distance measure is the Euclidean or L2 norm, where

r = 2. The Euclidean distance is used in K-Means (see below), where the objective

function is the sum of squares of distances between datapoints and prototypes, which

in statistics is the total intra-cluster variance.

When r = 1 we have the Taxicab norm or Manhattan norm, where the name

relates to the distance a taxi has to drive in a rectangular street grid to get from the

origin to the point x. The distance that returns the maximum of absolute difference

in coordinates corresponds to r = ∞. Figure 2.2 shows a representation of several

of these distances.

2.2.2 The Mahalanobis distance

In statistics, the Mahalanobis distance is a distance measure introduced by P. C.

Mahalanobis in 1936. It is based on correlations between variables by which different

patterns can be identified and analysed. It is a useful way of determining similarity

of an unknown sample set to a known one. It differs from Euclidean distance in that

it takes into account the correlations of the data set and is scale-invariant.

Mahalanobis distance is defined as the dissimilarity measure between two random

vectors x and y of the same distribution with the covariance matrix Σ :

d(x,y) =√

(x− y)T Σ−1(x− y) (2.8)

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces

to the Euclidean distance. If the covariance matrix is diagonal, then the resulting

distance measure is called the normalized Euclidean distance:

13


0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

L0.3

L1

L2

L5

Figure 2.2: First quadrant plot of unit length loci in two dimensions from the originwith various Lr norms [20].

14


d(x,y) =

√√√√d∑

k=1

(x(k) − y(k))2

σ2k

(2.9)

where σi is the standard deviation of the xi over the sample set.

2.2.3 Metrics in high dimensional spaces

The importance of the metric used to calculate distances or dissimilarities in gen-

eral is widely recognised, in topology preserving mappings in particular. However,

familiarity makes the Euclidean norm the most used distance, without always con-

sidering the implications of this choice. In [1, 20, 26] this selection is analysed for

high dimensional data, with normalised or raw data, and different noise models

respectively.

In [20] Doherty et al note that “in a learning context when measuring dissim-

ilarities between two entities, the use of a fractional norm reduces the impact of

extreme individual attribute differences when compared to the equivalent Euclidean

measurements. Conversely, the higher-order norms emphasise the larger attribute

dissimilarities between the two entities and taken to the limit, L∞ reports the dis-

tance based on the single attribute with the maximum dissimilarity”. However, they

don’t find clear benefits from using any particular norm, which rather leaves us with

the use of the familiar Euclidean distance.

As shown in [5], the difference between the maximum and minimum distance to

a given point, increases less than the nearest distance to any point when we increase

the dimensionality of the data. This gives a poor discrimination between nearest

and furthest neighbour in high-dimensional space, making the distance meaningless.

They define as criterion of meaningfulness

Dmax(d)k −Dmin

(d)k

Dmin(d)k

(2.10)

that in [1], is called “relative contrast”. In the latest document after studying this

measure, they defend the Manhattan norm as preferable to the Euclidean in high-

dimensional space.

After getting better results with r = 1 in high dimensional space, they experiment

with r < 1, which they called “fractional distance metrics”. They prove that in this

case, the smaller the fraction, the greater the rate of absolute divergence between

the maximum and minimum distance. They also give results with different datasets

and K-Means, always getting better results with fractional distances.

15


In [26] Francois et al suggest choosing the metric according to the shape of the

noise that is assumed on the data. While it is known that the Euclidean metric is

optimal in the presence of white Gaussian noise, it is shown that other types of noise

require other metrics. They give the example of impulse or burst noise, present in

many high-dimensional data; burst noise is a noise which affects only a minority of

the components of the data elements, but in a significant way.

The experiments carried out with fractional metrics applied to the new algo-

rithms presented in this thesis did not give better results. This could be due to the

improvement in the clustering techniques used.

2.3 Cluster analysis

2.3.1 K-Means

K-Means clustering is an algorithm to divide samples based on attributes/features

into K groups. K is a positive integer that has to be given in advance. The group-

ing is done by minimizing the sum of squares of distances between data and the

corresponding prototypes mk.

The performance function for K-Means may be written as

J =N∑

i=1

K

mink=1

‖ xi −mk ‖2 (2.11)

which we wish to minimise by moving the prototypes to the appropriate positions.

Note that (2.11) detects only the prototypes closest to data points and then dis-

tributes them to give the minimum performance which determines the clustering.

Any prototype which is still far from data is not utilised and does not enter any

calculation to determine minimum performance, which may result in dead proto-

types, which are never appropriate for any cluster. Thus initializing prototypes

appropriately can play a big effect in K-Means.

The algorithm has the following steps:

• Step 1. Begin with a decision on the value of K, the number of clusters.

• Step 2. Put any initial partition that divides the data into K clusters randomly.

• Step 3. Take each sample in sequence and compute its distance from the

prototype of each of the clusters. If a sample is not currently in the cluster

with the closest prototype, switch this sample to that cluster and update the

16


prototype of the cluster gaining the new sample and the cluster losing the

sample.

• Step 4. Repeat step 3 until convergence is achieved, that is until a pass through

the training samples causes no new assignments.

The main problem of the K-Means algorithm is getting trapped in local minima.

There are several extensions of this algorithm [32], including K-Harmonic Means,

explained below.

2.3.2 K-Means++

In order to overcome the initialisation problem of K-Means, Arthur and Vassilvitskii

[3] modified the algorithm by substituting the random allocation of the prototypes

with a seeding technique.

The K-Means algorithm begins with an arbitrary set of cluster prototypes. They

propose a specific way of choosing these prototypes. At any given time, let D(x)

denote the shortest distance from a data point x to the closest prototype already

chosen. Then, the K-Means++ algorithm is as follows:

1. Choose an initial prototype m1 uniformly at random from the dataset X.

2. Choose the next prototype mk, selecting mk = x′ ∈ Xwith probability D(x′)2∑x∈X D(x)2

3. Repeat from Step 2 until we have chosen a total of K prototypes.

4. Proceed as with the standard K-Means algorithm.

They give experimental results that show the advantage in time and accuracy of

this technique.

2.3.3 Harmonic Averages

Harmonic Means or Harmonic Averages are defined for spaces of derivatives. For

example, if you travel 12

of a journey at 10 km/hour and the other 12

at 20 km/hour,

your total time taken is d10

+ d20

and so the average speed is 2dd10

+ d20

= 2110

+ 120

. In

general, the Harmonic Average of K points, a1, ..., aK , is defined as

HA({ai, i = 1, · · · , K}) =K∑K

k=11ak

(2.12)

This average is used in K- Harmonic Means to overcome the problem of local

minima.

17


2.3.4 K- Harmonic Means

This has recently [95] been used to make the K-Means algorithm more robust. Zhang

et al have developed an algorithm based on the harmonic average which converges

to a better solution than the standard algorithm. The algorithm calculates the

Euclidean distance between the ith data point and the kth prototype as d(xi,mk).

Using gradient descent in the performance function

JHA =N∑

i=1

K∑Kk=1

1d(xi,mk)2

(2.13)

we get

mk =

∑Ni=1

1

d4ik

(∑Kl=1

1

d2il

)2xi

∑Ni=1

1

d4ik

(∑Kl=1

1

d2il

)2

(2.14)

where dik is d(xi,mk)

In [95] extensive simulations show that this algorithm converges to a better solu-

tion (less prone to finding a local minimum because of poor initialisation) than both

standard K-Means or a mixture of experts trained using the Expectation Maximi-

sation algorithm.

With this learning rule on a latent space model similar to the GTM, we get a

mapping which has elements of topology preservation in the HaToM algorithm (see

later).

Zhang subsequently developed a generalised version of the algorithm [91, 92, 93]

that includes the pth power of the L2 distance2 which creates a “dynamic weighting

function” that determines how data points participate in the next iteration in the

calculation of the new prototypes mk. The weight is bigger for data points further

away from the prototypes, so that their participation is boosted in the next iteration.

This makes the algorithm insensitive to initialisation and also prevents one cluster

from taking more than one prototype.

The aim of K-Harmonic Means was to improve the winner-takes-all partitioning

strategy of K-Means that gives a very strong relation between each datapoint and

its closest prototype, so that the change in membership is not allowed until another

prototype is closer. The transition of prototypes between areas of high density

is more continuous in K- Harmonic Means due to the distribution of associations

2Note that this distance is different from the Minkowski distance presented in Section 2.2.1,where Lp is applied

18


between prototypes and datapoints. To explain this we consider a general formula

for K-Means and K-Harmonic Means for the updating of the prototypes

mk ←∑N

i=1 mem(mk|xi) ∗ weight(xi) ∗ xi∑Ni=1 mem(mk|xi) ∗ weight(xi)

(2.15)

where

• weight(xi) > 0 is the weighting function that defines how much influence a

data point xi has in recomputing the prototype parameters mk in the next

iteration with constraint weight(xi) > 0.

• mem(mk|xi) with mem(mk|xi) ≥ 0 and∑K

k=1 mem(mk|xi) = 1 is the mem-

bership function that decides the portion of weight(xi) ∗ xi associated with

mk.

K-Means has a hard membership, that is each datapoint is related to just one

prototype, thus

mem(ml|xi) =

{1 if l = arg mink ‖ xi −mk ‖2

0 otherwise(2.16)

and the weighting function is

weight(xi) = 1 (2.17)

with weight > 0 ∀i.The soft membership in K-Harmonic Means on the other hand

mem(mk|xi) =‖ xi −mk ‖−p−2

∑Kl=1 ‖ xi −ml ‖−p−2

(2.18)

allows the data points to belong partly to all prototypes.

The boosting properties for the generalised version of K-Harmonic Means (p > 2)

are given by the weighting function ([32]):

weight(xi) =

∑Kk=1 ‖ xi −mk ‖−p−2

∑Kk=1 ‖ xi −mk ‖−p

2 (2.19)

where the dynamic function gives a variable influence to data in clustering in a

similar way to boosting [27] since the effect of any particular data point on the

19


re-calculation of a prototype is O(‖ xi −mk ‖p2−p−2), which for p > 2 has greatest

effect for larger distances.

An algorithm that uses the clustering properties of K-Harmonic Means for image

segmentation is the Spatial Kernel-based KHM (SKKHM) algorithm [52]. A kernel-

induced metric substitutes the classic Euclidean intensity distance, reducing the

effect of outliers and noise.

2.3.5 Neural Gas

Vector quantization methods encode a set of data points in N -dimensional space

with a smaller set of reference or prototype vectors mk, k = 1, ..., K. Neural Gas

(NG) [57] is a vector quantization technique with soft competition between the units;

it is called the Neural Gas algorithm because the prototypes of the clusters move

around in the data space similar to the Brownian movement of gas molecules in a

closed container. In each training step, the squared Euclidean distances

dik = ‖xi −mk‖2 = (xi −mk)T ∗ (xi −mk) (2.20)

between a randomly selected input vector xi from the training set and all refer-

ence vectors mk are computed; the vector of these distances is d. Each prototype k

is assigned a rank rk(d) = 0, ..., K − 1, where a rank of 0 indicates the closest and a

rank of K-1 the most distant prototype to xi. The learning rule is then

mk = mk + ε ∗ hρ[rk(d)] ∗ (xi −mk) (2.21)

The function

hρ(rk(d)) = e(−rk(d)/ρ) (2.22)

is a monotonically decreasing function of the ranking that adapts not only the closest

prototype, but all the prototypes, with a factor exponentially decreasing with their

rank. The width of this influence is determined by the neighborhood range ρ. The

learning rule is also affected by a global learning rate ε. The values of ρ and ε

decrease exponentially from an initial positive value (ρ(0), ε(0)) to a smaller final

positive value (ρ(T ), ε(T )) according to

ρ(t) = ρ(0) ∗ [ρ(T )/ρ(0)](t/T ) (2.23)

and

ε(t) = ε(0) ∗ [ε(T )/ε(0)](t/T ) (2.24)

20


where t is the time step and T the total number of training steps, forcing more local

changes with time.

The updating rule in NG is very similar to the Self-Organizing Map(SOM) rule

(see Section 2.4.2), the difference being the neighborhood function that SOM uses:

h = exp

(−‖tk − tj‖2σ2

)(2.25)

where tk is the position of the kth latent point in latent space.

In contrast to the NG, the neighborhood function of SOM is evaluated in the

latent space. The advantage of the SOM is the ordered topological structure of

neurons. In contrast, in the original NG, such an order is not given. One can

extend the NG to the topology representing network (TRN) such that topological

relations between neurons are installed during learning, although generally they do

not achieve the simple structure as in SOM lattices [58].

There is also a Growing version of Neural Gas [28] that learns the topology of the

data by combining NG with Competitive Hebbian Learning (CHL), which is then

closer to the SOM algorithm.

2.4 Topology preserving mappings

The interest in feature maps stems directly from their biological importance. A

feature map uses the “physical layout” of the output neurons to model some feature

of the input space. In particular, if two output neurons ya and yb are close together

with respect to some distance measure in the output layer, then the corresponding

inputs x1 and x2 which cause ya and yb to fire must be close together in the input

space. Such maps are also called topology preserving maps (TPM).

As explained in [31], preserving the distances in a TPM means that:

1. Nearby data points give nearby projection points.

2. Distant data points give distant projection points.

3. If the projections of two data points are close, it is because, in the original

high dimensional space, the two data points were close.

4. If the projections are distant, the original data points were distant.

However the techniques presented in this thesis (and also SOM and GTM) only

guarantee the second and third properties.

21


There are several ways of creating feature maps - the most popular are Kohonen’s

SOM and the GTM. Kohonen’s Self-Organizing Map (SOM) is a Neural Network

map called a topology-preserving map. It takes into consideration the physical

arrangement of the nodes. Nodes that are “close” together are going to interact

differently from nodes that are “far” apart. This TPM is by far the most popular,

and it is not uncommon to substitute the term topology preserving maps with self-

organizing maps.

The GTM was developed by Bishop as a probabilistic version of the SOM, in

order to overcome some of the problems of this map, especially the lack of objective

function.

Further information about ordering, convergence properties, energy functions

and topology preservation in self-organizing maps is provided in [22, 23, 47, 53, 54,

56, 76].

2.4.1 MultiDimensional Scaling

Multidimensional scaling (MDS) [50, 77] is an exploratory technique used to visual-

ize dissimilarities in a low dimensional space, usually Euclidean, in which distances

in the projection dij match, as well as possible, the original dissimilarities δij, that

may be distances as well or any other proximity measure, indeed any kind of rela-

tion between a pair of entities that can be translated into a proximity measure, or

conversely into a dissimilarity measure.

Classical scaling, that treats dissimilarities directly as Euclidean distances, and

least squares scaling, which matches distances dij to transformed dissimilarities

f(δij), are known as metric MDS, where metric refers to the type of transforma-

tion. They can be shown to be special cases of principal components analysis. With

Non-metric MDS, the metric nature is abandoned, and only the rank order of dis-

similarities has to be preserved by the transformation.

The coordinates in the distance function (xia, i = 1, ..., N with N = number

of entities, a = 1, ..., d with d = number of dimensions) and the function f which

allows to transform the dissimilarities into distances are estimated by minimising the

following badness of fit function (usually called stress or S-function in the context

of MDS):

S =

(∑Ni=1

∑Nj>i(f(δij)− dij)

2

∑Ni=1

∑Nj>i d

2ij

)1/2

(2.26)

22


Metric Multidimensional Scaling

The most simple case of multidimensional scaling considers quantitative data. In

classical scaling the dissimilarities are treated directly as distances. In metric scaling

two properties have to hold3: these are called non-degeneracy and triangular inequal-

ity. Non-degeneracy states that dij = 0 ⇒ i = j and the triangular inequality means

that dij + djk ≥ dik for all (i, j, k). The matrix obtained after pre-processing is la-

beled D. It can be shown that the elements of the double centered dissimilarity

matrix D equal minus two times the scalar products:

d2ij −

∑Nj=1 d2

ij

N−

∑Ni=1 d2

ij

N+

∑Ni=1

∑Nj>i d

2ij

N2= −2

d∑a=1

xiaxja (2.27)

with i the row index, j the column index, N the number of objects and d the

number of dimensions. Then the matrix of scalar products is:

B = −1

2

[I− 1

NiiT

]D2

[I− 1

NiiT

](2.28)

where I is the N by N identity matrix and i a unity vector of length N.

To obtain the original X values, the singular value decomposition (SVD) is per-

formed, B = V ΛV T ; defining B = XXT we get the matrix of the coordinates with

X = V Λ1/2. To project into a space of lower dimensionality we retain only the first

r eigenvectors: this implies that the summation over a in equation (2.27) runs from

1 to r instead of d.

2.4.2 Self-Organizing Map

Kohonen’s algorithm [47, 48] is an algorithm used to visualize and interpret large

high-dimensional data sets. The map consists of a regular grid of processing units,

“neurons” in a (usually) 2-layer network and competition takes place between the

output neurons. The disposition of the neurons in that grid can be different, as

shown in Figure 2.3 but the one used in this thesis (unless specified otherwise) for

SOM and the other topology preserving mappings is the rectangular lattice.

The map attempts to represent all the available observations with optimal accu-

racy using a restricted set of models. At the same time the models become ordered

on the grid so that similar models are close to each other and dissimilar models far

from each other. Fitting of the model vectors is usually carried out by a sequential

3From http : //www.mathpsyc.uni− bonn.de/doc/delbeke/delbeke.htm

23


100 200 300 400 500 600 700

50

100

150

200

250

Figure 2.3: Examples of Map shapes available for the SOM mapping [87].

regression process: for each sample x(t) where t = 1,2,... is the step index, first the

winner index BMU (Best Matching Unit) is identified by the condition

∀k ‖ x(t)−mBMU(x) ‖≤‖ x(t)−mk(t) ‖ . (2.29)

After that, all model vectors or a subset of them that belong to nodes centered

around node BMU are updated as

mk(t + 1) = mk(t) + h(BMU(x), k)(x(t)−mk(t)). (2.30)

Here h(BMU(x), k) is the neighbourhood function, a decreasing function of the

distance between the kth and BMU nodes on the map grid. Typical functions are

the Gaussian function, and the Difference of Gaussians function shown below; thus

if unit k is at point tk in the output layer then

h(k, k∗) = a exp

(−‖tk − tk∗‖2

2σ21

)− b exp

(−‖tk − tk∗‖2

2σ22

)

This regression is usually iterated over the available samples. See Figure 2.44 to

visualise the ordering process of the neurons in a two-dimensional mapping.

4Adapted from http://www-users.cs.york.ac.uk/ sok/IML/

24


Figure 2.4: Weight vectors during the ordering process in a two dimensional map-ping.

25


The number of neighbours and how much each weight can learn decreases over

time. This whole process is repeated a large number of times, usually more than

1000 times.

There are some drawbacks to the SOM algorithm [24, 64] like the lack of an ob-

jective function and no proof of convergence. However, it is a widely used technique

[44, 63], that has been the basis for many algorithms [51].

There are many extensions of the SOM algorithm, like ViSOM [89, 90], or

visualisation-induced SOM, that splits the updating force of each winner in two:

the first force pointing from the winner to the input data xi. It adapts the neu-

ron towards the input in a direction orthogonal to the tangent plane of the winner.

The second force is a lateral force bringing the neighbour neuron to the winner

neuron. The ViSOM constrains the lateral contraction forces between neurons and

hence regularises the inter-neuron distances so that distances between neurons in

the data space are in proportion to those in the map space, preserving the distance

information on the map, along with the topology.

U-Matrix

There are different methods for visualising the results of a SOM mapping [85], one

of the most common being the U-Matrix (unified distance matrix). The U-matrix

alow us to visualize the distances between the neurons. The distance in data space

between the adjacent neurons is calculated and illustrated with different colourings

between the adjacent nodes. A red colouring between the neurons corresponds to a

large distance and thus a gap between the codebook values in the input space. A

blue colouring between the neurons signifies that the codebook vectors are close to

each other in the input space. Light areas can be thought of as clusters and dark

areas as cluster separators. This can be a helpful presentation when one tries to

find clusters in the input data without having any a priori information about the

clusters.

In Figure 2.5 we can see the U-Matrix representation for the iris dataset that

reveals two clusters in the upper right corner and lower part, separated by an area

of higher distance.

2.4.3 Generative Topographic Map

The Generative Topographic Mapping (GTM) [8, 9, 11, 80, 83] is a non-linear latent

variable model for modeling continuous low-dimensional probability distributions,

26


SOM 08−Mar−2007

U−matrix

0.101

0.982

1.86

Figure 2.5: U-matrix representation of the Self-Organizing Map for the iris data.

embedded in high-dimensional spaces. It is a probabilistic extension of the self-

organizing map, that has an objective function and new visualisation technique.

Two limitations of the basic GTM model are the computational effort required,

that grows exponentially with the intrinsic dimensionality of the density model, and

the initialisation of the parameters, that can lead the algorithm to a local minimum.

It is also a more complex algorithm.

The GTM defines a non-linear, parametric mapping m(t; W ) from a q-dimensional

latent space to a d-dimensional data space x ∈ Rd, where normally q < d. The map-

ping is defined to be continuous and differentiable. m(t; W ) maps every point in the

latent space to a point in the data space. Since the latent space is q-dimensional,

these points will be confined to a q-dimensional manifold non-linearly embedded into

the d-dimensional data space. If we define a probability distribution over the latent

space, p(t), this will induce a corresponding probability distribution into the data

space. Strictly confined to the q-dimensional manifold, this distribution would be

singular, so it is convolved with an isotropic Gaussian noise distribution, given by

p (x|t,W, β) =

(β

2π

)d/2

exp

{−β

2

d∑s=1

(x(s) −m(s)(t, W ))2

}(2.31)

where x is a point in the data space and β denotes the noise inverse variance. By

integrating out the latent variable, we get the probability distribution in the data

space expressed as a function of the parameters β and W ,

27


p (x|W,β) =

∫p (x|t,W, β) p(t)dt (2.32)

Choosing p(t) as a set of K equally weighted delta functions on a regular grid,

p(t) =1

K

K∑

k=1

δ(t− tk) (2.33)

the integral in (2.32) becomes a sum,

p (x|W,β) =1

K

K∑

k=1

p (x|tk,W, β) (2.34)

Each delta function centre maps into the centre of a Gaussian which lies in the

manifold embedded in the data space, as illustrated in Figure 2.6. This algorithm

defines a constrained mixture of Gaussians [40, 43], since the centres of the mixture

components can not move independently of each other, but all depend on the map-

ping m(t; W ). Moreover, all components of the mixture share the same variance,

and the mixing coefficients are all fixed at 1/K . Given a finite set of independent

and identically distributed (i.i.d.) data points, {xNi=1}, the log-likelihood function

of this model is maximized by means of the Expectation Maximisation algorithm

with respect to the parameters of the mixture, namely W and β. The form of the

mapping m(t; W ) is defined as a generalized linear regression model

m(t; W ) = φ(t)T W

where the elements of φ(t) consist of M fixed basis functions φi(t)Mi=1, and W is

a M × d matrix.

There are several extensions of the initial algorithm like the Locally Linear Gen-

erative Topographic Mapping [86], Hierarchical GTM [11, 83], and a combination of

SOM and GTM [46].

2.4.4 Probabilistic Principal Surface

Principal surfaces (curves) [19, 33, 45] are nonlinear generalizations of principal

subspaces that formalize the notion of a low-dimensional manifold passing through

the middle of a dataset in high-dimensional space. The probabilistic principal surface

(PPS) [15, 16], a generalization of the generative topological mapping (GTM), is a

parametric approximation of principal surfaces. The PPS generalizes the GTM

model by building a unified model and shares the same formulation as the GTM,

28


t

1

2

t

3

m( ;w)x

1 x

2

x

t

Figure 2.6: In order to formulate a tractable non linear latent variable model, weconsider a prior distribution p(t) consisting of a superposition of delta functions,located at the nodes of a regular grid in latent space. Each node tk is mapped to acorresponding point m(tk; W ) in data space, and forms the centre of a correspondingGaussian distribution. (Adapted from [80]).

except for an oriented covariance structure for the nodes. Data points projecting near

a principal surface node have higher influences on that node than points projecting

far away from it. This is illustrated in Figure 2.7.

Therefore, each node m(t; W ), t ∈ {tk}Kk=1, has covariance

Σt =α

β

q∑i=1

ei(t)eTi (t) +

(d− αq)

β(d− q)

d∑j=q+1

ej(t)eTj (t), 0 < α <

d

q(2.35)

where

• {ei(t)}qi=1is the set of orthonormal vectors tangential to the manifold at m(t; W ),

• {ej(t)}dj=q+1 is the set of orthonormal vectors orthogonal to the manifold at

m(t; W ).

The parameter α is a clamping factor and determines the orientation of the covari-

ance matrix; this orientation gives the self-consistency property, i.e. every point of

the curve is the average of the data points projecting onto that point of the curve

[33]. The PPS model reduces to GTM for α = 1 and to the manifold-aligned GTM

for α > 1

29


d=

(a) GTM

σα

β≈

σβ

≈

(b) PPS

m

d t

tmm( )

t( )

Figure 2.7: (a)Under a spherical Gaussian model of the GTM, points 1 and 2 haveequal influences on the center node m(t) (b) PPS have an oriented covariance matrixso point 1 is probabilistically closer to the centre node m(t) than point 2. (Adaptedfrom [78]).

Σt =

0 < α < 1 ⊥ to the manifold

α = 1 Idor spherical

1 < α < dq

‖ to the manifold

(2.36)

The EM algorithm can be used to estimate the PPS parameters W and β. The

clamping factor is fixed by the user and is assumed to be constant during the EM

iterations.

Chang proposes in [15] the use of a three dimensional latent space disposed as

a spherical manifold for the application of the PPS with nodes {tk}Kk=1 arranged

regularly on the surface of a sphere in R3 latent space. After a PPS model is fitted

to the data, the data themselves are projected into the latent space as points onto

a sphere (Figure 2.8).

The latent manifold coordinates yi of each data point xi are computed as in the

GTM,

yi =K∑

k=1

riktk (2.37)

where rik are the responsibilities defined as

rik = p (tk|xi) =p (xi|tk) P (tk)∑K

k′=1 p (xi|t′k) P (t′k)(2.38)

30


(a) Manifold in

latent space R3

x

(b) Manifold in

feature space RD

t

(c) t projected onto

manifold in latent space R3

E[ x|t ]m

t

( )

x

Figure 2.8: (a) The spherical manifold in R3 latent space. (b) The spherical manifoldin R3 data space. (c) Projection of data points t onto the latent spherical manifold.(Adapted from [78]).

For a spherical manifold ‖tk‖ = 1 for k = 1, . . . , K and∑K

k=1 rik = 1 for i =

1, . . . , N , thus Equation 2.37 implies that these coordinates lie within a unit sphere,

i.e ‖yi‖ ≤ 1. In projecting the fitted PPS to this spherical manifold, the clusters

are much less overlapped for several experiments presented in that thesis.

2.4.5 Topology Representing Network

This method [58] is a straight-forward combination of neural gas and Competitive

Hebbian Learning (CHL). This, however, would apply also to the growing neural gas

model described later. Topology Representing Networks (TRN) are artificial neural

networks which use unsupervised algorithms to configure a topological representation

of input data. More properly, let xi be an input datapoint from a finite data set X:

X = {x1,x2, · · · ,xN}, and T = {t1, t2, · · · , tK} a topology representing network

composed of K neurons with reference vectors mk, k = 1, ..., K; then the set Vk of

all points in X which have mk as the closest vector, is called the Voronoi region (or

Voronoi polygon) of mk:

Vl = {xi ∈ X : l = arg mink∈{1,...,K}

‖xi −mk‖} (2.39)

Hence, the partition of the input manifold induced by the set of K reference

vectors of given net T is called the Voronoi tessellation of input space: V =

{V1,V2, · · · ,VK} and

X ⊆K⋃

i=1

Vi (2.40)

31


Finally, by connecting all pairs mi, mj whose Voronoi polygons Vi, Vj share

an edge, we get the corresponding Delaunay triangulation. This simply means that

TRN forms connectivity structures which are topology preserving with respect to

input data (i.e. neighbouring inputs tend to be mapped into neighbouring neurons);

such structure evolves as samples xi are sequentially presented to the net, thus giving

the possibility of acquiring further information about the process under examination.

The steps of the TRN algorithm are then:

1. assign initial values to the mk, k = 1...K, and set all the connections to zero

cjk = 0.

2. select an input pattern xi.

3. Calculate a rank rk(d) = 0, ..., K − 1 for each prototype k, where a rank of 0

indicates the closest and a rank of K-1 the most distant prototype to xi.

4. adapt the mk according to the neural gas algorithm

mnewk = mold

k + ε(t) ∗ exp (−rk(d)/ρ(t))(xi −moldk ) (2.41)

where

ρ(t) = ρ(0) ∗ [ρ(T )/ρ(0)](t/T ) (2.42)

and

ε(t) = ε(0) ∗ [ε(T )/ε(0)](t/T ) (2.43)

where t is the time step and T the total number of training steps, forcing more

local changes with time.

5. if it does not exist already, create a connection ci0i1 between the prototypes

ranked 0, i0, and ranked 1, i1, and set the age between i0 and i1 to 0 (“refresh”

the age).

6. increase the age of all connections of i0 by setting agei0ij = agei0ij + 1 for all

ij with ci0ij > 0.

7. remove those connection of i0 the age of which exceeds Tmax by setting ci0ij =

0 for all ij with ci0ij > 0 and agei0ij > Tmax.

8. increase the time parameter t = t + 1. If t < T continue with 2.

32


The updating of the prototypes in TRN is then uniquely done by the NG clustering

technique. The CHL algorithm does not modify their positions, but just finds the

topology preserving map according to those positions, by connecting them depending

on their distance.

2.4.6 Growing Neural Gas

The Growing Neural Gas (GNG) algorithm [28], is an unsupervised clustering al-

gorithm that can be considered a variation of the previous Topology Representing

Network. It uses the clustering properties of the Neural Gas algorithm, and the

variation is in the topology preservation of the mapping. The induced Delaunay

triangulation is in charge of the topology preservation, but the neighbourhood in-

formation is maintained by a variant of competitive Hebbian learning (CHL) [56];

that is, for each input signal xi an edge is inserted between the two closest nodes,

measured by Euclidean distance. GNG is an adaptive algorithm in the sense that

if the input distribution slowly changes over time, GNG is able to adapt, that is to

move the nodes so as to cover the new distribution. Starting with two nodes the

algorithm constructs a graph in which nodes are considered neighbours if they are

connected by an edge.

The growing version means that it is not necessary to decide on the number of

nodes to use a priori since nodes are added incrementally during execution. The

increment in the number of nodes stops when a user defined performance criteria is

met or if a maximum network size has been reached.

The GNG algorithm assumes that each node k consists of the following:

• mk - a reference vector or node in input space.

• errork - a local accumulated error variable.

• A set of edges defining the topological neighbours of node k.

Each new unit is inserted near the unit which has accumulated most error locally

(see Figure 2.9). As in TRN, each edge has an age variable used to decide when to

remove old edges in order to keep the topology updated. The nodes are moved by

NG again, and CHL is responsible for generating the topology.

There is a similar algorithm presented in [18], where the above growth is applied

to K-Means creating the Growing K-Means algorithm, which they present as simpler

and faster than GNG.

33


(i) 2 nodes, 500 iterations (ii) 3 nodes, 1000 iterations (iii) 50 nodes, 50000 iterations

Figure 2.9: An illustration of the GNG algorithm. (i) The state of the GNG al-gorithm after 500 iterations, one node is located in the left most data cluster, theother node is oscillating between the top most and bottom most data clusters. (ii)After 1000 iterations, a third node has been inserted and the nodes now cover thethree data clusters. (iii) After 50000 iterations, 50 nodes are spread out over thethree data clusters matching the topology. [37]

2.4.7 Topology preserving Elastic net

The elastic net, introduced in [21], was firstly applied to the traveling salesman prob-

lem (i.e. visiting a number of cities in the most efficient way), using an optimization

approach.

The energy function is:

E({mk}, Z) = −αZ

N∑i=1

logK∑

k=1

exp

(−‖xi −mk‖

Z2

)+

β

2

K−1∑

k=1

‖mk −mk+1‖2 (2.44)

where xi are the cities, mk the neurons that represent the tour stops, α and β control

the influence of the two terms, and Z is a simulated-annealing term that decreases

to a pre-specified value.

For each Z find mk that approximate all tour points with the minimum length

mapping. Then decrease Z. Finally a solution close to the global minimum of the

traveling salesman combinatorial problem is obtained by iterative optimization.

Adapt mk according to gradient descent:

34


∆mk = −ZδE

δmk

= α

N∑i=1

wik(xi −mk) + βZ(mk+1 + mk−1 − 2mk) (2.45)

Each neuron is subjected to two forces, one that attracts the neuron towards the

datapoint, and the elastic tension or force that pulls towards the mid-point between

neurons. The weights are calculated as

wik =exp

(−‖xi−mk‖

Z2

)

∑Kl=1 exp

(−‖xi−ml‖

Z2

) (2.46)

Repeat until Z is small enough or mK represents a good enough approximation for

xi.

Tereshko [82, 81] developed the topology preserving elastic net which combines

both lateral and synaptic interactions to obtain topologically ordered representa-

tions (receptive fields) of an external stimulus. The author affirms that existing

neural models that preserve the topology by utilizing lateral interactions, such as

the Kohonen map, and by utilizing synaptic interactions, such as cortical mapping

and elastic net, appear as limiting cases of this model.

2.5 Software tools

Most of the algorithms presented in this thesis can be found implemented in different

programming languages on the internet. In this section we discuss three toolboxes

that include many of those techniques. Other useful addresses are:

• Kmeans++ code is available online at www.stanford.edu/Darthur/Kmeanspptst.zip

• A Java implementation of Hard Competitive Learning, Neural Gas, TRN, and

GNG can be accessed at:

http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/

/gsn/DemoGNG/GNG.html,

where it is embedded as Java applet into a Web page, but also available to

download.

• The implementation of Probabilistic Principal Surfaces is available in the

IDEAL (formerly LANS) toolbox at http://www.mathworks.com/matlabcentral.

35


2.5.1 SOM Toolbox

The SOM Toolbox [62] is a function package for Matlab implementing the Self-

Organizing Map (SOM) algorithm and more. You can train SOM with different

network topologies and learning parameters, compute different error and quality

measures for the SOM, visualize data projections using U-matrices, component

planes, cluster colour coding and colour linking between the SOM and other vi-

sualization methods, and do correlation and cluster analysis with SOM. The SOM

Toolbox also features other data analysis methods related to vector quantization,

clustering, dimension reduction, and proximity preserving projections, e.g., data pre-

processing tools, K-Means, K-Nearest Neighbor classifier and LVQ (Learning Vector

Quantizer), agglomerative hierarchical clustering and dendrograms, principal com-

ponent analysis (PCA), Sammon’s projection, and Curvilinear Component Analysis

(CCA).

2.5.2 GTM Toolbox

This implementation of the GTM [6] runs under Matlab and includes all the neces-

sary machinery to create and experiment with GTM, including data visualisation;

it also includes a demo. It comes as a set of Matlab functions and scripts, together

with two short routines in C, which may be compiled into mex-files which can be

called directly from Matlab, provided you have a C-compiler supported by Matlab.

However, the toolbox can also be used as a pure Matlab implementation. In terms

of documentation, there is a User’s Guide in postscript format, which contains a

reference section for all the functions and scripts in the toolbox. The reference in-

formation is also available as Matlab help comments and as a set of html-files, which

can be viewed with a browser. Note that the documentation does not cover any of

the underlying theory of the GTM, for which you are referred to the papers on the

GTM.

2.5.3 Netlab Toolbox

The Netlab toolbox [7] has the advantage of having an accompanying text book

[61] published by Springer in their series Advances in Pattern Recognition. It is

widely used and many authors use Netlab as the basis for the programming of

new algorithms so that this toolbox is needed to use the new algorithms. Netlab is

designed to provide the central tools necessary for the simulation of theoretically well

founded neural network algorithms and related models for use in teaching, research

36


and applications development.

It consists of a toolbox of Matlab functions and scripts based on the approach

and techniques described in [12], but also including more recent developments in

the field. The functions come with Matlab on-line help, and further explanation

is available via HTML files. The software has been written by Ian Nabney and

Christopher Bishop.

The Netlab library includes software implementations of a wide range of data

analysis techniques, many of which are not yet available in standard neural network

simulation packages.

2.5.4 ICALAB

ICALAB toolboxes are presented with an easy-to-use interface that allows for easy

application of many different techniques, including preprocessing and postprocessing

tools. ICALAB for Signal Processing and ICALAB for Image Processing [25] are two

independent demo packages for MATLAB that implement a number of algorithms

for ICA employing higher order statistics, blind source separation employing second

order statistics and linear prediction, and blind signal extraction employing various

methods.

This package can also be used also for multidimensional independent component

analysis and non independent blind source separation.

Preprocessing tools include principal component analysis, pre-whitening, high

pass filtering, low pass filtering, and subband filters.

Postprocessing tools include deflation and reconstruction of original raw data by

removing undesirable components, noise or artifacts.

The algorithms can perform other techniques such as Blind Source Separation,

Factor Analysis and any other possible matrix factorization of the form X = HS +

N .

37

Chapter 3

The Topographic Product of

Experts

3.1 Topographic Product of Experts

This is the first of the family of algorithms within the topology preserving maps

category presented in this thesis, that share a common property: all of them are

based on the Generative Topographic Mapping model. The general structure is

similar to the GTM, so that the underlying structure of the data can be represented

by K latent points, t1, t2, . . . , tK . To allow local and non-linear modeling, those

latent points are mapped through a set of M basis functions, f1(), f2(), . . . , fM().

This gives a matrix Φ where φkm = fm(tk). Thus each row of Φ is the response

of the basis functions to one latent point, or alternatively each column of Φ is the

response of one of the basis functions to the set of latent points. Typically these

functions are Gaussians centered in the latent space. The output of these functions

are then mapped by a set of weights, W , into data space. W is M × d and is the

sole parameter which is changed during training. wm represents the mth column of

W and Φk to represent the row vector of the mapping of the kth latent point. Thus

each latent point is mapped to a point in data space, mk = (ΦkW )T .

3.1.1 Product of Experts

Hinton introduced the Product of Experts (PoE) in [35] with

p(xi|Θ) ∝K∏

k=1

p(xi|k) (3.1)

38

Chapter 3: ToPoE

where Θ is the set of current parameters in the model. The base model using a

Gaussian distribution is

p(xi|Θ) ∝K∏

k=1

(β

2π

) d2

exp

(−β

2||xi −mk||2

)(3.2)

where mk are the neurons in data space, β the inverse variance, and xi the

datapoints. The model was generalised into “products of Gaussian pancakes” in

[88]. The Gaussian case gives

p(xi|Θ) ∝ exp{−1

2

K∑

k=1

(xi −mk)T C−1

k (xi −mk)} (3.3)

where Ck is the covariance matrix of the kth Gaussian expert. The covariance matrix

of the product can easily be shown to be related to the harmonic average of the

individual experts. i.e.

C−1Π =

K∑

k=1

C−1k ⊂ Rd×d (3.4)

The mean of the product of experts can be shown to be

µΠ = CΠ

(K∑

k=1

C−1k mk

)⊂ Rd (3.5)

where a special case is C = 1βI.

To fit this model to the data, the cost function is defined as the negative logarithm

of the probabilities of the data so that

J =N∑

i=1

K∑

k=1

β

2||xi −mk||2 (3.6)

from which the learning rule for the weights is derived as,

∆wmj ∝ − ∂J

∂wmj

=K∑

k=1

β(x(j)i −m

(j)k )

∂m(j)k

∂wmj

=K∑

k=1

β(x(j)i −m

(j)k )φkm

39

Chapter 3: ToPoE

3.1.2 The Topographic Product of Experts

Fyfe [30] introduced responsibilities into the Gaussians:

p(xi|Θ) ∝K∏

k=1

(β

2π

) d2

exp

(−β

2||xi −mk||2rik

)(3.7)

where rik is the responsibility of the kth expert for the data point, xi. Thus all

the experts are acting in concert to create the data points but some will take more

responsibility than others. In contrast to the following algorithms presented in

Chapters 4 and 5, in ToPoE the responsibilities are calculated not only at the end,

as part of the visualisation step, but repeatedly on every iteration; this makes the

responsibilities more crucial in this model. The initial situation is a product of

experts situation, where all nodes have responsibilities for all datapoints, in contrast

to the mixture of experts where the responsibility regions are split between the

experts. The situation can progress in different ways however, always depending

on the modelling of the data, so that the final situation tends to be a mixture of

local products of experts. To prevent a situation where a data point has no expert

associated, if no expert has responsibility for a data point, they all are given equal

responsibility for that data point.

This model may be written as

p(xi|Θ) ∝(

β

2π

) d2

exp

(−β

2

K∑

k=1

(||xi −mk||2rik)

)(3.8)

The objective is to maximise the likelihood of the data set X = {xi : i =

1, · · · , N} under this model. The ToPoE learning rule (3.10) is derived from the

minimisation of − log(p(xi|Θ)) with respect to a set of parameters which generate

the mk.

To change W in online learning, a data point is randomly selected, say xi. The

calculation of the current responsibility of the kth latent point for this data point is,

rik =exp(−γd2

ik)∑Kl=1 exp(−γd2

il)(3.9)

where dik = ||xi −mk||, the Euclidean distance between the ith data point and the

projection of the kth latent point in data space (through the basis functions and then

multiplied by W ). γ is known as the width of the responsibilities. If no prototypes

are close to the data point (the denominator of (3.9) is zero), we set rik = 1K

,∀K.

40

Chapter 3: ToPoE

To maximise (3.7) so that the data is most likely under this model the − log()

of that probability is minimised: define m(d)k =

∑Mm=1 φkmwmd, i.e. m

(d)k is the

projection of the kth latent point on the dth dimension in data space. Similarly let

x(d)i be the dth coordinate of xi. These are used in the update rule

∆iwmd =K∑

k=1

ηφkm(x(d)i −m

(d)k )rik (3.10)

where ∆i signifies the change due to the presentation of the ith data point, xi, so

that the changes due to each latent point’s response to the data points are summed,

and η is the learning rate. Note that the Φ matrix is not changed during training

at all, and that β has been integrated in he responsibilities..

Since − log(p(xi|Θ)) ∝ ∑Kk=1 ‖ xi −mk ‖2 rik, the maximisation of that proba-

bility is equal to minimising the weighted mean square error.

The algorithm steps are then

1. Initialise the W weights randomly and spread the centres of the M basis func-

tions uniformly in latent space.

2. Initialise the K latent points uniformly in latent space.

(a) count=0

(b) Calculate the projection of the latent points to data space. This gives

the K prototypes, mk.

(c) Select a random data point xi and calculate dik = ||xi −mk||.(d) Calculate responsibilities that the kth latent point has for the ith data

point with (3.9).

(e) Recalculate W using (3.10).

(f) If count<MAXCOUNT, count= count +1 and return to (2b).

If we wish to use the mapping for visualisation, we must map data points into

latent space using the responsibilities; the new data point is placed at yi where

yi =K∑

k=1

riktk (3.11)


41

Chapter 3: ToPoE

3.1.3 Comparison with the GTM

The Generative Topographic Mapping (GTM) [8] is a mixture of experts model

which treats the data as having been generated by a set of latent points. These

K latent points are also mapped through a set of M basis functions and a set of

adjustable weights to the data space. The parameters of the combined mapping are

adjusted to make the data as likely as possible under this mapping. The GTM is

a probabilistic formulation so that if we define m = ΦW = Φ(t)W , where t is the

vector of latent points, the probability of the data is determined by the position of

the projections of the latent points in data space and so we must adjust this position

to increase the likelihood of the data. More formally, let

mk = Φ(tk)W (3.12)

be the projections of the latent points into the feature space. Then, if we assume

that each of the latent points has equal probability

p(x) =K∑

k=1

P (k)p(x|k) =K∑

k=1

1

K

(β

2π

) d2

exp

(−β

2||xi −mk||2

)(3.13)

where d is the dimensionality of the data space. i.e. all the data is assumed to

be noisy versions of the mapping of the latent points. This equation should be

compared with (3.7) and (3.8).

In the GTM, the parameters W and β are updated using the EM algorithm

though the authors do state that they could use gradient ascent. Indeed, in the

ToPoE, the calculation of the responsibilities may be thought of as being a partial

E-step while the weight update rule is a partial M-step. The GTM has been described

as a “principled alternative to the SOM” however it may be criticised on two related

issues:

1. it is optimising the parameters with respect to each latent point independently.

Clearly the latent points interact.

2. using this criterion and optimising the parameters with respect to each latent

point individually does not necessarily give us a globally optimal mapping from

the latent space to the data space.

The ToPoE will overcome some of these shortcomings in that all data points are

acting together. Specifically if no latent point accepts responsibility for a data

point, the responsibility is shared equally amongst all the latent points.

42

Chapter 3: ToPoE

The GTM, however, does have the advantage that it can optimise with respect

to β as well as W . However note that, in (3.7) and (3.8), the variance of each expert

is dependent on its distance from the current data point via the hyper-parameter,

γ. Thus we may define

(βk)|x=xi= βrik = β

exp(−γd2ik)∑K

l=1 exp(−γd2il)

(3.14)

Therefore the responsibilities are adapting the width of each expert locally dependent

on both the expert’s current projection into data space and the data point for which

responsibility must be taken. Initially, rik = 1K

,∀k, i and so we have the standard

product of experts. However during training, the responsibilities are redefined so

that individual latent points take more responsibility for specific data points. We

may view this as the model softening from a true product of experts to something

between that and a mixture of experts.

A model based on products of experts has some advantages and disadvantages.

The major disadvantage is that no efficient EM algorithm exists for optimising pa-

rameters. [35] suggests using Gibbs sampling but even with the method discussed

in that paper, the simulation times are excessive. Thus Fyfe [30] opted for gradient

descent as the parameter optimisation method.

The major advantage which a product of experts method has is that it is possible

to get very much sharper probability density functions with a product rather than

a sum of experts.

In the next sections we extend the ToPoE algorithm by including new kernels in

the responsibility calculation, analysing the local variance of the model, the projec-

tion of datapoints to the latent space and convergence properties, and investigating

the topology preservation through the magnification factors. We finally apply the

algorithm to several datasets in the simulation section.

3.2 Responsibility Estimation

Even though Gaussian kernels are the most often used, there are various other

possible kernels as shown in Table 3.1.

If yi is the point in latent space corresponding to xi, we have

yi =K∑

k=1

riktk ∈ L (3.15)

43

Chapter 3: ToPoE

Kernel K(u)

Uniform 12I(|u| ≤ 1)

Triangle (1− |u|)I(|u| ≤ 1)

Epanechnikov 34(1− u2)I(|u| ≤ 1)

Quartic 1516

(1− u2)2I(|u| ≤ 1)

Triweight 3532

(1− u2)3I(|u| ≤ 1)

Gaussian 1√2π

exp(−12u2)

Cosinus π4

cos(π2u)I(|u| ≤ 1)

Tri-cube (1− u3)3I(|u| ≤ 1)

Table 3.1: Kernel functions for kernel estimation, where I is an indicator function.

44

Chapter 3: ToPoE

where tk is the coordinate of the kth latent point and

rik =exp(−γ||xi −mk||2)∑Kl=1 exp(−γ||xi −ml||2)

(3.16)

is determined in data space. (3.15) recalls the Nadaraya-Watson kernel estimator

f(x) =

∑Ni=1 Kλ(x0, xi)yi∑Ni=1 Kλ(x0, xi)

(3.17)

Table 3.1 shows alternative kernels that can be used in this situation. Two that

we use are the Epanechnikov quadratic kernel [34]

Dλ(i, k) =d2

ik

λ

and Cλ(k, n) =

{34(1−Dλ(i, k)2) if ‖Dλ(i, k)‖ < 1

0 otherwise(3.18)

and the Tri-cube function with

Cλ(i, k) =

{(1−Dλ(i, k)3)3 if ‖Dλ(i, k)‖ < 1

0 otherwise(3.19)

both of which have compact support with

rik =Cλ(i, k)∑Kl=1 Cλ(i, l)

or

yi =

∑Kk=1 tkCλ(i, k)∑Kk=1 Cλ(i, k)

3.3 The Actual Variance

Let us write the ToPoE model as

p(xi|Θ) ∝K∏

k=1

exp(−β||xi −mk||2rik

)(3.20)

Inserting the Gaussian responsibilities gives us

p(xi|Θ) ∝K∏

k=1

exp

(−||xi −mk||2β exp(−γ||xi −mk||2)∑K

l=1 exp(−γ||xi −ml||2)

)(3.21)

45

Chapter 3: ToPoE

If we write

αi =1

β

K∑

l=1

exp(−γ||xi −ml||2) (3.22)

so that αi is dependent only on the position of the ith data point. Then we may

write (3.21) as

p(xi|Θ) ∝K∏

k=1

exp

(− ||xi −mk||2

αi exp(γ||xi −mk||2))

(3.23)

so that we may see that the local variance of the model around the ith data point

due to the kth Gaussian expert is

σik = αi exp(γ||xi −mk||2) (3.24)

Note that this means that experts whose representation in data space, mk, is far

from the current data point are estimating a large variance whereas experts whose

representation is close to the data point estimate a much smaller variance locally

and so are able to put far more of their probability mass around the data point.

Therefore using (3.4) we see that the local variance from the whole model at the

ith data point is

σi =1∑K

k=11

σik

=αi∑K

k=11

exp(γ||xi−mk||2)

(3.25)

Further we may predict where the model will put its estimate of the maximum

likelihood position of the ith data point i.e. its estimate of where a denoised estimate

lies as

µi = σi

(K∑

k=1

1

σik

mk

)(3.26)

Again we note that points far from the data point will have little effect on this

position since they are estimating large variance while points closer will contribute

greatly to this estimate since their estimate of the local variance is much smaller.

Examination of (3.26) shows that it leads to an identical projection of data points

as the previously used value of∑K

k=1 rikmk which was intuitively satisfying but now

has a more complete rationale.

3.3.1 Simulations

To illustrate the above, we generated 60 two dimensional data points, (x1, x2), from

the function x2 = x1 +1.25 sin(x1)+µ where µ is noise from a uniform distribution.

For the first and the last 20 data points we draw µ from the uniform distribution in

46

Chapter 3: ToPoE

0 0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

4Data and projections of latent points

Figure 3.1: The data are shown by red ’+’ marks. The projections of the latentpoints with γ = 20 are shown as blue ’*’s.

[0,0.3] while for the central 20 data points we draw µ from the uniform distribution

in [0,2] (see Figure 3.1). We show in Figure 3.1 the result of a simulation in which we

have 20 latent points deemed to be equally spaced in a one dimensional latent space,

passed through 5 Gaussian basis functions and then mapped to the data space by the

linear mapping W which is the only parameter we adjust. We use 10000 iterations

of the learning rule (randomly sampling with replacement from the data set) with

γ = 20, η = 0.1. The final placement of the projections of the latent points is shown

by the asterisks in the figure and we clearly see that the one dimensional nature of

the data has been identified. Also, the prototypes are placed along this manifold

in the order in which they appear in the latent space showing that a topographic

projection has been created.

In Figure 3.2, we show with asterisks the positions which the model estimates that

the data has come from. The model identifies a continuous distribution following

the manifold previously found. The datapoints are closer in the areas with higher

density in the original dataset. Finally in Figure 3.3, we show the responsibilities

adopted by the model for the data points. We see that the latent points in the centre

are sharing responsibility for the data points far more widely than those at either

end.

We see from these figures that the final responsibilities of the latent points for

the data points can be very narrow: often one latent point assumes much higher

responsibility than all the rest and typically non-zero probability is only assigned

47

Chapter 3: ToPoE

0 0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

4Data and Estimated positions from the ToPoE Model

Figure 3.2: The data and the estimated positions of the data from the model withγ = 20.

0 5 10 15 20

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

latent points

data points

resp

onsi

bilit

ies

Figure 3.3: The responsibilities of the 20 latent points for generating the 60 datapoints with γ = 20.

48

Chapter 3: ToPoE

0 0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

4Data and centres with gamma =2

Figure 3.4: The data are shown by red ’+’ marks. The projections of the latentpoints when the ToPoE model is used with γ = 2 are shown as blue ’*’s.

to 2 or 3 latent points in the low noise region and no more than 4 or 5 in the high

noise region. Another way of stating this is to say that many of the latent experts

are simply saying “I don’t know” when confronted with a data point. The ToPoE

method enables latent points which are far from the data to simply state that, as

far as they are concerned, the data points could have a high probability. The actual

probability of the data point is calculated from the experts whose projections are

close to the data point. This is rather more like a mixture of experts than a product

of experts, however the final probability is calculated as a product and so while the

training does move the model closer to a mixture, it is still firmly in the product of

experts camp.

However, we may change the projections by changing the value of γ. For Figures

3.4, 3.5 and 3.6, we use γ = 2. We see that there is a tendency for the responsibilities

to be more widely shared and so the map is drawn towards the centre. Also there

is less of an ability to denoise the data. The estimates of the data points’ positions

are pulled to their true positions rather than to the underlying manifold. Since

the responsibilities are inversely proportional to the variances, we may equally well

state that Figure 3.6 illustrates the fact that noisy regions of the data manifold will

exhibit a greater variance than less noisy regions.

49

Chapter 3: ToPoE

0 0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

4Data and estimated postions

Figure 3.5: The data in red ’+’ marks and the estimated positions of the data fromthe model with γ = 2 in blue ’*’s.

05

1015

20

0

20

40

600

0.2

0.4

0.6

0.8

latent pointsdata points

resp

onsi

bilit

es

Figure 3.6: The responsibilities of the 20 latent points for generating the 60 datapoints with γ = 2.

50

Chapter 3: ToPoE

3.4 Cost functions and Convergence

We note that since m = ΦW ,∂m

∂t= Φ

∂W

∂t(3.27)

In particular, for a given latent point,

∂mk

∂t=

M∑m=1

φkm∂wT

m

∂t(3.28)

where wm represents the mth row of W i.e. the weights from the mth basis function.

Since φkm > 0,∀m, k 1

∂mk

∂t= 0 ⇐⇒ ∂wT

m

∂t= 0,∀m (3.29)

We will consider a cost function

J =1

2

N∑i=1

K∑

k=1

‖ xi −mk ‖2 rik (3.30)

and show that the ToPoE learning rule can be considered to be approximately

minimising this cost function.

3.4.1 A simplified model

Consider a simplified model in which at each presentation of the data, we find which

projection of the latent points is closest to the data point and update only the

weights associated with the error due to this latent point’s projection. Let k∗ be the

latent point whose projection is closest to the data point, xi; then

∆iwmd = ηφk∗m(x(d)i −m

(d)k∗ ) (3.31)

Let Λk = {xi : k = arg min ‖ xi −mk ‖}. Then J = 12

∑Ni=1

∑Kk=1 ‖ xi −mk ‖2 rik

becomes

J1 =1

2

N∑i=1

K∑

k=1

‖ xi −mk ‖2 I(i, k) (3.32)

1In practice, some, but not all, φkm may be 0 because of the representation of floating pointnumbers in the computer, but this does not change the basis of the argument.

51

Chapter 3: ToPoE

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

Figure 3.7: The cut down ToPoE in which the projections are adjusted due to theeffect on a single latent point. Alternating symbols and colors show the areas affectedby different latent points.

where I(i, k) is the indicator function which returns 1 if mk is the closest projection

to xi, and 0 otherwise. Thus

J1 =1

2

K∑

k=1

∑xi∈Λk

‖ xi −mk ‖2 (3.33)

and ∂J1

∂mk= 0 ⇐⇒ mk = (x)k where (x)k is the average value of xi taken over Λk. In

the limit, as i → ∞, the value of mk → E(xi : xi ∈ Λk). We illustrate this model

on artificial data in Figure 3.7 where we have shown the data points in consecutive

Λk with alternating symbols.

3.4.2 The full model

Returning to the full model, we have the cost function

J =1

2

N∑i=1

K∑

k=1

‖ xi −mk ‖2 rik (3.34)

52

Chapter 3: ToPoE

so that∂J

∂mk

= −N∑

i=1

(xi −mk)rik +1

2

N∑i=1

‖ xi −mk ‖2 ∂rik

∂mk

(3.35)

Now, since

rik =exp(−γ||xi −mk||2)∑Kl=1 exp(−γ||xi −ml||2)

(3.36)

then

∂rik

∂mk

=

∑Kl=1 exp(−γ||xi −ml||2) exp(−γ||xi −mk||2)2γ||xi −mk||

(∑K

l=1 exp(−γ||xi −ml||2))2

− exp(−γ||xi −mk||2) exp(−γ||xi −mk||2)2γ||xi −mk||(∑K

l=1 exp(−γ||xi −ml||2))2(3.37)

Now if rik is large, the first term in the numerator is approximately equal to the

second term and so ∂rik

∂mk≈ 0. If rik is small, the second term in the numerator is

approximately 0 and

∂rik

∂mk

=

∑Kl=1 exp(−γ||xi −ml||2) exp(−γ||xi −mk||2)2γ||xi −mk||

(∑K

l=1 exp(−γ||xi −ml||2))2

=exp(−γ||xi −mk||2)2γ||xi −mk||∑K

l=1 exp(−γ||xi −ml||2)≈ 0

Thus the ToPoE learning rule can be derived as an approximation to the minimisa-

tion of the cost function (3.30). At convergence,

N∑i=1

(xi −mk)rik = 0 (3.38)

and so

mk =

∑Ni=1 xirik∑Ni=1 rik

(3.39)

a weighted average of the data, rather like a Parzen window based approximation

of the mean of the data.

53

Chapter 3: ToPoE

3.4.3 Projections of the latent points

The learning rule

∆iwmd =K∑

k=1

ηφkm(x(d)i −m

(d)k )rik (3.40)

is defined in terms of the weights. But since m = ΦW , ∂m∂t

= Φ∂W∂t

, or ∂W∂t

=

(ΦT Φ)−1ΦT ∂m∂t

, we may investigate the convergence of the projections of the latent

points. Consider the lth latent point.

M∑m=1

∆iφlmwmd = η

M∑m=1

φlm

K∑

k=1

φkm(x(d)i −m

(d)k )rik

i.e. ∆iml = η

K∑

k=1

(x(d)i −m

(d)k )rik

M∑m=1

φlmφkm

There are several consequences of this:

1. The last term∑M

m=1 φlmφkm is constant for any particular l and is the basis

of topology preservation in the feature space: by construction, rows with large

scalar product with each other correspond to points in latent space close to

each other.

2. It is however maximal for terms in the centre of the latent space (unless we

have wrap around in the latent space). Therefore points in the centre of the

latent space will converge first.

3. There is no objective function for this dynamics in this coordinate representa-

tion of latent space. This follows from

∂∆ml

∂mr

= −rir

M∑m=1

φlmφrm +K∑

k=1

(x(d)i −m

(d)k )

∂rik

∂mr

M∑m=1

φlmφkm 6= ∂∆mr

∂ml

(3.41)

Equality is a necessary and sufficient condition for the existence of an objective

function.

3.5 Magnification Factors and Dimensionality Es-

timation

In [10] the magnification factors of the GTM are discussed, which they define as the

ratio of the area, dA′, of the space traced out by the mapping on the manifold in

54

Chapter 3: ToPoE

data space to the corresponding area, dA, traced out in latent space. It is readily

shown that this is equal to the determinant of the Jacobean of the transformation

between the coordinates in the two spaces which in turn equals the square root of

the determinant of the metric tensor.

dA′

dA= J =

√g =

∣∣∣∣δkl∂ek

m

∂eit

∂elm

∂ejt

∣∣∣∣12

(3.42)

where eim is the ith basis vector of the manifold in data space and ei

t is the ith basis

vector of the latent space.

This is used to illustrate the separation of clusters on the manifold: points which

are well separated in data space but adjacent on the projection will be identifiable

by having locally a large value of this ratio. Points close in both basis systems will

have a relatively small value of the ratio. In practice, the ratio is calculated as

dA′

dA= |ΞT W T WΞ| 12 (3.43)

where Ξkj =∂φkj

∂tk∝ (cj − tk)φkj where cj is the centre of the jth Gaussian mapping

the latent space to the feature space.

However the above is based on the assumption that the data actually lie on a

manifold which it may not. Also the method will not identify folds in the manifold.

For these types of purposes, we must consider the mapping from data space to latent

space2 (where the points will actually be visualised):

xi → yi =K∑

k=1

riktk =K∑

k=1

exp(−γ||xi −mk||2)∑Kl=1 exp(−γ||xi −ml||2)

tk (3.44)

when using the Gaussian kernel function or

xi → yi =K∑

k=1

riktk =

∑Kk=1 C(i, k)tk∑Kk=1 C(i, k)

(3.45)

in general, with C(.,.) defined as in e.g. (3.18) or (3.19). In both (3.44) and

(3.45), we are using the latent points, tk, as basis vectors. This is liable to be an

overcomplete basis since typically K >> L, the dimensionality of the latent space.

Therefore the representation of each point in this basis is non-unique.

2For simplicity in the notation we consider a one dimensional latent space

55

Chapter 3: ToPoE

Assuming (3.44) for responsibilities, we find that

∂yi

∂xi

=2γ

∑Kk=1 exp(−γ||xi −mk||2){

∑Kl=1 exp(−γ||xi −ml||2)[xi −mk + xi −ml]}.tk(∑K

l=1 exp(−γ||xi −ml||2))2

(3.46)

which may be better written as

∂yi

∂xi

=4γ

∑Kk=1 exp(−γ||xi −mk||2){

∑Kl=1 exp(−γ||xi −ml||2)[xi − mk+ml

2]}.tk(∑K

l=1 exp(−γ||xi −ml||2))2

(3.47)

Let us assume that one responsibility dominates the others i.e. that xi is much

closer to mk∗ than it is to any other projected latent point. Then

∂yi

∂xi

≈ 4γ{exp(−γ||xi −mk∗||2)}2[xi −mk∗].tk∗(∑Kl=1 exp(−γ||xi −ml||2)

)2 (3.48)

i.e. the closer xi is to the prototype, mk∗, the lower the rate of change of its position

on the projected manifold. This gives a fine tuning effect within the centres of

clusters. Now let xi be close to a small number of projected prototypes indexed by

k ∈ T . Letting exp(−γ||xi −mk||2) = dk, (3.47) becomes

∂yi

∂xi

=4γ

∑k∈T dk{

∑l∈T dl[xi − mk+ml

2]}.tk(∑K

l=1 exp(−γ||xi −ml||2))2 (3.49)

In particular, if T = {k1, k2}, (3.49) becomes

∂yi

∂xi

∝ d2k1

(xi−mk1)tk1 +dk1dk2 [xi−mk1 + mk2

2](tk1 +tk2)+d2

k2(xi−mk2)tk2 (3.50)

At convergence, ∂y∂xi

= 0 and so

{d2k1

(xi −mk1) + dk1dk2 [xi − mk1 + mk2

2]}tk1

= {dk1dk2 [xi − mk1 + mk2

2] + d2

k2(xi −mk2)}tk2

If L ≥ 2, this basis is not overcomplete and so each of these coordinates must

56

Chapter 3: ToPoE

independently equal 0:

{d2k1

(xi −mk1) + dk1dk2 [xi − mk1 + mk2

2]} = 0

{dk1dk2 [xi − mk1 + mk2

2] + d2

k2(xi −mk2)} = 0

and so

d2k1

(xi −mk1) = d2k2

(xi −mk2) (3.51)

Then

(xi −mk1)

(xi −mk2)=

d2k2

d2k1

=(exp (−γ(xi −mk1)

2))2

(exp (−γ(xi −mk2)2))2

=(exp (γ(xi −mk2)

2))2

(exp (γ(xi −mk1)2))2

(3.52)

the effect that xi has on each prototype is inversely proportional to the square

of the exponential of the distance from xi to the prototype.

3.6 Simulations

In this section we apply ToPoE to artificial and real datasets. It should be noted

that, when results are shown for different kernels (Gaussian, Epanechnikov or Tri-

cube), this kernel is applied just in the visualisation step of the algorithm, in order

to compare it with the other algorithms introduced in this document, which use

the responsibilities only in the visualisation step. The responsibilities calculation of

the training step is always as presented above, with Gaussian kernels. Experiments

carried out with different kernels in the training phase did not give significant changes

in the outputs.

3.6.1 Experiment1: 1D Artificial Data

Figure 3.8 shows the result of a simulation in which we have up to 20 latent points

deemed to be equally spaced in a one dimensional latent space, passed through 5

Gaussian basis functions and then mapped to the data space by the linear mapping

W . We generated a two dimensional dataset, (x1, x2), from the function x2 =

x1 + 1.25 sin(x1) + µ where µ is noise from a uniform distribution in [0,1].

Actually ToPoE does not need the growing, which is one of the appeals of this

model. The next algorithms presented in Chapter 4 and 5 do however require a

growing version.

57

Chapter 3: ToPoE

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

Figure 3.8: The ToPoE topology preserving mappings of the 1D data (in red ’+’)with 2, 4, 8 and 20 latent points (in blue connected ’+’ marks). All the latent mk

prototypes stay in the manifold.

3.6.2 Experiment2: 2D Artificial Data

Using an artificial data set with four clusters spaced out evenly on a line we see the

results for ToPoE. This is the first of experiments in which we compare the use of

different kernels in the calculation of responsibilities for visualisation purposes.

The location of prototypes in data space are logically the same in all three cases.

We can notice the continuity in their location, that it is not spread over all clusters

but just in the centre of the figure. The Gaussian kernel does not separate well

the four clusters (Figure 3.9); both Epanechnikov and Tri-cube do it perfectly. The

reason for this difference it is that, in the Epanechnikov and Tri-cube kernel, only

the neurons that are below a certain distance from the datapoint, are used for the

projection into latent space. The Gaussian kernel however considers all the neurons,

and if there are many latent points with a very small responsibility, the projection

can be moved towards them. The Epanechnikov and Tri-cube kernel only consider

the neurons with higher responsibility, projecting only towards their position in the

two dimensional grid.

58

Chapter 3: ToPoE

0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

1

0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

−5 −4 −3 −2 −1 0 1 2 3 4

x 10−3

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5x 10

−4

0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

−4 −3 −2 −1 0 1 2 3

x 10−3

−2

−1.5

−1

−0.5

0

0.5

1x 10

−4

Figure 3.9: Projection of the 4 clusters data for ToPoE with 10*10 latent points andGaussian kernel (top), Epanichov kernel (middle) and Tri-cube kernel (bottom).Left: data space in which blue ∗ shows the positions of the mk prototypes anddatapoints are in red ’+’. Right: Latent space with yi projections.

59

Chapter 3: ToPoE

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1

−0.95

−0.9

−0.85

−0.8

−0.75

−0.7

−0.65

−0.6

−0.55

∗ Dove, Hen, Duck, Goose

o Owl, Hawk, Eagle

+ Fox, Dog, Wolf

∗ Cat

. Tiger,Lion

¦ Horse,Zebra,Cow

Figure 3.10: ToPoE projection of the animals dataset: M=5*5, K=10*10.

3.6.3 Experiment3: The Animals data set

In this case we use ToPoE to map a higher dimensional dataset. This animal dataset

was used in [64] with GTM and SOM, thus we are comparing the results with those

modellings. This is a 13D data with 6 classes of animals3. The animals are classified

in six groups according to the number of legs, size, swimming, running or flying

animals etc. Figure 3.10 shows the ToPoE projection with 5*5 basis functions,

10*10 latent points and 0.1 for their width. We have birds and 4-legged animals

separated as in the GTM and SOM examples, but the clusters are better defined.

3.6.4 Experiment4: Bank Notes Data

The bank data set has 200 observations of six properties of banknotes. Observations

were made for sets of 100 forged and 100 genuine banknotes.

ToPoE is able to completely separate forged, in the half lower part, from genuine

bank notes in the upper half (Figure 3.11).

3http://student.ifs.tuwien.ac.at/animals.tar.gz

60

Chapter 3: ToPoE

−0.9997 −0.9997 −0.9997 −0.9997 −0.9997 −0.9997 −0.99970

0.5

1

1.5

2

2.5x 10

−4

Figure 3.11: ToPoE projection of the bank note data. Forged notes denoted by redasterisks, genuines by green circles.

3.6.5 Experiment5: The Fundamental Clustering Problems

Suite

To illustrate the different kernels in section (3.2), we use some of the datasets that

appear in The Fundamental Clustering Problems Suite (FCPS) [85]; these datasets

are all low-dimensional, but some algorithms like K-Means have difficulty in clus-

tering them, so they are suitable for illustrating different visualisation capabilities

of the different kernels. We use specifically the Hepta and the Target algorithm; the

first one has clusters with different densities while the second one includes several

outliers.

We see in Figure 3.12 that the Gaussian kernel does not give a good impression

of the correct topology for the hepta dataset, so that it is not possible to identify

which is the central cluster (in red) from the projection. Using the other kernels

however this cluster appears in the middle, and the Epanechnikov gives a better

visualisation. In Figure 3.13 we see how the location of the prototypes in data space

is similar, and that is only the different calculation of the responsibilities after the

training that gives a different projection. ToPoE has been applied to other datasets

with clusters with different densities as in the Hepta dataset, and we found that

ToPoE tends to separate the clusters with more density in a different projection

area when using the Gaussian kernel. The reason could be again the amount of

61

Chapter 3: ToPoE

−4−2

02

4

−4

−2

0

2

4−4

−2

0

2

4

−4 −2 0 2 4 6 8 10

x 10−5

−1.5

−1

−0.5

0

0.5

1

1.5x 10

−4

0 0.005 0.01 0.015 0.02 0.025 0.03−8

−6

−4

−2

0

2

4

6

8

10

12x 10

−4

−4 −2 0 2 4 6 8 10 12 14

x 10−3

−4

−2

0

2

4

6

8

10x 10

−4

Figure 3.12: ToPoE projection of the hepta data (top left), with the Gaussian kernel(top right), Epanechnikov λ=18(bottom left) and Tri-cube kernels λ=18(bottomright).

latent points with very small responsibility that are pulling the projection towards

them. This could be useful when the purpose is to isolate areas of high density in

the data. If the aim is to maintain the topology, we have to use the right kernel for

visualisation.

Similarly with the Target dataset, (Figure 3.14), the Gaussian kernel does not

maintain the topology of the clusters, while the other kernels are able to do so, spe-

cially Epanechnikov that is the only one to give an intuitively satisfying projection.

Figure 3.15 shows again prototypes in data space and projections for two of the ker-

nels. The first thing we notice is that ToPoE has imposed a structure for the location

of the prototypes. They all remain within a region, keeping a common direction,

both for the Gaussian and Epanechnikov kernels; this reflects the tension between

the shape of the data and the shape of the map in latent space; the Epanechnikov

kernel however is able to project the right topology in latent space.

Similarly with the hepta dataset, the Gaussian projection for ToPoE separates

the tighter cluster from the other six that are of similar density between them. The

localisation of prototypes with ToPoE for the hepta dataset (Figure 3.13) is also

within a fixed structure. They all remain within a continuous shape (similar to a

62

Chapter 3: ToPoE

−0.20

0.20.4

0.6

−0.5

0

0.5

1−0.5

0

0.5

1

−0.20

0.20.4

0.60.8

0

0.1

0.2

0.3

0.4−0.2

0

0.2

0.4

0.6

−4 −2 0 2 4 6 8 10

x 10−5

−1.5

−1

−0.5

0

0.5

1

1.5x 10

−4

0 0.005 0.01 0.015 0.02 0.025 0.03−8

−6

−4

−2

0

2

4

6

8

10

12x 10

−4

Figure 3.13: ToPoE projection of the hepta data with the Gaussian kernel (left) andthe Epanechnikov kernel (right) at the bottom; the corresponding top figures showthe position of the mk prototypes in data space.

fan for this 3D experiment).

3.6.6 Experiment6: The Algae data set

This is a set of 118 samples from a scientific study of various forms of algae some of

which have been manually identified. Each sample is recorded as an 18 dimensional

vector representing the magnitudes of various pigments. 72 samples have been iden-

tified as belonging to a specific class of algae which are labeled from 1 to 9. 46

samples have yet to be classified and these are labeled 0. The results of a ToPoE

training is depicted in Figure 3.16. Figure 3.17 zooms into the central and left areas

of the previous figure, in order to better visualise the separation of clusters in the

projection space. Finally, Figure 3.18 adds the projection of the unclassified data-

points, according to the previous training. To classify those new points we would

need an additional technique like K-Nearest Neighbourgs in order to associate each

datapoint to a particular class.

It is of interest to compare the GTM on the same data: we use a two dimensional

latent space with a 10×10 grid for comparison. The results are shown in Figures

3.19 and 3.20. The GTM makes a very confident classification: we see that the

63

Chapter 3: ToPoE

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

−0.01 −0.005 0 0.005 0.01 0.015−5

−4

−3

−2

−1

0

1

2

3

4x 10

−4

−0.01 −0.005 0 0.005 0.01 0.015 0.02−8

−6

−4

−2

0

2

4

6

8x 10

−4

Figure 3.14: Target data (top left), ToPoE projection of the data with the Gaussiankernel (top right), with the Epanechnikov kernel λ=18(bottom left) and Tri-cubekernel λ=18(bottom right).

responsibilities for data points are very confidently assigned in that individual classes

tend to be allocated to a single latent point. This, however works against the GTM

in that, even when zooming in to the map, one cannot sometimes disambiguate the

two different classes such as at the points (1,-1) and (1,1). This was not alleviated

by using regularisation in the GTM though we should point out that we have a very

powerful model for a rather small data set.

In fact, we can control the level of quantisation in ToPoE by changing the γ

parameter in (3.9). For example by lowering γ, we share the responsibilities more

equally and so the map contracts to the centre of the latent space to get results

such as shown in Figure 3.21; the different clusters can still be identified but rather

less easily. Alternately, by increasing γ, one tends to get the data clusters confined

to a single node, that which has sole responsibility for that cluster, as in the Self-

Organizing Map.

64

Chapter 3: ToPoE

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

−6 −4 −2 0 2 4 6 8 10

x 10−3

−3

−2

−1

0

1

2

3

4

5

6

7x 10

−4

Figure 3.15: ToPoE projection of the target data with the Gaussian kernel (left)and the Epanechnikov kernel (right) at the bottom; the corresponding top figuresshow the position of the mk prototypes in data space.

+ Class 1

o Class 2

F Class 3

. Class 4

2 Class 5

O Class 6

. Class 7

+ Class 8

♦ Class 9

Figure 3.16: Projection of the 9 classes by the ToPoE.

65

Chapter 3: ToPoE

Figure 3.17: Left: zooming in on the central portion. Right: zooming in on the leftside.

+ Class 1

o Class 2

F Class 3

. Class 4

2 Class 5

O Class 6

. Class 7

+ Class 8

♦ Class 9

F Unclassified

Figure 3.18: The projection of the whole data set by the ToPoE.

66

Chapter 3: ToPoE

+ Class 1

o Class 2

F Class 3

. Class 4

2 Class 5

O Class 6

. Class 7

+ Class 8

♦ Class 9

Figure 3.19: The projection of the 9 classes of algae data given by the GTM.

+ Class 1

o Class 2

F Class 3

. Class 4

2 Class 5

O Class 6

. Class 7

+ Class 8

♦ Class 9

F Unclassified

Figure 3.20: The projection of the algae data given by the GTM.

67

Chapter 3: ToPoE

+ Class 1

o Class 2

F Class 3

. Class 4

2 Class 5

O Class 6

. Class 7

+ Class 8

♦ Class 9

F Unclassified

Figure 3.21: By lowering the γ parameter, the ToPoE map is contracted.

3.7 Conclusions

We have reviewed the Topographic Product of Experts introduced in [30]. This is

the first of the algorithms in this thesis with the underlying model of the GTM.

ToPoE replaces mixture of experts by product of experts, and the EM algorithm by

gradient descent. We have extended the algorithm investigating its local variance,

the projection to latent space and convergence properties. We also study the use of

the magnification factors as a tool for measuring topology preservation.

We have applied ToPoE to several artificial and real datasets, analysing the

projections and the location of prototypes in latent space. We have seen how the

projections are in general good with a Gaussian kernel, but sometimes need a differ-

ent kernel to better visualise the topology of the data. Real and higher dimensional

datasets were correctly mapped with this algorithm, giving clusters more separated

in the projections for some of them than with the GTM algorithm.

68

Chapter 4

The Harmonic Topographic

Mapping

This is the second of the family of algorithms within the topology preserving maps

category presented in this thesis, that share the Generative Topographic Map-

ping model, but with the important variation that the projection to the lower-

dimensionality space is separate from the clustering step, so that they are not in-

cluded in the same learning rule. The underlying structure is the same as in ToPoE,

and thus the same as in the GTM. The clustering technique though is based on the

K-Harmonic Means.

The first attempt was to apply the K-Harmonic Means (KHM) algorithm devel-

oped by Zhang [91, 92, 95] to the Self-Organising Map (Section 4.1); the results were

not so good, which seemed to suggest using KHM with a different frame of centres.

We then focused our attention on the GTM algorithm, and the Topographic Product

of Experts (Section 3.1). We found its non-linear projection more suitable for our

purpose (allowing us to separate projection from clustering in two different steps),

creating the Harmonic topographic mapping (HaToM)(Section 4.2). We developed

two versions of this algorithm, that give better projections in different situations.

4.1 The Harmonic Self-Organising Map

As seen in the literature review (Chapter 2), the Harmonic Average of K points,

a1, ..., aK , is defined as

HA({ai, i = 1, · · · , K}) =K∑K

k=11ak

(4.1)

69

Chapter 4: HaToM

Using this, Zhang et al developed the K-Harmonic Means algorithm whose re-

cursive formula to update the prototypes is

mk =

∑Ni=1

1

d4ik

(∑Kl=1

1

d2il

)2xi

∑Ni=1

1

d4ik

(∑Kl=1

1

d2il

)2

(4.2)

In an attempt to improve the SOM algorithm using the clustering properties of

K-Harmonic Means, we included this recursive formula in the update rule for the

prototypes of the SOM, creating the Harmonic SOM (HSOM),

mk =

∑Ni=1

1

d4ik

(∑Kl=1

1

d2il

)2xiΛi(k, k∗)

∑Ni=1

1

d4ik

(∑Kl=1

1

d2il

)2

(4.3)

where Λi(k, k∗) denotes the value of the neighbourhood function when k∗ is the

winning neuron when the network is presented with the ith data point, xi.

4.1.1 HSOM Simulations

Uniform distribution artificial data

We first illustrate the Harmonic SOM on the standard data set which is drawn

from a uniform distribution on [0,1]×[0,1]. In Figure 4.1, we show the results of

a simulation in which (4.3) was performed 200 times for each data point. In fact,

there was very little change after the first 20 iterations and so the map found is very

stable. The prototypes were initialised randomly near the mean of the dataset. We

see that nearby prototypes are in fact positioned close to one another in data space

but, because the dimensionality of the map is not the same as that of the data, some

data points which are close to one another are not quantised to prototypes which

are adjacent on the map.

2D artificial data

Now we investigate the ability of the Harmonic SOM to enable users to visualise

manifolds of lower dimensionality embeded in a higher dimensionality dataset, pro-

jecting into a two dimensional space. To illustrate this, we generated 1000 two

dimensional data points, (x1, x2), from the function x2 = x1 +1.25 sin(x1)+µ where

µ is noise from a uniform distribution in [0,1]. Thus this two dimensional data exists

70

Chapter 4: HaToM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 4.1: The prototypes of a Harmonic SOM trained on data from a uniformdistribution on [0,1]×[0,1].

close to a one dimensional manifold. With randomly initialised prototypes, we get

the results shown in Figure 4.2: the prototypes approximately find the manifold but

there is a twist in the mapping. This would inhibit a user from identifying the shape

of the manifold on which the data lies. Extensive simulations have shown that such

twists are very difficult to remove.

We therefore repeat this experiment but initialise the prototypes to lie on the

first principal component of the data and so the prototypes initially lie on a straight

line in data space spanning the direction of greatest spread. Now (Figure 4.3 left)

the trained map lies on the manifold in a way that enables a user to identify the

manifold merely from the quantisation of the data to the prototypes.

However there is one continuing failing in the mapping which is that the mapping

does not stick to the centre of the manifold but moves from side to side across the

manifold: the mapping is responding too finely to the actual positions of the data

set i.e. to the noise in the data set. This indeed was one of the positive aspects of

the original application of harmonic means to K-Means but is less helpful when we

wish to use the map as a data visualisation tool.

To solve this in this particular case we had to reduce the number of iterations to

two (Figure 4.3 right), which may or may not be sufficient for a real data visualisation

problem, but certainly suggests a lack of stability in the mapping.

71

Chapter 4: HaToM

0 1 2 3 4 5 6 7−2

−1

0

1

2

3

4

5

6

7

8

Figure 4.2: The map found by the HSOM is centered in the data but has a twist(data in red ’+’, prototypes in blue ’*’).

0 1 2 3 4 5 6 7−2

−1

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

Figure 4.3: The Harmonic SOM (data in red ’+’, prototypes in blue ’*’), follows themanifold more smoothly with 2 iterations (right), than with 10 (left).

72

Chapter 4: HaToM

4.2 Harmonic topographic Map

In the light of the good results obtained with ToPoE, and wishing to apply K-

Harmonic Means within a different projection structure, we applied this clustering

technique to the same underlying structure as ToPoE and GTM. We thus remove the

probabilistic underpinnings in the Generative Topographic Mapping. The Harmonic

Topographic Map (HaToM) [59, 60, 67, 68, 69, 70, 71, 72] is then the sum of the GTM

projection plus the K-Harmonic Means, this time separated into two different steps.

This separation allows for the creation of two different variations of the algorithm,

as shown below: Data-driven HaToM (D-HaToM) and Model-driven HaToM (M-

HaToM). The HaToM has the same structure as the GTM, but the similarity ends

there because the objective function is not the GTM one, nor is it optimised with

the Expectation-Maximization (EM) algorithm. Instead, the HaToM uses the well

proved clustering abilities of the K-Means algorithm, improved by using harmonic

means to make it insensitive to initialisation ([95]).

The basic batch algorithm often exhibited twists, such as are well-known in

the SOM [48], so we developed a growing method that prevents the mapping from

developing these twists. The growing also provides a number of advantages discussed

below.

One of the main attractions of the HaToM compared with ToPoE is that the

algorithm does not require responsibilities. These are only used when we are using

HaToM to visualise data, i.e. when we are working in latent space.

4.2.1 Data-driven HaToM

In [67] we opted to begin with a small value of K (for one dimensional latent spaces,

K=2; for two dimensional latent spaces and a square grid, K=2*2) and grow the

mapping. However we do not randomise W each time we augment K. The current

value of W is approximately correct and so we need only continue training from this

current value. Also we use a pseudo-inverse method for the calculation of W from

the positions of the prototypes.

The algorithm in [67] was

1. Initialise K to 2. Initialise the W weights randomly and spread the centres of

the M basis functions uniformly in latent space.

2. Initialise the K latent points uniformly in latent space.

73

Chapter 4: HaToM

3. Calculate the projection of the latent points to data space. This gives the K

prototypes, mk.

(a) count=0.

(b) Randomly select a datapoint xi; calculate dik = ||xi − mk|| for the K

prototypes.

(c) Recalculate prototypes using (4.2).

(d) If count<MAXCOUNT, count= count +1 and return to 3b

4. Recalculate W using

W =

{(ΦT Φ + δI)−1ΦTΞ if K < M

(ΦT Φ)−1ΦTΞ if K ≥ M

where Ξ is the matrix containing the K prototypes, I is identity matrix and

δ is a small constant, necessary because initially K < M and so the matrix

ΦT Φ is singular.

5. If K < Kmax, K = K + 1 and return to 2.

If we wish to use the mapping for visualisation, we must map data points into

latent space. To do this, we define the responsibility that the kth latent point has

for the ith data point as

rik =exp(−γ||xi −mk||2))∑Kl=1 exp(−γ||xi −ml||2))

=exp(−γd2

ik)∑Kl=1 exp(−γd2

il)(4.4)

and the new data point is placed at yi in latent space where

yi =K∑

k=1

riktk (4.5)


4.2.2 Model-driven HaToM

The steps for the M-HaToM model are:



74

Chapter 4: HaToM

2. Initialise the K latent points uniformly in latent space. Set count=0.


prototypes, mk = φTk W .

4. Randomly select a datapoint xi; calculate dik = ||xi −mk|| for the K proto-

types.

5. Recalculate prototypes using (4.2).


W =



with the same notation as before.

7. If count<MAXCOUNT, count= count +1 and return to 3


The projection method is the same as above. The difference with the first version

is that the projection is more constrained by the non-linear model, which is con-

stantly imposed inside the inner loop. This ensures a smoother manifold as shown

in Section 4.3.1.

4.2.3 Generalised Harmonic Topographic Map

The generalised version of both D-HaToM and M-HaToM, the Generalised Harmonic

Topographic Map or G-HaToM [65, 66] uses the generalised version of KHM. The

only change from the HaToM is in the recalculation of the prototypes, which in this

case is:

mk =

∑Ni=1

1

dp+2ik

(∑Kl=1

1

dpil

)2xi

∑Ni=1

1

dp+2ik

(∑Kl=1

1

dpil

)2

(4.6)

so that p determines the power of the L2 distance used in the algorithm. This version

has not only the soft membership that allows for a continuous transition from areas

of high density of data, that was already present in the ungeneralised versions, but

also boosting properties due to the dynamic weighting function, since the effect of

any particular data point on the re-calculation of a prototype is O(‖ xi−mk ‖p2−p−2),

75

Chapter 4: HaToM

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

72 latent points

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7mapping with 4 latent points

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7harmonic mapping with 8 latent points

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7Harmonic mapping with 20 latent points

Figure 4.4: The D-HaToM topology preserving mappings of the 1D data with 2, 4,8 and 20 latent points (data in red ’+’, prototypes in blue ’*’).

which for p > 2 has greatest effect for larger distances. Again, the generalised M-

HaToM imposed the model to a greater extent, while the generalised D-HaToM

leaves more freedom for the prototypes to move.

4.3 HaToM Simulations

4.3.1 Experiment 1: 1D Artificial Data

Figure 4.4 shows the result of D-HaToM applied to the 1D case. We see that, for

a sufficiently small number of latent points, the one dimensional nature of the data

set is revealed but when the number of latent points exceeds 15, the manifold found

begins to wander across the true manifold.

Figure 4.5 shows how the M-HaToM algorithm solves this problem, i.e. we

can increment the number of latent points as long as we want, without losing the

manifold shape.

The reason for the creation of the smooth manifold compared to the D-HaToM

76

Chapter 4: HaToM

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7−1

0

1

2

3

4

5

6

7

8

Figure 4.5: The M-HaToM topology preserving mappings of the 1D data with 2, 4,8 and 20 latent points (data in red ’+’, prototypes in blue ’*’). All the latent pointsstay in the centre of the manifold.

77

Chapter 4: HaToM

algorithm is twofold:

1. The δI is a regularising term which ensures that the manifold does not wander

about the data space but sticks closely to the manifold.

2. However, even when we remove this term (for K ≥ M), the regularisation

continues since we are compressing the re-construction of Ξ into M dimensions:

Ξ = ΦT W

where W = (ΦΦT )−1ΦΞ

Therefore Ξ = ΦT (ΦΦT )−1ΦΞ

4.3.2 Experiment 2: 2D Artificial Data

Using an artificial data set with four clusters spaced out evenly we see how the D-

HaToM keeps all the mk prototypes inside the clusters (Figure 4.6 Top left) but it

maps the clusters too close together in latent space, impeding the users’ ability to

differentiate between them (Figure 4.6 Top right); whereas the M-HaToM has a few

mk prototypes outside the clusters but gives a good clustering for visualisation in

latent space (Figure 4.6 Bottom). Note that Figure 4.6 (left) shows the projections

of the latent points in data space while Figure 4.6 (right) shows the positions which

the data points assume in latent space.

To be able to separate the clusters in latent space for the D-HaToM, the user

has to tune the parameters. In Figure 4.7 (top) we see the D-HaToM projection of

the same data changing the width of the responsibility function (i.e. making it less

wide). Now the clusters are well separated. The same parameters in the M-HaToM

give also a tighter clustering (Figure 4.7 bottom).

With the M-HaToM, the W changes with each iteration which re-imposes the

model structure on the mapping in every iteration, while with the D-HaToM, the

W is only changed when the number of latent points is changed which leaves the

data in charge within the inner loop; so in the first case the model forces the mk

prototypes to have a smooth change (so as not to leave a big space between clusters)

while in the second case the data is in charge and keeps the mk prototypes only

where the data is.

78

Chapter 4: HaToM

0 1 2 3 4 5 6−4

−2

0

2

4

6

8

−2 −1.5 −1 −0.5 0 0.5 1 1.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

0 1 2 3 4 5 6−4

−2

0

2

4

6

8

−2 −1.5 −1 −0.5 0 0.5 1 1.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 4.6: Projection of the 4 clusters data with 10*10 latent points. Left: dataspace in which . shows the positions of the mk prototypes, and the datapoints arecolored according to clusters. Right: Latent space with projections using the sameclusters’ coloring. Top: D-HaToM . Bottom: M-HaToM

79

Chapter 4: HaToM

0 1 2 3 4 5 6−4

−2

0

2

4

6

8

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

0 1 2 3 4 5 6−4

−2

0

2

4

6

8

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Figure 4.7: Projection of the 4 clusters data with 10*10 latent points and narrowerresponsibilities. Left: data space in which . shows the positions of the mk proto-types, and the datapoints are colored according to clusters. Right: Latent space withprojections using the same clusters’ coloring. Top: D-HaToM . Bottom: M-HaToM

80

Chapter 4: HaToM

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5


o Owl, Hawk, Eagle

+ Fox, Dog, Wolf

∗ Cat

. Tiger,Lion

¦ Horse,Zebra,Cow

Figure 4.8: M-HaToM projection of the animals dataset: M=5, K=10.

4.3.3 Experiment 3: The Animals data set

Figure 4.8 shows the M-HaToM projection with 5*5 basis functions, 10*10 latent

points and 0.1 for the width of the Gaussians. We have birds and 4-legged animals

separated as in the GTM and SOM examples, but the clusters are better defined.

Figure 4.9 shows the clustering with the same parameters as above for the nor-

malised data, with D-HaToM. M-HaToM produces a better clustering with and

without normalising, and seems in this case to be a more robust algorithm, im-

proving also upon the projection done by ToPoE in Section 3.6.3, by separating the

different classes better in the projection figure.

4.3.4 Experiment 4: The Fundamental Clustering Problems

Suite

To illustrate the use of the three kernels explained in Section 3.2, we use again the

datasets from The Fundamental Clustering Problems Suite (FCPS) [85]. We also

calculate the magnification factors used for the GTM [10], that are the same for all

the algorithms introduced in this thesis, due to the common structure underlying in

all of them.

We see in Figure 4.10 that the Gaussian kernel in both D- and M-HaToM does not

give a good impression of the correct topology for the hepta dataset, so that it is not

possible to identify which is the central cluster (in red) from the projection. Figure

4.11 shows the improvement achieved with other kernels for M-HaToM. Similarly

81

Chapter 4: HaToM

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5


o Owl, Hawk, Eagle

+ Fox, Dog, Wolf

∗ Cat

. Tiger,Lion

¦ Horse,Zebra,Cow

Figure 4.9: D-HaToM projection of the normalised animals dataset: M=5, K=10.

−4−2

02

4

−4

−2

0

2

4−4

−2

0

2

4

−4−2

02

4

−4

−2

0

2

4−4

−2

0

2

4

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

Figure 4.10: D-HaToM (left) and M-HaToM (right) projection of the hepta data intolatent space with Gaussian kernels (bottom figures); the corresponding top figuresshow the position of the mk prototypes in data space.

82

Chapter 4: HaToM

−4−2

02

4

−4

−2

0

2

4−4

−2

0

2

4

−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

−3 −2 −1 0 1 2 3

x 10−4

−1

−0.5

0

0.5

1

1.5

2x 10

−4

−4 −3 −2 −1 0 1 2 3 4

x 10−5

−1.5

−1

−0.5

0

0.5

1x 10

−5

Figure 4.11: M-HaToM projection of the hepta data (top left) with the Gaussian (topright), Epanechnikov λ=2 (bottom left) and Tri-cube λ=2 (bottom right) kernels.

83

Chapter 4: HaToM

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Figure 4.12: D-HaToM (left) and M-HaToM (right) projection of the target data intolatent space with Gaussian kernels (bottom figures); the corresponding top figuresshow the position of the mk prototypes in data space.

84

Chapter 4: HaToM

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

−6 −4 −2 0 2 4 6

x 10−4

−5

−4

−3

−2

−1

0

1

2

3

4

5x 10

−4

−1 −0.5 0 0.5 1 1.5 2

x 10−4

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1x 10

−4

Figure 4.13: M-HaToM projection of the target data (top left) with the Gaussian(top right), Epanechnikov λ=2 (bottom left) and Tri-cube kernels λ=2 (bottomright).

85

Chapter 4: HaToM

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.14: D-HaToM projection with 30*30 latent points of the target data withGaussian (left), and Epanechnikov λ=4 (right). The magnification factors for eachnode are depicted by green circles, with the size proportional to its value.

with the target dataset, (Figure 4.12), the Gaussian kernel does not separate properly

the outliers from the rest of the data for M-HaToM (D-HaToM projects correctly

this time), while the other kernels are able to do so (Figure 4.13).

The author considers that the projections for D- and M-HaToM with the Gaus-

sian kernel are better than with ToPoE, but the outliers were only properly separated

with D-HaToM. From that figure also we see that the reason for this is the locali-

sation of the mk prototypes in data space: while for M-HaToM they all stay within

the main data, D-HaToM allocates some of the prototypes to the outlying regions.

The HaToM allocation of prototypes (Figure 4.11) shows the influence of K-

Harmonic Means, that maintains the prototypes in data space where the clusters

are. The M-HaToM algorithm, due to the more frequent imposition of the non-linear

projection, always find a smoother manifold, and the prototypes within the data

clusters are always connected by prototypes in between the clusters; the low-density

areas with outliers are not well covered with this version though, while D-HaToM

allocates most of the neurons within the main clusters, but also four of them to the

four outlier areas, which helps to correctly project the topology of this dataset.

As with ToPoE, the Epanechnikov and Tri-cube kernels help to properly visualise

the data with M-HaToM. We can see the similarity between ToPoE and the model

version of HaToM that gives greater weight to the underlying model.

With respect to the magnification factors for each node for the D-HaToM (Figure

4.14) and M-HaToM (Figure 4.15), we see how the former is more sensitive to out-

liers, while the latter does not give a proper visualisation with the Gaussian kernel,

the areas with outliers being magnified to excess.

86

Chapter 4: HaToM

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.15: M-HaToM projection with 30*30 latent points of the target data withGaussian (left), and Epanechnikov λ=4 (right). The magnification factors for eachnode are depicted by green circles, with the size proportional to its value.

The D-HaToM version, while being more sensitive to the visualisation parameters

as seen for the two-dimensional experiment, and creating a less smooth manifold

(i.e. the prototypes wander more around the data, though always inside the data

clusters), locates the centres always near the data so that the use of a specific

kernel has less influence on the visualisation, as the responsibilities will be high for

near datapoints and small for far away data. Also, the magnification factors show

equal magnification for data points and outliers, indicating a consistent relationship

between distances in data space and distances in latent space.

In the case of the M-HaToM version, the Gaussian kernel for the responsibilities

shows a bigger magnification area where the outliers are, which indicates that the

algorithm, trying to fix all data points into a smooth manifold, and being more

influenced by the model than the previous D-HaToM, has allocated prototypes in

between clusters, but not in the outliers area. The Epanechnikov kernel however,

seems to overcome this problem and gives a proper separation between data points

and outliers, due to its local properties, and shows again equal magnification factors

for all nodes around the data.

This shows that the different versions of the algorithm are better for different

kind of data, e.g. D-HaToM is more suitable for finding outliers, while M-HaToM

finds smoother manifolds.

4.3.5 Experiment 5: The Algae data set

In this section we apply D- and M- HaToM to the algae dataset. The D-HaToM

algorithm with 6*6 basis functions has given a better visualisation of the clusters of

87

Chapter 4: HaToM

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1∗ Class 1

o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

Figure 4.16: The D-HaToM projection of the 9 labelled algae classes and 1 unlabelledclass (0) on a harmonic mapping with a 2 dimensional set of 10*10 latent points(M=6).

the algae data to the author’s belief. Figure 4.16 shows a clustering with a nearly

complete separation of the different classes. With M-HaToM more basis functions

are required (see Figure 4.17), and again we see a tightening of the projection due

to model reinforcement.

Additional tightening is possible as seen above by reducing the width of the

responsibilities (Figures 4.18 and 4.19). The magnification factors depicted in the

same figures are smaller in the data areas (indicating more tight clusters in high-

dimensional space). In Figure 4.18 however we see how the Magnification Factors

(MF) are higher for the classes 9 and 6 areas (red diamonds and pink down triangles),

reflecting the fact that the differences between the distances in the two spaces are

higher for these two clusters.

It is worth noting that this mapping was achieved with 20 iterations of the

algorithms, while for ToPoE we need at least 100,000 iterations. The reduction in

time for not growing the algorithm in ToPoE, is offset with a longer convergence

time.

4.4 G-HaToM Simulations

Using the pth power of the L2 distances we are better able to separate into clusters

high dimensional and also more complex data, such as the crabs or the oil data (see

88

Chapter 4: HaToM

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1∗ Class 1

o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

Figure 4.17: The M-HaToM projection of the 9 labelled algae classes and 1 unlabelledclass (0) on a harmonic mapping with a 2 dimensional set of 10*10 latent points(M=12).

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.18: D-HaToM projection of the algae data with narrower Gaussian (left).The Magnification factors for each node are depicted on the right side (M=6).

89

Chapter 4: HaToM

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.19: M-HaToM projection of the algae data with narrower Gaussian (left).The Magnification factors for each node are depicted on the right side (M=12).

below). Also, the boosting-like weighting allows the algorithm to achieve a faster

clustering on the algae data as we will see compared with both previous versions of

HaToM. We call the models G-D-HaToM for the generalised D-HaToM and G-M-

HaToM for the generalised M-HaToM.

4.4.1 Experiment 1: Crabs Data

This is a 5 dimensional dataset1 on the morphology of rock crabs of genus Leptograp-

sus, with 50 specimens of each sex of each of two colour forms, blue and orange. This

data is used in the Generative Topographic Map (GTM) by Svensen [80] to show

the projection into latent space of the four clusters with the GTM. We illustrate the

results of the G-D-HaToM algorithm in Figure 4.20, using the L2 distance to the

third power, 20∗20 latent points and 20 iterations, over non-normalised data (unlike

the GTM which needed normalising first). The projection keeps together the two

female clusters on the low part of the figure, while the male clusters are at the top;

only the blue form sexes stay closer.

4.4.2 Experiment 2: Bank Notes Data

G-D-HaToM is able to completely separate forged from genuine bank notes (see

Figure 4.21), without normalising first as was needed for the non-generalised version.

1http://www.stats.ox.ac.uk/pub/PRNN/crabs.dat

90

Chapter 4: HaToM

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5∗ Male blue form

o Female blue form

+Male orange form

. Female orange form

Figure 4.20: G-D-HaToM projection of the two species of crabs with equal proportionof both sexes: p=3.

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 4.21: G-D-HaToM projection of the bank note data with 25 basis functionsand 4*4 latent points; p=2. Forged notes denoted by red asterisks, genuines by greencircles.

91

Chapter 4: HaToM

−2 −1.5 −1 −0.5 0 0.5 1 1.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 4.22: G-D-HaToM projection of the oil data: p=5.

4.4.3 Experiment 3: Oil Data

The oil flow dataset2 consists of 1000 points classified into three flow configura-

tions. This is synthetic data modelling non-intrusive measurements on a pipe-line

transporting a mixture of oil, water and gas. The flow in the pipe takes one out of

three possible configurations: horizontally stratified, nested annular or homogeneous

mixture flow. The data lives in a 12-dimensional measurement space, but for each

configuration, there are only two degrees of freedom: the fraction of water and the

fraction of oil. (The fraction of gas is redundant, since the three fractions must sum

to one.) Hence, the data lives on a number of ’sheets’ which locally are approxi-

mately 2-dimensional. The data is 12 dimensional and therefore more suitable for

the purpose of showing the capabilities of an algorithm to visualise complex data

sets. This data is used to check the hierarchical GTM in [83].

In this case again the pth power of the L2 distance was better (compared to

D-HaToM) to separate the clusters and Figure 4.22 shows the projection onto a 2

dimensional map with 60 by 60 latent points, 40 iterations and 20*20 basis functions.

The L2 distance was to the power of 5. Augmenting the number of centre points to

40*40 (Figure 4.23) we get a better separation of the three kind of mixtures.

The advantage in comparison with the hierarchical GTM is the simplicity and

the corresponding lower computational cost.

2http://www.ncrg.aston.ac.uk/GTM/

92

Chapter 4: HaToM

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Figure 4.23: G-D-HaToM projection of the oil data with more basis functions, p=5.

4.4.4 Experiment 4: Algae Data

HaToM gives a very good clustering of the algae data as we showed above (Figures

4.16 and 4.17). To illustrate the improvement with the G-HaToM (both versions),

we reduce the number of latent points to the minimum: G-D-HaToM is able to

cluster this data with only 4*4 latent points and p=3, as shown in Figure 4.24,

while the G-M-HaToM needs at least 10*10 latent points, p=6 with M=6, or p=3

with M=26 (see Figures 4.25 and 4.26). A clear difference with the generalisation

of both versions of HaToM is the increment in separation of clusters (reflected by

the increment in the MF), even intra-cluster in some cases. An example is Class

3 (blue stars), that seems to have two subclasses, always differentiated in these

projections. Further analyses within the clusters could be carried out with G-D-

and G-M-HaToM.

4.5 Conclusion

This Chapter presents the second of the algorithms sharing a common structure

with GTM, the Harmonic Topographic Mapping (HaToM). The main property of

this algorithm is the use of the K-Harmonic Means clustering method that overcomes

some of the drawbacks of K-Means. Another relevant characteristic is the separation

of the clustering and projection steps, allowing the inclusion or not of the projection

in the inner loop, with the clustering steps. Using that characteristic we develop

93

Chapter 4: HaToM

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1∗ Class 1∗ Class 1

o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.24: G-D-HaToM projection of the algae data: p=3 and 4*4 latent points,and Magnification Factors below.

94

Chapter 4: HaToM

−1 −0.5 0 0.5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8


o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.25: G-M-HaToM projection of the algae data: p=6, M=6 and 10*10 latentpoints, and Magnification Factors below.

95

Chapter 4: HaToM

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8


o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 4.26: G-M-HaToM projection of the algae data: p=3, M=26 and 10*10 latentpoints, and Magnification Factors below.

96

Chapter 4: HaToM

two versions of HaToM, depending on the necessity to reinforce the model (M-

HaToM) or the clustering (D-HaToM). Both of them can be further extended using

the generalised version of K-Harmonic Means, which give additional properties to

the clustering.

We have also applied the algorithms in Chapter 3 and 4 to other tasks such as

Blind Source Separation (BSS) and forecasting [71]. These however are somewhat

separate from the main uses of topology preserving maps such as feature extraction,

clustering and visualisation, and so we do not report details here.

97

Chapter 5

The Topographic Neural Gas &

Algorithms comparison

This chapter introduces the last of the new algorithms presented in this thesis.

The reason for a new clustering technique within the same underlying model as

ToPoE and HaToM was the search for a faster method. Both HaToM and ToPoE

have proved to be good alternatives to the Self-Organizing Map and the Generative

Topographic Mapping, with artificial and real data, but they both require longer

training time than ToNeGas. The new algorithm makes use of the Neural Gas

technique, and it is thus called the Topographic Neural Gas.

This chapter also discuss a comparison between the three algorithms, and also

against the SOM. We comment on the differences in theory, in the experiments

already presented, and also new experiments that study the topology preservation

and quantization of the algorithms, for which we employ some of the functions in

the somtoolbox [62].

5.1 The Topographic Neural Gas

The Topographic Neural Gas (ToNeGas) [73] unifies the underlying structure of the

GTM for topology preservation, with the technique of Neural Gas which clusters

the prototypes in data space.

The Topographic Neural Gas gains advantages from the Neural gas clustering as

well as from the GTM like structure. Three main advantages of NG model are [74]:

1. faster convergence to low distortion errors,

98

Chapter 5: ToNeGas & Comparisons

2. lower distortion error than that resulting from K-Means clustering, maximum-

entropy clustering and Kohonen’s self-organizing map algorithm [47],

3. obeying a stochastic gradient descent on an explicit energy surface.

From the non-linear projection from latent space to data space, the algorithm obtains

topology preservation as well as visualisation on a low dimensional grid.

As we have seen in Section 2.3.1, in K-Means a set of example vectors is clustered

into a few prototypes iteratively such that a distortion measure is continuously

minimized. Every prototype has its mean vector. In a K-Means iteration, every

example vector is assigned to the prototype with the closest mean vector. After that

every mean vector is replaced by the average of all vectors that have been assigned

to it. The neural-gas algorithm is a generalization of the K-Means algorithm. The

difference is, that every example vector is not assigned to a single prototype but

to more than one prototype. It will be assigned to the closest prototype with a

high weight and to other prototypes with smaller weights. After an iteration, the

mean of a prototype is replaced by the weighted average of all assigned vectors. In

this way, the neural gas algorithm is smoother, and every prototype gets to see all

data (some with a very low weight). The neural-gas algorithm uses a temperature

value t which defines what weight will be given to the closest prototype and to the

second closest prototype etc. The initial higher values of the temperature enables

bigger changes in the values (and uphill moves), so that local minima can be escaped

from, reaching other minima or regions of high density of data. Progressively the

temperature diminishes and the changes allowed are much smaller (the probability

of uphill moves being smaller as well), keeping the values around the minimum

selected. The weights decay exponentially with the increasing distance-rank of the

prototypes, such that the nth-closest prototype gets a weight of exp(−(n − 1)/t).

In ToNeGas we usually use a very low temperature and decrease the temperature

every iteration such that in the end it is virtually zero, and the neural-gas algorithm

resembles the K-Means algorithm.

The algorithm has been implemented based on the Neural Gas algorithm code

included in the SOM Toolbox for Matlab [62]. We develop the same growing model

as in HaToM, starting with 2*2 latent points in a square grid, and using the previous

W value when incrementing the number of latent points. The steps of the algorithm

are as follows:



99


2. Initialise the K latent points uniformly in latent space. Set t=0.


prototypes, mk = Φ(tk)T ∗W .

4. Randomly select a datapoint.

5. Calculate the distances between the datapoint selected and all the prototypes.

6. Calculate the rank rk(d) of each prototype depending on the distances, and

the neighborhood function hρ(rk(d)) = e(−rk(d)/ρ(t)) with

ρ(t) = ρ(0) ∗ [ρ(T )/ρ(0)](t/T ). (5.1)

7. Recalculate prototypes using the learning rule mk = mk + ε(t) ∗ hρ[rk(d)] ∗(x−mk) with

ε(t) = ε(0) ∗ [ε(T )/ε(0)](t/T ). (5.2)

8. If count<MAXCOUNT, t=t+1 and return to 4.


W =



.


11. For every data point, xi, calculate the Euclidean distance between the ith data

point and the kth prototype as dik = ||xi −mk||.

12. Calculate responsibilities that the kth latent point has for the ith data point

and the projections of each datapoint in latent space

rik =Cλ(i, k)∑Kl=1 Cλ(i, l)

and yi =K∑

k=1

riktk (5.3)

where tk is the position of the kth latent point in latent space, and Cλ(i, k) the

100


Epanechnikov kernel

Dλ(i, k) =d2

ik

λ

and Cλ(i, k) =

{34(1−Dλ(i, k)2) if ‖Dλ(i, k)‖ < 1

0 otherwise(5.4)

The visualisation is provided by the projection of each datapoint to latent space

yi, using the responsibilities of all the prototypes for each data point rik, and the

fixed prototypes in latent space tk. The responsibilities include the Epanechnikov

kernel that proved to be better also for HaToM [72]. One of the advantages of this

algorithm is that the Neural Gas part is independent of the non-linear projection,

thus the clustering efficiency is not limited by the topology preservation restriction.

As with HaToM, it is possible to include the non-linear mapping in the inner

loop, that would create the Model-driven version of ToNeGas, but the property

looked for is the reduction in convergence time.

5.1.1 Simulations

5.1.2 Experiment 1: The Fundamental Clustering Problems

Suite

For both target and hepta datasets the Topographic Neural Gas separates the clus-

ters well, projecting the right topology into the latent space (Figure 5.1 and Figure

5.2) . The prototypes (bottom left of the Figures) are mainly located within the

clusters.

The Harmonic Topographic Mapping proved to be good also in separating these

datasets (see Section 4.3.4)1. To illustrate how the clustering speed of NG makes a

great improvement of ToNeGas over HaToM we evaluate the time of convergence for

both algorithms and four datasets in Table 5.1. The difference in time is noticeable,

specially when the number of datapoints is large.

Another possible criterion for comparison is the reduction in the Mean Quan-

tisation Error (MQE) while growing the map. In this experiment we calculate the

MQE every time we add new latent points to the map, that is after finishing each

run of the clustering technique (K-Harmonic Means for HaToM and Neural Gas for

ToNeGas). We can see in Figure 5.3 that both techniques reduce the MQE, but the

1Note: We are comparing ToNeGas vs HaToM with the Epanechnikov kernel, because this isthe one used in ToNeGas.

101


−4−2

02

4

−4

−2

0

2

4−4

−2

0

2

4

−4−2

02

4

−4

−2

0

2

4−4

−2

0

2

4

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Figure 5.1: Original data (top), 2 prototypes and red dots for datapoints in dataspace (bottom left), and ToNeGas projection of the hepta data (bottom right).

Table 5.1: Convergence time (seconds) for HaToM and ToNeGas.Dataset Four clusters Algae Hepta Target

No samples 800 118 212 770Dim 2 19 3 2

HaToM 174.47 7.07 17.19 155.19ToNeGas 20.21 6.10 7.24 19.23

102


−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Figure 5.2: Original data (top), 2 prototypes and red dots for datapoints in dataspace (bottom left), and ToNeGas projection of the target data (bottom right).

103


0 100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 5.3: Mean quantisation error over time for the Harmonic Topographic Map-ping (top) and the Topographic Neural Gas (bottom) with the algae data. Thehorizontal axes show the moment that new latent points were added to the map-ping.

104


−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1∗ Class 1

o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

Figure 5.4: ToNeGas projection of the 9 labelled algae classes and 1 unlabelled class(0).

change is much faster and to a lower final value for ToNeGas.

In conclusion, the clustering speed of Neural Gas gives an important improvement

over the previously developed algorithm, the Harmonic Topographic Mapping, and

has also proved to reduce the mean quantisation error much more than the latter.

5.1.3 Experiment 2: The Algae data set

ToNeGas is able to cluster this data correctly (Figure 5.4). In this case we used wider

responsibilities (in this case controlled by the λ value) to spread the clusters, but as

with HaToM, the projection depicts tighter clusters with narrower responsibilities.

The important difference with this algorithm, is that is the only one so far to

project Classes 5 and 7 (blue squares and red triangles respectively) into separate

clusters.

5.2 Topology preservation

The rest of this chapter is dedicated to the comparison between SOM, ToPoE,

HaToM and ToNeGas. In this section we quantify how the topology is actually

preserved in these algorithms with two datasets, the algae dataset presented above,

and an extreme dataset in which we have only 40 samples of 3036 dimensionality.

We calculate measures of quantization and topology, and apply different techniques

105


020

4060

80100

0

50

100

1500

0.01

0.02

0.03

0.04

0.05

0.06

020

4060

80100

0

50

100

1500

0.2

0.4

0.6

0.8

020

4060

80100

0

50

100

1500

0.2

0.4

0.6

0.8

1

Figure 5.5: Responsibilities for ToPoE, HaToM and ToNeGas for the algae data.

to study the results. For several of the results below we have utilised the SOM

Toolbox [62] with default values.

5.2.1 Experiment1: Algae dataset

We apply a 10*10 mapping for all the algorithms: SOM, ToPoE, HaToM (Data

driven version) and ToNeGas in order to compare them. We already saw the pro-

jections of the algorithms using the responsibilities. In the next section we see the

responsibilities and measures of clustering and topology preservation.

Responsibilities

The responsibilities of each latent point for each datapoint are shown in Figure

5.5. We can see how the responsibilities are spread over several neurons for all

datapoints in HaToM and ToNeGas. The responsibilities for ToPoE are really small

(notice the axis of the figure), which shows the high distances from the prototypes

to the datapoints. This is also reflected below in the U-Matrix distances range and

in the value of the Mean Quantisation Error. Only a few neurons have meaningful

responsibilities in this case, as is also depicted in the corresponding hit histogram.

106


Figure 5.6: Hit histogram and U-matrix for SOM with the algae data.

U-matrix, Hit Histograms and Distance Matrix

The U-Matrix assigns to each cell the average distance to all of its neighbors. This

enables the identification of regions of similarity using different colors for different

ranges of distances.

The hit histograms are formed by taking a data set, finding the Best Matching

Unit (BMU) of each data sample from the map, and increasing a counter in a map

unit each time it is the BMU. The hit histogram shows the distribution of the data

set on the map. Here, the hit histogram for the whole data set is calculated and

visualized with the U-matrix (Figures 5.6, 5.7, 5.8, 5.9).

All the hitting matrices and U-Matrices show good separations of the classes in

different areas of the grid, but while ToPoE uses only a few neurons to do so, HaToM

shows clusters spread all over the grid. Both SOM and ToNeGas show intermediate

cases.

Surface plot of distance matrix (Figure 5.10): both color and z-coordinate in-

dicate average distance to neighboring map units. This is closely related to the

U-matrix.

107


Figure 5.7: Hit histogram and U-matrix for ToPoE with the algae data.

Figure 5.8: Hit histogram and U-matrix for HaToM with the algae data.

108


Figure 5.9: Hit histogram and U-matrix for ToNeGas with the algae data.

0

2

4

6

8

0123456789

0.05

0.1

0.15

Distance matrix

0

2

4

6

8

0123456789

0.2

0.4

0.6

0.8

1

1.2

Distance matrix

0

2

4

6

8

0123456789

0.1

0.2

0.3

0.4

0.5

Distance matrix

0

2

4

6

8

0123456789

0.05

0.1

0.15

0.2

Distance matrix

Figure 5.10: Distance matrix for SOM (top left), ToPoE (top right), HaToM (bottomleft), and ToNeGas (bottom right) with the algae data.

109


0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

tonegasSOMhatomtopoe

Figure 5.11: Quantization errors of the 118 data points for SOM, ToPoE, HaToM,and ToNeGas with the algae data.

The quality of the map

Any topology preserving map requires a few parameters (such as size and topology

of the map or the learning parameters) to be chosen a priori, and this influences the

quality of the mapping. Typically two evaluation criteria are used: resolution and

topology preservation. If the dimension of the data set is higher than the dimension

of the map grid, these usually become contradictory goals.

We first show the quantization error for each datapoint with the distance to its

Best Matching Unit (BMU) in Figure 5.11.

The mean quantization error qe is the average distance between each data vector

and its BMU. It measures the resolution of the mapping.

qe =1

N

N∑i=1

‖ xi − (BMU(i), k) ‖ . (5.5)

The distortion measure which measures the deviation between the data and the

quantizers is defined as:

E =N∑

i=1

K∑

k=1

h(BMU(i), k) ‖ xi −mk ‖2. (5.6)

We first calculate the total distortion for each unit, and then average for the

110


Table 5.2: Quantization error and topology preservation error with Topology-preserving Mappings for the algae data.

Algorithm SOM ToPoE HaToM ToNeGasMean Quanti- 0.0526 2.8445 0.0147 0.0162zation ErrorAverage total distortion 0.0862 0.4530 0.2443 0.2071for each unit (e+003 )Topology preservation 0.0593 0.6915 0.5000 0.4237error

total number of neurons.

Another important measure of the quality of the mapping is the topology preser-

vation te. In this case we calculate the topographic error, i.e. the proportion of all

data vectors for which first and second BMUs are not adjacent units.

te =1

N

N∑i=1

u(xi) (5.7)

where u(xi) is equal to 1 if first and second BMU are not adjacent and 0 otherwise

The higher value for all the clustering errors is for ToPoE, which means that the

neurons are further away from the data. Also the topology error is higher for that

algorithm. Both HaToM and ToNeGas have a lower mean quantisation error than

SOM. SOM gives however the lowest topology preservation error.

PCA projections

To project into a two dimensional space, we used the responsibilities rik, but a

principal component analysis can also be used, projecting both the clusters and the

data onto the same two first eigenvectors of the data. Figure 5.12 shows the PCA

projection of the data with the same classes symbols as in the previous experiments

with the algae data.

We can see in Figure 5.13 that the reason for a worse clustering in ToPoE is

the location of the neurons, that are projected in some cases out of the data area.

Figure 5.14 depicts the map formed with each algorithm.

111


−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3∗ Class 1

o Class 2

F Class 3

2 Class 4

2 Class 5

O Class 6

. Class 7

F Class 8

♦ Class 9

F Class 0

Figure 5.12: The PCA projection for the algae data.

5.2.2 Experiment2: Gene dataset

We use a dataset containing results of a high-throughput experimental technology

application in molecular biology (microarray data)2. The datasets contains only 40

observations of high-dimensional data (3036) and there are three types of bladder

cancer: T1, T2+ and Ta. The data is examined in the gene space (40 rows of 3036

variables). This is quite a demanding dataset because we have very few observations

for a really high dimensional dataset, and thus it is a demanding test for our algo-

rithms. The dataset has been preprocessed to have zero mean; also, in the original

dataset some data was missing and these values have been filtered out.

Projections in latent space

The three algorithms give a good visualisation of the three types of cancer in the

projection (see Figures 5.15, 5.16, and 5.17), but ToPoE requires to run for 100,000

iterations while the others do it with only 20 passes. The Data-driven version of

HaToM (Figure 5.16) gives a good projection but ToNeGas (Figure 5.17) separates

the clusters faster. In all cases, one datapoint of the Ta class (blue ’*’) is separated

from the others, seeming to be an outlier.

2http://www.math.le.ac.uk/PEOPLE/ag153/homepage/PrincManLeicAug2006.htm

112


−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

−2 −1.5 −1 −0.5 0 0.5 1 1.5−2

−1.5

−1

−0.5

0

0.5

1

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

Figure 5.13: PCA projection for the algae data of the datapoints (in blue) and theprototypes (in red) for SOM (top left), ToPoE (top right), HaToM (bottom left),and ToNeGas (bottom right) with the algae data.

113


−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3Colored PC−projection

−2 −1.5 −1 −0.5 0 0.5 1 1.5−2

−1.5

−1

−0.5

0

0.5

1Colored PC−projection

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2


−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.4

−0.3

−0.2

−0.1

0

0.1

0.2


Figure 5.14: Colored PCA projection for the algae data of the prototypes for SOM(top left), ToPoE (top right), HaToM (bottom left), and ToNeGas (bottom right)with the algae data.

114


0 2 4 6 8 10 120

2

4

6

8

10

12

Figure 5.15: The ToPoE projection for the gene data. T1 in red triangles, T2+ ingreen stars and Ta in blue ’*’.

−1 −0.5 0 0.5 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 5.16: The HaToM projection for the gene data. T1 in red triangles, T2+ ingreen stars and Ta in blue ’*’.

115


−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 5.17: The ToNeGas projection for the gene data. T1 in red triangles, T2+in green stars and Ta in blue ’*’.

Responsibilities

The responsibilities of each latent point for each datapoint are shown in Figure

5.18. The responsibilities are much smaller for ToPoE, meaning that the prototypes

are further away from the datapoints and/or that more prototypes are responsible

for each datapoint (but prototypes localised in the same region of the map).The

responsibilities are less widely spread between all prototypes and of higher value,

both in HaToM and ToNeGas, which reflects the mixture of experts model applied

and higher proximity to the datapoints.

U-matrix, Hit Histograms and Distance Matrix

The hit histogram for the whole data set is calculated and visualized with the U-

matrix (Figures 5.19, 5.20, 5.21,and 5.22). Again we can see a good separation of

classes in the hitting matrix, with the correspondent distances in data space shown

in the U-Matrix.

The Surface plot of distance matrix is shown in Figure 5.23.

The quality of the map

The quantization errors for the algae dataset is shown in Figure 5.24. ToNeGas is

clearly reducing further the values for most of the datapoints.

116


0

50

100

150

0

10

20

30

400

0.1

0.2

0.3

0.4

0

50

100

150

0

10

20

30

400

0.2

0.4

0.6

0.8

1

0

50

100

150

0

10

20

30

400

0.2

0.4

0.6

0.8

1

Figure 5.18: Responsibilities for ToPoE (top left), HaToM (top right),and ToNeGas(bottom) with the gene data.

Figure 5.19: Hit histogram and U-matrix for SOM with the gene data.

117


Figure 5.20: Hit histogram and U-matrix for ToPoE with the gene data.

Figure 5.21: Hit histogram and U-matrix for HaToM with the gene data.

118


Figure 5.22: Hit histogram and U-matrix for ToNeGas with the gene data.

0

2

4

6

8

10

0246810

5

10

15

Distance matrix

0

2

4

6

8

10

0246810

3

4

5

6

7

Distance matrix

0

2

4

6

8

10

0246810

5

10

15

20

25

Distance matrix

0

2

4

6

8

10

02468

10

15

20

Distance matrix

Figure 5.23: Distance matrix for SOM (top left), ToPoE (top right), HaToM (bottomleft), and ToNeGas (bottom right) with the gene data.

119


0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

35

40

SOMHaToMToNeGasToPoE

Figure 5.24: Quantization errors of the 40 data points for SOM, ToPoE, HaToM,and ToNeGas with the gene data.

In this case, to analyse the influence of each clustering technique on its own, we

calculate as well the quantization error of K-Means, KHM, and Neural Gas with

this dataset and 12 prototypes (see Table 5.3). We can clearly see the improvement

achieved by NG. KHM gives a bigger value, which proves the necessity to apply

the generalised version when using high-dimensional data. Zhang suggested using

p =√

d, being d the dimensionality of the dataset [91, 92]. In this experiment

however it was enough to use p = 3 for HaToM.

Table 5.3: Quantization error with clustering techniques for the gene data.Algorithm K-Means KHM Neural GasMean Quantization Error 22.8778 31.7438 16.0784

The te does not consider diagonal neighbors thus the hexagonal case always gives

lower values of te due to its six neighbors for each unit in comparison to the four in

the rectangular mapping. We are using a rectangular lattice for all algorithms, so

in order to see the right topology error considering all neighbors we need a different

topographic error, such as the Alfa error [2]. The formula for the alpha error is as

follows:

Alfa =1

N

N∑i=1

α(xi) (5.8)

where α(xi) is equal to 1 if first and second BMU are not adjacent nor diagonals

120


Table 5.4: Quantization error and topology preservation error with Topology-preserving Mappings for the gene data.

TPM SOM ToPoE HaToM ToNeGasMean Quanti- 11.881 22.410 22.308 8.843zation ErrorMean total distortion 0.597 0.892 1.357 0.851for each unit (e+003 )Topology preservation 0.025 0.250 0.750 0.200errorAlfa error 0 0.025 0.675 0.1

and 0 otherwise. This error gives lower values than te for all the algorithms (see

Table 5.4).

As with the previous data set, the lower mean quantisation error between is for

ToNeGas, thus the mapping is more flexible and adjusts better to the data as seen

also in the distance matrix (Figure 5.23). The topology preservation in this case is

however similar for ToPoE and ToNeGas. The Alfa measure reduces the values for

topology error in all the algorithms.

PCA projections

With this dataset, all algorithms have neurons projected within the data area (Figure

5.25). The two dimensional grid is similar for both ToNeGas and ToPoE (Figure

5.26).

5.3 Experiment Comparisons

This section summarises the experiments carried out with the same datasets for all

the algorithms in the previous chapters, and the latter two in this chapter. We have

seen that when you need to impose the model and find a smoth manifold, ToPoE

and M-HaToM are the best choices. This was shown with the one-dimensional data

and the animals dataset. On the other hand, D-HaToM and ToNeGas are best

options for clustering. The generalised version of D-HaToM with a suitable value

of p should be adequate for a high-dimensional dataset, but ToNeGas would do a

good job as well and in a shorter time (Table 5.1). G-HaToM is able to construct

a topology preserving mapping with less nodes and can be also useful when further

investigation of subclasses is required (Algae data).

121


−30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

5

10

15

20

−30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

5

10

15

20

−30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

5

10

15

20

−30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

5

10

15

20

Figure 5.25: PCA projection of the datapoints (in blue) and the centers (in red) forSOM (top left), ToPoE (top right), HaToM (bottom left), and ToNeGas (bottomright) with the gene data.

−30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

5

10

15


−25 −20 −15 −10 −5 0 5 10 15 20 25−25

−20

−15

−10

−5

0

5

10

15


−20 −15 −10 −5 0 5 10 15 20−25

−20

−15

−10

−5

0

5

10

15


−30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

5

10

15


Figure 5.26: Colored PCA projection of the prototypes for SOM (top left), ToPoE(top right), HaToM (bottom left), and ToNeGas (bottom right) with the gene data.

122


ToPoE was not always able to locate the prototypes in data space near the data

(e.g. four clusters data). The most restrictive structure of prototypes in this mod-

elling positions them in a continuous shape, while D-HaToM and ToNeGas locates

m near the data and not in between clusters (four clusters and FCPS data). We have

to take into account however that most of the experiments were carried out with

clustering data, more suitable thus for D-HaToM and ToNeGas. The projections

were nevertheless satisfactory, with a proper separation of clusters (Algae data),

thanks to the visualisation system used; this system is based on responsibilities that

take into account the distances between the datapoints and the prototypes. A vi-

sualisation process similar to SOM where the datapoints are just projected to the

nearest node, would have failed in this task.

In some cases ToPoE and HaToM required the use of local kernels, which give

weights only to the nearest prototypes in the visualisation process (Hepta and Target

datasets). Those kernels were also a good trick for M-HaToM to project outliers in

the target data.

We have also studied the possibility of tightening the projections

• applying Model-driven algorithms like ToPoE or M-HaToM (Algae data),

• reducing the width of the responsibilities (four clusters with D-HaToM).

An investigation of the magnification factors proved to be a useful tool to in-

terpret the projections found by the algorithm, and to estimate the real relative

distances in data space.

Finally we analysed in more detail clustering and topology preservation, finding

adequate values of MQE and topology errors, especially with ToNeGas.

5.4 Comparison of the algorithms

In this family of algorithms as well as in the basic GTM, the algorithm assumes

hyper-spherical, Gaussian clusters, whose flexibility and accuracy to approximate

the data can be traded-off by adjusting the parameters [64]: the flexibility shouldn’t

be too high to prevent over-fitting and also a high accuracy will increment the com-

putational cost. For the SOM there is no clear separation between the parameters

that control both properties. For the GTM and our algorithms, these are the number

of latent points, the width of the Gaussian functions, and the width of the respon-

sibilities. The number of latent points, as in the GTM, controls the accuracy of the

clustering; the width of the Gaussian functions controls the topology preservation of

123


the clusters; the rik are not calculated until the end of the algorithm (in all except

ToPoE, which is more similar to GTM), to get the projection into latent space, but

their width controls the ability of the algorithm to separate the clusters, especially

for the D-HaToM. The number of basis functions M in all of them, controls the

flexibility in the shape of the manifold.

As a visualisation technique these algorithms have one advantage over the stan-

dard SOM: the projections of the data onto the grid need not be solely to the grid

nodes. If we project each data point to that node which has highest responsibility

for the data point, we get a similar quantisation to that of the SOM. However if we

project each data point, xi onto∑

k rik ∗ tk, we get a mapping onto the manifold at

intermediate points.

5.4.1 Growing and Pruning

Another advantage of this family of topology preserving mappings is that we can

easily grow a net: we train a net with a small number of latent points and then

increase the number of latent points. Thus we have to recalculate the Φ matrix but

need not change the W matrix of weights which can simply continue to learn from

its current values.

Equally we may question the completed map to investigate whether any latent

point is being mapped to a part of the data space which has no data nearby. If a

latent point does not have the greatest responsibility for any data point, it can be

deleted from the map. This technique is illustrated in Figure 5.27. In each diagram

the red ’+’s show the positions of the data points: the data consists of 4 distinct

clusters. The trained map is shown on the left: the projections of the 20 latent

points map cover the data set but some are placed in positions in which there is

no data for which they need take responsibility. Such points are excluded and the

map continues to learn to get the situation in the right diagram: only 10 latent

points remain. It must be emphasised that we do not alter the positions of either

the latent points (in latent space) or the basis functions when we continue training.

These remain at their original situations.

5.4.2 One-to-one comparisons

The following comparisons between algorithms are summarised in Table 5.5.

124


2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 72

2.5

3

3.5

4

4.5

5

5.5

6

2 2.5 3 3.5 4 4.5 5 5.5 62

2.5

3

3.5

4

4.5

5

5.5

6

Figure 5.27: In both diagrams the data set is shown by the red ’+’s. Left: theprojections of the 20 latent points are shown with ’*’s. Right: after pruning, the 10remaining latent points may continue training.

Table 5.5: Properties of each of the algorithmsClustering Insens Preserves Cost No Need Separates

Initial. Topology function No clusters clust.&proj.Kmeans

√ √KHM

√ √ √NG

√ √ √SOM

√ √ √GTM

√ √ √ √ToPoE

√ √ √ √HaToM

√ √ √ √ √ √ToNeGas

√ √ √ √ √ √

125


ToPoE vs SOM

ToPoE uses the responsibilities within the learning rule to update the prototypes

based on the distances between prototypes and datapoints. The topology is pre-

served through the non-linear projection of those prototypes. SOM on the other

hand keeps the preservation using the distances between latent points in the output

layer in the learning rule.

HaToM vs K-Harmonic Means

The main difference between K-Harmonic Means (KHM) and HaToM is the projec-

tion of the clusters into a non linear mapping which represents the main properties

of the clusters, such as distances, that makes it easier to visualise them [17]. Thus,

HaToM has all the advantages of the KHM, and also the topology preservation and

visualisation of the GTM.

Another important property of HaToM method is that it is not necessary to

determine the number of clusters a priori like in K-Means and KHM; a growing

stopping criterion can be applied (and has been shown to work [71]), such as a small

change in Mean Quantisation Error.

HaToM vs SOM

KHM is less prone to be trapped in a local minima due to the continuous movement

of weights. The visualisation in HaToM can be projecting directly to the winning

neurons as in SOM, or using responsibilities as in the GTM. SOM has a unique

learning rule for clustering and topology preservation, while HaToM is composed of

an inner loop for the clustering, plus an outside non-linear projection applied every

time a new latent point is added. The latest can be included in the inner loop if a

model-driven version is desired.

HaToM vs GTM

HaToM shares the structure of the GTM, with a latent space projected non-linearly

through a feature space into data space. But the HaToM has a cost function based

on the K-Harmonic means which is optimised with gradient descent. This provides

the algorithm with a strong clustering supported with a good visualisation given by

the latent space. GTM requires a careful initialisation to self-organise [46]. HaToM

overcomes this problem using Harmonic Means. The separation of clustering and

topology preservation is again unique for HaToM.

126


HaToM vs ToPoE

HaToM was initially inspired by ToPoE but with a clustering focused purpose. Both

HaToM and ToPoE have a similar structure to GTM: a non-linear projection from

latent space to data space. ToPoE starts with a product of experts that adapts

progressively to the data during training, becoming more similar to the mixture of

experts of GTM as the responsibilities sharpen and some latent points lose respon-

sibility for some data points. The gradient descent algorithm optimised the process,

in contrast to EM for GTM. HaToM on the other hand gets the insensitivity to

initialisation provided by the K-Harmonic Means, that clusters the prototypes in

data space.

The main difference in visualisation is that the responsibilities in HaToM are only

calculated once the clustering is done, just to project the data to latent space, while

for ToPoE this calculation is done on every iteration. In the HaToM algorithm no

responsibility is fixed, so that it is the data that says which experts are responsible

and which not; both HaToM and ToPoE permit the data to control the responsi-

bilities of each expert, but while ToPoE gives equal responsibilities to all when a

datapoint has no expert, the HaToM due to the K-Harmonic clustering, moves the

mk prototypes more freely to the data (the soft assignment of membership allows

for a continuous transition of prototypes between areas of high density of data), and

thus there is not a data point for which no prototype takes responsibility.

ToPoE, like the GTM, has a fixed structure in which the mk prototypes have

limited movement in the manifold, always responding to the position of the nearest

prototypes. In HaToM however, the mk prototypes are much more flexible and can

move towards the clusters in data space, while still keeping a smooth manifold by

making the W and mk follow the latent space model, especially in the M-HaToM.

The big advantage with ToPoE though is that it is not necessary to grow the

network, and using the same number of prototypes which the HaToM finally reaches,

gives similar results to those of the growing HaToM. However, it has to be noted

that for the high dimensional cases presented in this thesis, ToPoE required 100.000

iterations, while HaToM and ToNeGas showed good results just with 20.

As we have seen, the main difference between the two HaToM models is the up-

dating of the W and mk prototypes, which in D-HaToM is only made when we grow

the latent space with another point K, so that the data conveys the mk prototypes

to the clusters, while in M-HaToM this updating is done in every iteration, so that

the mk prototypes are forced to follow the model more strictly and therefore give

a smoother manifold; this forces the mk prototypes to be outside the clusters when

127


the data is not continuous (see section 4.3.2), but still gives a perfect clustering in

the latent space. The Model- version is thus closer to ToPoE.

HaToM vs ToNeGas

As we saw in Section 5.1.2, ToNeGas reduces the mean quantization error faster and

to a lower value. This algorithm showed in the experiments tried some of the best

results. The clustering properties are similar though: both HaToM and ToNeGas

are insensitive to the initialisation of the prototypes: the former uses the continuous

movement of prototypes between areas of high density of data, while the latter makes

use of the temperature value to allow for changes between “valleys” at the beginning,

looking for the global minima, reducing the temperature progressively to keep the

values around the deepest valley.

The boosting like properties from the generalised version of KHM could indicate

that HaToM would be better for difficult data, but so far, and probably thanks

to the higher and faster reduction of the Mean Quantisation Error, ToNeGas has

proved to be at least as good as HaToM with high dimensional and complex data.

ToNeGas vs K-Means

ToNeGas gets its clustering properties from Neural Gas, so comparing ToNeGas

and K-Means clustering is like comparing NG and K-Means. The Neural-Gas algo-

rithm is quite similar to the K-Means clustering method, but with K-Means there

is no neighborhood involved. Instead only the winner is updated. K-Means is then

a winner takes all method, or hard competitive learning, whereas the Neural-Gas

method belongs to the soft competitive learning or winner takes most methods. We

have seen that K-Means is susceptible to the initial position of the weights, while

“the location of the Neural-Gas nodes after a few iterations is independent from the

initialization, due to the very fast gas-like movement of the nodes in the beginning”

[75].

The Neural-Gas algorithm reaches faster convergence with similar clustering re-

sults and minimizes a global cost function. The reason for the faster convergence in

NG is that the oscillations of the cost functions are small.

The other ToNeGas properties, topology preservation and dimensionality reduc-

tion, are not included in K-Means.

128


ToNeGas vs SOM

It is not possible to define a cost function for the Kohonen network, while there is a

well-defined cost function for the Neural-Gas algorithm [57]. Its learning procedure

resembles the simulated annealing approach [75], which is a well-known global opti-

mization technique. Decreasing the temperature-like decay variable helps to escape

from local minima at the beginning when the temperature is high, updating the

weights gradually towards the global minimum.

If two nodes are close in a SOM map, they have higher probability of being

similar. In NG this kind of neighborhood information is not available. However,

combined with projection methods, like the GTM structure, the ToNeGas algorithm

has at least equal visualization power to the Kohonen maps.

The main difference between NG and SOM is the neighborhood concept: in the

Kohonen algorithm, the weights of the nodes are adapted based on the distance of

the nodes to the BMU in latent space, whereas in NG the distances are calculated in

data space, and this is used simply to give weights to each neuron in the updating of

the prototypes for a particular datapoint, without considering topology preservation.

The separation of the topology preservation step from the clustering allows for

a faster clustering, while the topology preservation is still kept at reasonable levels.

ToNeGas vs GTM and ToPoE

ToNeGas, ToPoE and GTM share the underlying structure, but ToNeGas is faster

due to the gas-like clustering. As in HaToM, the continuous transition of prototypes

between clusters due to the soft competition, gives those prototypes in this algorithm

freedom to find the clusters faster, compared with ToPoE.

5.5 Benefits from separating clustering from pro-

jection

As seen in this thesis, the two new developed algorithms (HaToM and ToNeGas)

have in common the separation of the clustering and projection steps. This has a

number of benefits detailed below:

• the clustering and topology preservation efficiency do not interfere with each

other, as occurs when both are in the same learning rule.

129


• using the same underlying structure, we can use several clustering techniques

that are better in different situations.

• we have a greater control of the parameters, identifying which one is influencing

each property of the algorithm.

• we can give more or less weight to the clustering or the non-linear projection

by

1. inserting the projection in the inner loop or not.

2. giving more or less iterations to the clustering step.

The non-linear projection is also responsible for the topology preservation, as

long as the mapping function m(x; w) is smooth and continuous [13]. Sepa-

rating clustering from topology preservation is giving us the choice to select

whether we want to give more weight to the reduction of the Mean Quantisa-

tion Error, or the reduction of the Topology Error.

• depending on the above we can get an algorithm more or less sensitive to

outliers depending on the objectives.

HaToM and ToNeGas have been developed in a growing version. This has ad-

vantages [29] such as

• the possibility to use error methods to locate new nodes.

• the possibility to use a performance criterion to stop growing the map, thus

having less parameters to tune.

• we can prune the prototypes that have no great responsibility as seen above.

The growing is not preventing continuous training, since the W is not randomized

each time we increment K, but the previous value is used in the next non-linear

projection m = φW , that includes the new number of latent points in the φ vector.

ToPoE does not separate clustering from topology preservation, but shares a com-

mon non-linear projection with HaToM and ToNeGas. Having information about

the data is always a plus, and the more information we have, the more we know about

the algorithm needed. In this case, having a whole family of related algorithms with

different properties allows a wider selection. Both ToPoE and M-HaToM should be

selected when the model needs to be reinforced and for continuous distributions.

D-HaToM and ToNeGas are better disposed for discrete data and when time con-

vergence is a constraint.

130


5.6 Conclusions

The first part of this Chapter introduced the last of the algorithms sharing the GTM

structure, the Topographic Neural gas, which makes use of the Neural Gas technique.

In ToNeGas, as well as in HaToM, the projection to a lower-dimensional space and

clustering technique are not included in the same learning rule, giving more control

of the parameter selection and accuracy results. This algorithm is faster than the

previous HaToM.

In the second part we have compared, jointly and in comparison with standard

topology preserving maps like SOM, the three algorithms: ToPoE, HaToM and

ToNeGas. Firstly we evaluated the clustering and topology preservation using two

different datasets. Then, we summarised all the experimental results. And finally

we make a one-to-one comparison of the different models. We also enumerate the

benefits from separating clustering and projection in two steps, and from having a

family of related algorithms.

131

Chapter 6

Summary and Future work

6.1 Summary

We have presented a family of Topology preserving mappings based on the Genera-

tive Topographic Mapping. ToPoE, HaToM and ToNeGas share a common projec-

tion technique from latent space to data space, that allows for the preservation of

topology, giving a visualisation of the data space represented in two or three dimen-

sions. The visualisation model is also common for all the algorithms, and is based

on a continuous projection to the output space.

ToPoE includes a Product of Expert model, with responsibilities embeded in the

learning process. The greater imposition of the model in this algorithm makes it

more suitable for manifold searching cases.

HaToM and ToNeGas are also based on the same underlying model as GTM

and ToPoE; thus they also keep the topology in the outer space; but the main new

property of these algorithms is the separation of clustering and projection steps.

This provides a number of advances including the development of two versions for a

model- or data- reinforcement. Both K-Harmonic Means and Neural Gas overcome

the local minima curse by allowing hill-climbing changes in the gradient descent

optimisation.

6.2 Major contributions of this thesis

We have further investigated ToPoE’s theory, including the change of local variance,

the magnification factors, and the projection model.

We introduced two new algorithms that belong to the same family of models

with additional properties and more control of the parameter tuning thanks to the

132

Chapter 6: Summary & Future work

separation of clustering and topology preservation steps.

We have compared the algorithms with different artificial and real datasets, find-

ing each of their weaknesses and strengths, and what applications are more suitable,

whether it is clustering or modelling of the distributions. We have seen how the use

of responsibilities as in the GTM algorithm allows us to project in a continuous way,

instead of having discrete projections exclusively to the latent points positions in the

prefixed output space. We also noticed how sometimes too many small responsibili-

ties from latent points far away from the data, can project some of the datapoints to

the wrong side of the latent space. This can be overcome using kernels such as the

Tri-cube or Epanechnikov, that consider only points at a certain distance from the

datapoint, or pruning. This is particular true for ToPoE, that tends to have few high

responsibilities, and a great quantity of latent points with very small responsibility

for all datapoints; the projections are nevertheless adequate due to the visualisation

process characteristics. In HaToM and ToNeGas the corresponding clustering tech-

nique locates the prototypes faster near the data, so their responsibilites are high

for some of the datapoints and small for other.

We have investigated their clustering and their preservation of the topology nu-

merically and in comparison with SOM, one of the most used topology preserving

maps.

6.3 Future Work

The algorithms presented in this thesis could be extended in several ways:

• In Neural Gas the rank of distances from neurons to the datapoints are used

for the updating of the prototypes. The temperature value gives more or

less weight to the influence of the prototypes based on their rank order. In

a similar way ToPoE makes use of the responsibilities to update within the

learning rule. We have used so far a fixed value for the width, but as in NG,

a decreasing parameter proportional to the rank order could be used, so that

higher values at the first stages could allow for hill-climbing, while the reduced

values towards the end of the training would secure the convergence to the

closer minimum.

• We could also use the local variance in ToPoE to see how good the mapping

is. In between clusters the variance should be more spread, while inter cluster

cases should lead to sharper variance.

133

Chapter 6: Summary & Future work

• We have used the underlying structure of the GTM as projecting technique to

preserve the topology. The mapping however is usually two dimensional, so

the preservation of the topology is only true when the manifold of the data is

also two dimensional. In the literature review, we have reviewed an alternative

model used in the Topology Representing Networks to do so, the Competitive

Hebbian learning. This technique finds the true topology of the manifold (once

Neural Gas positions the prototypes within the clusters), without assuming any

prefixed dimensionality or topology of the manifold. This interesting technique

could be used in combination with other clustering techniques such as K-Means

and K-Harmonic Means, to form new Topology Preserving Mappings.

• We could also grow the mapping in the same way as GNG, with the location

of the nodes depending on the error of the map, instead of having a fixed

structure.

• As proposed for GTM [80], Bayesian techniques and/or Cross-validation could

be applied for parameter selection in any of these new models.

• Chang [15, 16] uses a spherical manifold with PPS reducing the overlapping

of clusters. This and other disposition of the latent points in two or three

dimensional shapes could be applied to the TPM algorithms.

• We could introduce a “conscience” term so that frequent winners get a “bad

conscience” for winning so often.

• Chang [15, 16] developed the Probabilistic Principal Surfaces, which are closely

related to the GTM. Thus, an interesting study would be the relationship

between our algorithms and Principal Curves/Surfaces.

• The author has a background in Marine sciences, particularly in Fisheries

acoustics. One of the future tasks will be the application of TPM and other

data mining techniques to marine (oceanographical and fisheries) data.

• Applications to other real data and to other applications like Forecasting, or

Regression Clustering (see [94]).

134

Bibliography

[1] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of

distance metrics in high dimensional space. Lecture Notes in Computer Science,

1973:420–434, 2001.

[2] E. Arsuaga-Uriarte and F. Daz-Martn. Topology preservation in som. Trans-

actions On Engineering, Computing And Technology, 15:1305–5313, 2006.

[3] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding.

In Bay Area Theory Symposium, BATS 06, 2006.

[4] P. Berkhing. Survey of clustering data mining technique. Technical report,

Accrue Software, 2002.

[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest

neighbor” meaningful? Lecture Notes in Computer Science, 1540:217–235,

1999.

[6] Neural Computing Research Group. Aston University. Birmingham. Gtm tool-

box. http://www.ncrg.aston.ac.uk/gtm/.

[7] Neural Computing Research Group. Aston University. Birmingham. Netlab

toolbox. http://www.ncrg.aston.ac.uk/netlab/index.php.

[8] C. M. Bishop, M. Svensen, and C. K. I. Williams. Gtm: The generative topo-

graphic mapping. Neural Computation, 10(1):215–234, 1997.

[9] C. M. Bishop, M. Svensen, and C. K. I. Williams. Magnification Factors for

the GTM Algorithm. In Proceedings of the IEE 5th International Conference

on Artificial Neural Networks, Cambridge, U.K., 64-69P, 1997.

[10] C. M. Bishop, M. Svensen, and C. K. I. Williams. Magnification factors for the

gtm algorithm. Technical Report NCRG/97/006, Neural Computing Research

Group, Aston University, 1997.

135

Bibliography

[11] C. M. Bishop, M. Svensen, and C. K. I. Williams. Developments of the gener-

ative topographic mapping. Neurocomputing, 21(1):203–224, 1998.

[12] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford:Clarendon

Press, 1995.

[13] C.M. Bishop. Latent variable models. Learning in Graphical Models, MIT

Press, pages 371–403, 1999.

[14] M. A. Carreira-Perpinan. A review of dimension reduction techniques. Technical

Report CS-96-09, Dept. of Computer Science. University of Sheffield, 1997.

[15] K. Chang. Nonlinear Dimensionality Reduction Using Probabilistic Principal

Surfaces. PhD thesis, The University of Texas at Austin, Department of Elec-

trical & Computer Engineering, May 2000.

[16] K. Chang and J. Ghosh. A unified model for probabilistic principal surfaces.

IEEE Trans. Pattern Anal. Mach. Intell., 23(1):22–41, 2001.

[17] S. Dasgupta. Experiments with random projection. In UAI ’00: Proceedings

of the 16th Conference on Uncertainty in Artificial Intelligence, pages 143–151,

San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

[18] M. Daszykowski, B. Walczak, and D. L. Massart. On the optimal partitioning

of data with k-means, growing k-means, neural gas, and growing neural gas. J.

Chem. Inf. Comput. Sci, 42(6):1378 – 1389, 2002.

[19] P. Delicado. Another look at principal curves and surfaces. Journal of Multi-

variate Analysis, 77:84–116, 2001.

[20] K. Doherty, R. Adams, and N. Davey. Unsupervised learning with normalised

data and non-euclidean norms. Applied Soft Computing, 7(1):203–210, 2007.

[21] R. Durbin and D. Willshaw. An analogue approach to the traveling salesman

problem using an elastic net method. Nature, pages 689–691, 1987.

[22] E. Erwin, K. Obermayer, and K. Schulten. Self-organising maps:ordering, con-

vergence properties and energy functions. Biological Cybernetics, 67:47–55,

1992.

[23] E. Erwin, K. Obermayer, and K. Schulten. Self-organising maps:stationary

states, metastability and convergence rate. Biological Cybernetics, 67:35–45,

1992.

136

Bibliography

[24] A. Flexer. Limitations of self-organizing maps for vector quantization and mul-

tidimensional scaling. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors,

Advances in Neural Information Processing Systems 9. Proceedings of the 1996

Conference, pages 445–51. MIT Press, London, UK, 1997.

[25] Laboratory for Advanced Brain Signal Processing. Riken. Saitama. Japan.

Icalab toolbox. http://www.bsp.brain.riken.go.jp/icalab/.

[26] D. Francois, V. Wertz, and M. Verleysen. Non-euclidean metrics for similarity

search in noisy datasets. In In Proc. of the European Symposium on Artificial

Neural Networks, ESANN 2005, pages 339–344, Bruges, Belgium, 2005.

[27] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a sta-

tistical view of boosting. Annals of Statistics, 28:337–374, 2000.

[28] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S.

Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing

Systems 7, pages 625–632. MIT Press, Cambridge MA, 1995.

[29] B. Fritzke. Growing self-organizing networks – why? In ESANN’96: European

Symposium on Artificial Neural Networks, pages 61–72, 1996.

[30] C. Fyfe. Topographic product of experts. In International Conference on Ar-

tificial Neural Networks, ICANN2005, pages 397–402, 2005.

[31] C. Garcıa-Osorio. Data Mining And Visualization. PhD thesis, School of Com-

puting, University of Paisley, 2005.

[32] G. Hamerly and C. Elkan. Alternatives to the k-means algorithm that find

better clusterings. In CIKM ’02: Proceedings of the eleventh international

conference on Information and knowledge management, pages 600–607, New

York, NY, USA, 2002. ACM Press.

[33] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical

Association, 84(406):502–516, June 1989.

[34] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.

Springer, 2001.

[35] G. E. Hinton. Training products of experts by minimizing contrastive diver-

gence. Technical Report GCNU TR 2000-004, Gatsby Computational Neuro-

science Unit, University College, London, 2000.

137

Bibliography

[36] J. Hollmen. Process Modeling Using the Self-Organizing Map. PhD thesis,

Helsinki University of Technology, 1996.

[37] J. Holmstrom. Growing neural gas: Experiments with gng, gng with utility

and supervised gng. Masters thesis in computer science, Uppsala University.

Department of Information Technology, 2002.

[38] A. Hyvarinen. Survey on independent component analysis. Neural Computing

Surveys, 2:94–128, 1999.

[39] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis.

Wiley, 2001.

[40] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures

of local experts. Neural Computation, 3:79–87, 1991.

[41] N. Jardine and R. Sibson. The construction of hierarchic and non-hierarchic

classifications. The Computer Journal, 11:177–184, 1968.

[42] S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 2(3):241–254,

1967.

[43] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em

algorithm. Neural Computation, 6:181–214, 1994.

[44] S. Kaski, J. Kangas, and T. Kohonen. Bibliography of self-organizing map

(som) papers: 1981-1997. Neural Computing Surveys, 1:102–350, 1998.

[45] B. Kegl, A. Krzyzak, T. Linder, and K. Zeger. Learning and design of princi-

pal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence,

22(3):281–297, 2000.

[46] K. Kiviluoto and E. Oja. S-map: A network with a simple self-organization

algorithm for generative topographic mappings. In NIPS, 1997.

[47] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, 1984.

[48] T. Kohonen. Self-Organising Maps. Springer, 1995.

[49] B. Krose. Projection and clustering. ASCI Advanced Pattern Recognition

Course. Computer Science Department, University of Amsterdam, May 2002.

[50] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.

138

Bibliography

[51] K. Van Laerhoven. Combining the self-organizing map and k-means clustering

for on-line classification of sensor data. In ICANN, pages 464–469, 2001.

[52] Q. Li, N. Mitianoudis, and T. Stathaki. Spatial kernel k-harmonic means clus-

tering for multi-spectral image thresholding. IEE Vision, Image and Signal

Processing Journal.

[53] Z.-P. Lo and B. Bavarian. On the rate of convergence in topology preserving

neural networks. Biological Cybernetics, 65:11:55–63, 1991.

[54] S. P. Luttrell. Code vector density in topographic mappings: Scalar case. IEEE

Transactions on Neural Networks, 2(4):427–436, July 1991.

[55] K. V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate Analysis. Academic

Press, 1979.

[56] T.M. Martinetz. Competitive hebbian learning rule forms perfectly topology

preserving maps. In Stan Gielen and Bert Kappen, editors, ICANN93, pages

427–434. Springer Verlag, 1993.

[57] T.M. Martinetz, S.G. Berkovich, and K.J. Schulten. ’neural-gas’ network for

vector quantization and its application to time-series prediction. IEEE Trans-

actions on Neural Networks., 4(4):558–569, 1993.

[58] T.M. Martinetz and K.J. Schulten. Topology representing networks. Neural

Networks., 7:507–522, 1994.

[59] S. McGlinchey, M. Pena, and C. Fyfe. Comparison of quantization errors in the

model- and data-driven harmonic topographic mappings. WSEAS Transactions

On Computers, 5(7):1562–1570, July 2006.

[60] S. McGlinchey, M. Pena, and C. Fyfe. Quantization errors in the harmonic

topographic mapping. In The 9th WSEAS International Conference on applied

mathematics, MATH 06, pages 105–110, May 2006.

[61] I. T. Nabney. NETLAB: Algorithms for pattern recognition. Springer-Verlag

New York, Inc., New York, NY, USA, 2002.

[62] Neural Networks Research Centre. Helsinki University of Technology. Som tool-

box. www.cis.hut.fi/projects/somtoolbox.

[63] M. Oja, S. Kaski, and T. Kohonen. Bibliography of self-organizing map (som)

papers: 1998-2001 addendum. Neural Computing Surveys, 3:1–156, 2003.

139

Bibliography

[64] E. Pampalk. Limitations of the som and the gtm.

url:citeseer.ist.psu.edu/670342.html. 2001.

[65] M. Pena and C. Fyfe. Developments of the generalised harmonic topographic

mapping. WSEAS Transactions On Computers, 4(11):1548–1555, November

2005.

[66] M. Pena and C. Fyfe. Faster clustering of complex data with the generalised

harmonic topographic mapping (g-hatom). In 5th WSEAS International Con-

ference on Applied Informatics And Communications, WSEAS AIC ’05, pages

270–275, 2005.

[67] M. Pena and C. Fyfe. The harmonic topographic map. In The Irish conference

on Artificial Intelligence and Cognitive Science, AICS05, pages 245–254, 2005.

[68] M. Pena and C. Fyfe. The harmonic topographic map. Tech-

nical Report 35, School of Computing, University of Paisley,

http://cis.paisley.ac.uk/research/reports/, 2005.

[69] M. Pena and C. Fyfe. Model- and data-driven harmonic topographic maps.

WSEAS Transactions On Computers, 4(9):1033–1044, September 2005.

[70] M. Pena and C. Fyfe. Tight clusters and smooth manifolds with the harmonic

topographic map. In 5th WSEAS International Conference on Simulation, Mod-

eling And Optimization, WSEAS SMO ’05, pages 508–513, 2005.

[71] M. Pena and C. Fyfe. Forecasting with topology preserving maps: Harmonic

topographic map and topographic product of experts application. In First

International Conference on Multidisciplinay Information Sciences and Tech-

nologies, InSciT2006, pages 42–46, October 2006.

[72] M. Pena and C. Fyfe. Outlier identification with the harmonic topographic

mapping. In European Symposium on Artificial Neural Networks, ESANN’06,

pages 289–295, April 2006.

[73] M. Pena and C. Fyfe. The topographic neural gas. In 7th International Con-

ference on Intelligent Data Engineering and Automated Learning, IDEAL06.,

pages 241–249, September 2006.

[74] A.K. Qin and P.N. Suganthan. Enhanced neural gas network for prototype-

based clustering. 38(8):1275–1288, August 2005.

140

Bibliography

[75] F. Questier, Q. Guo, B. Walczak, D.L. Massart, C. Boucon, and S. de Jong. The

neural-gas network for classifying analytical data. Chemometrics and Intelligent

Laboratory Systems, 61/1-2:105–121, 2002.

[76] H. Ritter and T. Kohonen. Self-organising semantic maps. Biological Cybernet-

ics, 61:241–254, 1989.

[77] R. N. Shepard, A. K. Romney, and S. B. Nerlove. Multidimensional Scaling:

Theory and Applications in the Behavioral Sciences., volume 1. Seminar Press,

Inc., 1972.

[78] A. Staiano, R. Tagliaferri, and L. De Vinco. High-d data visualization methods

via probabilistic principal surfaces for data mining applications. In Proceedings

Trim, 2004.

[79] J. V. Stone. Independent Component Analysis. A tutorial introduction. A

bradford book. The MIT Press, 2004.

[80] M. Svensen. GTM: The Generative Topographic Mapping. PhD thesis, Aston

University, Birmingham, UK, 1998.

[81] V. Tereshko. Topology-preserving elastic nets. J. Mira and A. Prieto (Eds.):

Connectionist Models of Neurons, Learning Processes and Artificial Intelligence,

LNCS. Springer-Verlag:Berlin, 2084:551–557, 2001.

[82] V. Tereshko. Deriving cortical maps and elastic nets from topology-preserving

maps. J. Cabestany, A. Prieto, and D.F. Sandoval (Eds.): IWANN’2005,

LNCS. Springer-Verlag: Berlin/Heidelberg, 3512:326–332, 2005.

[83] P. Tino and I. Nabney. Hierarchical GTM: constructing localized non-linear

projection manifolds in a principled way. (IEEE) Transactions on Pattern

Analysis and Machine Intelligence., 24(5):639–656, 2001.

[84] M.E. Tipping. Topographic Mappings and Feed-Forward Neural Networks. PhD

thesis, The University of Aston in Birmingham, 1996.

[85] A. Ultsch. Clustering with som: U*c. In Proc. Workshop on Self-Organizing

Maps, pages 75–82, 2005.

[86] J.J. Verbeek, N. Vlassis, and B. Krose. Locally linear generative topographic

mapping. In Proc. of 12th Belgian-Dutch conf. on Machine Learning, 2002.

141

Bibliography

[87] J. Vesanto, J. Himberg, E. Alhoniemi, and

J. Parhankangas. Som toolbox for matlab 5.

http://www.cis.hut.fi/projects/somtoolbox/package/papers/techrep.pdf.

[88] C.K.I. Williams and F. V. Agakov. Products of gaussians and probabilistic

minor components analysis. Technical Report EDI-INF-RR-0043, University of

Edinburgh, 2001.

[89] H. Yin. Visom-a novel method for multivariate data projection and structure vi-

sualization. IEEE TRANSACTIONS ON NEURAL NETWORKS, 13(1):237–

243, 2002.

[90] H. Yin. Nonlinear multidimensional data projection and visualisation. Lecture

Notes in Computing Sciences, 2690:377–388, 2003.

[91] B. Zhang. Generalized k-harmonic means – boosting in unsupervised learning.

Technical Report HPL-2000-137, HP Laboratories, Palo Alto, October 2000.

[92] B. Zhang. Generalized k-harmonic means– dynamic weighting of data in unsu-

pervised learning. First SIAM international Conference on Data Mining, 2001.

[93] B. Zhang. Comparison of the performance of center-based clustering algorithms.

In Advances in Knowledge Discovery and Data Mining: 7th Pacific-Asia Con-

ference, PAKDD 2003, pages 63–74, 2003.

[94] B. Zhang. Regression clustering. In Third IEEE International Conference on

Data Mining (ICDM’03), page 451. IEEE Computer Society, 2003.

[95] B. Zhang, M. Hsu, and U. Dayal. K-harmonic means - a data clustering algo-

rithm. Technical Report HPL-1999-124, HP Laboratories, Palo Alto, October

1999.

142

Latent Variable Spaces For The Construction Of Topology ...webdoc.sub.gwdg.de/ebook/mon/2010/ppn...

Documents

Transcript of Latent Variable Spaces For The Construction Of Topology ...webdoc.sub.gwdg.de/ebook/mon/2010/ppn...