c 2008 Sudhir Madhav Rao - University of...
Transcript of c 2008 Sudhir Madhav Rao - University of...
UNSUPERVISED LEARNING: AN INFORMATION THEORETIC FRAMEWORK
By
SUDHIR MADHAV RAO
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2008
1
c© 2008 Sudhir Madhav Rao
2
To Parents, Teachers and Friends
3
ACKNOWLEDGMENTS
I would like to take this opportunity to thank my advisor Dr. Jose C. Prıncipe for his
constant, unwavering support and guidance throughout my stay at CNEL. He has been
a great mentor pulling me out of many local minima (in the language of CNEL!). I still
wonder how he works non-stop from morning to evening without lunch, and I am sure this
feeling is shared among many of my colleagues. In short, he has been an inspiration and
continues to be so.
I would like to express my gratitude to all my committee members; Dr. Murali Rao,
Dr. John G. Harris and Dr.Clint Slatton; for readily agreeing to be part of my committee.
They have helped immensely in improving this dissertation with their inquisitive nature
and helpful feedbacks. I would like to especially thank Dr. Murali Rao, my math mentor,
for keeping a vigil on all my notations and bringing sophistication to my engineering mind!
Special mention is also needed for Julie, the research coordinator at CNEL, for constantly
monitoring the pressure level at the lab and making us smile even if it is for a short while.
My past as well as present colleagues at CNEL need due acknowledgement. Without
them, I would have been shooting questions to walls and mirrors! I have grown to
appreciate them intellectually, and I thank them for being there for me when I needed
them most. It has been a pleasure to play ball on and off the basketball court!
Finally, it would be foolish on my part not to acknowledge my parents and sister for
their constant support and never-ending love. Thanks go out to one and all who have
helped me to complete this journey successfully. It has been an adventure and a fun ride!
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 THEORETICAL FOUNDATION . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1 Information Theoretic Learning . . . . . . . . . . . . . . . . . . . . . . . . 262.1.1 Renyi’s Quadratic Entropy . . . . . . . . . . . . . . . . . . . . . . . 272.1.2 Cauchy-Schwartz Divergence . . . . . . . . . . . . . . . . . . . . . . 292.1.3 Renyi’s Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 A NEW UNSUPERVISED LEARNING PRINCIPLE . . . . . . . . . . . . . . . 36
3.1 Principle of Relevant Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.1 Case 1: β = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.1.2 Case 2: β = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.3 Case 3: β →∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 APPLICATION I: CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Review of Mean Shift Algorithms . . . . . . . . . . . . . . . . . . . . . . . 494.3 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 GMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.2 GBMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Mode Finding Ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 Clustering Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.7 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7.1 GMS vs GBMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . 624.7.2 GMS vs Spectral Clustering Algorithm . . . . . . . . . . . . . . . . 63
5
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 APPLICATION II: DATA COMPRESSION . . . . . . . . . . . . . . . . . . . . 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4 Face Boundary Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 805.5 Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 APPLICATION III: MANIFOLD LEARNING . . . . . . . . . . . . . . . . . . . 88
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.2 A New Definition of Principal Curves . . . . . . . . . . . . . . . . . . . . . 906.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 Spiral Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.3.2 Chain of Rings Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . 96
7.1 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 967.2 Future Direction of Research . . . . . . . . . . . . . . . . . . . . . . . . . . 101
APPENDIX
A NEURAL SPIKE COMPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.2.1 Weighted LBG (WLBG) Algorithm . . . . . . . . . . . . . . . . . . 108A.2.2 Review of SOM and SOM-DL Algorithms . . . . . . . . . . . . . . . 110
A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111A.3.1 Quantization Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111A.3.2 Compression Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 114A.3.3 Real-time Implementation . . . . . . . . . . . . . . . . . . . . . . . 115
A.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B SPIKE SORTING USING CAUCHY-SCHWARTZ PDF DIVERGENCE . . . . 119
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119B.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.2.1 Electrophysiological Recordings . . . . . . . . . . . . . . . . . . . . 121B.2.2 Neuronal Spike Detection . . . . . . . . . . . . . . . . . . . . . . . . 121
B.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122B.3.1 Non-parametric Clustering using CS Divergence . . . . . . . . . . . 122B.3.2 Online Classification of Test Samples . . . . . . . . . . . . . . . . . 126
6
B.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126B.4.1 Clustering of PCA Components . . . . . . . . . . . . . . . . . . . . 126B.4.2 Online Labeling of Neural Spikes . . . . . . . . . . . . . . . . . . . . 127
B.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
C A SUMMARY OF KERNEL SIZE SELECTION METHODS . . . . . . . . . . 130
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7
LIST OF TABLES
Table page
5-1 J(X) and MSQE on the face dataset for the three algorithms averaged over 50Monte Carlo trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A-1 SNR of spike region and of the whole test data obtained using WLBG, SOM-DLand SOM algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A-2 Probability values obtained for the code vectors and used in Figure A-7. . . . . 115
8
LIST OF FIGURES
Figure page
1-1 The process of learning showing the feedback between a learner and its associatedenvironment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1-2 A new framework for unsupervised learning. . . . . . . . . . . . . . . . . . . . . 25
2-1 Information force within a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 29
2-2 Cross information force between two datasets. . . . . . . . . . . . . . . . . . . . 35
3-1 A simple example dataset. A) Crescent shaped data. B) pdf plot for σ2 = 0.05. 45
3-2 An illustration of the structures revealed by the Principle of Relevant Entropyfor the crescent shaped dataset for σ2 = 0.05. As the values of β increases wepass through the modes, principal curves and in the extreme case of β → ∞ weget back the data itself as the solution. . . . . . . . . . . . . . . . . . . . . . . . 46
4-1 Ring of 16 Gaussian dataset with different a priori probabilities. The numberingof the clusters is in anticlockwise direction starting with center (1, 0). A) R16Gadataset. B) a priori probabilities. C) probability density function estimatedusing σ2 = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4-2 Modes of R16Ga dataset found using GMS and GBMS algorithms. A) GoodMode finding ability of GMS algorithm. B) Poor mode finding ability of GBMSalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4-3 Cost function of GMS and GBMS algorithms. . . . . . . . . . . . . . . . . . . . 58
4-4 Renyi’s “cross” entropy H(X; S) computed for GBMS algorithm. This does notwork as a good stopping criterion for GBMS in most cases since the assumptionof two distinct phases of convergence does not hold true in general. . . . . . . . 59
4-5 A clustering problem A) Random Gaussian clusters (RGC) dataset. B) 2σg contourplots. C) probability density estimated using σ2 = 0.01. . . . . . . . . . . . . . . 59
4-6 Segmentation results of RGC dataset. A) GMS algorithm. B) GBMS algorithm. 61
4-7 Averaged norm distance moved by particles in each iteration. . . . . . . . . . . 61
4-8 Baseball image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4-9 Baseball image segmentation using GMS and GBMS algorithms. The left columnshows results from GMS for various different number of segments and the kernelsize at which it was achieved. The right column similarly shows the results fromGBMS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9
4-10 Performance comparison of GMS and spectral clustering algorithms on RGCdataset. A) Clustering solution obtained using GMS algorithm. B) Segmentationobtained using spectral clustering algorithm. . . . . . . . . . . . . . . . . . . . . 66
4-11 Bald Eagle image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4-12 Multiscale analysis of bald eagle image using GMS algorithm. . . . . . . . . . . 69
4-13 Performance of spectral clustering on bald eagle image. . . . . . . . . . . . . . . 70
4-14 Two popular images from Berkeley image database. A) Flight image. B) Tigerimage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4-15 Multiscale analysis of Berkeley images using GMS algorithm. . . . . . . . . . . . 71
4-16 Performance of spectral clustering on Berkeley images. . . . . . . . . . . . . . . 72
4-17 Comparison of GMS and spectral clustering (SC) algorithms with a priori selectionof parameters. A new post processing stage has been added to GMS algorithmwhere close clusters were recursively merged until the required number of segmentswas achieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5-1 Half circle dataset and the square grid inside which 16 random points wheregenerated from a uniform distribution. . . . . . . . . . . . . . . . . . . . . . . . 79
5-2 Performance of ITVQ-FP and ITVQ-gradient methods on half circle dataset.A) Learning curve averaged over 50 trials. B) 16 point quantization results. . . . 80
5-3 Effect of annealing the kernel size on ITVQ fixed point method. . . . . . . . . . 82
5-4 64 points quantization results of ITVQ-FP, ITVQ-gradient method and LBGalgorithm on face dataset. A) ITVQ fixed point method. B) ITVQ gradient method.C) LBG algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5-5 Two popular images selected for image compression application. A) Baboonimage. B) Peppers image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5-6 Portions of Baboon and Pepper images used for image compression. . . . . . . . 85
5-7 Image compression results. The left column shows ITVQ-FP compression resultsand the right column shows LBG results. A,B) 99.75% compression for boththe algorithms, σ2 = 36 for ITVQ-FP algorithm. C,D) 90% compression, σ2 =16. E,F) 95% compression, σ2 = 16. G,H) 85% compression, σ2 = 16. . . . . . . 86
6-1 The principal curve for a mixture of 5 Gaussians using numerical integrationmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6-2 The evolution of principal curves starting with X initialized to the original datasetS. The parameters were set to β = 2 and σ2 = 2. . . . . . . . . . . . . . . . . . 93
10
6-3 The principal curve passing through the modes (shown with black squares). . . . 94
6-4 Changes in the cost function and its two components for the spiral data as afunction of the number of iterations. A) Cost function J(X). B) Two componentsof J(X) namely H(X) and H(X; S). . . . . . . . . . . . . . . . . . . . . . . . . 94
6-5 Denoising ability of principal curves. A) Noisy 3D “chain of rings” dataset. B)Result after denoising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7-1 A novel idea of unfolding the structure of the data in the 2-dimensional spaceof β and σ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7-2 The idea of information bottleneck method. . . . . . . . . . . . . . . . . . . . . 103
A-1 Synthetic neural data. A) Spikes from two different neurons. B) An instance ofthe neural data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A-2 2-D embedding of training data which consists of total 100 spikes with a certainratio of spikes from the two different neurons. . . . . . . . . . . . . . . . . . . . 107
A-3 The outline of weighted LBG algorithm. . . . . . . . . . . . . . . . . . . . . . . 109
A-4 16 point quantization of the training data using WLBG algorithm. . . . . . . . . 112
A-5 Two-dimensional embedding of the training data and code vectors in a 5 × 5lattice obtained using SOM-DL algorithm. . . . . . . . . . . . . . . . . . . . . . 112
A-6 Performance of WLBG algorithm in reconstructing spike regions in the test data. 113
A-7 Firing probability of WLBG codebook on the test data. . . . . . . . . . . . . . . 114
A-8 A block diagram of Pico DSP system. . . . . . . . . . . . . . . . . . . . . . . . . 116
B-1 Example of extracellular potentials from two neurons. . . . . . . . . . . . . . . . 122
B-2 Distribution of the two spike waveforms along the first and the second principalcomponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B-3 Comparison of clustering based on CS divergence with K-means for spike sorting. 128
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
UNSUPERVISED LEARNING: AN INFORMATION THEORETIC FRAMEWORK
By
Sudhir Madhav Rao
August 2008
Chair: Jose C. PrıncipeMajor: Electrical and Computer Engineering
The goal of this research is to develop a simple and unified framework for unsupervised
learning. As a branch of machine learning, this constitutes the most difficult scenario
where a machine (or learner) is presented only with the data without any desired output
or reward for the action it takes. All that the machine can do then, is to somehow capture
all the underlying patterns and structures in the data at different scales. In doing so, the
machine has extracted maximal information from the data in an unbiased manner, which
it can then use as a feedback to learn and make decisions.
Till now, the different facets of unsupervised learning namely, clustering, principal
curves and vector quantization have been studied separately. This is understandable
from the complexity of the field and the fact that no unified theory exists. Recent
attempts to develop such a theory have been mired with complications. One of the
issues is the imposition of models and structures on the data rather than extracting them
directly through self organization of the data samples. Another reason is the lack of
non-parametric estimators for information theoretic measures. Gaussian approximations
generally follow, which fail to capture the higher order features in the data that are really
of interest.
In this thesis, we present a novel framework for unsupervised learning called - The
Principle of Relevant Entropy. By using concepts from information theoretic learning,
we formulate this problem as a weighted combination of two terms - an information
12
preservation term and a redundancy reduction term. This information theoretic
formulation is a function of only two parameters. The user defined weighting parameter
controls the task (and hence the type of structure) to be achieved whereas the inherent
hidden scale parameter controls the resolution of our analysis. By including such a user
defined parameter, we allow the learning machine to influence the level of abstraction
to be extracted from the data. The result is the unraveling of “goal oriented structures”
unique to the given dataset.
Using information theoretic measures based on Renyi’s definition, we estimate these
quantities directly from the data. One can derive fixed point update scheme to optimize
this cost function, thus avoiding Gaussian approximation altogether. This leads to
interaction of data samples giving us a new self organizing paradigm. An added benefit
is the lack of unnecessary parameters (like step size) common in other optimization
approaches. The strength of this new formulation can be judged from the fact that the
existing mean shift algorithms appear as special cases of this cost function. By going
beyond the second order statistics and dealing directly with the probability density
function, we effectively extract maximal information to reveal the underlying structure
in the data. By bringing clustering, principal curves and vector quantization under one
umbrella, this powerful principle truly discovers the underlying mechanism of unsupervised
learning.
13
CHAPTER 1INTRODUCTION
1.1 Background
The functioning of the human brain has baffled scientists for a long time now.
The desire to unravel the mystery has led us to imitate this underlying mechanism in
machines so as to make them more intelligent and adaptable to the environment. Machine
Learning, the field which particularly deals with these questions, can be defined as the
science of the artificial [Langley, 1996]. It is a confluence of numerous fields ranging from
cognitive science and neurobiology to engineering, statistics and optimization theory to
name a few. Its goal involves modeling the mechanism underlying human learning and to
constantly corroborate it through applications to real world problems using empirical and
mathematical approaches. This branch of science can thus be considered as the field of
research devoted to formal study of learning systems [Ghahramani, 2003].
Human mind has been the subject of interest to many great philosophers. More than
2000 years back, Greek philosophers like Plato and Aristotle discussed the concept
of universal [Watanabe, 1985]. The universal simply means a concept or a general
notion. They argued that the way to find universal is to find “forms” or “patterns”
from exemplars. During the 16th century Francis Bacon furthered inductive reasoning
as central to understanding intelligence [Forsyth, 1989]. But the scientific disposition to
these philosophical ideas was not laid until 1748 when Hume [1999] discovered the rule
of induction. Meanwhile, deductive and formal reasoning gained increasing popularity,
especially among logicians. Working with symbols and heuristic rule based algorithms, the
first knowledge based systems were built. This lead to the birth of Artificial Intelligence
(AI) in mid-1950’s which can be considered primarily as a science of deduction. The
idea behind these domain specific expert systems was to explore role and organization of
knowledge in memory. Borrowing heavily from cognitive science literature, research on
knowledge representation, natural language and expert systems dominated this era.
14
It was increasingly realized that such heuristic rule based symbolic methods led to
highly complex systems which could not be easily generalized to other similar applications.
Further, external world does not contain symbols but signals which need to be processed
and interpreted before any deductive reasoning could be applied. The role of pattern
recognition, signal processing and more importantly inductive inference to understand the
underlying learning principles started gaining attention by the middle of 1960’s. Learning
was regarded as central to intelligent behavior and was thus seen as a way to explain the
general mechanism behind one’s cognitive ability and the resulting actions. Researchers in
the area of pattern recognition began to emphasize algorithmic, numerical methods that
contrasted sharply with heuristic, symbolic methods associated with AI paradigm.
The gap between these two fields continued to increase due to differences in the
fundamental notion of intelligence. It was not until the late 1970’s, that a renewed interest
emerged which spawned the field of Machine Learning. This was necessitated by the need
to do away with overwhelming number of domain specific expert systems and a urge to
understand better the underlying principles of learning which govern these systems. Many
new methods were proposed with emphasis in the area of planning, diagnosis and control.
Work on real world problems provided strong clues to the success of these methods. The
incremental nature of this field helped it to grow tremendously for the past 30 years with
greater tendency to build on previous systems. This field today occupies a central position
in our quest to answer many unsolved mysteries of man and machine alike. Borrowing
ideas both from AI and pattern recognition, this field of science has sought to unravel the
central role of learning in intelligence.
Learning can be broadly defined as improvement in performance in some environment
through the acquisition of knowledge resulting from experience in that environment [Langley,
1996]. Thus, learning never occurs in isolation. A learner always finds himself attached to
an environment actively gathering data for knowledge acquisition. With a goal in mind,
the learner finds patterns in data relevant to the task at hand and takes appropriate
15
Environment
Actions Data
Learner
Figure 1-1. The process of learning showing the feedback between a learner and itsassociated environment.
actions resulting in more data from the environment. This feedback enriches the learner’s
knowledge thus improving his performance. Figure 1-1 depicts this relationship.
There are many aspects of environment that effects learning, but the most important
aspect that influences learning is the degree of supervision involved in performing the task.
Broadly, three cases emerge. In the first scenario called supervised learning, the machine
is provided with a sequence of data samples along with the corresponding desired outputs
and the goal is to learn the mapping so as to produce correct output for a given new test
sample. Another scenario is called reinforcement learning where the machine does not
have any desired output values to learn the mapping directly, but gets scalar rewards
(or punishments) for every action it takes. The goal of the machine then, is to learn and
perform actions so as to maximize its rewards (or in turn reduce its punishments). The
final and the most difficult scenario is called unsupervised learning where the machine
neither has outputs nor rewards and hence is left only with the data.
This thesis is about unsupervised learning. At first, it may seem unclear as to
what the machine can possibly do just with the data at hand. Nevertheless, there
16
exists a structure to the input data and by finding relevant patterns one can learn to
make decisions and predict the future outputs. Such learning characteristics are found
throughout the biological systems. According to Barlow [1989],
statistical redundancy contains information about the patterns and regularities
of sensory stimuli. Structure and information exist in external stimulus and it
is the task of perceptual system to discover this underlying structure.
In order to do so, we need to extract as much information as possible from the data. It
is here that unsupervised learning relies heavily on statistics and information theory to
dynamically extract information and capture the underlying patterns and regularities in
the data.
1.2 Motivation
Traditionally, unsupervised learning has been viewed as consisting of three different
facets. Clustering, the first facet, is a statistical data analysis technique which deals with
the classification of data into similar objects or clusters. One of the most popular methods
is the K-means technique where Euclidean distance between samples is used to quantify
similarity. Other techniques like hierarchical clustering methods recursively join (or split)
clusters based on proximity measures to form larger (or smaller) clusters [Duda et al.,
2000]. Some other examples include Gaussian mixture models (GMM) [MacKay, 2003]
and recent kernel based spectral clustering methods [Shi and Malik, 2000; Ng et al., 2002]
which have become quite popular in vision community. Clustering is used extensively in
the areas of data mining, image segmentation and bioinformatics.
The second aspect is the notion of principal curves. This can be seen as a non-linear
counterpart of principal component analysis (PCA) where the goal is to find a lower
dimensional embedding appropriate to the intrinsic dimensionality of the data. This
topic was first addressed by Hastie and Stuetzle [1989] who defined it broadly as
“self-consistent” smooth curves which pass through the “middle” of a d-dimensional
probability distribution or data cloud. Using regularization to avoid overfitting, Hastie’s
17
algorithm attempts to minimize the average squared distance of the data points and the
curve. Other ideas have been proposed including definitions based on mixture model
by Tibshirani [1992] and the recent popular work of Kegl [1999]. Due to their ability
to find intrinsic dimensionality of the data manifold, principal curves have immense
applications in dimensional reduction, manifold learning and denoising.
Vector quantization, the third component of unsupervised learning, is a compression
technique which deals with efficient representation of the data with few code vectors
by minimizing a distortion measure [Gersho and Gray, 1991]. A classic example is the
Linde-Buzo-Gray (LBG) [Linde et al., 1980] technique which minimizes the mean square
error between the code vectors and the data. Another popular and widely used technique
is Kohonen’s self organizing maps (SOM) [Kohonen, 1982; Haykin, 1999] which not only
represents the data efficiently but also preserves the topology by a process of competition
and cooperation between neurons. Speech compression, video coding and communications
are some of the fields where vector quantization is used extensively.
Until now, a common framework for unsupervised learning had been missing. The
three different aspects namely clustering, principal curves and vector quantization have
been studied separately. This is quite understandable from the breadth of the field and
the unique niche of applications these facets serve. Nevertheless, one could see them
under the same purview using ideas from statistics and information theory. For example,
it is well known that probability density function (pdf) subsumes all the information
contained in the data. Clusters of data samples would correspond to high density regions
of the data highlighted by the peaks in the distribution. The underlying manifold is
naturally revealed by the ridge lines of the pdf estimate connecting these clusters, while
vector quantization can be seen as preserving directly the pdf information of the data.
Concepts from information theory would help us quantify these pdf features through scalar
descriptors like entropy and divergence. By doing so, we not only gain the advantage of
going beyond the second order statistics and capturing all the information in the data, but
18
also we reveal these and other hidden structures unique to the given dataset. In summary,
such a unifying theory would bring these discordant views yet highly overlapping notions
together under the same purview. Altogether, it would advance the learning theory by
bringing new mathematical insights and interrelations between different facets of learning.
1.3 Previous Work
The role of statistics and information theory has been widely accepted now in
unraveling the basic principles behind unsupervised learning. Earliest approaches were
based on building generative models to effectively describe the observed data. The
parameters of these generative models are adjusted to optimize the likelihood of the
data with constraints on model architecture. Bayesian inference models and maximum
likelihood competitive learning are examples of this line of thought [Bishop, 1995;
MacKay, 2003]. Many of these models employ the objective of minimum description
length. Minimizing the description length of the model forces the network to learn
economic representations that capture the underlying regularities in the data. The
downside of this approach is the imposition of external models on the data which
may artificially constrain data description. It has been observed that when the model
assumption is wrong, these methods perform poorly leading to wrong inferences.
Modern approaches have proposed computational architectures inspired by the
principal of self organization of biological systems. Most notable contribution in this
regard was made by Linsker [1988a,b] with the introduction of “Infomax” principle.
According to this principle, in order to find relevant information one needs to maximize
the information rate R of a network constrained by its architecture. This rate R is defined
as the mutual information between noiseless input X and the output Y , that is
R = I(X; Y ) = H(Y )−H(Y |X).
Interestingly, in a multilayer feedforward network one needs to only ensure that information
rate R is maximized between each subsequent layer. This would in turn ensure the
19
maximization of the overall information transfer. Since H(Y |X) quantifies the output
noise which is independent of the network weights, maximization of R is equivalent to
maximization of the output entropy H(Y ). In particular, for Gaussian scenario and
simple linear network, the cost function reduces to maximization of output variance giving
us a Hebbian rule as a stochastic online update. Thus, Principal Component Analysis
(PCA) and its Hebbian algorithms [Oja, 1982, 1989] could be seen in term of information
maximization leading to minimum reconstruction error.
A different way to explain the functioning of sensory information processing is the
principle of minimum redundancy proposed by Barlow [1989]. Since the external stimuli
entering our perceptual system are highly redundant, it would be desirable to extract
independent features which would facilitate subsequent learning. The learning of an event
E then requires the knowledge of only the M conditional probabilities of E given each
feature yi, instead of the conditional probabilities of E given all possible combination of
N sensory stimuli. In short, a N-dimensional problem is reduced to M one-dimensional
problems. A natural consequence of this idea is the Independent Component Analysis
(ICA) [Jutten and Herault, 1991; Comon, 1994; Hyvarinen and Oja, 2000]. Unfortunately,
the Shannon entropy definition makes these information theoretic quantities difficult to
compute. For the Gaussian assumption, one could instead reduce the correlation among
the outputs, but this would only remove second order statistics and not address higher
order dependencies.
Pioneering work in the area of unsupervised learning has been done by Becker
[1991]. By framing the problem as building global objective functions under optimization
framework, Becker [1998] has shown that many current methods can be seen as energy
minimizing functions, thus elucidating the similarity among different aspects of learning
theory. In an attempt to go beyond preprocessing stages of sensory systems, Becker
and Hinton [1992] proposed the Imax principle for unsupervised learning which dictates
that signals of interest should have high mutual information across different sensory
20
input channels. By discovering properties of input that are coherent across space and
time through maximization of mutual information between different channels, one could
extract relevant information to build more abstract representations of the data. Using
multilayer neural network with the assumption that the signal is Gaussian and the noise
is additively distributed, Becker was able to discover higher order image features such as
stereo disparity in random dot stereograms [Becker and Hinton, 1992].
These information theoretic coding principles have significantly advanced our
understanding of unsupervised learning. The process of information preservation
and redundancy reduction strategies have provided important clues about the initial
preprocessing stages required to extract relevant features for learning. There exist though,
two major bottlenecks in fully realizing the potentiality of these theories. The first
constraint is the way the objective function is framed. Most of the previous approaches
either preserve maximal information or aim at redundancy reduction strategy. Some
authors have even tried to use one to achieve the other, thus concentrating on the special
case of equivalence. For example, Bell and Sejnowski [1995] used “Infomax” to achieve
ICA using a nonlinear network under the special case when the nonlinearity matches the
cumulative density function (cdf) of the independent components. In doing so, a lot of rich
intermediary structures are lost.
The second constraint is the information theoretic nature of the cost function which
makes it difficult to derive self organizing rules. All these models share the general
feature of modeling the brain as a communication channel and applying concepts from
information theory. Since Shannon definition of information theoretic measures are hard to
compute in closed form, Gaussian approximation generally follows giving rules which fail
to really capture the higher order features that are of interest. A recent exception is the
information bottleneck (IB) method proposed by Tishby et al. [1999]. Instead of trying to
encode all the information in a random variable S, the authors only quantify the relevant
or meaningful information by introducing another variable Y which is dependent on S, so
21
that the mutual information I(S; Y ) > 0. Thus, the information that S provides about Y
is passed through a “bottleneck” formed by compact code book X. Using ideas from lossy
source compression, in particular the rate distortion theory, one can get self consistent
equations which are recursively iterated to minimize a variational objective function.
Nevertheless, the method assumes the knowledge of joint distribution p(s, y) which may
not be available in many applications.
1.4 Contribution of this Thesis
Learning is unequivocally connected to finding patterns in data. As Forsyth [1989]
puts it, learning is 99% pattern recognition and 1% reasoning. If we therefore wish to
develop a unified framework for unsupervised learning, we need to address this central
theme of capturing the underlying structure in the data. To put simply, complex pattern
could be described hierarchically (and quite often recursively) in terms of simpler
structures [Pavlidis, 1977]. This is a fundamentally new and promising approach for it
tries to understand the developmental process that generated the observed data itself.
Put succinctly, structures are regularities in the data with interdependencies between
subsystems [Watanabe, 1985]. This immediately tells us that white noise is not a structure
since knowledge of the partial system does not give us any idea of the whole system.
In short, structures have rigidity with interdependencies and are not amorphous. Since
entropy of the overall system with interdependent subsystems is smaller than the sum
of the entropies of partial subsystems, structures can be seen as entropy minimizing
entities. By doing so, they reduce redundancy in the representation of the data. Thus,
formulation of structures can be cast according to the principle of minimum entropy. But
mere minimization of entropy would not allow these structures to capture any relevant
information of the data leading to trivial solutions. This formulation therefore needs to
be modified as minimization of entropy with the constraint to quantify certain level of
information about the data. By framing this problem as a weighted combination of a
redundancy reduction term and an information preservation term, we propose a novel
22
unsupervised learning principle called - The Principle of Relevant Entropy. This is one of
the major contributions of this thesis.
A learning machine is always associated with a goal [Lakshmivarahan, 1981].
In unsupervised scenarios where the machine only has the data, goals are especially
crucial since they help differentiate relevant from irrelevant information in the data. By
seeking “goal oriented structures” then, the machine can effectively quantify the relevant
information in the data as compact representations needed as a first step for successful
and effective learning. This is another contribution of our thesis. Notice that goals can
exist at many levels of learning. In this thesis, we address the lower level vision tasks
like clustering (for differentiating objects), principal curves (for denoising) and vector
quantization (for compression) which help in decoding patterns from the data relevant to
the goal at hand. By doing so, one can extract an internally derived teaching signal, thus
converting an unsupervised problem to supervised framework.
A key aspect needed for the success of this methodology is to compute the information
theoretic measures of the global cost function directly from the data. Though Shannon
definition is popularly used, few of them have non-parametric estimators which are easy
to compute leading to simplifications using Gaussian assumption. In this thesis, we do
away with the Shannon definition of information theoretic measures and use instead ideas
from Information Theoretic Learning (ITL) based on Renyi’s entropy [Prıncipe et al.,
2000; Erdogmus, 2002]. This is the third contribution of this thesis. Using information
theoretic principles based on Renyi’s definition [Renyi, 1961, 1976], we estimate different
information theoretic measures directly from the data non-parametrically. These scalar
descriptors go beyond the second order statistics to reveal higher order features in the
data.
Interestingly, one can derive fixed point update rule for self-organization of the data
samples to optimize the cost function, thus avoiding Gaussian approximations altogether.
Borrowing an analogy from physics, in this adaptive context, one can view the data
23
samples as “particles” and the self organizing rules as local forces. With the “goal”
embedded in the cost, these local forces define the rules of interaction between the data
samples. Self organization of these particles should then reveal the structure in the data
relevant to the goal. Such self organizing, fixed point update rules is the final but an
important contribution of this thesis.
To summarize, this thesis proposes a number of key contributions which furthers the
theory of unsupervised learning;
• To formulate structure in terms of minimum entropy principle with a constraint toquantify certain level of information about the data.
• Seeing unsupervised learning as extracting “goal oriented structures” from the data.By doing so, we convert an unsupervised problem to supervised framework.
• To estimate different information theoretic measures of the formulation directly fromdata using ITL theory.
• To derive a simple and fast fixed point update rule so as to reveal the underlyingstructure through self-organization of the data.
This new understanding of unsupervised learning can be summarized with a block
diagram as shown in Figure 1-2. The important idea behind this schematic is the goal
oriented nature of unsupervised learning. In this framework, the neuro-mapper can be
seen as working on physics based principles. Using an analogy with biological systems,
data from the external environment entering our sensory inputs are processed to extract
important features for learning. Based on the goal of the learner, a self organizing rule
is initiated in the neuro-mapper to find appropriate structures in the extracted features.
Though a small portion, an important step is critical reasoning, which helps one to judge
how useful these patterns are in achieving the goal. This results in a dual correction; first
to take appropriate actions in the environment so as to actively generate useful data and
second, to change the functioning of the neuro-mapper itself. This cyclical loop is essential
for effective learning and to adapt in a changing environment. An optional correction may
be used to update the goal itself which would change the direction of learning. What is
24
Environment Neuro Structuresor
patterns
CriticalReasoning
Goal
Actions
Learner
Data
Self organizingprinciple
mapper
Figure 1-2. A new framework for unsupervised learning.
interesting in this approach is that the learner’s block is converted into a classic supervised
learning framework. This block represents an open system constantly interacting with the
environment. Together with the environment, the overall scheme is a closed system where
the principles of conservation is applicable [see Watanabe, 1985, chap. 6].
A natural outcome of this thesis is a simple yet elegant framework bringing different
facets of unsupervised learning namely clustering, principal curves and vector quantization
under one umbrella, thus providing a new outlook to this exciting field of science. In the
end, we would like to present a quote from Ghahramani [2003], which best summarizes
this field and is a constant source of motivation throughout this work;
In a sense, unsupervised learning can be thought of as finding patterns in the
data above and beyond what would be considered pure unstructured noise.
25
CHAPTER 2THEORETICAL FOUNDATION
2.1 Information Theoretic Learning
Let X ∈ Rd be a random variable in a d-dimensional space. Consider N independent
and identically distributed (iid) samples {x1, x2, . . . , xN} of this random variable. Given
this sample set of the population, a non-parametric estimator of the probability density
function (pdf) of X can be obtained using the Parzen window technique [Parzen, 1962] as
shown below.
pX(x) =1
Nhd
N∑j=1
K
(x− xj
h
), (2–1)
where K is a kernel satisfying the following conditions:
• supy∈Rd
|K(y)| < ∞
• ∫Rd
|K(y)|dy < ∞
• lim||y||→∞
||y||dK(y) = 0
• ∫Rd
K(y)dy = 1
The kernel size h is actually a function of the number of samples N ; represented as h(N).
Analysis of the estimator pX(x) along with the four properties above gives the following
results,
• If limN→∞
h(N) = 0 then, pX(x) is an asymptotically unbiased estimator.
• If in addition h(N) satisfies limN→∞
Nh(N) = ∞ then, the estimator pX(x) is also a
consistent estimate of the true pdf f(x).
We particularly consider the Gaussian kernel given by Gσ(x) = 1√(2πσ2)
e−x2
2σ2 for
our analysis. The advantage of this kernel selection is two-folded. First, it is a smooth,
continuous and infinitely differentiable kernel required for our purposes (shown later).
Second, the Gaussian kernel has a special property that convolution of two Gaussians gives
26
another Gaussian with variance equal to the sum of the original variances. That is,
∫Gσ(x− xi) Gσ(x− xj) dx = G√
2σ(xi − xj).
As would be seen later, this property is important in developing non-parametric estimators
for Renyi’s information theoretic measures.
2.1.1 Renyi’s Quadratic Entropy
Renyi [1961, 1976] proposed a family of entropy measures characterized by order-α.
Renyi’s order-α entropy is defined as
Hα(X) =1
1− αlog
(∫fα(x)dx
). (2–2)
For the special case of α → 1, Renyi’s α-entropy tends to Shannon definition, that is,
limα→1
Hα(X) = −∫
f(x) logf(x) dx = HS(X).
Therefore, Hα(X) can be seen as a more general definition of entropy measure. One can
also show that this measure is a monotone decreasing function of α. This can be proved
by taking the first derivative and showing that this is always negative through Jensen’s
inequality. To summarize,
H0<α<1(X) > HS(X) > Hα>1(X).
Now, consider the special case of α = 2. The Renyi’s quadratic entropy1 is defined as
H(X) = − log
(∫f 2(x)dx
). (2–3)
Using pX(x) = 1N
∑Nj=1 GσX
(x− xj) as the Parzen estimate of f(x) along with the property
of the Gaussian kernel in 2–3, we can obtain a non-parametric estimator2 of Renyi’s
1 we drop the α = 2 subscript for convenience
2 Note that H(X) is an estimator of H(X) and so a different notation is used.
27
quadratic entropy given by
H(X) = − log
(1
N2
N∑i=1
N∑j=1
∫GσX
(x− xi) GσX(x− xj) dx
)
= −log
(1
N2
N∑i=1
N∑j=1
Gσ(xi − xj)
),
(2–4)
where σ2 = 2σ2X . Notice the argument of the Gaussian kernel which considers all possible
pairs of samples. The idea of regarding the samples as information particles was first
introduced by Prıncipe and collaborators [Prıncipe et al., 2000; Erdogmus, 2002] upon
realizing that these samples interact with each other through laws that resemble the
potential field and their associated forces in physics.
Since the log is a monotonic function, any optimization based on H(X) can be
translated into optimization of the argument of the log, which we denote by V (X), and
call it the information potential of the samples. We can consider this quantity as an
average of the contributions from each particle xi as shown below. Note that V (xi) is the
potential field over the space of the samples, with an interaction law given by the kernel
shape.
V (X) =1
N
N∑i=1
V (xi)
V (xi) =1
N
N∑j=1
Gσ(xi − xj).
The derivative of V (xi) gives us the net information force acting on xi due to all other
samples xj . This force is attractive pulling the particle xi in the direction of xj due to
the vector term (xj − xi) in F (xi|xj). The vector summation of all these individual forces
would then give us a net force F (xi).
∂
∂xi
V (xi) =1
N
N∑j=1
Gσ(xi − xj)(xj − xi)
σ2
⇒ F (xi) =1
N
N∑j=1
F (xi|xj).
(2–5)
28
This force field is neatly demonstrated in Figure 2-1. Being inherently attractive,
the direction of the inter-force field on each sample always points inwards (towards other
samples) as shown in this plot.
−1 −0.5 0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 2-1. Information force within a dataset.
2.1.2 Cauchy-Schwartz Divergence
In information theory, the concept of relative entropy is a measure to quantify
the difference between two probability distributions f and g corresponding to random
variables X and Y . In the case of Shannon definition, the relative entropy or Kullback-Leibler
(KL) divergence is uniquely defined as
DKL(f ||g) =
∫f(x) log
f(x)
g(x)dx. (2–6)
It could be seen as a measure of uncertainty when one approximates the “true” distribution
f(x) with another distribution g(x). Thus, DKL(f ||g) can also be rewritten as
DKL(f ||g) = −∫
f(x) log g(x) dx +
∫f(x) log f(x) dx
= HS(X; Y )−HS(X).
(2–7)
29
HS(X; Y )3 is called the cross entropy and can be seen as expected gain of information
Ef(x)[− log g(x)] when observing g(x) with respect to the “true” distribution f(x).
On the contrary, in the case of Renyi’s definition, there exists more than one way of
representing the divergence measure. Two such definitions were given by Renyi himself
[Renyi, 1976]. The more popular one is written as
DRKL−1(f ||g) =1
α− 1log
∫fα(x)
gα−1(x)dx. (2–8)
For special case of α → 1,
limα→1
DRKL−1(f ||g) = DKL(f ||g).
One of the drawbacks of this definition is that while evaluating, if g(x) is zero at some
values of x where f(x) is finite, this could yield a value of infinity. Recently, another
definition for Renyi’s KL divergence was proposed by Lutwak et al. [2005]. The relative
entropy can be written as follows,
DRKL−2(f ||g) = log(∫
gα−1f)1
1−α (∫
gα)1α
(∫
fα)1
α(1−α)
. (2–9)
Note that the denominator in the argument of the log contains an integral evaluation. So
f(x) could be zero at some points but overall the integral is well defined, thus avoiding
numerical issues of the previous definition. Again, for α → 1, this gives DKL(f ||g). In
particular, for α = 2, one can rewrite 2–9 as
Dcs(f ||g) = − log
∫f(x)g(x) dx√
(∫
f 2(x) dx)(∫
g2(x) dx). (2–10)
We call this the Cauchy-Schwartz divergence since the same could also be derived using
the Cauchy-Schwartz inequality [Rudin, 1976]. Note that the argument is always ∈ (0, 1]
3 We use a semicolon instead of a comma to differentiate this quantity from jointentropy HS(X, Y ) = − ∫ ∫
f(x, y) log f(x, y) dxdy.
30
and therefore, Dcs(f ||g) ≥ 0. Further, Dcs(f ||g) is symmetric unlike other divergence
measures. It does not satisfy the triangular inequality though, and hence is not a distance
metric. To summarize, the following properties hold,
• Dcs(f ||g) ≥ 0 for all f and g.
• Dcs(f ||g) = 0 iff f = g.
• Dcs(f ||g) is symmetric i.e. Dcs(f ||g) = Dcs(g||f).
2.1.3 Renyi’s Cross Entropy
Similar to the relation between KL divergence and Shannon entropy in 2–7, one can
derive a connection between Cauchy-Schwartz divergence and Renyi’s quadratic entropy.
Rearranging 2–10, we get
Dcs(f ||g) = − log
∫f(x)g(x) dx +
1
2log
∫f 2(x) dx +
1
2log
∫g2(x) dx
= H(X; Y )− 1
2H(X)− 1
2H(Y ).
(2–11)
The term H(X; Y ) can be considered as Renyi’s cross entropy. The argument of the log
can also be seen as Ef(x)[−g(x)]. One can also derive this following the actual derivation
of Renyi’s entropy [Renyi, 1976]. We prove this for the discrete case in the following
theorem.
Theorem 1. Consider a set E with “true” probabilities of its elements given by
{p1, p2, . . . , pN}(where∑
pi = 1). After an experiment, we observe these elements
with probabilities {q1, q2, . . . , qN}. The gain in information can then be expressed as
Hα(p, q) =1
1− αlog
∑
k
pk q(α−1)k .
In particular, for α = 2, we get
H(p, q) = − log∑
k
pk qk
.
31
Proof. The information gained by observing an element with probability qk is Ik = log 1qk
.
Let us denote the information of all elements by the set,
= = {I1, I2, . . . , IN}
= {− log q1,− log q2, . . . ,− log qN}.
Since the “true” distribution of the samples is {p1, p2, . . . , pN}, the average gain in
information can be easily computed to give the cross entropy under Shannon definition.
HS(p, q) = −∑
k
pk log qk.
One could also consider the Kolmogorov-Nagumo function associated with the mean.
According to this general theory of means, a mean of the real numbers x1, x2, . . . , xN with
weights p1, p2, . . . , pN is given by
ϕ−1( N∑
k=1
pk ϕ(x)),
where ϕ(x) is an arbitrary continuous and strictly monotone function on the real numbers.
Using this, the average mean information can be represented as,
I = ϕ−1( N∑
k=1
pk ϕ(Ik)).
In the case of information, it is not just sufficient that the function ϕ(x) be continuous
and strictly monotone function. One needs ϕ(x) to satisfy the postulate of additivity
also. A trivial example of ϕ(x) which satisfies this requirement is a linear function which
was used by Shannon. Another function which satisfies this constraint is the exponential
function. In particular, Renyi used the following function,
ϕ(x) = 2(1−α)x.
32
Using this exponential function we get,
Hα(p, q) = ϕ−1( N∑
k=1
pk ϕ(Ik))
= ϕ−1( N∑
k=1
pk 2(1−α)Ik
)
= ϕ−1
( N∑
k=1
pk qα−1k
)
=1
1− αlog
( N∑
k=1
pk qα−1k
).
In particular, for α = 2, we get H(p, q) = − log∑k
pk qk.
Notice that the argument of Renyi’s cross entropy is an inner product between two
pdfs and hence is amenable for deriving a non-parametric estimator. Let X, Y ∈ Rd
be two random variables with pdfs f and g respectively. Suppose we have iid samples
{x1, x2, . . . , xN} and {y1, y2, . . . , yM} from these two random variables. Using the Parzen
estimates pX(x) = 1N
N∑i=1
GσX(x− xi) and pY (y) = 1
M
M∑i=1
GσY(y− yi) corresponding to f and
g respectively, we have
H(X; Y ) = − log
(1
NM
N∑i=1
M∑j=1
∫GσX
(u− xi) GσY(u− yj) du
)
= −log
(1
NM
N∑i=1
M∑j=1
Gσ(xi − yj)
),
(2–12)
where σ2 = σ2X + σ2
Y . The argument of the Gaussian kernel now shows interaction of the
samples from one dataset with samples of another dataset. One can consider the overall
argument of the log as the cross information potential V (X; Y ) which can be represented
as an average over individual potentials V (xi; Y ).
V (X; Y ) =1
N
N∑i=1
V (xi; Y )
V (xi; Y ) =1
M
M∑j=1
Gσ(xi − yj).
33
The derivative of V (xi; Y ) gives us the cross information force acting on each sample xi
due to all samples {yj}Mj=1 of the set Y .
∂
∂xi
V (xi; Y ) =1
M
M∑j=1
Gσ(xi − yj)(yj − xi)
σ2
⇒ F (xi; Y ) =1
M
M∑j=1
F (xi|yj).
(2–13)
Similarly, one can derive the cross information force acting on sample yj due to all samples
{xi}Ni=1 of the set X by expressing V (X; Y ) as an average over V (yj; X) and differentiating
with respect to yj as shown below for completeness.
∂
∂yj
V (yj; X) =1
N
N∑i=1
Gσ(yj − xi)(xi − yj)
σ2
⇒ F (yj; X) =1
N
N∑i=1
F (yj|xi).
(2–14)
Note that F (xi|yj) and F (yj|xi) have the same magnitude but opposite directions giving
us a symmetric graph model. This concept is well depicted in Figure 2-2 where one can
see the intra-force field operating between samples of the two datasets.
2.2 Summary
In the previous sections, we have derived non-parametric estimators for both Renyi’s
entropy and cross entropy terms. This gives us a means to estimate these quantities
directly from the data. From 2–11, one can then get a non-parametric estimator for
Cauchy-Schwartz divergence measure as well. Thus,
Dcs(pX ||pY ) = H(X; Y )− 1
2H(X)− 1
2H(Y ). (2–15)
Another aspect which requires particular attention is the concept of information
forces. Figures 2-1 and 2-2 urges one to ask the question as to what would happen to
the samples if they were allowed to freely move under the influence of these forces. This
question led to the culmination of this thesis. Initial efforts of moving the samples using
34
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Figure 2-2. Cross information force between two datasets.
gradient method showed promising results, but were mired with problems of step-size
selection and stability issues. We soon realized that for success of this concept, two things
were essential;
• To devise a global cost function which would combine these two forces appropriatelyto achieve a “goal”.
• To come up with a fast fixed-point update rule, thus avoiding step-size issues.
The subsequent chapters would elaborate on how we developed a novel framework for
unsupervised learning taking these two aspects into consideration.
These ideas lie at the heart of information theoretic learning (ITL) [Prıncipe et al.,
2000]. By playing directly with the pdf of the data, ITL effectively goes beyond the
second order statistics. The result is a cost function that directly manipulates information
giving rise to powerful techniques with numerous applications in the area of adaptive
systems [Erdogmus, 2002] and machine learning [Jenssen, 2005; Rao et al., 2006a].
35
CHAPTER 3A NEW UNSUPERVISED LEARNING PRINCIPLE
3.1 Principle of Relevant Entropy
A common theme among all learning principles is the information feedback loop
necessary for knowledge acquisition. There is more than one way of summarizing a given
dataset, but information is only relevant when it is useful in attaining the “goal”. This
aspect (as to what information to extract and learn) is quite clear in some scenarios while
in some others it is not. In supervised learning scenario, for example, one is not only given
the data but also a priori information. This could be in the form of class labels in the case
of classification problems or desired signal to distinguish from noise in the case of filtering
paradigms. Using this training set, one then needs to design systems which could adapt
and learn the mapping so as to generalize it to new test samples. From optimization point
of view therefore, this problem is well posed.
On the other hand, in unsupervised learning context the learner is only presented
with the data. What is then a best strategy to learn? The answer lies in discovering
structures in the data. As [Watanabe, 1985] puts it, these are rigid entities with
interdependent subsystems that capture the regularities in the data. Since these
regularities or patterns are subsets of the given data, their entropy is smaller than the
entire sample set. The first step therefore in discovering these patterns is to minimize
the entropy of the data. This in turn would reduce redundancy. But mere minimization
of entropy would give us a trivial solution of data collapsing to a single point which is
not useful. To discover structure, one needs to model them as compact representations
of the data. This connection between structures and their corresponding data could be
established by introducing a distortion measure. Depending on the extent of the distortion
then, one is bound to find a range of structures which can be seen as an information
extraction process. Since minimizing the entropy with this constraint adds relevance to the
resulting structure, we call this - The Principle of Relevant Entropy.
36
To be specific, consider a dataset S ∈ Rd with sample population {s1, s2, . . . , sM}.This original dataset is static and kept constant throughout the process. Let us now
create a new dataset X = {x1, x2, . . . , xN} such that X = S initially. Starting with this
initialization, we would like to evolve the samples of X so as to capture relevant structures
of S. We formulate this problem as minimizing a weighted combination of the Renyi’s
entropy of X namely H(X), and the Cauchy Schwartz divergence measure between the
probability densities (pdfs) of X and S as shown below.
J(X) = minX
H(X) + β Dcs(pX ||pS), (3–1)
where β ∈ [0,∞). The first quantity H(X) can be seen as a redundancy reduction term
and the second quantity Dcs(pX ||pS), as an information preservation term. At the first
iteration, Dcs(pX ||pS) = 0 due to the initialization X = S. This would reduce the entropy
of X irrespective of the value of β selected, thus making Dcs(pX ||pS) positive. This is a
key factor in preventing X from diverging. From the second iteration onwards, depending
on the β value chosen, one would iteratively move the samples of X to capture different
regularities in S. Interestingly, there exists a hierarchical trend among these regularities
resulting from the continuous nature of the distortion measure which varies between
[0,∞). From this point of view, one can see these regularities as quantifying different
levels of structure in the data, thus giving us a composite framework.
To understand this better, we would investigate the behavior of the cost function for
some special values of β. Before we do this, let us first simplify the cost function. We can
substitute the expansion of Dcs(pX ||pS) term as given in 2–15 in 3–1. To avoid working
with fractional values, we use a more convenient definition of Cauchy-Schwartz divergence
given by
Dcs(pX ||pS) = − log
(∫pX(u)pS(u) du
)2
(∫
p2X(u) du)(
∫p2
S(u) du)
= 2 H(X; S)−H(X)−H(S).
37
This satisfies all the properties of divergence as enlisted in Chapter 2 following the
Cauchy-Schwartz inequality. We could also see this change as absorbing a scaling constant
(a value of 1/2) in the weighting factor β in 3–1. In any case, this is not going to alter any
results, but would only change the values of β at which we attain those results. Having
done this, the cost function now looks like
J(X) = minX
H(X) + β [2 H(X; S)−H(X)−H(S)]
≡ minX
(1− β)H(X) + 2β H(X; S).
(3–2)
Note that we have dropped the extra term H(S) since this is constant with respect to X.
With this simplified form of the cost function, we will now investigate the effect of the
weighting parameter β.
3.1.1 Case 1: β = 0
For this special case, the cost function in equation 3–2 reduces to
J(X) = minX
H(X).
Looking back at equation 3–1, we see that the information preservation term has been
eliminated and the cost function directly minimizes the entropy of the dataset iteratively.
As pointed out earlier, this simple case leads to a trivial solution with all the samples
converging to a single point.
Since log is a monotonous function, minimization of entropy is equivalent to
maximization of the argument of the log which is the information potential V (X) of
the dataset.
J(X) ≡ maxX
V (X)
= maxX
1
N2
N∑i=1
N∑j=1
Gσ(xi − xj).
38
Differentiating this quantity with respect to xk={1,2,...,N} ∈ X and equivating to zero, we
would get the following fixed point update rule.
∂
∂xk
V (X) = 0
2
N2
N∑j=1
Gσ(xk − xj){xj − xk
σ2} = 0
{ N∑j=1
Gσ(xk − xj)
}xk =
N∑j=1
Gσ(xk − xj)xj
xk =
∑Nj=1 Gσ(xk − xj)xj∑N
j=1 Gσ(xk − xj).
Since we run this fixed point rule continuously for each xk ∈ X, we specifically write the
iteration number τ as
x(τ+1)k =
∑Nj=1 Gσ(x
(τ)k − x
(τ)j )x
(τ)j∑N
j=1 Gσ(x(τ)k − x
(τ)j )
. (3–3)
Due to the minimization of Renyi’s entropy of the dataset iteratively, this update
rule results in the collapse of the entire dataset to a single point as proved in the following
theorem.
Theorem 2. The update rule 3–3 converges with all the data samples merging to a single
point.
Proof. Let us denote the sample set after τ iterations by X(τ+1) = {x(τ+1)1 , x
(τ+1)2 , . . . , x
(τ+1)N }.
Starting with the initialization X0 = S, the dataset X then evolves in the following
sequence.
X(0) → X(1) → . . . → X(τ+1) . . . (3–4)
We could rewrite equation 3–3 as
x(τ+1)k =
N∑j=1
w(τ)j (k)x
(τ)j ,
N∑j=1
w(τ)j (k) = 1.
39
Therefore, each sample x(τ+1)k is a convex combination of the samples of X(τ) and hence
belongs to its convex hull1 denoted by Hull(X(τ)). Further, due to the nature of the
Gaussian kernel each weight w(τ)j (k) is strictly ∈ (0, 1). Therefore, each sample x
(τ+1)k
strictly lies inside the convex hull of X(τ). Generalizing this to all the samples of X(τ+1),
we have,
X(τ+1) ⊂ Hull(X(τ)).
From convexity of the Hull, this also implies that
Hull(X(τ+1)) ⊂ Hull(X(τ)).
With the sequence in 3–4 then, one could conclude the following,
Hull(X(0) = S) ⊃ Hull(X(1)) ⊃ Hull(X(2)) . . . ⊃ Hull(X(τ+1)) . . .
This collapsing sequence therefore converges with all the data sample merging to a single
point.
3.1.2 Case 2: β = 1
For β = 1, the cost function directly minimizes Renyi’s cross entropy between X and
the original dataset S. Again, using the monotonous property of the log, the cost function
can be rewritten as
J(X) = minX
H(X; S)
≡ maxX
V (X; S)
= maxX
1
NM
N∑i=1
M∑j=1
Gσ(xi − sj).
Similar to the derivation in the previous section, one could obtain the following fixed point
update rule by differentiating J(X) with respect to xk={1,2,...,N} ∈ X and equivating it to
1 A convex hull of a set X, is the smallest convex set containing X.
40
zero.
x(τ+1)k =
∑Mj=1 Gσ(x
(τ)k − sj)sj∑M
j=1 Gσ(x(τ)k − sj)
. (3–5)
Notice the lack of any superscript on sj ∈ S since it is kept constant throughout the
process of adaptation of X. Unlike 3–3 when β = 0 then, in this case the samples of X(τ)
are constantly compared to this fixed dataset. This results in the samples of X converging
to the modes (peaks) of pdf pS(u), thus revealing an important structure of S. We prove
this in the following theorem.
Theorem 3. Modes of the pdf estimate pS(x) = 1M
M∑j=1
Gσ(x− sj) are the fixed points of the
update rule 3–5.
Proof. The proof is quite simple. We could rearrange the update rule 3–5 as
{ M∑j=1
Gσ(xk − sj)
}xk =
M∑j=1
Gσ(xk − sj)sj
M∑j=1
Gσ(xk − sj){sj − xk} = 0
1
M
M∑j=1
Gσ(xk − sj){sj − xk
σ2
}= 0
∂
∂xk
pS(x) = 0
Since modes are stationary solutions of ∂∂xk
pS(x) = 0, they are the fixed points of 3–5.
We can also rewrite equation 3–5 as
x(τ+1)k =
M∑j=1
w(τ)j (k)sj,
M∑j=1
w(τ)j (k) = 1.
w(τ)j (k) are the weights evaluated with respect to the original data S which is kept
constant throughout the process. Since each w(τ)j (k) is strictly ∈ (0, 1), each iteration
sample x(τ+1)k is mapped strictly inside the convex hull of S (and not X(τ) as in the case of
41
β = 0). Taking into consideration all the data samples then,
X(τ+1) ⊂ Hull(S).
This gives rise to a stable sequence where the data samples converge to their respective
modes. Therefore, for β = 1, one can identify high concentration regions in the original
dataset S, which could be used for clustering applications.
3.1.3 Case 3: β →∞Before we analyze this asymptotic case, we will derive a general update rule for β 6= 0.
We rewrite equation 3–2 as
J(X) = minX
(1− β)H(X) + 2β H(X; S)
= minX
−(1− β) logV (X)− 2β logV (X; S)
= minX
−(1− β)log
{1
N2
N∑i=1
N∑j=1
Gσ(xi − xj)
}− 2β log
{1
NM
N∑i=1
M∑j=1
Gσ(xi − sj)
}
Taking the derivative of J(X) with respect to xk={1,2,...,N} ∈ X and equivating to zero, we
have2
(1− β)
V (X)
2
N2
N∑j=1
Gσ(xk − xj){xj − xk
σ2
}+
β
V (X; S)
2
NM
M∑j=1
Gσ(xk − sj){sj − xk
σ2
}= 0
(1− β)
V (X)F (xk) +
β
V (X; S)F (xk; S) = 0,
(3–6)
where F (xk) and F (xk; S) are given by equations 2–5 and 2–13. Equation 3–6 shows the
net force acting on sample xk ∈ X. This force is a weighted combination of two forces; the
information force within the dataset X and the cross information force between X and S.
2 The factor 2 in the first term is due to the double summation and symmetry of theGaussian kernel.
42
Depending on the value of β, these two forces combine in different proportions giving rise
to different rules of interaction between data samples.
Rearranging 3–6 we have,
{β
V (X; S)
1
M
M∑j=1
Gσ(xk − sj)
}xk =
(1− β)
V (X)
1
N
N∑j=1
Gσ(xk − xj)xj
+β
V (X; S)
1
M
M∑j=1
Gσ(xk − sj)sj
−{
(1− β)
V (X)
1
N
N∑j=1
Gσ(xk − xj)
}xk.
This gives us the following fixed point update rule,
x(τ+1)k = c
(1− β)
β
∑Nj=1 Gσ(x
(τ)k − x
(τ)j )x
(τ)j∑M
j=1 Gσ(x(τ)k − sj)
+
∑Mj=1 Gσ(x
(τ)k − sj)sj∑M
j=1 Gσ(x(τ)k − sj)
− c(1− β)
β
∑Nj=1 Gσ(x
(τ)k − x
(τ)j )
∑Mj=1 Gσ(x
(τ)k − sj)
x(τ)k .
(3–7)
where c = V (X;S)V (X)
MN
. Notice that there are three different ways of rearranging 3–6 to get a
fixed point update rule, but we found only the fixed point iteration 3–7 to be stable and
consistent.
Having derived the general form, we are now ready to discuss the asymptotic case
of β → ∞. It turns out that for this special case, one ends up directly minimizing the
Cauchy-Schwartz divergence measure Dcs(pX ||pS). This is not obvious by looking at
equation 3–1. We prove this in Theorem 4. Since we start with the initialization X = S,
a consequence of this extreme case is that we get back the data itself as the solution. This
corresponds to the highest level of structure with all the information preserved at the cost
of no redundancy reduction.
Theorem 4. When β →∞, the cost function 3–1 directly minimizes the Cauchy-Schwartz
pdf divergence Dcs(pX ||pS).
43
Proof. We prove this theorem by showing that the fixed point equation derived by directly
optimizing Dcs(pX ||pS) is same as (3–7) when β →∞. Consider the cost function,
F (X) = minX
Dcs(pX ||pS)
≡ minX
2 H(X; S)−H(X)
= minX
−2 log V (X; S) + log V (X).
(3–8)
Differentiating with respect to xk={1,2,...,N} ∈ X and equivating to zero, we have
1
V (X)F (xk)− 1
V (X; S)F (xk; S) = 0
Rearranging the above equation just as we did to derive equation 3–7, we get
x(τ+1)k =− c
∑Nj=1 Gσ(x
(τ)k − x
(τ)j )x
(τ)j∑M
j=1 Gσ(x(τ)k − sj)
+
∑Mj=1 Gσ(x
(τ)k − sj)sj∑M
j=1 Gσ(x(τ)k − sj)
+ c
∑Nj=1 Gσ(x
(τ)k − x
(τ)j )
∑Mj=1 Gσ(x
(τ)k − sj)
x(τ)k ,
(3–9)
where again c = V (X;S)V (X)
MN
. Taking the limit β →∞ in 3–7 gives 3–9.
Corollary 4.1. With X initialized to S, the data satisfies the fixed point update rule 3–9.
Proof. When X(0) = S, where X(0) corresponds to initialization at τ = 0, we have
V (X) = V (X; S) and N = M . Substituting this in 3–9 we see that
x(τ+1)k = x
(τ)k
and hence, the update rule 3–9 converges at the very first iteration giving the data itself as
the solution.
3.2 Summary
The analysis so far of the Principle of Relevant Entropy at particular values of β
revealed structures which at one end balance information preservation and at the other
end minimize redundancy. Further, a continuous array of such structures exists for
44
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
A
−4
−2
0
2
4
−8
−6
−4
−2
0
2
4
X axis
Y axis
B
Figure 3-1. A simple example dataset. A) Crescent shaped data. B) pdf plot forσ2 = 0.05.
different proportions of these two terms. For example, when 1 < β < 3 one could get
principal curves passing through the data. These can be considered as non-linear versions
of the principal components, embedding the data in a lower dimensional space. What
is even more interesting is that these curves represent ridges passing through the modes
(peaks) of the pdf pS(u) and hence subsume the information of the modes revealed for
β = 1. There exists therefore, a hierarchical structure to this framework. At one extreme
with β = 0, we have a single point which is highly redundant but preserves no information
of the data while at the other extreme when β → ∞, we have the data itself as the
solution preserving all the information we have with no redundancy reduction. In between
these two extremes, one encounters different phases like modes and principal curves which
reveal interesting patterns unique to the given dataset. To demonstrate this idea, we
use a simple example of crescent shaped dataset as shown in Figure 3-1. Applying our
cost function for different values of β reveals various structures relevant to the dataset as
shown in Figure 3-2.
From unsupervised learning point of view, there are very interesting aspects unique to
this principle. Notice that the cost in equation 3–1 is a function of just two parameters;
namely the weighting parameter β and the inherently hidden scale parameter σ. The
45
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
A β = 0, Single point
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
B β = 1, Modes
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
C β = 2, Principal Curve
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
D β = 2.8
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
E β = 3.5
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
F β = 5.5
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
G β = 13.1
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
H β = 20
−4 −3 −2 −1 0 1 2 3 4−8
−6
−4
−2
0
2
4
I β →∞, The data
Figure 3-2. An illustration of the structures revealed by the Principle of Relevant Entropyfor the crescent shaped dataset for σ2 = 0.05. As the values of β increases wepass through the modes, principal curves and in the extreme case of β →∞we get back the data itself as the solution.
parameter β controls the “task” to be achieved (like finding the modes or the principal
curves of the data) whereas σ controls the resolution of our analysis. By including an
internal user defined parameter β, we allow the learning machine to influence the level
of abstraction to be extracted from the data. To the best of our knowledge, this is the
first attempt to build a framework with the inclusion of goals at the data analysis level.
A natural consequence of this is a rich and hierarchical structure unfolding the intricacies
between all levels of information in the data. In Chapters 4 to 6, we show different
applications of our principle like clustering, manifold learning and compression which now
fall under a common purview.
46
Before we end this chapter, we would like to highlight two important details. The
different structures shown in Figure 3-2 were obtained for kernel size of σ2 = 0.05. For
a different kernel size, we get a new set of structures relevant to that particular scale.
This rich representation of the data is a natural consequence of framing our cost function
with information theoretic principles. The other aspect corresponds to the redundancy
achieved. Throughout this discussion we have kept N = M . To achieve redundancy in the
number of points to represent the data, we can follow with a post processing stage where
merged points are replaced with a single point. This would finally give us N ≤ M . For
example, to capture the modes of the data we can replace M points which have merged
to their respective modes with a single point at each mode. The special case of equality
occurs when β → ∞. In this case, we sacrifice redundancy and store all the samples to
preserve maximal information about the data.
47
CHAPTER 4APPLICATION I: CLUSTERING
4.1 Introduction
The first application we would like to discuss is the popular facet of unsupervised
learning called Clustering. In this learning scenario, we are presented with a set of data
samples and the task is to dynamically extract groups (or patterns) in the set such that
samples within a group are more similar to each other than samples of different groups.
One therefore needs to address the question of what similarity means before tackling this
issue. The traditional view of similarity is to associate it with some distance metric [Duda
et al., 2000; Watanabe, 1985]. These two quantities generally share an inverse relationship.
So, if we denote an arbitrary similarity measure by S and the corresponding distance
metric by D, then
S ×D = constant.
In essence, the smaller the distance between two points the larger is their similarity and
vice versa. Examples of distance metric include Euclidean distance, Mahalanobis distance
and others.
This approach is popular since distance metric is well defined and quantifiable. By
defining a metric for a feature space, one indirectly defines an appropriate similarity
measure thus bypassing the ill posed problem of similarity itself. Nevertheless, one needs
to pick a distance metric which can be seen as part of a priori information. For example,
K-means clustering technique uses a Euclidean distance metric. This leads to a priori
assumption that the clusters have a spherical Gaussian distribution. If this is really
true, then the clustering solution obtained is good. But most of the time such a priori
information is not available and is used as a convenience to ease the difficulty of the
problem. The imposition of such external models on the data leads to biased information
extraction, thus hampering the learning process. For example, most neural spike data
we have worked with do not satisfy this assumption at all, but a lot of medical work
48
use K-means technique for spike sorting leading to poor clustering solution and wrong
inference [Rao et al., 2006b].
Recent trends in clustering have tried to make use of information from higher order
moments or to deal directly with the probability density function (pdf) of the data. One
such technique which has become quite popular lately in image processing community
are the mean shift algorithms. The idea is very simple. Since clusters correspond to
high density regions in the dataset, a good pdf estimation would result in these clusters
demarcated by peaks (or modes) of the pdf. We could, for example, use Parzen window
technique to non-parametrically estimate the pdf thus avoiding any a priori model
assumption. Using any mode finding procedure, we could then move these points in the
direction of normalized density gradient. By correlating the points with the modes to
which they converge one can in essence achieve a very good clustering solution. The fixed
point non-parametric iterative scheme makes this method particularly attractive with only
a single parameter (kernel size) to estimate. An added benefit is automatic inference of the
number of clusters K which in many other algorithms like K-means and spectral clustering
needs to be specified a priori.
We show in this chapter, that essentially these mean-shift algorithms are special cases
of our novel principle corresponding to β = 1 and β = 0. The independent derivation of
these algorithms gives a new perspective from optimization point of view thus highlighting
their differences and in general their connection to a broader framework. Before we dwell
into this, we briefly review the existing literature on mean shift algorithms.
4.2 Review of Mean Shift Algorithms
Let us consider a sample set S = {s1, s2, . . . , sM} ∈ Rd. The non-parametric pdf
estimation of this dataset using Parzen window technique is given by
pS(x) =1
M
M∑j=1
Gσ(x− sj),
49
where Gσ(t) = e−t2
2σ2 is a Gaussian kernel with bandwidth σ > 0. In order to solve the
modes of the pdf, we solve the stationary point equation ∇xpS(x) = 0 for which the modes
are the fixed points. This gives us an iterative scheme as shown below.
x(τ+1)k = T (x
(τ)k ) =
∑Mj=1 Gσ(x
(τ)k − sj)x
(τ)j∑M
j=1 Gσ(x(τ)k − sj)
. (4–1)
The expression T (x) is the sample mean of all the samples si weighted by the kernel
centered at x. Thus, the term T (x) − x was coined “mean shift” in their landmark paper
by Fukunaga and Hostetler [1975]. In this process, two different datasets are maintained
namely X and S. The dataset X would be initialized to S as X(0) = S. At every
iteration, a new dataset X(τ+1) is produced by comparing the present dataset X(τ) with
S. Throughout this process S is fixed and kept constant. Due to the usage of Gaussian
kernel, this algorithm was called Gaussian Mean Shift (GMS).
Unfortunately, the original paper by Fukunaga and Hostetler used a modified version
of 4–1 which is shown below.
x(τ+1)k = T (x
(τ)k ) =
∑Mj=1 Gσ(x
(τ)k − x
(τ)j )x
(τ)j∑M
j=1 Gσ(x(τ)k − x
(τ)j )
. (4–2)
In this iteration scheme, one compares the samples of the dataset X(τ) with X(τ) itself.
Starting with the initialization X(0) = S and using 4–2, we then successively “blur” the
dataset S to produce datasets X(1), X(2) . . . X(τ+1). As the new datasets are produced we
forget the previous one giving rise to the blurring process which is inherently unstable.
It was Cheng [1995] who first pointed out this and named the fixed point update 4–2 as
Gaussian Blurring Mean Shift (GBMS).
Recent advancements in mean shift have made these algorithms quite popular in
image processing literature. In particular, the mean shift vector of GMS has been shown
to always point in the direction of normalized density gradient [Cheng, 1995]. Since points
lying in low density region have small value of pS(x), the normalized gradient at these
points have large value. This helps the samples to quickly move from low density regions
50
toward the modes. On the other hand, due to relatively high value of pS(x) near the
mode, the steps are highly refined around this region. This adaptive nature of step size
gives GMS a significant advantage over traditional gradient based algorithms where step
size selection is well known problem.
A rigorous proof of stability and convergence of GMS was given in [Comaniciu and
Meer, 2002] where the authors proved that the sequence generated for each sample xk
by 4–1 is a Cauchy sequence that converges due to the monotonic increasing sequence of
the pdfs estimated at these points. Further the trajectory is always smooth in the sense
that the consecutive angles between mean shift vectors is always between(−π
2, π
2
). M. A.
Carreira-Perpinan [2007] also showed that GMS is an Expectation-Maximization (EM)
algorithm and thus has a linear convergence rate.
Due to these interesting and useful properties, GMS has been successfully applied
in low level vision tasks like image segmentation and discontinuity preserving smoothing
[Comaniciu and Meer, 2002] as well as in high level vision tasks like appearance based
clustering [Ramanan and Forsyth, 2003] and real-time tracking of non rigid objects
[Comaniciu et al., 2000]. Carreira-Perpinan [2000] used mean shift for mode finding in
mixture of Gaussian distributions. The connection to Nadarayana-Watson estimator from
kernel regression and the robust M-estimators of location has been thoroughly explored by
Comaniciu and Meer [2002]. With just a single parameter to control the scale of analysis,
this simple non-parametric iterative procedure has become particularly attractive and
suitable for wide range of applications.
Compared to GMS, the understanding of GBMS algorithm remains relatively poor
since this concept first appeared in [Fukunaga and Hostetler, 1975]. Apart from the
preliminary work done in [Cheng, 1995], the only other notable contribution which we
are aware of was recently made by Carreira-Perpinan. In his paper [Carreira-Perpinan,
2006], the author showed that GBMS has a cubic convergence rate and to overcome its
instability, developed a new stopping criterion. By removing the redundancy among points
51
which have already merged, an accelerated GBMS was developed which was 2 × −4×faster1 .
4.3 Contribution of this Thesis
In spite of these achievements of mean shift techniques, there are still some
fundamental questions unanswered. It is not clear, for example, what changes are incurred
when going from GBMS to GMS and vice versa. Further, the implications and instability
of GBMS is least understood. The justification till date has been ad hoc and heuristic.
Though Cheng [1995] provided a comprehensive comparison, the analysis is quite complex.
Fashing and Tomasi [2005] showed mean shift as quadratic bound maximization but the
analysis is indirect and the scope limited. On top of this, the derivation of GBMS by
replacing the original samples of S with X(τ) itself is quite heuristic.
A right step to tackle this issue is to pose the question - “What does mean shift
algorithms optimize?”. For this would elucidate the cost function and hence the
optimization surface itself. Rigorous theoretical analysis is then possible to account
for their inherent properties. This we claim is our major contribution in this area. Notice
that equation 4–1 is the same as 3–5. The mode finding ability of our principle for β = 1
actually corresponds to GMS algorithm. Thus, GMS optimizes Renyi’s cross entropy
between datasets X and S. From Theorem 3, it is then clear that the iteration procedure
4–1 leads to a stable algorithm. The data samples move in the direction of normalized
gradient converging to their respective modes.
On the other hand, equation 4–2 exactly corresponds to 3–3. This iteration scheme
is the result of directly minimizing Renyi’s quadratic entropy of the data X. Since this
happens for β = 0, GBMS is a special case of our general principle. From optimization
point of view then, GBMS with initialization X(0) = S leads to rapid collapse of the data
to a single point as proved in Theorem 2. GBMS therefore does not truly poses mode
1 Note that this can also be done for GMS algorithm
52
finding ability to track the modes in the original given dataset S. With the purpose of
finding high concentration regions in S and dynamically find clusters, this algorithm then
is really unstable and should not be used.
Some authors have suggested GBMS as a fast alternative to GMS algorithm if it
can be stopped appropriately. The inherent assumption here is that the clusters are well
separated in the given data. This would lead to two phases in GBMS optimization. In the
first phase, the points would rapidly collapse to the modes of their respective clusters, and
in the second phase, the modes slowly move towards each other to ultimately yield a single
point. By stopping the algorithm after the first phase, one could ideally get the clustering
solution. Further, since GBMS has approximately 3× faster convergence rate than GMS,
this approach can be seen as fast alternative to GMS. Unfortunately, this assumption
of well demarcated clusters is not true in practical applications especially in areas like
image segmentation. Further, since modes are not the fixed points of GBMS, any stopping
criteria would be utmost heuristic. We demonstrate this rigorously for many problems in
the following sections in this chapter.
By clarifying these differences through an optimization route, we bring in new
perspective to these algorithms [Rao et al., 2008]. It is important to note that our
derivation is completely independent of the earlier efforts. Further, it is only in our
case that the patterns extracted from the data can be seen as structures of a broader
unsupervised learning framework with their connections to other facets like vector
quantization and manifold learning. By elucidating their respective strengths and
weaknesses from information theoretic point of view we simplify greatly the understanding
of these algorithms.
4.4 Stopping Criteria
Before we start comparing the performance of the mean shift algorithms, we need to
devise a stopping criteria for them. Because these techniques are iterative procedure, they
53
would continue running unnecessarily even after convergence. To avoid this redundancy,
we look into simple and practical way of stopping the fixed point iteration.
4.4.1 GMS Algorithm
Stopping the GMS algorithm to find the modes is very simple. Since samples move in
the direction of normalized gradient toward the modes which are fixed points of 4–1, the
average distance moved by samples becomes smaller over subsequent iterations. By setting
a tol level on this quantity to a low value we can get the modes as well as stop GMS from
running unnecessarily. This is summarized in 4–3.
Stop when1
N
N∑i=1
d(τ)(xi) < tol where
d(τ)(xi) =‖x(τ)i − x
(τ−1)i ‖.
(4–3)
Another version of the stopping criteria which we found more useful in image segmentation
applications, is to stop when the maximum distance moved among all the particles is less
than some tol level rather than the average distance, that is,
Stop when maxi‖x(τ)
i − x(τ−1)i ‖ < tol . (4–4)
4.4.2 GBMS Algorithm
As stated earlier, modes are not the solution of GBMS fixed point update equation
and hence GBMS cannot be used to find them. But assume that the modes are far apart
compared to kernel size. In such cases, there generally seems to be two distinct phases of
convergence. In the first phase, the points quickly collapse to their respective modes while
the modes move very slowly towards each other. In the second phase, the modes start
merging and ultimately yield a single point. If the algorithm can be stopped after the first
phase, then it could be used in applications like clustering where the exact position of
modes is not important, although any such stopping criterion would at most be heuristic.
Of course the stopping criterion 4–3 cannot be used unless we carefully hand-pick the tol
54
level since the average distance moved by the particles never settles down until all of them
have merged.
The assumption of two distinct phases was effectively used to formulate a stopping
criterion by Carreira-Perpinan [2006]. In phase 2, d(τ) = {d(τ)(xi)}Ni=1 takes on at most
K different values (for K modes). Binning d(τ) using large number of bins gives us the
histogram which has K or fewer non empty bins. Since entropy does not depend on exact
location of the bins, its value does not change and can be used to stop the algorithm as
shown in 4–5.∣∣Hs(d
(τ+1))−Hs(d(τ))
∣∣ < 10−8, (4–5)
where Hs(d) = −∑Bi=1 fi log fi is the Shannon entropy, fi is the relative frequency of
bin i and the bins span the interval [0,max(d)]. The number of bins B was selected as
B = 0.9N .
It is clear that there is no guarantee that we would find all the modes using this
rule. Further, the assumption used in developing this criterion does not hold true in many
practical scenarios as will be shown in our experiments.
4.5 Mode Finding Ability
Here, we study the mode finding ability of the two algorithms. We use a systematic
approach, by generating a mixture of Gaussian dataset with known modes. We select the
kernel size (σ) such that the modes corresponding to the estimated pdf (using Parzen
window technique) is as close as possible to the original modes. We then use GMS and
GBMS to iteratively track these modes and compare their performance.
The dataset in Figure 4-1 consists of a mixture of 16 Gaussians with centers spread
uniformly around a circle of unit radius. Each Gaussian density has a spherical covariance
of σ2gI = 0.01× I. To include a more realistic scenario, different a priori probabilities were
selected which is shown in Figure 4-1B. Using this mixture model, 1500 iid data points
were generated. We selected the scale of analysis σ2 = 0.01 such that the estimated modes
are very close to the modes of the Gaussian mixture. Note that since the dataset is a
55
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
A
2 4 6 8 10 12 14 160
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
B
−1.5
−1
−0.5
0
0.5
1
1.5
−1.5
−1
−0.5
0
0.5
1
1.5
0
0.1
0.2
Y axis
X axis
est
imat
ion
C
Figure 4-1. Ring of 16 Gaussian dataset with different a priori probabilities. Thenumbering of the clusters is in anticlockwise direction starting with center(1, 0). A) R16Ga dataset. B) a priori probabilities. C) probability densityfunction estimated using σ2 = 0.01.
mixture of 16 Gaussians each with variance σ2g = 0.01 and spread across the unit circle,
the overall variance of the data is much larger than 0.01. Thus, by using a kernel size of
σ2 = 0.01 for Parzen estimation of the pdf, we ensure that the Parzen kernel size is smaller
than the actual kernel size of the data. Figure 4-1C shows the 3−D view of this estimated
pdf. Note the unequal peaks due to different proportion of points in each cluster.
Figure 4-2 shows the mode finding ability of the two algorithms. To compare with
ground truth we also plot 2σg contour lines and actual centers of the Gaussian mixture.
With tol level in 4–3 set to 10−6, GMS algorithm stops at 46th iteration giving almost
perfect results. On the other hand, using stopping criterion 4–5, GBMS stops at 20th
iteration missing already 4 modes (shown with arrows). We would also like to point out
that this is the best result achievable by GBMS algorithm even if we had used stopping
criterion 4–3 and selectively hand-picked the best tol value.
Figure 4-3 shows the cost functions which these algorithms minimize for a duration
of 70 iterations. Notice how the cost function H(X) of GBMS continuously drops as the
modes merge. This would go on until H(X) becomes zero when all the samples would
have merged to a single point. For GMS, on the other hand, H(X; S) decreases and settles
down smoothly to its fixed points which are the modes of pS(x). Thus, a more intuitive
stopping criterion for GMS which originates directly from its cost function is to stop when
56
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5True centersGMS centers
A
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5True centersGBMS centers
B
Figure 4-2. Modes of R16Ga dataset found using GMS and GBMS algorithms. A) GoodMode finding ability of GMS algorithm. B) Poor mode finding ability ofGBMS algorithm.
the absolute difference between subsequent values of H(X; S) becomes smaller than some
tol level as summarized in 4–6. These are some of the unforeseen advantages when we
know exactly what we are optimizing.
∣∣H(X(τ+1); S)−H(X(τ); S)∣∣ < 10−10. (4–6)
Another interesting result pops up with this new understanding. Notice that even
though GBMS does not minimize Renyi’s cross entropy H(X; S) directly, we can always
measure this quantity between X(τ) at every iteration τ and the original dataset S. If
the assumption of two distinct and well separated phases in GBMS holds true, then the
samples will quickly collapse to the actual modes of the pdf before they start slowly
moving toward each other. Since we start with initialization X = S, H(X; S) will reach its
local minimum at this point before it again starts increasing due to the merging of GBMS
modes (and hence moving them away from the actual modes of the pdf). By stopping
GBMS at this minimum, we could devise an effective stopping criterion giving the same
result as GMS with less number of iterations.
57
10 20 30 40 50 60 702
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
iterations
H(X) − GBMSH(X;S) − GMS
Figure 4-3. Cost function of GMS and GBMS algorithms.
Unfortunately, we found that this works only when the modes (or clusters) are
very well separated compared to the kernel size (making the assumption to hold true).
For example, Figure 4-4 shows H(X; S) computed for GBMS for R16Ga dataset. The
minimum is reached at 7th iteration. Therefore, this stopping criterion would have
prematurely stopped GBMS algorithm giving very poor results. It is clear that GBMS is
not a good mode finding algorithm.
These results shed a new light in our understanding of these two algorithms. Mode
finding can be used as a means to cluster data into different groups. We will see next,
the performance of these algorithms in clustering where their respective properties effect
greatly the outcome of the applications.
4.6 Clustering Example
We generated 10 Gaussian clusters with centers spread uniformly in unit square. The
Gaussian clusters have random spherical covariance matrices with 50 iid samples each.
Figure 4-5 shows the dataset with true labeling as well as the 2σg contour plots.
Although, different kernel sizes should be used for density estimation of different
clusters, for simplicity and to express our idea clearly, we use a common Parzen kernel
58
10 20 30 40 50 60 703.35
3.4
3.45
3.5
3.55
3.6
3.65
3.7
3.75
iterations
H(X;S) − GBMS
Figure 4-4. Renyi’s “cross” entropy H(X; S) computed for GBMS algorithm. This doesnot work as a good stopping criterion for GBMS in most cases since theassumption of two distinct phases of convergence does not hold true in general.
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
A
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
B
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.5
X axisY axis
est
imat
ion
C
Figure 4-5. A clustering problem A) Random Gaussian clusters (RGC) dataset. B) 2σg
contour plots. C) probability density estimated using σ2 = 0.01.
size for pdf estimation. Using Silverman’s rule of thumb [Silverman, 1986], we estimated
the kernel size as σ2 = 0.01. A cross check with the pdf plot validated the efficacy of this
estimate. The plot is shown in Figure 4-5C. Note that all the clusters are well identified
for this particular kernel size. By correlating the points with their respective modes we
wish to segment this dataset into meaningful clusters.
With tol level set at 10−6, the GMS algorithm converges at the 41st iteration.
The segmentation result is shown in Figure 4-6A . Clearly GMS performs very well in
59
clustering the dataset into meaningful clusters. There are a total of 20 misclassification
(out of 500 points) which arise mostly due to the cluster with the largest spherical
covariance matrix. Notice that this cluster is under represented with just 50 points.
Further due to the overlap of the 2σg contour of this cluster with the neighboring
cluster as shown in Figure 4-5B, these misclassifications are bound to occur. Another
interesting mistake occur at the top right corner, where 4 points belonging to a cluster are
misclassified and put as part of another highly concentrated cluster. These points lie in
the narrow valley bordering the two clusters and unfortunately their gradient directions
point toward the incorrect mode. But it should be appreciated that even for this complex
dataset with varying shapes of Gaussian clusters, GMS with the simplest solution of single
kernel size gives such a good result.
On the other hand, using stopping criterion 4–5, GBMS stops at 18th iteration with
the output shown in Figure 4-6B. Notice the poor segmentation result as a consequence
of multiple modes merging. It should be kept in mind that by defining the kernel size
σ2 = 0.01, we have selected the similarity measure for clustering and are looking for
spherical Gaussians with variance around this value. In this regard, the result of GBMS is
incoherent. On the other hand, the segmentation result obtained for GMS is much more
homogeneous and consistent with our similarity measure. Further, it is only in the case of
GMS algorithm that the modes estimated from the pdf directly translate into clusters. On
the contrary, for GBMS algorithm, it is not clear how the modes in Figure 4-5C correlate
with the clustering solution obtained in Figure 4-6B.
Figure 4-7 shows the average change in sample position for both the algorithms.
Notice the peaks in GBMS curve corresponding to modes merging. This is a classic
example where the assumption of two distinct phases in GBMS becomes fuzzy. By 5th
iteration, two of the modes have already merged and by 18th iteration a total of 5 modes
are lost giving rise to poor segmentation result. In case of GMS, on the other hand, the
60
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
A
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
B
Figure 4-6. Segmentation results of RGC dataset. A) GMS algorithm. B) GBMSalgorithm.
averaged norm distance steadily decreases and by selecting a tol level sufficiently low, we
are always assured of a good segmentation result.
5 10 15 20 25 30 35 40 45 500
0.005
0.01
0.015
0.02
0.025
0.03
0.035
iterations
Ave
rage
d no
rm d
ista
nce
GMSGBMS
Figure 4-7. Averaged norm distance moved by particles in each iteration.
4.7 Image Segmentation
In this section, we extend our claim to real applications like image segmentation
where these techniques have been widely used. In the first part of this section, we compare
the performance of GMS and GBMS algorithms on a classic dataset of baseball game
61
image [Shi and Malik, 2000]. In the second part, we compare GMS algorithm with the
current state of art in segmentation literature called spectral clustering on a wide range of
popular images.
4.7.1 GMS vs GBMS Algorithm
We highlight the differences between GMS and GBMS by applying it on a real
dataset. For this purpose, we use the famous baseball game image of the normalized cuts
paper by Shi and Malik [2000] shown in Figure 4-8. For computational reasons, the image
has been reduced to 110× 73 pixels. This gray level image is transformed to 3-dimensional
feature space consisting of two spatial features, namely the x and y coordinates of the
pixels, and the range feature which is the intensity value at that location. Thus, the
dataset consists of 8030 points in the feature space. In order to use an isotropic kernel, we
scale the intensity value such that they fall in the same range as the spatial features as
done in [M. A. Carreira-Perpinan, 2007]. All the values reported are in pixel units. Since
this is an image segmentation application, we use stopping criterion 4–4 for GMS. Further,
we set the tol level equal to 10−3 for both the algorithms throughout this experiment.
Figure 4-8. Baseball image.
We performed an elaborate experiment of multi scale analysis where the kernel size
σ was changed from a small value to a large value in steps of 0.5. We selected the best
segmentation result for both the algorithms for a particular number of segments. The
results are shown in Figure 4-9. The top row shows the segmentation result for 8 clusters.
62
Since the clusters are well separated for the respective kernel sizes, both GMS and GBMS
give very similar results. The interesting development occurs when we try to achieve
segments less than 8. Note that, for this image, the best number of segments is 5 to 6
segments as seen in the image itself. Many researchers have tried to do this using various
methods [M. A. Carreira-Perpinan, 2007; Shi and Malik, 2000].
Figures 4-9C and 4-9D show the GMS and GBMS results for 6 segments. Note the
poor performance of GBMS. Instead of grouping similar objects into one, GBMS splits
and merges them to two different clusters. The disc segment in the image was split into
two with one of them merging with the player and the other with the bottom background.
This is counter intuitive given the fact that two of the coordinates of the feature space
are spatial coordinates of the image. On the other hand, GMS clearly gives a very good
segmentation result with each segment corresponding to an object in the image. Further,
a nice consistent and hierarchical structure is seen in GMS. As we reduce the number of
clusters, GMS merges clusters of same intensity and which are closer to each other before
merging similar intensity clusters which are far apart. This is what we would expect for
this feature space. This results in a beautiful pattern in the image space where whole
objects which are similar are merged together in an intuitive manner. This phenomenon is
again observed as we move from 6 segments to 4 where GMS puts all the gray objects in
one cluster, thus putting together three full objects of similar intensity in one group.
Thus, starting from results for 8 segments which were very similar to each other,
GMS and GBMS tread a very different path for lower number of segments. GMS neatly
clusters objects in the image into different segments and hence is very close to human
segmentation result. The different path followed by the two algorithms results in a
completely different 2 level image segmentation as shown in Figure 4-9.
4.7.2 GMS vs Spectral Clustering Algorithm
The aim here is to compare the performance of GMS algorithm with spectral
clustering which is considered the state of the art in image segmentation, and in turn
63
A GMS: segments=8 , σ = 11 B GBMS: segments=8 , σ = 10
C GMS: segments=6 , σ = 13 D GBMS: segments=6 , σ = 11.5
E GMS: segments=4 , σ = 18 F GBMS: segments=4 , σ = 13
G GMS: segments=2 , σ = 28.5 H GBMS: segments=2 , σ = 18
Figure 4-9. Baseball image segmentation using GMS and GBMS algorithms. The leftcolumn shows results from GMS for various different number of segments andthe kernel size at which it was achieved. The right column similarly shows theresults from GBMS algorithm.
64
validate the efficacy of mean shift technique. There exists many variations of spectral
clustering techniques, but the one we found very convenient and robust was the method
proposed by Ng et al. [2002]. For completeness, we briefly summarize this method below.
Given a set of points S = {s1, s2, . . . , sM} in Rd that we want to segment into K
clusters:
1. Form an affinity matrix A ∈ RM×M defined by Aij = exp(−||si − sj||2/2σ2) if i 6= j
and Aii = 0.
2. Define D to be the diagonal matrix whose (i, i)-element is the sum of A′s i-th row
and construct the matrix L = D−1/2AD−1/2.
3. Find e1, e2, . . . , eK , the K largest eigenvectors of L (chosen to be orthogonal to each
other in the case of repeated eigenvalues), and form the matrix E = [e1, e2, . . . , eK ] ∈RM×K by stacking the eigenvectors in columns.
4. Form the matrix Y from E by normalizing each of E ′s rows to have unit length (i.e.
Yij = Eij/(∑
j Eij)1/2).
5. Treating each row of Y as a point in RK , cluster them into K clusters via K-means
algorithm.
6. Finally, assign the original point si to cluster j if and only if row i of the matrix Y
was assigned to cluster j.
The idea here is that if the clusters are well defined, then the affinity matrix has a clear
block structure. The dominant eigenvectors would then correspond to each block matrix
with 1’s in the rows corresponding to the block matrix and zeros in other regions. The
Y matrix would then result in projecting the clusters in orthogonal directions which can
be easily clustered using a simple algorithm like K-means. To eliminate the effects of
K-means initialization, we run K-means for 10 different initializations and then select
the result with the minimum distortion (which is the mean square error in the case
of K-means). Note that one needs to give a priori the number of clusters K in this
algorithm.
65
We start with the performance of spectral clustering on the RGC dataset. The
number of clusters K was set to 10 and the kernel size σ2 = 0.01 was set using Silverman’s
rule of thumb. The result is shown in Figure 4-10. Although both methods perform very
well, spectral clustering gives a better result with just 5 misclassifications. In the case of
GMS, the number of misclassifications are 20, mainly arising in low valley areas of pdf
estimation which are generally tricky regions. It seems that by projecting the clusters onto
their dominant eigenvectors, spectral clustering is able to enhance the separation between
different clusters leading to a better result. Nevertheless, it should not be forgotten that
the number of clusters was automatically detected in GMS algorithm unlike in spectral
clustering where it had to be specified a priori.
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
A
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
B
Figure 4-10. Performance comparison of GMS and spectral clustering algorithms on RGCdataset. A) Clustering solution obtained using GMS algorithm. B)Segmentation obtained using spectral clustering algorithm.
Selecting a priori the number of segments in images is a difficult task. Instead we
first gain an understanding of the problem through a multi scale analysis using GMS
algorithm. Since GMS naturally reveals the number of clusters at a particular scale, this
analysis would help us in ascertaining the kernel size and the number of segments K to
compare with spectral clustering. In our experience, a good segmentation result is stable
over a broad range of kernel size.
66
Figure 4-11. Bald Eagle image.
Figure 4-11 shows a full resolution image of a bald eagle. We downsampled this image
to 48 × 72 size using a bicubic interpolation technique. This still preserves the segments
very well at the benefit of reducing the computation time. This step is important since
both the algorithms have O(N2) complexity. In particular, the affinity matrix construction
in spectral clustering becomes difficult to manage for more than 5000 data points due to
memory issues. In order to get meaningful segments in images close to human accuracy,
the perceived color differences should correspond to Euclidean distances of pixels in the
feature space. One such feature space that best approximates perpetually uniform color
spaces is the L*u*v space [Connolly, 1996; Wyszecki and Stiles, 1982]. Further, we added
the x and y coordinates to this feature space to take into account the spatial correlation.
Thus, we have 3456 data points spread across a 5-dimensional feature space.
The obtained results for the multiscale analysis is shown in Figure 4-12. Before
plotting these segments, we did a post processing operation where segments with less than
10 data points were merged to the closest cluster. This is needed to eliminate spurious
clusters arising due to isolated points or outliers. In the case of bald eagle image, we
found only the segmentation result for kernel size σ2 = 9 to have such spurious clusters.
This is clearly a result of low kernel size selection. As we increase the kernel size, the
number of segments are drastically reduced reaching a level of 7 segments for both σ2 = 25
and σ2 = 36. Note the sharp and clear segments obtained. For example, the eagle itself
67
with its beak, head and torso is very well represented. One can also appreciate the nice
hierarchical structure in this analysis where previous disjoint but close clusters merge to
form larger clusters in a very systematic way. For σ2 = 49, we obtain 5 segments which
is the most broad segmentation of this image. Beyond this we start losing important
segments as shown for σ2 = 64 where the beak of the eagle is lost.
Since both 5 and 7 segments look very appealing with the previous analysis, we
show the comparison with spectral clustering for both these segments. Figure 4-13
shows these results for σ2 = 25 and σ2 = 49 respectively. Spectral clustering performs
extremely well with clear well defined segments. Two things need special mention. First,
the segment boundaries are very sharp in spectral clustering compared to GMS results.
This is understandable since in GMS each data point is moved towards its modes and
in boundary regions there may be a clear ambiguity as to which peak to climb. This
localized phenomenon leads to a pixelization effect at the boundary. Second, its surprising
how spectral clustering technique could depict the beak region so well. It should be
remembered that the image used is a low resolution image as shown in Figure 4-12A. A
close observation shows that the region of intesection between the beak and the face of the
eagle has a color gradation. This is clearly depicted in the GMS segmentation result for
σ2 = 16 as shown in Figure 4-12C. Probably, by using the dominant eigenvectors one can
concentrate on the main component and reject other information. This could also explain
the clear crisp boundaries produced in Figure 4-13.
As a next step in this comparison, we selected two popular images shown in
Figure 4-14 from the Berkeley image database [Martin et al., 2001] which has been
widely used as a standard benchmark. The most important part of this database is that
human segmentation results are available for all the images which helps one to compare
the performance of their algorithm. Once again we performed a multiscale analysis of
these two images. Selecting the number of clusters is especially difficult in the case of the
tiger image as can be seen in human segmentation results also.
68
A Low resolution 48x72 image
B σ2 = 9, segments=20 C σ2 = 16, segments=14
D σ2 = 25, segments=7 E σ2 = 36, segments=7
F σ2 = 49, segments=5 G σ2 = 64, segments=4
Figure 4-12. Multiscale analysis of bald eagle image using GMS algorithm.
69
A σ2 = 25, segments=7 B σ2 = 49, segments=5
Figure 4-13. Performance of spectral clustering on bald eagle image.
A B
Figure 4-14. Two popular images from Berkeley image database. A) Flight image. B)Tiger image.
Our analysis depicted in Figure 4-15 shows good performance of GMS algorithm.
Both the images are well segmented especially the flight and the tiger regions. Based
on this analysis, we selected a value of 2 segments in the case of flight image and 8
segments in the case of the tiger image. Using the same kernel size as in GMS algorithm,
we initialized the spectral clustering method. Figure 4-16 shows the performance of this
algorithm. Again, spectral clustering does a very good segmentation especially in the flight
image. In the case of GMS algorithm, a portion of the flight region is lost due to the large
kernel size and oversmoothing effect of the pdf estimation. The results in the case of the
tiger image is pretty much the same, though spectral clustering seems slightly better.
The case of the flight image is especially interesting. Ideally, one would want to use
a smaller kernel size if the flight region is of interest. This is seen in segmentation results
for GMS with σ2 = 16 in Figure 4-15C. But such a small kernel size would lead to many
70
A 48x72 flight image B 48x72 tiger image
C σ2 = 16, segments=15 D σ2 = 16, segments=14
E σ2 = 49, segments=4 F σ2 = 25, segments=10
G σ2 = 100, segments=2 H σ2 = 36, segments=8
Figure 4-15. Multiscale analysis of Berkeley images using GMS algorithm.
71
A σ2 = 100, segments=2 B σ2 = 36, segments=8
Figure 4-16. Performance of spectral clustering on Berkeley images.
clusters. Since it is very clear from the beginning that we need 2 clusters for this image,
one can do a post processing stage where closer clusters are recursively merged until we
have only two clusters left. Such a result is shown in Figure 4-17A for σ2 = 16. Note the
improved and clear segmentation of the flight region. To be fair, we provide a comparison
with spectral clustering for the same parameters in Figure 4-17C. Clearly, both perform
very well with a marked improvement in the performance of GMS algorithm. A similar
comparison is shown for the tiger image with a priori selection of σ2 = 16 and number of
segments K = 8.
For more complicated images like natural scenary, we need to go beyond a single
kernel size for all data points. One technique is to use different kernels for each sample
estimated using K nearest neighborhood (KNN) method. Readers who are interested in
this are advised to refer to Appendix C. Other techniques of adaptive kernel size have
been proposed in literature as in [Comaniciu, 2003] which have greatly improved the
performance of GMS algorithm giving similar results to spectral clustering.
4.8 Summary
Clustering is a statistical data analysis technique with importance in pattern
recognition, data mining and image analysis. In this chapter we have introduced the mean
shift algorithms which have become increasingly popular in image processing and vision
community to perform segmentation and tracking. With a new independent derivation
using information theoretic concepts, we have clearly elucidated the differences between
72
A GMS: σ2 = 16, segments=2 B GMS: σ2 = 16, segments=8
C SC: σ2 = 16, segments=2 D SC: σ2 = 16, segments=8
Figure 4-17. Comparison of GMS and spectral clustering (SC) algorithms with a prioriselection of parameters. A new post processing stage has been added to GMSalgorithm where close clusters were recursively merged until the requirednumber of segments was achieved.
the two variations namely GMS and GBMS algorithms. Since both these methods appear
as special cases of our novel unsupervised learning principle, we were not only able to show
the intrinsic connection between these techniques, but also the general relationship they
share with other unsupervised learning facets like principle curves and vector quantization.
With this new understanding a number of interesting results follow. We have shown
that GBMS directly minimizes Renyi’s quadratic entropy and hence is an unstable
mode finding algorithm. Since modes are not the fixed points of this cost function, any
stopping criterion would at most be heuristic. On the other hand, its stable counterpart
GMS, minimizes Renyi’s “cross” entropy giving the modes as stationary solutions.
Thus, a new stopping criterion is to stop when the change in the cost function is
small. We have corroborated this theoretical analysis with extensive experiments with
superior performance of GMS over GBMS algorithm. Finally, we have also compared the
73
performance of GMS with the state of art spectral clustering technique highlighting the
difference of the two methods and their inherent advantages.
74
CHAPTER 5APPLICATION II: DATA COMPRESSION
5.1 Introduction
A second important aspect of our unsupervised learning principle is vector quantization
with applications in compression of data and images. This field can be regarded as
the art of compressing information in data in a few code vectors for efficient storage
and transmission [Gersho and Gray, 1991]. Inherent to every vector quantization
technique then, there should be an associated distortion measure to quantify the degree of
compression. For example, in the simplest technique of Linde-Buzo-Gray (LBG) algorithm
[Linde et al., 1980] one tries to recursively minimize the sum of euclidean distance between
the data points and their corresponding winner code vectors. This can be summarized as
shown below.
DLBG(X, C) =1
2
N∑i=1
‖xi − ci∗‖2 , (5–1)
where {c1, c2, . . . , cK} ∈ C are K code vectors with K << N and i∗ = argmink
‖xi − ck‖.This mean square distortion measure preserves the second order statistics of the data and
can be implemented using a simple learning rule where each code vector is updated as the
mean of all data points for which it is a winner neuron. After reaching stable state, if the
K code vectors has a higher distortion measure, then these code vectors are split into two
and used as initialization for the next level of coding until the required distortion is met.
In our context, an information theoretic criterion which can be seen as a distortion
measure is the Cauchy-Schwartz pdf divergence Dcs(pX ||pS). We have shown in Theorem 4
that this corresponds to the extreme case of β →∞ in our learning principle. Here, X can
be considered as the code book and S as the original dataset whose information needs to
be preserved. If X is initialized to S, then Corollary 4.1 shows that the fixed point update
rule would give us back the original data S as the solution. This is obvious since the best
vector quantizer of a dataset without any constraint on the number of code vectors is the
data itself. Of course, this ideal scenario is not interesting if the goal is to compress the
75
data. In any vector quantization technique this is achieved by initializing X with far fewer
points than S. Then Dcs(pX ||pS) > 0 and by minimize this quantity the algorithm returns
the codebook such that the code vectors best model the data. Note that Dcs(pX ||pS) is a
measure between the pdfs of X and S and thus, by minimizing this quantity we ensure
that the pdf of codebook X is as close as possible to S. By preserving directly the pdf
information, this technique goes beyond the second order statistics and captures maximum
information about the data. In lieu of this, we call this technique - Information Theoretic
Vector Quantization (ITVQ) [Rao et al., 2007a].
5.2 Review of Previous Work
Previous work on ITVQ by Lehn-Schiøler et al. [2005] uses gradient method to
achieve the same. The gradient update is given by
x(τ+1)k = x
(τ)k − η
∂
∂xk
J(X), (5–2)
where η is the learning rate parameter and J(X) = Dcs(pX ||pS). Thus, the derivative
∂∂xk
J(X) would be equal to the expression below as was shown in Theorem 4.
∂
∂xk
J(X) =2
V (X; S)F (xk; S)− 2
V (X)F (xk).
Unfortunately, as is true in any gradient method, the algorithm almost always gets stuck
in local minima. A common method used to overcome this problem is the simulated
annealing technique [Kirkpatrick et al., 1983; Haykin, 1999] where the kernel size σ is
annealed from a large value to a small value slowly. Large kernel size has the effect of
over smoothing the surface of the cost function, thus eliminating or suppressing most
local minima and accelerating the samples quickly near the vicinity of the global solution.
Reducing the kernel size would then allow the samples to reduce the bias in the solution
and effectively capture the true global minimum. Equation 5–3 shows how the kernel size
is annealed from a constant κ1 times the initial kernel σo with an annealing rate ζ. Here,
76
σn denotes the kernel size used at the nth iteration.
σn =κ1σo
1 + ζκ1n(5–3)
The problem with the gradient method is that to actually utilize the benefit of kernel
annealing, the step size also needs to be annealed, albeit with a different set of parameters.
This helps speed up the algorithm by using a step size proportional to the kernel size. As
shown in 5–4, the step size is annealed from a constant κ2 times the initial value ηo with
an annealing rate γ giving us the step size at the nth iteration, ηn as shown below.
ηn =κ2ηo
1 + γκ2n(5–4)
As can be imagined, selecting these parameters can be a daunting task. Not only do
we need to tune these parameters finely, but also we need to take into consideration the
interdependencies between the parameters. For example, for a given annealing rate ζ of
the kernel size there is a narrow range of values for the annealing rate γ of the step size
which makes the algorithm work. Selecting outside this range either makes the algorithm
unstable or would stop it prematurely due to slow convergence process.
By doing away with gradient method and using fixed point update rule 3–9, we
effectively solve this problem and at the same time speed up the algorithm tremendously.
One of the easiest initialization for X is to select randomly N points from S with
N << M (M is the number of samples in original dataset S). Unfortunately, without
using kernel annealing, the problem of local minima exists (though not as severe as
gradient method) with different initialization giving slightly different solutions. On the
other hand, kernel annealing benefits the algorithm in two ways. It gives the samples
enough freedom initially to quickly capture the important aspects of the pdf distribution
and speeds up the algorithm. Secondly, it makes the solution almost independent of any
initialization giving identical result every time. We would show this in our experiments
77
where we initialize X not from the dataset S but random points anywhere in the vicinity
of this dataset.
In this chapter, we will present the quantization results obtained on an artificial
dataset as well as some real images. The artificial dataset shown here was used in
[Lehn-Schiøler et al., 2005]. A real application of compressing the edges of a face image
will be shown next. To distinguish between the two variation of the ITVQ methods, we
call our method ITVQ fixed point (ITVQ-FP) and the gradient method as ITVQ-gradient.
Comparison between ITVQ-FP, ITVQ-gradient and the standard Linde-Buzo-Gray (LBG)
algorithm is provided and quantified. To compare with LBG, we also report the mean
square quantization error (MSQE) which is shown in 5–5. Finally, we present some image
compression results showing the performance of ITVQ-FP and LBG techniques.
ε =1
N
N∑i=1
‖xi − ci∗‖2 , ci∗ = mincj∈C
‖xi − cj‖2 . (5–5)
5.3 Toy Problem
This artificial dataset consists of two semicircles of unit radius interlocking into each
other and perturbed by the Gaussian noise of standard deviation 0.1. N = 16 random
points were selected from a unit square in the middle of the data with −0.5 < X < 1.5 and
−1 < Y < 1 as shown in Figure 5-1. In the gradient method the initial kernel was set as
shown in 5–6 with the diagonal entries being the variance along each feature component.
The kernel was annealed with parameters κ1 = 1 and ζ = 0.05. At no point of time, the
kernel was allowed to go below σo/√
N . Note that these are the same parameters used
in [Lehn-Schiøler et al., 2005].
σo =
0.75 0
0 0.51
(5–6)
The most difficult part of gradient method is to select the step size and its annealing
parameters to best suit the kernel annealing rates. The step size should be sufficiently
78
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1.5
−1
−0.5
0
0.5
1
1.5
Figure 5-1. Half circle dataset and the square grid inside which 16 random points wheregenerated from a uniform distribution.
small to ensure smooth convergence. Further, the step size annealing rate γ should be
selected such that the step size at each iteration is well below the maximum allowed step
size for present iteration kernel size. After lots of hit and trial runs, the following step size
parameters were selected: ηo = 1, κ2 = 1 and γ = 0.15. Further, the step size was never
allowed to go below 0.05.
In the case of fixed point method, for fair comparison with the gradient counterpart,
we select the same kernel initialization and its associated annealing parameters. There is
no step size or its annealing parameters in this case. To quantify the statistical variation,
50 Monte Carlo simulations were performed. It was ensured that the initialization was
same for both the methods in every simulation. With these selected parameters, it was
observed that both the methods almost gave good result every time. Nevertheless, the
fixed point was found to be more consistent in its results. Figure 5-2A shows the plot
of the cost function J(X) = Dcs(pX ||pS). Clearly, the fixed point algorithm is almost
10× faster than the gradient method and gives a better solution in terms of minimizing
the cost function. Figure 5-2B shows one of the best results for gradient and fixed point
method. A careful look shows some small differences which may explain the smaller J(X)
79
0 50 100 150 200 2500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
iterations
Cos
t fun
ctio
n J(
X)
fixed pointgradient
A
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1.5
−1
−0.5
0
0.5
1
1.5datafixed pointgradient
B
Figure 5-2. Performance of ITVQ-FP and ITVQ-gradient methods on half circle dataset.A) Learning curve averaged over 50 trials. B) 16 point quantization results.
obtained for fixed point over gradient method. Further,the fixed point has a lower MSQE
with 0.0201 value compared to 0.0215 for gradient method.
5.4 Face Boundary Compression
Here, we present 64 point quantization results of the edges of a face image. Apart
from being an integral part of image compression, this technique also finds application
in face modeling and recognition. We would like to preserve the facial details as much as
possible, especially the eyes and ears which are more complex features.
We repeat the procedure of finding the best parameters for the gradient method.
After an exhaustive search we end up with the following parameters: κ1 = 1, ζ = 0.1,
ηo = 1, κ2 = 1, γ = 0.15. σo was set to a diagonal matrix with variance along each feature
component as the diagonal entries. The algorithm was initialized with 64 random points
inside a small square grid centered on the mouth region. As done before for the fixed
point, the kernel parameters were set exactly as the gradient method for a fair comparison.
As discussed earlier, annealing the kernel size in ITVQ-FP not only gives the samples
enough freedom to quickly move and capture the important aspects of the data, thus
speeding up the algorithm, but also makes the final solution more robust to random
initialization. This idea is best illustrated in Figure 5-3 where the initialization is done
80
with random points in the middle of the face and around the mouth region. Notice
how the code vectors spread beyond the face region initially with large kernel size and
immediately capture the broad aspects of the data. As the kernel size is decreased, the
samples model the fine details and give a very good solution.
Figure 5-4 shows the results obtained using ITVQ fixed point, ITVQ gradient
method and LBG. Two important conclusions can be made. Firstly, among the ITVQ
algorithms, the fixed point represents the facial features better than the gradient method.
For example, the ears are very well modeled in the fixed point method. Perhaps, the
most important advantage of the fixed point over the gradient method is the speed of
convergence. We found that ITVQ fixed point was more than 5× faster than its gradient
counterpart. In image applications, where the number of data points (pixels) is generally
large, this translates into a huge saving of computational time.
Secondly, both the ITVQ algorithms outperformed LBG in terms of facial feature
representation. The LBG uses many code vectors to model the shoulder region and few
code vectors to model the eyes or ears. This is due to the fact that LBG just uses second
order statistics whereas ITVQ, due to its intrinsic formulation, uses all higher order
statistics to better extract the information from the data and allocate the code vectors
to suit the structural properties. This also explains why ITVQ methods also perform
very close to LBG from MSQE point of view as shown in Table 5-1. On the other hand,
notice the poor performance of LBG in terms of minimizing the cost function J(X).
Obviously, the LBG and ITVQ fixed point methods give lowest values for their respective
cost functions.
Table 5-1. J(X) and MSQE on the face dataset for the three algorithms averaged over 50Monte Carlo trials.
Method J(X) MSQE
ITVQ-FP 0.0253 3.355× 10−4
ITVQ-gradient 0.0291 3.492× 10−4
LBG 0.1834 3.05× 10−4
81
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
A
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
B
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
C
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
D
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
E
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
F
Figure 5-3. Effect of annealing the kernel size on ITVQ fixed point method.
82
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
A
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
B
0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
C
Figure 5-4. 64 points quantization results of ITVQ-FP, ITVQ-gradient method and LBGalgorithm on face dataset. A) ITVQ fixed point method. B) ITVQ gradientmethod. C) LBG algorithm.
83
A B
Figure 5-5. Two popular images selected for image compression application. A) Baboonimage. B) Peppers image.
5.5 Image Compression
One of the most popular applications of vector quantization is the area of image
compression for efficient storage, retrieval and transmission. We demonstrate here
the performance of ITVQ-FP on some popular images and compare those with LBG
technique. Figure 5-5 shows two images used in this section. The first is the popular
baboon image and the other is the peppers image available in Matlab. Due to the huge
size of these images, the computational complexity would be exorbitant. Therefore,
we selected only some portions of these images to work with as shown in Figure 5-6.
These portions were limited to at most 5000 pixel points for convenience and ease of
implementation.
We used the L*u*v feature space for this application and initialized the codebook as
random points selected from the dataset itself depending on the compression level. Instead
of using simulated annealing technique, we performed 10 to 20 Monte Carlo simulations
for both the methods and selected the best result in terms of lowering their respective
cost functions. We found that our algorithm was able to find good results every time with
these Monte Carlo runs and the results were similar to that generated using simulated
annealing technique. Different levels of compression were evaluated starting from 80%
to as far as 99.75% compression in some images. We hand picked the maximum level of
84
A B
C D
Figure 5-6. Portions of Baboon and Pepper images used for image compression.
compression for both ITVQ-FP and LBG after comparing the reconstructed image to
the original image. Since ITVQ-FP has an extra parameter of kernel size, we ran the
algorithm for different kernel sizes in the range of σ2 = 16 to σ2 = 64 and selected the best
result.
Figure 5-7 shows the comparison between ITVQ-FP and LBG techniques. In the
case of images, it is hard to quantify the difference as such, unless this compression is
used as a preprocessing stage in some bigger application. Nevertheless, one can see some
small differences between these methods like the eye part of the baboon in Figures 5-7A
and 5-7B. Overall both methods are fast, perform reliably and produce a very high level of
compression suitable of many practical applications.
5.6 Summary
Vector quantization, an art of preserving the information of the data in few code
vectors, is an important prerequisite in many applications. However, for this to be
85
A B
C D
E F
G H
Figure 5-7. Image compression results. The left column shows ITVQ-FP compressionresults and the right column shows LBG results. A,B) 99.75% compression forboth the algorithms, σ2 = 36 for ITVQ-FP algorithm. C,D) 90% compression,σ2 = 16. E,F) 95% compression, σ2 = 16. G,H) 85% compression, σ2 = 16.
86
practical, not only should the algorithm be fast but at the same time preserve as much
information as possible.
In this chapter, we have shown how information theoretic vector quantization arises
naturally as a special case of our general unsupervised learning principle. This algorithm
has a dual advantage. First, by using information theoretic concepts, we use all the higher
order statistics available in the data and hence preserve maximum possible information.
Secondly, by using fixed point update rule, we substantially speed up the algorithm
compared to gradient method. At the same time, we circumvent many parameters which
were used in gradient method whose tuning is too heuristic to be applicable in any real life
scenario.
87
CHAPTER 6APPLICATION III: MANIFOLD LEARNING
6.1 Introduction
In this chapter, we describe a final aspect of our learning theory called principal
curves with application prospects in the area of manifold learning and denoising.
The notion of principal curves was first introduced by Hastie and Stuetzle [1989] as a
nonlinear extension of principal component analysis (PCA). The authors describe them
as “self-consistent” smooth curves which pass through the “middle” of a d-dimensional
probability distribution or data cloud. Hastie-Stuetzle (HS) algorithm attempts to
minimize the average squared distance of the data points and the curve. In essence,
the curve is built by finding local mean along directions orthogonal to the curve. This
definition also uses two different smoothing techniques to avoid overfitting. An alternative
definition of principal curves based on the mixture model was given by Tibshirani [1992].
An EM algorithm was used to carry out the estimation iteratively.
A major drawback of Hastie’s definition is that it is not amenable to theoretical
analysis. This has prevented researchers from formally proving the convergence of
HS algorithm. The only known fact is that principal curves are fixed points of the
algorithm. Recently, Kegl [1999] developed a regularized version of Hastie’s definition.
By constraining the total length of the curve to be fit to the data, one could prove the
existence of principal curve for every dataset with finite second order moment. With
this theoretical model, Kegl derived a practical algorithm to iteratively estimate the
principal curve of the data. The Polygonal Line Algorithm produces piecewise linear
approximations to the principal curve using gradient based method. This has been applied
to hand-written character skeletonization to find the medial axis of the character [Kegl
and Krzyzak, 2002]. In general, one could find a principal graph of a given dataset using
this technique. Similarily, other improvements were proposed by Sandilya and Kulkarni
[2002] to regularize the principal curve by imposing bounds on the turns of the curve.
88
The common theme among all these methods is to initialize a line to the dataset
and then, recursively update its parametric form to minimize a given distortion measure.
This parametric approach generates poor results as the complexity of the data manifold
increases. Part of the reason is attributed to the initialization which has no a prior
information in built about the shape of the data. Further, with the gradient method used
as an optimization procedure, this technique gets stuck in local solution quite often. Could
we instead extract the principal curve directly from the data? Since this information is
contained in the data, with data itself as the initialization, we are bound to capture the
shape of the manifold much better through self organization of the samples.
In light of this background, consider once again our cost function which is reproduced
here for convenience.
J(X) = minX
H(X) + β Dcs(pX ||pS).
We have seen that for β = 1 we get the modes of the data which effectively captures the
high concentration regions. What happens if we increase the value of β? From the cost
function, we see that as we increase this value, we increase the weight of the similarity
term Dcs(pX ||pS). By putting more emphasis on the similarity measure, the algorithm
would have to give something more (information) than just the modes. What we actually
see is that the algorithm returns a curve passing through the modes of the data. The
reason for modes to be part of the curve becomes clear when we see the cost function in
the simplified form as shown below.
J(X) = minX
(1− β)H(X) + 2βH(X; S).
The second term which is the Renyi’s cross entropy is the cost function to find the modes
of the data. This term is the cost function of GMS algorithm and is minimized at the
modes of the data. Therefore to minimize J(X), it is essential that modes be part of
the solution. On the other hand since (1 − β) < 0 we are effectively maximizing H(X).
To satisfy this and minimize J(X) overall, the data spreads along a curve connecting
89
the modes of the pdf. This satisfies our intuition since a curve through the modes of the
data has more information than just discrete points representing modes. As we continue
increasing the value of β, in the extreme case for β → ∞, we get back the data itself as
the solution capturing all possible information.
It is interesting to note that the principal curve proposed here passes through the
modes of the data unlike other definitions which passes through the mean of the projection
of all the points orthogonal to the curve. This would then dictate a new maximum
likelihood definition of principal curves. Since the idea of principal curves is to find the
intrinsic lower dimensional manifold from which the data originated, we are convinced
that mode (or peak) along every orthogonal direction is much more informative than the
mean. By capturing the local dense regions, one can imagine this principal curve a a ridge
passing through the pdf contours. Such a definition was recently proposed by Erdogmus
and Ozertem [2007]. We briefly summarize the definition and the results obtained here for
our discussion.
6.2 A New Definition of Principal Curves
Let x ∈ Rn be a random vector and p(x) be its pdf. Let g(x) denote the transpose of
the gradient of this pdf and U(x) its local Hessian.
Definition 1. A point x is an element of the d-dimensional principal set, denoted by ρd,
iff g(x) is orthonormal (null inner product) to at least (n − d) eigenvectors of U(x) and
p(x) is a strict local maximum in the subspace spanned by these (n− d) eigenvectors.
Using this definition, Erdogmus proved the following important results.
• Modes of the data correspond to ρ0 i.e. 0-dimensional principal set.
• ρd ⊂ ρd+1
This immediately highlights the hierarchical structure in the data, with modes given by
ρ0, being part of the 1-dimensional curve ρ1. By going through all the dense regions of
the data (the modes), ρ1 can be considered as a new definition of principal curve which
effectively captures the underlying 1-dimensional structure in the data.
90
Figure 6-1. The principal curve for a mixture of 5 Gaussians using numerical integrationmethod.
In order to implement this, Erdogmus first found the modes of the data using
Gaussian Mean Shift (GMS) algorithm. Starting from each mode and shooting a
trajectory in the direction of the eigenvector corresponding to largest eigenvalue 1 ,
one can then effectively trace out the curve. A numerical integration technique like
Runge-Kutta order 4 (RK4) could be utilized to get the next point on the curve starting
at the current point and moving in the direction of the eigenvector of the local Hessian
evaluated at that point. Figure 6-1 shows an example to clarify this idea.
The example shown here is a simple case with well defined Hessian values. In many
practical problems though, this approach suffers from ill conditioning and numerical issues.
Further, the notion of the existence of 1-dimensional structure for every data is not true.
For example, for a T-shaped 2-dimensional dataset there does not exist a 1-dimensional
manifold. What is only possible is a 2-dimensional denoised representation, which can
be seen as a principal graph of the data. In the following section, we experimentally
show how our algorithm satisfies the same definition extracting principal curve (or
1 Note that the Hessian at the modes is semi-negative definite
91
graphs) directly from the data but much more elegantly. Principal curves have immense
applications in denoising and as a tool for manifold learning. We will illustrate with some
examples how this can be achieved.
6.3 Results
Two examples are presented here. One is a spiral dataset which is a classic example
used in the literature of principal curves. The other is a “chain of rings” dataset showing
the denoising ability of principal curves.
6.3.1 Spiral Dataset
We will start with the example of spiral data which is considered a very difficult
problem in principal curves literature [Kegl, 1999]. The data consists of 1000 samples
which are perturbed by Gaussian noise with variance equal to 0.25. We would like to point
out here that this is a much noisy dataset compared to the one actually used in [Kegl,
1999]. We use β = 2 for our experiment here, although we found that any value between
1 < β < 3 can be used. Figure 6-2 shows the different stages of the principal curve as it
evolves starting with initialization X = S where S is the spiral data.
Note how quickly the samples tend to the curve. By the 10th iteration the structure
of the curve is clearly revealed. After this, the changes in the curve are minimal with
the samples only moving along the curve (and hence always preserving it). What is even
more exciting is that this curve exactly passes through the modes of the data for the
same scale σ as shown in Figure 6-3. Thus our method gives a principal curve which
satisfies definition 1 naturally. We also depict the development of the cost function J(X)
and its two components H(X) and H(X; S) as a function of the number of iterations in
Figure 6-4. Notice the quick decrease in the cost function due to the rapid collapse of the
data to the curve. Further decrease is associated with small changes in the movement of
samples along the curve and by stopping at this juncture we can get a very good result as
shown in Figure 6-2F
92
−10 −5 0 5
−10
−5
0
5
10
A Starting initialization X = S
−10 −5 0 5
−10
−5
0
5
10
B Iteration 2
−10 −5 0 5
−10
−5
0
5
10
C Iteration 5
−10 −5 0 5
−10
−5
0
5
10
D Iteration 10
−10 −5 0 5
−10
−5
0
5
10
E Iteration 20
−10 −5 0 5
−10
−5
0
5
10
F Iteration 30
Figure 6-2. The evolution of principal curves starting with X initialized to the originaldataset S. The parameters were set to β = 2 and σ2 = 2.
93
−10 −5 0 5
−10
−5
0
5
10
Figure 6-3. The principal curve passing through the modes (shown with black squares).
5 10 15 20 25 30 35 40 45 50
2.95
3
3.05
3.1
3.15
3.2J(X)
A
5 10 15 20 25 30 35 40 45 502.75
2.8
2.85
2.9
2.95
3
3.05
3.1
3.15
3.2
3.25
iterations
H(X;S)H(X)
B
Figure 6-4. Changes in the cost function and its two components for the spiral data as afunction of the number of iterations. A) Cost function J(X). B) Twocomponents of J(X) namely H(X) and H(X; S).
94
−10
0
10−10
0
10
−5
0
5
A
−10−5
0
5
10 −10
−5
05
10
−5
0
5
B
Figure 6-5. Denoising ability of principal curves. A) Noisy 3D “chain of rings” dataset. B)Result after denoising.
6.3.2 Chain of Rings Dataset
Figure 6-5A shows a 3-dimensional “chain of rings” dataset corrupted with Gaussian
noise of unit variance. The parameters of the algorithm were set to β = 1.5 and σ2 = 1.
With less than 20 iterations, the algorithm was able to successfully removes noise and
extract the underlying “principal manifold” as can be seen in Figure 6-5B.
6.4 Summary
In this chapter, we have made an attempt to understand the principle of Relevant
Entropy as a tool for denoising and manifold learning. As the β value is increased beyond
one, the modes start giving way to curves which characterize the data in lower dimensional
space. We have shown through experiments that these results actually satisfy the new
definition of principal curves. By passing through the modes of the data, these curves not
only take into account the high density regions of the pdf, but also reveal the underlying
lower dimensional structure of the data. We see a strong biological imitation and hope
that this is one step in the right direction to learn how the human brain assimilates data
efficiently.
95
CHAPTER 7SUMMARY AND FUTURE WORK
7.1 Discussion and Conclusions
In this thesis, we have presented a novel framework for unsupervised learning.
The principle of Relevant Entropy is the first to address the fundamental idea behind
unsupervised learning, that of finding the underlying structure in the data. There are
many attributes and strengths which come along with this formulation. This is the first
theory which self organizes the data samples to unravel hidden structures in the given
dataset instead of imposing external structures. For example, there exist an entire field
called structural pattern recognition with numerous applications in the area of natural
language and syntactic grammar [Fu, 1974], optical character recognition [Amin, 2002],
time-series data analysis [Olszewski, 2001], image processing [Caelli and Bischof, 1997],
face and gesture recognition [Bartlett, 2001] and many more. Obviously, the type of
“structures” or “primitives” depends on the data at hand. This has been one of the
bottlenecks of this field. All the applications described so far are domain dependent and
use extensive knowledge of the problem at hand to construct the appropriate structures.
Creating knowledge base is a difficult task and in some applications, especially online
documents or biological data, it is a costly affair. Further, in some applications like scene
analysis, quantifying “structures” manually may be an ill posed problem due to their
variety and multitude of effective combinations possible to generate different pictures. A
self organizing technique to dynamically find appropriate structures from the data itself
would not only make knowledge bases redundant, but also lead to online and fast learning
systems. For a thorough treatment of this topic, we would direct the readers to Watanabe
[1985, chap. 10].
The notion of “goal” is a crucial aspect of unsupervised learning so much so that one
can call this field as - Goal Oriented Learning (GOL!!). This approach is needed since
the learner only has the data available without any additional a priori information like
96
desired response or an external critic. Speaking in a broad sense, we humans do this in
our day-to-day life. There many ways to assimilate information from the incoming sensory
inputs, but we only learn what serves our purpose (or goal). Referring back to Figure 1-2,
we see that our formulation is driven by the goal the machine seeks. This is modeled
through a parameter β which is under the control of the system itself. By allowing the
machine to influence the level of abstraction to be extracted from the data and in turn
capturing “goal oriented structures”, the Principle of Relevant Entropy addresses this
fundamental concept behind unsupervised learning.
Appropriate goals are needed for appropriate levels of learning. For example, in
a game of chess, the ultimate goal is to checkmate your opponent and win the game.
There are also intermediary goals like making the best move to remove the queen etc.
These correspond to higher level goals. A lower level vision task is to make such a move
happen. This involves processing the incoming vision signals and going through the
process of clustering (segregation of different objects), principal curves (denoising) and
vector quantization (compact representation) to assimilate the data. It is these lower
level goals that we have addressed in this thesis. Further, beyond this sensory processing
and cognition stage, one needs to convert this knowledge into appropriate actions and
motor actuators. This involves the process of reasoning and inference (especially inductive
inference, though deductive reasoning may be necessary to compare with past experiences)
which would generate appropriate control signals to drive actions. These actions in
the environment (like making your respective moves in the game of chess) would in
turn generate new data which needs to be further analyzed. Though our schematic in
Figure 1-2 shows this complete idea, we certainly do not intend to solve this entire gamut
of machine learning and we encourage readers to think on these lines to further this
theory. An interesting aspect of this schematic is that reinforcement learning can be seen
as a special case of our framework where goal and the critical reasoning blocks are external
to the learner and controlled by an independent critic.
97
Another aspect we would like to highlight is the nature of the cost function and its
complexity. Notice that the entire framework is a function of just two parameters; the
weighting parameter β which controls the task (and hence the type of structure) to be
achieved and the inherent hidden scale parameter σ which controls the resolution of our
analysis. Thus, the parameters are minimal and have a clear purpose. We have eluded the
discussion about kernel size till now. Most current approaches try to select a particular
kernel size to analyze the data. There exists extensive literature for such kernel size
selection techniques. Some of the examples include Silverman’s rule of thumb [Silverman,
1986], maximum-likelihood (ML) technique and K nearest neighborhood (KNN) method.
We have summarized these techniques in Appendix C for easy reference. Some of these
techniques make extra assumptions to find a single suitable kernel size appropriate to the
given dataset . For example, the Silverman’s rule assumes that the data is Gaussian in
nature. A more robust method would be to assign different kernel size to different samples.
The KNN technique can yield such a result as discussed in Appendix C.
A different way to see the kernel size issue is its role as controlling the scale of
analysis. Since structures exist at different scales, it would be prudent to analyze the data
with a multiscale analysis before we make an inference. This is certainly true in image
and vision applications as well as in many biological data. From this angle, we see kernel
size as a strength of our formulation, encompassing multiscale analysis needed for effective
learning. This gives rise to an interesting perspective where one can see our framework
as analyzing the data in a 2-D plane. The two axes of this plane correspond to the two
continuous parameters β and σ. By extracting structures for different combinations of
(β, σ), we unfold the data in this two dimensional space to truly capture the very essence
of unsupervised learning. This idea is depicted in Figure 7-1. One can also see this
analysis as a “2D spectrum” of the data revealing all the underlying patterns, giving rise
to a strong biological connection.
98
Figure 7-1. A novel idea of unfolding the structure of the data in the 2-dimensional spaceof β and σ.
The computational complexity of our algorithm is O(N2) in general where N is
the number of data samples. This compares favorably with other methods particularly
clustering techniques. For example, the latest method of spectral clustering [Shi and
Malik, 2000] requires one to find the N × N Gram matrix (which require huge memory,
especially for images) and then find the K dominant eigenvectors. This would entail
a theoretical cost of O(N3) which can be brought down to O(KN2) by using the
symmetry of the matrix and recursively finding the top eigenvectors. A way to alleviate
the computational burden of our algorithm is to use the Fast Gauss transform (FGT)
technique [Greengard and Strain, 1991; Yang et al., 2005]. By approximating the Gaussian
with Hermite series expansion, we could reduce the computional complexity to O(LpN)
where L and p are parameters of the expansion (these are very small compared to N).
Another idea would be to employ resampling techniques to work with only M points
(with M << N) randomly selected from the data and allow only these point to converge.
For example, in GMS algorithm for clustering application, all we need to do is to ensure
that there is at least one point from this resampled set corresponding to each mode.
99
Convergence of these points would then be sufficient to identify all the modes. A practical
way to do this is to employ the bootstrapping technique from statistics and perform monte
carlo simulations with resampling done at each experiment. This would bring down the
cost to O(MN), M << N for each monte carlo run.
Information Theoretic Learning (ITL) developed by Principe and collaborators
at CNEL is the first theory which is able to break the barrier and address all the four
important issues of unsupervised learning framework as enlisted at the end of Chapter 1.
Crucial to this success, are the non-parametric estimators for information theoretic
measures like entropy and divergence which can be computed directly from the data.
Further, the ability to derive simple and fast fixed point iterative scheme creates a self
organizing paradigm, thus avoiding problems like step size selection which are part of
the gradient methods. This is an important as well as a crucial bottleneck which ITL
solves elegantly. Due to the lack of such iterative scheme, many methods end up making
Gaussian approximation and missing important higher order features in the data.
This thesis deals in general with spacial data with the assumption of iid samples. This
occurs in diverse fields including pattern recognition, image analysis and machine learning
applications. There are other problems where the data has time structure. Examples
include financial data, speech and video signals to name a few. We can also extract similar
features from these time series (as done is speaker recognition) or do a local embedding
to create a feature space and still apply our method. We show two examples of this
approach in Appendix A and B 1 . There are also plenty of sophisticated methods like
Hidden Markov Models (HMMs), State Space Models (SSMs) etc. that are appropriate
for such time series data [Ghahramani, 2003]. A novel concept developed at CNEL called
Correntropy [Santamaria et al., 2006] also extracts time structures in the data using
1 Although the unsupervised learning techniques used in these sections are differentthan our algorithm.
100
kernel methods. Nevertheless, a good future direction to pursue is to bring in time in our
formulation. Of course, this is a whole new area and we encourage readers to take this up
as a Ph.D topic!!
7.2 Future Direction of Research
The field of machine learning, and in particular unsupervised learning, is at a
crossroad with some exciting developments taking place both in terms of theoretical
as well as practical applications. The advancements in brain science as well as the
engineering needs of the 21st century is only going to fuel such activity. For example, the
explosion of data due to the rapid growth of internet and mass communication is one
such area which has garnered particular attention in information sciences and computing.
One needs sophisticated machine learning techniques to sort these out and arrange them
so as to efficiently retrieve and transmit information. Due to advancements in data
collection and storage techniques, other scientific area like bioinformatics, astronomy,
neuro-computing and image analysis are unable to cope with the exponential growth of
data. This has led to a shift from traditional hypothesis driven to data driven science2 .
By working directly with the data to unravel hidden structures, we believe our novel
framework is one such step in the right direction.
There are some particularly interesting areas in which we would like to pursue
our future work. One such area is the theoretical analysis of the cost function. The
convergence of the fixed point update rule for some special cases have been shown in this
thesis. But for the general case, this seems especially difficult. Special attention is required
in understanding the working of the principle beyond β = 1. In particular, rigorous
mathematical analysis is needed to quantify the concept of principal curves. We believe
we have something more than just principal curves here, but this would need a totally
2 http://cnls.lanl.gov/
101
different approach and substantial mathematical analysis. Work in this area would lead to
new applications while enhancing our understanding of this principle.
Another topic which is of interest is the connection to information bottleneck (IB)
method [Tishby et al., 1999]. There are two particular reasons we are interested in
pursuing this research. As pointed out earlier in Chapter 1, this technique tries to address
meaningful or relevant information in a given data S through another variable Y which
is dependent on S. Inherent to this setup, is the assumption that the joint density p(s, y)
exists so that I(S; Y ) > 0. This idea is depicted clearly in Figure 7-2. In order to capture
this relevant information Y in a minimally redundant codebook X, one needs to minimize
the mutual information I(X; S) constrained on maximizing the mutual information
I(X; Y ). This is framed as minimizing a variational problem as shown below
L(p(x|s)) = I(X; S)− βI(X; Y ), (7–1)
where β is a Lagrange multiplier. Most methods at this stage make a Gaussian assumption
due to the difficulty in deriving iterative scheme using Shannon definition. But the
authors, using the ideas from rate distortion theory and in particular by generalizing the
Blahut-Arimoto algorithm [Blahut, 1972], were able to derive a convergent re-estimation
scheme. These iterative self consistent equations work directly with the probability
density, thus providing an alternative solution to capture relevant higher order information
and avoid Gaussian approximations altogether.
Instead of assuming an external relevant variable, the Principle of Relevant Entropy
extracts such relevant information directly from the data for different combinations of
(β, σ). Thus, we go one step further, by not just extracting one structure, but a range of
structures relevant for different tasks and at different resolutions. By doing so, we gather
maximal information to actively learn directly from the data. Additionally, working with
just two variables namely X and S, we simplify the problem with no extra assumption of
the existence of the joint density function p(s, y).
102
S X Y
I(S;Y)>0
p(x|s) p(y|x)
p(s,y)
Figure 7-2. The idea of information bottleneck method.
It should be pointed out that our method is particularly designed to address
unsupervised learning. The IB method has been applied to supervised learning paradigms
due to the availability of extra variable Y , though some applications of IB method to
unsupervised learning like clustering and ICA have also been pursued. In particular,
changing the value of parameter β leads to a coarse to a very fine analysis and has
interesting implications with the parameters in our cost function. Similar form of cost
function based on energy minimizing principle for unsupervised learning has been
proposed by Ranzato et al. [2007] and it would be important to study further the
connection to these methods.
Finally, we would like to pursue research in the area of inference, data fusion and
active learning [Cohn et al., 1996]. Going back to our schematic in Figure 1-2, a constant
feedback loop exists between the environment and the learner. Thus, a learner is not a
passive recipient of the data, but can actively sense and gather it to advance his learning.
To do so, one needs to address the issue of inference and reasoning beyond the structural
analysis of the incoming data. Lessons from reinforcement learning could also be used to
103
study actions resulting in both new data and rewards for the learner. This is a broad area
of research and has vast implications in many important fields.
To summarize, machine learning is an exciting field and we encourage readers as well
as critics to pursue research in this important area of science.
104
APPENDIX ANEURAL SPIKE COMPRESSION
The work presented in this section specifically addresses the application of unsupervised
learning techniques to time-series data. We have devised a new compression algorithm
suitable and highly tailored to the needs of neural spike compression [Rao et al., 2007b].
Due to its simplicity, we were able to successfully implement this algorithm in low power
DSP processor for real-time BMI applications [Goh et al., 2008]. What follows is a
summary of this research.
A.1 Introduction
Brain Machine Interfaces (BMI) aim at establishing a direct communication pathway
between human or animal brain and external devices (prosthetics, computers) [Nicolelis,
2003]. The ultimate goal is to provide paralyzed or motor-impaired patients a mode
of communication through the translation of thought into direct computer control. In
this emerging technology, a tiny chip containing hundreds of electrodes are chronically
implanted in the motor, premotor and parietal cortices and connected through wires to
external signal processor, which is then processed to generate control signals [Nicolelis
et al., 1999; Wessberg et al., 2000].
Recently, the idea of wireless neuronal data transmission protocols has gained
considerable attention [Wise et al., 2004]. Not only would this enable increased mobility
and reduce risk of infection in clinical settings, but also would free cumbersome wired
behavior paradigms. Although this idea looks simple, a major bottleneck in implementing
it is the high constraints on the bandwidth and power imposed on these bio-chips. On
the other hand, to extract as much information as possible, we would like to transmit all
the electrophysiological signals for which the bandwidth requirement can be daunting.
For example, to transmit the entire raw digitized potentials from 32 channels sampled at
20 kHz with 16 bits of resolution we need a huge bandwidth of 10 Mbps.
105
Many solutions were proposed to solve this problem. One solution is to perform
spike detection on site, and then transmit spike signal only or the time at which the
spike occurred [Bossetti et al., 2004]. An alternative is to use spike detection and
sorting techniques so that binning (the number of spikes in a given time interval) can
be immediately done [Wise et al., 2004]. The disadvantage of these methods lies in the
weakness of current automated spike detection methods without human interaction as
well as the missed opportunities of any post processing since the original waveform is lost.
It is in this regard, that we propose compressing the raw neuron potentials using well
established vector quantization techniques as an alternative and viable solution.
Before we dwell into neural spike compression, it is important to first understand the
data at hand. We present here a synthetic neural data which was designed to emulate as
accurately as possible the scenario encountered with actual recordings of neuronal activity.
The waveform contains spike from two neurons differing in both peak amplitude and
width as shown in Figure A-1A. Both neurons fired according to homogeneous Poisson
process with firing rates of 10 spikes/s (continuous line) and 20 spikes/s (dotted line).
Further, to introduce some variability in the recorded template, each time a neuron fired
the template was scaled by a Gaussian distributed random number with mean 1 and
standard deviation 0.01. Finally, the waveform was contaminated with zero mean white
noise of standard deviation 0.05. An instance of the neural data is shown in Figure A-1B.
Notice the sparseness of the spikes compared to the noise in the data. This is a peculiar
characteristic of neural signal in general.
We show here the two dimensional non-overlapping embedding of the training data
in Figure A-2. Note the dense cluster near the origin corresponding to the noise which
constitutes more than 90% of the data. Further, since the spikes correspond to large
amplitudes in magnitude, the farther the data point from the origin the more likely it
is to belong to the spike region. It is these sparse points which we need to preserve as
accurately as possible while compressing the neural information.
106
0 5 10 15 20 25 30−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2spike1spike2
A
4.4 4.5 4.6 4.7 4.8 4.9
x 104
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
B
Figure A-1. Synthetic neural data. A) Spikes from two different neurons. B) An instanceof the neural data.
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure A-2. 2-D embedding of training data which consists of total 100 spikes with acertain ratio of spikes from the two different neurons.
107
Applying blindly tradition vector quantization techniques like Linde-Buzo-Gray
(LBG) and Self Organizing Maps (SOM) for this data is bound to give very low SNR
for the spike region (which is of interest here) since these algorithms would devote more
code vectors to model the denser regions of the data than the sparse regions. This
would correspond to modeling the noise regions of the data well at the cost of the spike
regions. To correct this, a new algorithm called self organizing map with dynamic learning
(SOM-DL) was introduced [Cho et al., 2007]. Though there is slight improvement in
the SNR of the spike region, the changes made were heuristic. Further, these algorithms
are computationally intensive with large number of parameters to tune and can only be
executed offline.
In this chapter, we introduce a novel technique called weighted LBG (WLBG)
algorithm which effectively solves this problem. Using a novel weighting factor we give
more weightage to sparse region corresponding to the spikes in the neural data leading to
a 15 dB increase in the SNR of the spike region apart from achieving a compression ratio
of 150 : 1. The simplicity and the speed of the algorithm makes it feasible to implement
this in real-time, opening new doors of opportunity in online spike compression for BMI
applications.
A.2 Theory
A.2.1 Weighted LBG (WLBG) Algorithm
The weighted LBG is actually a recursive implementation of weighted K-means
algorithm. The cost function optimized is the weighted L2 distortion measure between the
data points and the codebook as shown below in (A–1).
D(C) =N∑
i=1
wi ‖xi − ci∗‖2 , (A–1)
where ci∗ is the nearest code vector to data point xi as given in (A–2).
ci∗ = mincj∈C
‖xi − cj‖2 (A–2)
108
Step 1 Specify the maximum distortion Dmax and maximum number of levelsLmax.
Step 2 Initialize L = 0 and codebook C as random point in Rd.Step 3 Set M = 2L. Start the optimization loop
a. Calculate for each xi ∈ X, the nearest code vector ci∗ ∈ C using (A–2).b. Update the the code vectors in the codebook using (A–3) where the sum
is taken over all data points for which cj is the nearest code vector.
cj =
∑k:ck∗=cj
wkxk∑k:ck∗=cj
wk
(A–3)
c. Measure the new distortion D(C) as shown in (A–1). If D(C) ≤ Dmax
then go to step 5 or else continue.d. Go back to (a) unless the change in distortion measure is less than δ
Step 4 If L = Lmax, go to step 5 . Else L = L + 1 and split each point cj ∈ C intotwo point cj + ε and cj − ε and go back to step 3.
Step 5 Stop the algorithm. Return C, the optimized codebook.
Figure A-3. The outline of weighted LBG algorithm.
Consider a dataset X = {x1, x2, . . . , xN} ∈ Rd. Let C = {c1, c2, . . . , cM} denote the
codebook to be found. The outline of the algorithm is show in Figure A-3. Both δ and ε
are set to a very small value. A typical value is δ = 0.001 and ε = 0.0001[1−1−1 . . .] where
the random 1 and -1 is a d dimensional vector. This recursive splitting of the codebook
has two advantages over the direct K-means method.
• Firstly, there is no need to specify the exact number of code vectors. In most realapplications, the maximum distortion level Dmax is known. The LBG algorithmstarts with one code vector and recursively splits it so that D(C) ≤ Dmax.
• Secondly, the recursive splitting effectively avoids the formation of empty clusterswhich is very common in K-means.
Novel weighting factor. Since the spikes correspond to large amplitudes in
magnitude, the farther the data point from the origin the more likely it is to belong to
the spike region. Further, information at the tip of the spike should be modeled well since
amplitude of the spike is an important feature in spike sorting. Thus, to reconstruct spike
information as accurately as possible, we need to give more weightage to the points far
109
from the origin. Thus, we select the weighting for our algorithm as shown below.
wi =
‖xi‖2 if ‖xi‖2 ≥ τ
τ if ‖xi‖2 ≤ τ ,
(A–4)
where τ is a small constant to prevent the weighting from going to zero. Though an
arbitrary choice of τ would do, we can make an intelligent selection. Note that we can
estimate the standard deviation σ of the noise from the data which corresponds to the
dense Gaussian cluster at the origin in Figure A-2. Since 2σ denotes 95 percent confidence
interval of the Gaussian noise, therefore we can set τ = (2σ)2 giving same weightage to
all points belonging to the Gaussian noise. For higher dimensional embedding like L = 5
or L = 10, one can use τ = (√
Lσ)2, which ensure more tight cluster boundary. In our
experiment σ = 0.05 and so we set τ = 0.01.
A.2.2 Review of SOM and SOM-DL Algorithms
Self organizing maps (SOM) is an idea based on competitive learning. The goal is
to learn the non linear mapping between the data in the input space and a two or one
dimensional fully connected lattice of neurons in an adaptive and topologically ordered
fashion [Haykin, 1999]. Each processing element (PE) in the lattice of M PEs has a
corresponding synaptic weight vector which has the same dimensionality as that of the
input space. At every iteration, the synaptic weight closest to every input vector xk is
found as shown in (A–5).
i∗ = argmin1≤i≤M
‖xk − wi‖ (A–5)
Having found the winner PE for each xk, a topological neighborhood is determined
around the winner neuron. The weight vector of each PE is then updated as
wi,k+1 = wi,k + ηkΛi,k(xk − wi,k), (A–6)
where ηk ∈ [0, 1] is the learning rate. The topological neighborhood is typically defined
as Λi,k = exp(−‖ri−ri∗‖2
2σ2k
)where ‖ri − ri∗‖ represents the euclidean distance in the
110
output lattice between ith PE and the winner PE. Notice that both learning rate (ηk)
and the neighborhood width (σk) are time dependent and are normally annealed for best
performance.
When applying SOM to neural data, it was found that most of the PEs were used to
model the noise rather than the spikes in the data. This is typical of any neural recording
which generally has sparse number of spikes. In order to alleviate this problem and to
move PEs from low amplitude region of state space to the region corresponding to the
spikes, the following update rule was proposed,
wi,k+1 = wi,k + µΛi,ksign(xk − wi,k)(xk − wi,k)2. (A–7)
This was called self organizing map with dynamic learning (SOM-DL) [Cho et al., 2007].
By accelerating the movements of the PEs toward the spikes, the SOM-DL represents the
spikes better. But for good performance, careful tuning of the parameters is important.
For example, it was experimentally verified that µ between 0.05 and 0.5 balances between
fast convergence and small quantization error for the spikes. Further, it is well known that
SOM based algorithms are computationally very intensive.
A.3 Results
In this section, we present the results obtained by WLBG on the neural spike data
using the novel weighting factor developed in the previous section and compare it with
results obtained from SOM-DL and SOM.
A.3.1 Quantization Results
Figure A-4 shows 16 point quantization obtained using WLBG on the training data.
As can be seen, more code vectors are used to model the points far away from the origin
even though they are sparse. This helps to code the spike information in greater details
and hence minimize reconstruction errors. On the other hand, SOM-DL wastes a lot
of points in modeling the noise cluster as shown in Figure A-5. Further, not only does
SOM-DL have large number of parameters which needs to be fine tuned for optimal
111
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure A-4. 16 point quantization of the training data using WLBG algorithm.
−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure A-5. Two-dimensional embedding of the training data and code vectors in a 5× 5lattice obtained using SOM-DL algorithm.
performance, but also takes immense amount of time to train the network, making it only
suitable for off line training.
We test this on a separate test data generated to emulate real neural spike signal. A
small region is highlighted in Figure A-6 which shows the comparison between the original
and the reconstructed signal. Clearly weighted LBG does a very good job in preserving
112
2.226 2.227 2.228 2.229 2.23 2.231 2.232 2.233 2.234
x 104
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1 data
reconstructed
Figure A-6. Performance of WLBG algorithm in reconstructing spike regions in the testdata.
spike features. Also notice the suppression of noise in the non-spike region. This denoising
ability is one of the strengths of this algorithms and is attributed to the novel weighting
factor we selected.
We report the SNR obtained by using WLBG, SOM-DL and SOM in Table A-1.
As can be seen, there is a huge increase of 15dB in the SNR of the spike regions of the
test data compared to SOM-DL which only marginally improves the SNR over SOM.
Obviously, by concentrating more on the spike region, our performance on the non-spike
region suffers but the decrease is negligible compared to SOM-DL. It should be noted
that good reconstruction of the spike region is of utmost importance and hence the only
measure which should be considered is the SNR in the spike region. Further, the result
reported here for WLBG is for 16 code vectors which is far less than 25 code vectors (5× 5
lattice) for SOM-DL and SOM algorithms.
Table A-1. SNR of spike region and of the whole test data obtained using WLBG,SOM-DL and SOM algorithms.
SNR of WLBG SOM-DL SOM
Spike region 31.12dB 16.8dB 14.6dBWhole test data 8.12dB 8.6dB 9.77dB
113
A.3.2 Compression Ratio
We quantify here the theoretical compression ratio achievable by using the codebook
generated by WLBG. In order to do so, we use a test data consisting of 5 seconds of spike
data sampled at 20kHz and digitized to 16 bits of resolution. We use this to measure the
firing rate of the code vectors. Figure A-7 shows the probability of firing of the WLBG
codebook. Code vector 16 models the noisy part of the signal and hence fires most of the
time. It should be noted that in general, neural data has very sparse number of actual
spikes. The probability values for the code vectors is given in Table A-2.
0 2 4 6 8 10 12 14 160
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure A-7. Firing probability of WLBG codebook on the test data.
The entropy of this distribution is
H(C) =16∑i=1
pi log(pi) = 0.2141.
From information theory, we know that this is a lower bound for average number of bits
needed to represent the codebook. Thus, with good coding like arithmetic codes we can
reach very close to this optimal value. Since we are using 2D non overlapping embedding
of the signal sampled at 20 kHz, the number of bits needed to transmit the data is
20k2× 0.2141 = 2.141 kbps. If the data had been transmitted without any compression
114
Table A-2. Probability values obtained for the code vectors and used in Figure A-7.
Code Vectors Probability
1 0.00072 0.00173 0.00074 0.00145 0.00036 0.00077 0.00108 0.01939 0.000310 0.001811 0.000712 0.004513 0.000214 0.001515 0.000616 0.9647
then, the number of bits needed is 20k × 16 = 320 kbps. Thus we achieve a compression
ratio of 150 and at the same time maintain a 32 dB SNR on the spike region. Further,
on real datasets, where a 10D embedding is generally used, the compression ratio would
increase to 750 with only 428 bps needed to transmit the data. This is a significant
achievement and would help alleviate the bandwidth problem faced in transmitting data in
BMI experiments.
A.3.3 Real-time Implementation
This simple inner product algorithm has been successfully implemented recently
on low power digital signal processor (DSP) with the aim of wirelessly transmitting raw
neuronal recordings. The hardware platform is a Pico DSP [Cieslewski et al., 2006] that
couples with the University of Florida’s existing technology, the Pico Remote (a battery
powered neural data processor, and a wireless transmitter) [Cheney et al., 2007]. This
board consists of a DSP from Texas Instruments (TMS320F28335), a low power Altera
Max II CPLD, and a Nordic Semiconductor’s ultra-low-power transceiver (nRF24L01).
The Pico DSP can provide up to 150 MIPS and can operate for nearly 4 hours in low
115
Wireless Module DSP CPLD
Pico Remote
PICO DSP
FIFO Buffer 2FIFO Buffer 1State Machine
SPI
CPLD
Figure A-8. A block diagram of Pico DSP system.
power modes. This system is designed to have a maximum of 8 Pico Remotes, each with a
maximum of 8 channels (20 kHz, 12 bits of resolution) resulting in a total of 64 channels of
neural data. Figure A-8 shows the block diagram of this system.
The signal is sampled at real-time using the A/D conversion in Pico Remotes. All
these daughterboards are connected to the CPLD which buffers the data and controls
the flow to the DSP. The neural data captured by the DSP are then encoded using the
code vectors generated by the WLBG algorithm. When the wireless buffer is full with
compressed data, it is transmitted wirelessly using the DSP’s onboard Serial Peripheral
Interface (SPI) which is connected to the wireless transceiver. A brief summary of the
results follow. This is an ongoing work and readers who are interested in this should refer
to [Goh et al., 2008].
Two different neural signals were used to test this architecture. The first is generated
by a Bionic Neural Simulator and has an ideal signal to noise ratio (SNR) of 55 dB.
The second signal is a real neuronal recording from a microwire electrode chronically
116
implanted in layer V of a rat motor cortex. This signal is more heterogeneous, has a lower
SNR (23 dB), and is representative of an average chronic neural recording. With the
simulated signal (with high SNR) and 32 code vectors, we were able to achieve a very
high compression ratio of 184 : 1 with a 15-D embedding. On the other hand, due to
severe noise presence in the in-vivo signal, a compression ratio of only 10 : 1 was possible
using 2-D embedding and 64 code vectors. Nevertheless, the reconstructed signal modeled
the spike information extremely well. This real-time implementation is unique in term of
tradeoff between power and bandwidth requirement for BMI experiments and surpasses
the previous technology both in terms of efficiency and performance.
A.4 Summary
We have proposed a weighted LBG as an excellent compression algorithm for neural
spike data with emphasis on good reconstruction of the spikes. The following are the
salient features of this algorithm compared to SOM-DL.
• A 15dB increase in SNR of spike regions in the data.
• A smaller and more efficient codebook achieving a compression ratio of 150 : 1.
• The increase in speed is many folds. The WLBG takes less than 1 seconds onmachine with Pentium IV and 512 MB RAM versus SOM-DL which takes more than15 minutes.
• There is no parameters to tune in WLBG compared to SOM algorithms which hasstep size, neighborhood kernel size and their annealing parameters which needs to beproperly tuned. The only variable to be defined in WLBG is the weights wi whichhave a clear interpretation for the application at hand.
• A real-time implementation of this algorithm is possible due to its computationsbased on inner product unlike SOM algorithms. This has been demonstration usingPico DSP system for a single channel showing promising results.
There are number of ways to build upon this work. We plan to construct an efficient
k-d tree search algorithm for the codebook taking into account the weighting factor
and the probability of the code vectors. This would further speed up the algorithm. We
would also like to use advanced encoding techniques like entropy coding and achieve bit
117
rate as close as possible to the theoretical value. Finally, we are pursuing a real-time
implementation of this algorithm to wirelessly transmit all the 64 channels of the neural
data which would lead to significant progress in BMI research.
118
APPENDIX BSPIKE SORTING USING CAUCHY-SCHWARTZ PDF DIVERGENCE
The research summarized here proposes a new method of clustering neural spike
waveforms for spike sorting [Rao et al., 2006b]. After detecting the spikes using a
threshold detector, we use principal component analysis (PCA) to get the first few PCA
components of the data. Clustering on these PCA components is achieved by maximizing
the Cauchy Schwartz PDF divergence measure which uses the Parzen window method
to non-parametrically estimate the pdf of the clusters. We provide a comparison with
other clustering techniques in spike sorting like K-means and Gaussian mixture model,
showing the superiority of our method in terms of classification results and computational
complexity.
B.1 Introduction
The ability to identify and discriminate the activity of single neurons from neural
ensemble recordings is a vital and integral part of basic neuroscience research. By tracking
the modulation of the fundamental constituents of the nervous system, neurophysiologists
have begun to formulate the basic constructs of how systems of neurons interact and
communicate [Fetz, 1992]. Recently, this knowledge of neurophysiology has been applied
to Brain Machine Interface (BMI) experiments where multielectrode arrays are used
to monitor the electrical activity of hundreds of neurons from the motor, premotor,
and partietal cortices [Wessberg et al., 2000]. In multielectrode BMI experiments,
experimenters are faced with the labor intensive task of analyzing each of the extracellular
recordings for the signatures of electrical activity related to the neurons surrounding the
electrode tip. Separating these different neural sources - a term called “spike sorting”,
helps the neurophysiologist to study and infer the role played by each individual neuron
with respect to the experimental task at hand.
Spike sorting is based upon the property that every neuron has its own characteristic
“spike” shape which is dependent on its intrinsic electrochemical dynamics as well as
119
the position of the electrode with respect to the neuron. The key is to separate these
spikes from the background noise and use features in each of the shapes to discriminate
different neurons. The analysis of this non-stationary signal is made more difficult by
the fluctuating effects of the electrode-tissue interface which is affected by movement,
glial encapsulation, and ionization [Sanchez et al., 2005]. To overcome these challenges,
many signal processing and machine learning techniques have been successfully applied
which are well summarized in [Lewicki, 1998]. Modern methods use multiple electrodes
and sophisticated techniques like Independent Component Analysis (ICA) to address the
issue of overlapping and similarity between spikes [Brown et al., 2004]. Nevertheless, in
many cases, good single-unit activity can be obtained with a simple hardware threshold
detector [Wheeler, 1999]. After detection, the classification is done using either template
matching or clustering of the principal components (PCA) of the waveforms [Lewicki,
1998; Wood et al., 2004]. The advantage with the PCA of the spike waveforms is that
it dynamically exploits differences in the variance of the waveshapes to discriminate and
cluster neurons.
A common clustering algorithm which is used extensively on the PCA of the
waveforms is the ubiquitous K-means [MacKay, 2003]. For spike sorting, the K-means
algorithm always clusters neurons, but there is no guarantee that it converges to
the optimum solution leading to incorrect sorts. The result depends on the original
cluster centers (the random initialization problem) as well as the fact that K-means
assumes hyper spherical or hyper ellipsoidal clusters. Researchers have employed other
clustering techniques to overcome this problem. Lewicki [1994] used Bayesian clustering
to successfully classify neural signals. The advantage of using Bayesian framework is
that it is possible to quantify the certainty of the classification. Recently, Hillel et al.
[2004] extended this technique by automating the process of detecting and classification.
Other clustering techniques like Gaussian Mixture Model (GMM) [Duda et al., 2000]
and Support Vector Machines (SVM) [Vogelstein et al., 2004] have also been applied
120
successfully. The tradeoff of simplicity for accuracy results in techniques suffering
from high computational cost and consequently, they are not very suitable for online
classification of neural signals in low-power devices. Further, model order selection is a
difficult task in both GMM and Bayesian clustering. Due to these reasons, K-means is still
used extensively for its simplicity and ease of use.
In this section, we propose the Cauchy Schwartz PDF divergence measure for
clustering of neural spike data. We show that this method not only yields superior
results to K-means but also is computationally less expensive with O(N) complexity for
classifying online test samples.
B.2 Data Description
B.2.1 Electrophysiological Recordings
Extracellular cortical neuronal activity was collected using 50m tungsten microelectrodes
from behaving animals. A neural recording system sampling at 24, 414.1Hz was used to
digitize the extracellular analog voltages with 16 bits of resolution. To emphasize the
frequencies contained in action potentials, the raw waveforms were bandpass filtered
between 300Hz and 6kHz. Representative microelectrode recordings are shown in
Figure B-1. Here, the action potentials from two neurons can be identified (with asterisks)
by the differences in amplitude and width of the waveshapes.
B.2.2 Neuronal Spike Detection
The voltage threshold was set by the experimenter using Spike 2 (CED, UK) through
visual inspection of the digitized time series. For the example given in Figure B-1, a
threshold of 25µV is sufficient to detect each of the two waveshapes. A set of unique
waveshapes were constructed from the thresholded waveforms based upon the width which
was measured −0.6ms to the left and 1.5ms to the right of the threshold crossing. Using
electrophysiological parameters (amplitude and width) of the spike, artifact signals (e.g.,
electrical noise, movement artifact) were removed. The peak-to-peak amplitude, waveform
shape, and interspike interval (ISI) were then evaluated to ensure that the detected spikes
121
Figure B-1. Example of extracellular potentials from two neurons.
had a characteristic and distinct waveform shape when compared with other waveforms
in the same channel. Next, the first ten principal components (PC) were computed from
all of the detected spike waveforms. Of all the PCs, only the first two eigenvalues were
greater than the value 1 and captured majority of the variance1 . This feature space was
found to be sufficient to discriminate between the two neurons. The first two PCs are
plotted in Figure B-2 where two overlapping clusters of points correspond to each of the
detected waveshapes. The challenge here is to cluster each of the neurons using automated
techniques.
B.3 Theory
We provide a brief summary of the non-parametric clustering method using
Cauchy-Schwartz(CS) pdf divergence measure. For a detailed treatment of this topic,
the readers are advised to refer to the Ph.D dissertation of Jenssen [2005].
B.3.1 Non-parametric Clustering using CS Divergence
Let p(u) and q(u) denote the probability density functions of two random variables
X and Y respectively. Since these functions are square integrable and are also strictly
1 The first three PCs accounted for more than 95% of the variance.
122
−3 −2 −1 0 1 2 3−4
−3
−2
−1
0
1
2
3
4
PC1
PC
2
Figure B-2. Distribution of the two spike waveforms along the first and the secondprincipal components.
non-negative, by Cauchy-Schwartz inequality the following condition holds,
(∫p(u)q(u)du
)2
≤∫
p2(u)du
∫q2(u)du.
Using this inequality, a divergence measure can then be expressed as
Dcs(p||q) = − log Jcs(p, q)
Jcs(p, q) =
∫p(u)q(u)du√∫
p2(u)du∫
q2(u)du.
(B–1)
Note that Jcs(p, q) ∈ (0, 1] and hence the divergence measure Dcs(p||q) ≥ 0 with equality iff
the two pdfs are the same.
Given the sample population {x1, x2, . . . , xN1} and {y1, y2, . . . , yN2} of the random
variable X and Y , one can estimate this divergence measure directly from the data using
ideas from information theoretic learning [Prıncipe et al., 2000]. The readers are referred
to Chapter 2 for a detailed derivation of the following non-parametric estimator,
Jcs =1
N1N2
∑N1,N2
i,j=1 Gij,σ2I√1
N21
∑N1,N1
i,i′=1 Gii′,σ2I1
N22
∑N2,N2
j,j′=1 Gjj′,σ2I
.
123
Here, Gij,σ2I = Gσ2I(xi − yj) with G being a Gaussian kernel. Further, Gii′,σ2I =
Gσ2I(xi − xi′) and Gjj′,σ2I = Gσ2I(yj − yj′). Careful observation of Jcs shows that it
is a ratio of between-cluster information potential and the within-cluster information
potentials. Since entropy and information potential are inversely related, minimizing Jcs
maximizes between-cluster entropy and at the same time minimizes the within-cluster
entropies of the two clusters. This in turn would maximize Dcs(p||q), thus providing a
clustering solution with maximum divergence between the two pdfs.
To recursively find the best labeling of each data point so as to minimize Jcs, we first
have to express the above quantity as
Jcs =12
∑N,Ni,j=1(1−mT
i mj)Gij,σ2I√∑N,Ni,j=1(1−mi1mj1)Gij,σ2I
∑N,Ni,j=1(1−mi2mj2)Gij,σ2I
.
The quantity mi is the fuzzy membership vector of sample xi={1,2,...,N}, N = N1 + N2 and
mik, k = 1, 2 represent the elements of mi. If xi (or yj) truly belongs to cluster C1 (C2),
then the crisp membership vector is given by mi = [1, 0] ([0,1]). Note that Jcs is a function
of these membership vectors and hence can be written as Jcs(m1,m2, . . . ,mN).
With mik, k = 1, 2 initialized to any value in the interval [0, 1], the optimization
problem can be formulated as
minm1,m2,...,mN
Jcs(m1,m2, . . . ,mN)
subject to mTj 1− 1 = 0, j = 1, 2, . . . , N.
Note that the constraint term ensures that the elements of each vector mi sum to unity.
In order to derive a iterative fixed point update rule, we need to make the change of
variable, mik = v2ik, k = 1, 2. The optimization problem now looks like
minv1,v2,...,vN
Jcs(v1,v2, . . . ,vN)
subject to vTj vj − 1 = 0, j = 1, 2, . . . , N.
124
Using the method of Lagrange multipliers, we have
L = Jcs(v1,v2, . . . ,vN) +N∑
j=1
λj(vTj vj − 1),
where λj, j = 1, 2, . . . , N are the Lagrange multipliers. Differentiating with respect to vj
and λj for j = 1, 2, . . . , N , we can get the following iterative scheme.
vj =− 1
2λj
∂Jcs
∂vj
λj =1
2
öJcs
∂vj
T∂Jcs
∂vj
.
(B–2)
After the convergence of the algorithm, one can get crisp membership vectors by
designating the maximum value of the elements of each mi, i = 1, 2, . . . , N to one,
and rest to zero.
This technique implements a constrained gradient descent search, with built in
variable step-size for each coordinate direction. The generalization to more than
two clusters can be found in [Jenssen, 2005]. Jenssen also demonstrated the superior
performance of this algorithm compared to GMM and Fuzzy K-means (FKM) for many
non convex clustering problems. For a d-dimensional data, the kernel size is calculated
using Silverman’s rule of thumb given by equation B–3.
σopt = σX{4N−1(2d + 1)−1} 1d+4 , (B–3)
where σ2X = d−1
∑i ΣXii
and ΣXiiare the diagonal elements of the sample covariance
matrix. To avoid local minima, we anneal the kernel size from 2σopt to 0.5σopt over a
period of 100 iterations. Calculating the gradients requires O(N2) computations. To
reduce the complexity, we stochastically sample the membership space by randomly
selecting M membership vectors where M << N . Thus, the complexity of the algorithm
drops to O(NM) per iteration.
125
B.3.2 Online Classification of Test Samples
For online spike sorting, the goal is to assign the test point to the cluster whose
within cluster entropy has changed the least. This rule automatically ensures the
maximization of between-cluster entropy and hence the maximization of CS PDF
divergence measure for every new test point. The kernel size σopt can be estimated from
the training samples which have already been classified. The kernel size σk, k = 1, 2 of each
individual cluster was calculated from B–3 using the corresponding training samples and
then averaged to obtain an estimate of σopt.
Since information potential and entropy are inversely related, we assign the test
point to the cluster which incur maximum change in information potential. The change in
information potential of the clusters C1 and C2 due to a new test point xt is given by
4Vck=
Nk∑j=1
Gσ2optI
(xt − xj), k = 1, 2. (B–4)
Thus the classification rule can be summed up as,
If4Vc1 > 4Vc2 99K Classify as Cluster 1 sample
If4Vc2 > 4Vc1 99K Classify as Cluster 2 sample.
(B–5)
Note that this computation takes just O(N) calculations, and hence is much more efficient
than calculating the change in the entire CS divergence measure.
B.4 Results
B.4.1 Clustering of PCA Components
The dataset consists of 2734 points out of which the first 500 points were used in
training phase. The algorithm was fully automated and involved kernel annealing. Only
one fourth of the data (M= 0.25N) was used for calculating the membership update thus
speeding up the algorithm significantly. Our algorithm took 6 − 9 seconds (Dell P4,
1.8GHz, 512MB RAM) to converge giving good clustering result for almost every new
initialization as shown in Figure B-3A. As seen in Figure B-3B, K-means clearly fails to
126
separate the two clusters giving poor classification results. The main reason for this poor
performance is the inherent assumption of hyper ellipsoidal clusters in K-means which is a
wrong hypothesis for this application scenario.
B.4.2 Online Labeling of Neural Spikes
Using the classification rule presented in B–5, the remaining 2234 points were
classified as belonging to either of the clusters by just comparing with the training samples
as shown in Figure B-3C. For comparison, we have also plotted the training samples along
with the test samples. The algorithm took 2 − 3 seconds to classify all 2234 test points.
We also used another method of comparison where a new test point is not only compared
to the training samples, but also to the previous test points which have been already
classified. This will linearly increase the computational complexity, but generally gives
better classification results. Nevertheless, in our case, the two methods gave identical
results which may be due to the fact that the training samples defined the clusters
sufficiently well.
B.5 Summary
With the advent of multielectrode data acquisition techniques, fast and efficient
sorting of neural spike data is of utmost importance for monitoring the activity of
ensembles of single neurons. The state-of-the-art in neuronal waveform PCA analysis
is faced with a clustering problem due to the electrochemical dynamics in the tissue. We
have proposed a technique for clustering that addresses waveform PCA distributions
that are non Gaussian and non convex, arising from neural sources both near and far
from the electrode. Clustering based on Cauchy-Schwartz PDF divergence helps address
these issues encountered in multielectrode BMI experiments and classifies new incoming
neural spike with O(N) complexity which is suitable for implementation in low-power
portable hardware. With no parameters to be selected, this method additionally provides
neurophysiologists with an easy and powerful tool for spike sorting. Future research
127
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−4
−3
−2
−1
0
1
2
3
PCA 1
PC
A 2
cluster 1cluster 2
A Clustering of training data using CS divergence
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−4
−3
−2
−1
0
1
2
3
B Clustering of training data using K-means
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3
−2
−1
0
1
2
3
PCA 1
PC
A 2
test cluster 1test cluster 2train cluster 1train cluster 2
C Online Classification of test points
Figure B-3. Comparison of clustering based on CS divergence with K-means for spikesorting.
128
involves extending this technique simultaneously to hundreds of electrodes and extensive
comparison with other methods like ICA and Bayesian clustering.
129
APPENDIX CA SUMMARY OF KERNEL SIZE SELECTION METHODS
We provide here a brief summary of some kernel size selection methods which we have
used in this thesis. There exists extensive literature work in this field and we certainly
do not intend to summarize everything in this chapter. For a thorough treatment on this
topic, the readers are directed to the following references [Katkovnik and Shmulevich,
2002; Wand and Jones, 1995; Fukunaga, 1990; Silverman, 1986].
One of the earliest approaches taken to find an optimal kernel size was by minimizing
the mean square error (MSE) between the estimated and the original pdf. The MSE
between the estimator f(x) and the true density f(x) at a particular point x is given by
MSE{f(x)} = E{[f(x)− f(x)]2},
where the expectation is taken over the kernel size σ. This can be rewritten in terms of
the bias and the variance quantities as
MSE{f(x)} = [E{f(x)} − f(x)]2 + V ar{f(x)}.
Instead of just evaluating at a particular point x, we could integrate this quantity over the
entire space to find an optimal σ. Since the bias and the variance term behave differently
with respect to σ, one has to make a best compromise between these two terms, depicting
the classic bias-variance dilemma.
MISE{f(x)} =
∫[E{f(x)} − f(x)]2 dx +
∫V ar{f(x)} dx.
A bottleneck of this approach is the inherent assumption of the knowledge of true
density f(x). For the particular case of N samples from a multivariate d-dimensional
normal density with some mild and asymptotic assumptions, one can derive the
Silverman’s rule of thumb [Silverman, 1986] as shown in equation C–1.
σ∗ = σX{4N−1(2d + 1)−1} 1d+4 , (C–1)
130
where σ2X = d−1
∑i ΣXii
and ΣXiiare the diagonal elements of the sample covariance
matrix. Due to the inherent assumption of normal density, this method generally yields a
large kernel size resulting in an oversmoothed density estimate.
Another technique of optimality is the maximum likelihood solution. Consider a
random sample {x1, . . . , xN}. The general Parzen pdf estimation can be formulated as
p(x) =1
N
N∑i=1
Gσ2Ci(x− xi),
where Ci = 1K
∑Kj=1(x
ji − xi)(x
ji − xi)
T with{xj
i
}j=1,...,K
being the K nearest neighbors of
xi. To reduce the complexity, we can also set Ci = σ2i I where σi is the mean or median
of the distance of xi to its K neighbors. K is generally selected as√
N (or in general N1p
with p > 1) where N is the number of data points to ensure asymptotic unbiasedness of
the estimate. The global scale σ is then optimized to maximize the leave-one-out cross
validation likelihood.
σ∗ = maxσ
N∑j=1
log1
N
N∑
i=1,i6=j
Gσ2Ci(xj − xi). (C–2)
The maximization is carried out as a brute-force line search. A good upper limit is the
kernel estimate from Silverman’s rule since it yields an over estimated result. With a log
spaced line search, the σ that yields the maximum value in equation C–2 is selected as a
good approximation to the true optimal.
One could also use K nearest neighbor (KNN) technique for each sample xi to get
a local estimate of the kernel size. This local estimate σi is computed as the mean of K
nearest neighbors as shown below.
σi =1
K
K∑j=1
‖xi − xji‖,
where{xj
i
}j=1,...,K
are the K nearest neighbors of xi. K is generally set as√
N where
N is the number of data points to ensure asymptotic unbiasedness of the estimate. An
advantage of this multi-kernel technique is that one could get individual kernel size
131
appropriate for each sample xi. Thus, the global pdf estimate is well tailored taking into
consideration all the local effects. Further, this technique is quite robust to outliers since
σi would be quite large resulting in smoothing out these effects in the pdf estimate. From
this multi-kernel rule, we could also select a global kernel size as the median value of all
these local σi, that is
σ∗ = mediani
σi. (C–3)
132
REFERENCES
A. Amin. Structural description to recognising arabic characters using decision treelearning techniques. In T. Caelli, A. Amin, R. P. W. Duin, M. S. Kamel, andD. de Ridder, editors, SSPR/SPR, volume 2396 of Lecture Notes in Computer Sci-ence, pages 152–158. Springer, 2002. ISBN 3-540-44011-9.
H. B. Barlow. Unsupervised learning. Neural Computation, 1(3):295–311, 1989.
M. S. Bartlett. Face Image Analysis by Unsupervised Learning. Springer, 2001.
S. Becker. Unsupervised learning procedures for neural networks. International Journal ofNeural Systems, 2:17–33, 1991.
S. Becker. Unsupervised learning with global objective functions. In M. A. Arbib, editor,The Handbook of Brain Theory and Neural Networks, pages 997 – 1000. MIT Press,Cambridge, MA, USA, 1998.
S. Becker and G. E. Hinton. A self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature, 355:161–163, 1992.
A. Bell and T. Sejnowski. An information-maximization approach to blind separation andblind deconvolution. Neural Computation, 7(6):1129–1159, 1995.
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
R. E. Blahut. Computation of channel capacity and rate distortion function. IEEE Trans.on Information Theory, IT-18:460–473, 1972.
C. A. Bossetti, J. M. Carmena, M. A. L. Nicolelis, and P. D. Wolf. Transmission latenciesin a telemetry-linked brain-machine interface. IEEE Transactions on BiomedicalEngineering, 51(6):919–924, June 2004.
E. M. Brown, R. E. Kass, and P. P. Mitra. Multiple neural spike train data analysis:state-of-the-art and future challenges. Nature Neuroscience, 7(5):456–461, May 2004.
T. Caelli and W. F. Bischof. Machine Learning and Image Interpretation. Springer, 1997.
M. A. Carreira-Perpinan. Mode-finding for mixtures of gaussian distributions. IEEETrans. on Pattern Analysis and Machine Intelligence, 22(11):1318–1323, November2000.
M. A. Carreira-Perpinan. Fast nonparametric clustering with gaussian blurring mean-shift.In W. W. Cohen and A. Moore, editors, ICML, pages 153–160. ACM, 2006. ISBN1-59593-383-2.
D. Cheney, A. Goh, J. Xu, K. Gugel, J. G. Harris, J. C. Sanchez, and J. C. Prıncipe.Wireless, in vivo neural recording using a custom integrated bioamplifier and the picosystem. In International IEEE/EMBS Conference on Neural Engineering, pages 19–22,May 2007.
133
Y. Cheng. Mean shift, mode seeking and clustering. IEEE Trans. on Pattern Analysis andMachine Intelligence, 17(8):790–799, August 1995.
J. Cho, A. R. C. Paiva, S. Kim, J. C. Sanchez, and J. C. Prıncipe. Self-organizing mapswith dynamic learning for signal reconstruction. Neural Networks, 20(11):274–284,March 2007.
G. Cieslewski, D. Cheney, K. Gugel, J. Sanchez, and J. C. Prıncipe. Neural signalsampling via the low power wireless pico system. In Proceedings of IEEE EMBS, pages5904–5907, August 2006.
D. A. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models.Journal of Artificial Intelligence Research, 4:129–145, March 1996.
D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Trans. on PatternAnalysis and Machine Intelligence, 25(2):281–288, 2003.
D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis.IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(5):603–619, May 2002.
D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects usingmean shift. In Proceedings of IEEE Conf. Computer Vision and Pattern Recognition,volume 2, pages 142–149, June 2000.
P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994.
C. Connolly. The relationship between colour metrics and the appearance ofthree-dimensional coloured objects. Color Research and Applications, 21:331–337,1996.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd edition.Wiley-Interscience, 2000.
D. Erdogmus. Information Theoretic Learning: Renyi’s Entropy and its Applications toAdaptive System Training. PhD thesis, University of Florida, 2002.
D. Erdogmus and U. Ozertem. Self-consistent locally defined principal surfaces. InProceedings of the International Conference on Accoustic, Speech and Signal Processing,volume 2, pages 15–20, April 2007.
M. Fashing and C. Tomasi. Mean shift is a bound optimization. IEEE Trans. on PatternAnalysis and Machine Intelligence, 27(3):471–474, March 2005.
E. E. Fetz. Are movement parameters recognizably coded in the activity of single neurons.Behavioral and Brain Sciences, 15(4):679–690, March 1992.
R. Forsyth. Machine Learning: Principles and Techniques. Chapman and Hall, 1989.
K. S. Fu. Syntactic Methods in Pattern Recognition. Academic Press, New York, 1974.
134
K. Fukunaga. Statistical Pattern Recognition. Academic Press, 1990.
K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density functionwith applications in pattern recognition. IEEE Trans. on Information theory, 21(1):32–40, January 1975.
A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Springer, 1991.
Z. Ghahramani. Unsupervised learning. In O. Bousquet, U. von Luxburg, and G. Ratsch,editors, Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes inComputer Science, pages 72–112. Springer, 2003. ISBN 3-540-23122-6.
A. Goh, S. Craciun, S. Rao, D. Cheney, K. Gugel, J. C. Sanchez, and J. C.Prıncipe. Wireless transmission of neural recordings using a portable real-timediscimination/compression algorithm. In Proceedings of the International Conference ofthe IEEE Engineering in Medicine and Biology Society, August 2008.
L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on ScientificComputing, 24:79–94, 1991.
T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Associa-tion, 84(406):502–516, 1989.
S. Haykin. Neural Networks: A Comprehensive Foundation, 2nd edition. New Jersey,Prentice Hall, 1999.
A. B. Hillel, A. Spiro, and E. Stark. Spike sorting: Bayesian clustering of non-stationarydata. In Proceedings of NIPS, December 2004.
D. Hume. An Enquiry Concerning Human Understanding. Oxford University Press, 1999.
A. Hyvarinen and E. Oja. Independent component analysis: Algorithms and applications.Neural Networks, 13(4-5):411–430, 2000.
R. Jenssen. An Information Theoretic Approach to Machine Learning. PhD thesis,University of Tromso, 2005.
C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive algorithm basedon neuromimetic architecture. Signal Processing, 24:1–10, 1991.
V. Katkovnik and I. Shmulevich. Kernel density estimation with adaptive varying windowsize. Pattern Recognition Letters, 23:1641–1648, 2002.
B. Kegl. Principal curves: learning, design, and applications. PhD thesis, ConcordiaUniversity, 1999.
B. Kegl and A. Krzyzak. Piecewise linear skeletonization using principal curves. IEEETransactions on Pattern Analysis and Machine Intelligence, 24(1):59–74, 2002.
135
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.Science, 220(4598):671–680, 1983.
T. Kohonen. Self-organized formation of topologically correct feature maps. BiologicalCybernetics, 43:59–69, 1982.
S. Lakshmivarahan. Learning Algorithms: Theory and Applications. Springer-Verlag,Berlin-Heidelberg-New York, 1981.
P. Langley. Elements of Machine Learning. Morgan Kaufmann, 1996.
T. Lehn-Schiøler, A. Hegde, D. Erdogmus, and J. C. Prıncipe. Vector-quantization usinginformation theoretic concepts. Natural Computing, 4:39–51, Jan. 2005.
M. S. Lewicki. Bayesian modeling and classification of neural signals. Neural Computation,6:1005–1030, May 1994.
M. S. Lewicki. A review of methods for spike sorting: the detection and classification ofneural action potentials. Network: Computation in Neural Systems, 9(4):R53–R78, 1998.
Y. Linde, Buzo A., and Gray R.M. An algorithm for vector quantizer design. IEEE Trans.on Communications, 28(1):84–95, Jan. 1980.
R. Linsker. Self-organization in a perceptual network. IEEE Computer, 21(3):105–117,March 1988a.
R. Linsker. Towards an organizing principle for a layered perceptual network. In D. Z.Anderson, editor, Neural Information Processing Systems - Natural and Synthetic.American Institute of Physics, New York, 1988b.
E. Lutwak, D. Yang, and G. Zhang. Cramer-rao and moment-entropy inequalities for renyientropy and generalized fisher information. IEEE Transactions of Information Theory,51(2):473–478, February 2005.
M. A. Carreira-Perpinan. Gaussian mean shift is an em algorithm. IEEE Transactions onPattern Analysis and Machine Intelligence, 29(5):767 – 776, May 2007.
D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CambridgeUniversity Press, 2003.
D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuringecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423,July 2001.
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. InProceedings of NIPS, 2002.
M. A. L. Nicolelis. Brain machine interfaces to restore motor function and probe neuralcircuits. Nature Reviews Neuroscience, 4:417–422, 2003.
136
M. A. L. Nicolelis, C. R. Stambaugh, A. Bristen, and M. Laubach. Methods forsimultaneous multisite neural ensemble recordings in behaving primates. In MiguelA. L. Nicolelis, editor, Methods for Neural Ensemble Recordings, chapter 7, pages157–177. CRC Press, 1999.
E. Oja. A simplified neuron model as a principal component analyser. Journal ofMathematical Biology, 15:267–273, 1982.
E. Oja. Neural networks, principal components and subspaces. International Journal ofNeural Systems, 1(1):61–68, 1989.
R. T. Olszewski. Generalized Feature Extraction for Structural Pattern Recognition inTime-Series Data. PhD thesis, Carnegie Mellon University, 2001.
E. Parzen. On the estimation of probability density function and mode. The Annals ofMathematical Statistics, 33(3):1065–1076, 1962.
T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, Berlin-Heidelberg-New York,1977.
J. C. Prıncipe, D. Xu, and J. Fisher. Information theoretic learning. In Simon Haykin,editor, Unsupervised Adaptive Filtering, pages 265–319. John Wiley, 2000.
D. Ramanan and D. A. Forsyth. Finding and tracking people from the bottom up. InProceedings of IEEE Conf. Computer Vision and Pattern Recognition, pages 467–474,June 2003.
M. Ranzato, Y. Boureau, S. Chopra, and Y. LeCun. A unified energy-based framework forunsupervised learning. In Proc. of the 11-th Int’l Workshop on Artificial Intelligence andStatistics (AISTATS), 2007.
S. Rao, S. Han, and J. C. Prıncipe. Information theoretic vector quantization with fixedpoint updates. In Proceedings of the International Joint Conference on Neural Networks,pages 1020–1024, August 2007a.
S. Rao, A. M. Martins, W. Liu, and J. C. Prıncipe. Information theoretic mean shiftalgorithm. In Proceedings of IEEE Conf. on Machine Learning for Signal Processing,pages 155–160, Sept. 2006a.
S. Rao, A. M. Martins, and J. C. Prıncipe. Mean shift: An information theoreticperspective. Submitted to Pattern Recognition Letters, 2008.
S. Rao, A. R. C. Paiva, and J. C. Prıncipe. A novel weighted lbg algorithm for neuralspike compression. In Proceedings of the International Joint Conference on NeuralNetworks, pages 1883–1887, August 2007b.
S. Rao, J. C. Sanchez, and J. C. Prıncipe. Spike sorting using non parametric clusteringvia cauchy schwartz pdf divergence. In Proceedings of the International Conference onAccoustic, Speech and Signal Processing, volume 5, pages 881–884, May 2006b.
137
A. Renyi. On measure of entropy and information. In Proceedings 4th Berkeley Symp.Math. Stat. and Prob., volume 1, pages 547–561, 1961.
A. Renyi. Some fundamental questions of information theory. In Selected Papers of AlfredRenyi, pages 526–552. Akademia Kiado, 1976.
W. Rudin. Principles of Mathematical Analysis. McGraw-Hill Publishing Co., 1976.
J. Sanchez, J. Pukala, J. C. Prıncipe, F. Bova, and M. Okun. Linear predictive analysisfor targeting the basal ganglia in deep brain stimulation surgeries. In Proceedings of theConference on 2nd Int IEEE Workshop on Neural Engineering, Washington, 2005.
S. Sandilya and S. R. Kulkarni. Principal curves with bounded turn. IEEE Transactionson Information Theory, 48(10):2789–2793, 2002.
I. Santamaria, P. P. Pokharel, and J. C. Principe. Generalized correlation function:Definition, properties and application to blind equalization. IEEE Trasactions on SignalProcessing, 54(6):2187–2197, 2006.
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions onPattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman andHall, 1986.
R. Tibshirani. Principal curves revisited. Statistics and Computing, 2:183–190, 1992.
N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedingsof the 37-th Annual Allerton Conference on Communication,Control and Computing,pages 368–377, 1999.
R. J. Vogelstein, K. Murari, P. H. Thakur, G. Cauwenberghs, S. Chakrabartty, andC. Diehl. Spike sorting with support vector machines. In Proceedings of IEEE EMBS,pages 546–549, September 2004.
M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall, 1995.
S. Watanabe. Pattern Recognition: Human and Mechanical. Wiley, 1985.
J. Wessberg, C. R. Stambaugh, J. D. Kralik, P. D. Beck, M. Laubach, J. K. Chapin,J. Kim, S. J. Biggs, M. A. Srinivasan, and M. A. L. Nicolelis. Real-time predictionof hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810):361–365, November 2000.
B. C. Wheeler. Automatic discrimination of single units. In Miguel A. L. Nicolelis, editor,Methods for Neural Ensemble Recordings, chapter 4, pages 61–78. CRC Press, 1999.
K. D. Wise, D. J. Anderson, J. F. Hetke, D. R. Kipke, and K. Najafi. Wireless implantablemicrosystems: high-density electronic interfaces to the nervous system. Proceedings ofthe IEEE, 92(1):76–97, January 2004.
138
F. Wood, M. Fellows, J. P. Donoghue, and M. J. Black. Automatic spike sorting for neuraldecoding. In Proceedings of IEEE EMBS, pages 4009–4012, September 2004.
G. Wyszecki and W.S. Stiles. Color Science: Concepts and Methods, Quantitative Dataand Formulae. Wiley, 1982.
C. Yang, R. Duraiswami, and L. Davis. Efficient mean-shift tracking via a new similaritymeasure. In Proceedings of IEEE Conf. Computer Vision and Pattern Recognition,volume 1, pages 176–183, June 2005.
139
BIOGRAPHICAL SKETCH
Sudhir Madhav Rao was born in Hyderabad, India in September of 1980. He received
his B.E. from the Department of Electronics and Communications Engineering, Osmania
University, India in 2002. He obtained his M.S in electrical and computer engineering in
2004 from the University of Florida with emphasis in the area of signal processing. In
the Fall of 2004, he joined the Computational NeuroEngineering Laboratory (CNEL) at
the University of Florida and started working towards his Ph.D under the supervision of
Dr. Jose C. Prıncipe. He received his Ph.D in the Summer of 2008. His primary interests
include signal processing, machine learning and adaptive systems. He is a member of Eta
Kappa Nu, Tau Beta Pi and IEEE.
140