c 2008 Sudhir Madhav Rao - University of...

140
UNSUPERVISED LEARNING: AN INFORMATION THEORETIC FRAMEWORK By SUDHIR MADHAV RAO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 1

Transcript of c 2008 Sudhir Madhav Rao - University of...

Page 1: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

UNSUPERVISED LEARNING: AN INFORMATION THEORETIC FRAMEWORK

By

SUDHIR MADHAV RAO

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008

1

Page 2: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

c© 2008 Sudhir Madhav Rao

2

Page 3: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

To Parents, Teachers and Friends

3

Page 4: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

ACKNOWLEDGMENTS

I would like to take this opportunity to thank my advisor Dr. Jose C. Prıncipe for his

constant, unwavering support and guidance throughout my stay at CNEL. He has been

a great mentor pulling me out of many local minima (in the language of CNEL!). I still

wonder how he works non-stop from morning to evening without lunch, and I am sure this

feeling is shared among many of my colleagues. In short, he has been an inspiration and

continues to be so.

I would like to express my gratitude to all my committee members; Dr. Murali Rao,

Dr. John G. Harris and Dr.Clint Slatton; for readily agreeing to be part of my committee.

They have helped immensely in improving this dissertation with their inquisitive nature

and helpful feedbacks. I would like to especially thank Dr. Murali Rao, my math mentor,

for keeping a vigil on all my notations and bringing sophistication to my engineering mind!

Special mention is also needed for Julie, the research coordinator at CNEL, for constantly

monitoring the pressure level at the lab and making us smile even if it is for a short while.

My past as well as present colleagues at CNEL need due acknowledgement. Without

them, I would have been shooting questions to walls and mirrors! I have grown to

appreciate them intellectually, and I thank them for being there for me when I needed

them most. It has been a pleasure to play ball on and off the basketball court!

Finally, it would be foolish on my part not to acknowledge my parents and sister for

their constant support and never-ending love. Thanks go out to one and all who have

helped me to complete this journey successfully. It has been an adventure and a fun ride!

4

Page 5: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 THEORETICAL FOUNDATION . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Information Theoretic Learning . . . . . . . . . . . . . . . . . . . . . . . . 262.1.1 Renyi’s Quadratic Entropy . . . . . . . . . . . . . . . . . . . . . . . 272.1.2 Cauchy-Schwartz Divergence . . . . . . . . . . . . . . . . . . . . . . 292.1.3 Renyi’s Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 A NEW UNSUPERVISED LEARNING PRINCIPLE . . . . . . . . . . . . . . . 36

3.1 Principle of Relevant Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.1 Case 1: β = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.1.2 Case 2: β = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.3 Case 3: β →∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 APPLICATION I: CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Review of Mean Shift Algorithms . . . . . . . . . . . . . . . . . . . . . . . 494.3 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.1 GMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.2 GBMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Mode Finding Ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 Clustering Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.7 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7.1 GMS vs GBMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . 624.7.2 GMS vs Spectral Clustering Algorithm . . . . . . . . . . . . . . . . 63

5

Page 6: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 APPLICATION II: DATA COMPRESSION . . . . . . . . . . . . . . . . . . . . 75

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4 Face Boundary Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 805.5 Image Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 APPLICATION III: MANIFOLD LEARNING . . . . . . . . . . . . . . . . . . . 88

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.2 A New Definition of Principal Curves . . . . . . . . . . . . . . . . . . . . . 906.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3.1 Spiral Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.3.2 Chain of Rings Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . 96

7.1 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 967.2 Future Direction of Research . . . . . . . . . . . . . . . . . . . . . . . . . . 101

APPENDIX

A NEURAL SPIKE COMPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . 105

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

A.2.1 Weighted LBG (WLBG) Algorithm . . . . . . . . . . . . . . . . . . 108A.2.2 Review of SOM and SOM-DL Algorithms . . . . . . . . . . . . . . . 110

A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111A.3.1 Quantization Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111A.3.2 Compression Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 114A.3.3 Real-time Implementation . . . . . . . . . . . . . . . . . . . . . . . 115

A.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

B SPIKE SORTING USING CAUCHY-SCHWARTZ PDF DIVERGENCE . . . . 119

B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119B.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

B.2.1 Electrophysiological Recordings . . . . . . . . . . . . . . . . . . . . 121B.2.2 Neuronal Spike Detection . . . . . . . . . . . . . . . . . . . . . . . . 121

B.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122B.3.1 Non-parametric Clustering using CS Divergence . . . . . . . . . . . 122B.3.2 Online Classification of Test Samples . . . . . . . . . . . . . . . . . 126

6

Page 7: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

B.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126B.4.1 Clustering of PCA Components . . . . . . . . . . . . . . . . . . . . 126B.4.2 Online Labeling of Neural Spikes . . . . . . . . . . . . . . . . . . . . 127

B.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

C A SUMMARY OF KERNEL SIZE SELECTION METHODS . . . . . . . . . . 130

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7

Page 8: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

LIST OF TABLES

Table page

5-1 J(X) and MSQE on the face dataset for the three algorithms averaged over 50Monte Carlo trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A-1 SNR of spike region and of the whole test data obtained using WLBG, SOM-DLand SOM algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

A-2 Probability values obtained for the code vectors and used in Figure A-7. . . . . 115

8

Page 9: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

LIST OF FIGURES

Figure page

1-1 The process of learning showing the feedback between a learner and its associatedenvironment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1-2 A new framework for unsupervised learning. . . . . . . . . . . . . . . . . . . . . 25

2-1 Information force within a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 29

2-2 Cross information force between two datasets. . . . . . . . . . . . . . . . . . . . 35

3-1 A simple example dataset. A) Crescent shaped data. B) pdf plot for σ2 = 0.05. 45

3-2 An illustration of the structures revealed by the Principle of Relevant Entropyfor the crescent shaped dataset for σ2 = 0.05. As the values of β increases wepass through the modes, principal curves and in the extreme case of β → ∞ weget back the data itself as the solution. . . . . . . . . . . . . . . . . . . . . . . . 46

4-1 Ring of 16 Gaussian dataset with different a priori probabilities. The numberingof the clusters is in anticlockwise direction starting with center (1, 0). A) R16Gadataset. B) a priori probabilities. C) probability density function estimatedusing σ2 = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4-2 Modes of R16Ga dataset found using GMS and GBMS algorithms. A) GoodMode finding ability of GMS algorithm. B) Poor mode finding ability of GBMSalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4-3 Cost function of GMS and GBMS algorithms. . . . . . . . . . . . . . . . . . . . 58

4-4 Renyi’s “cross” entropy H(X; S) computed for GBMS algorithm. This does notwork as a good stopping criterion for GBMS in most cases since the assumptionof two distinct phases of convergence does not hold true in general. . . . . . . . 59

4-5 A clustering problem A) Random Gaussian clusters (RGC) dataset. B) 2σg contourplots. C) probability density estimated using σ2 = 0.01. . . . . . . . . . . . . . . 59

4-6 Segmentation results of RGC dataset. A) GMS algorithm. B) GBMS algorithm. 61

4-7 Averaged norm distance moved by particles in each iteration. . . . . . . . . . . 61

4-8 Baseball image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-9 Baseball image segmentation using GMS and GBMS algorithms. The left columnshows results from GMS for various different number of segments and the kernelsize at which it was achieved. The right column similarly shows the results fromGBMS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9

Page 10: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

4-10 Performance comparison of GMS and spectral clustering algorithms on RGCdataset. A) Clustering solution obtained using GMS algorithm. B) Segmentationobtained using spectral clustering algorithm. . . . . . . . . . . . . . . . . . . . . 66

4-11 Bald Eagle image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4-12 Multiscale analysis of bald eagle image using GMS algorithm. . . . . . . . . . . 69

4-13 Performance of spectral clustering on bald eagle image. . . . . . . . . . . . . . . 70

4-14 Two popular images from Berkeley image database. A) Flight image. B) Tigerimage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4-15 Multiscale analysis of Berkeley images using GMS algorithm. . . . . . . . . . . . 71

4-16 Performance of spectral clustering on Berkeley images. . . . . . . . . . . . . . . 72

4-17 Comparison of GMS and spectral clustering (SC) algorithms with a priori selectionof parameters. A new post processing stage has been added to GMS algorithmwhere close clusters were recursively merged until the required number of segmentswas achieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5-1 Half circle dataset and the square grid inside which 16 random points wheregenerated from a uniform distribution. . . . . . . . . . . . . . . . . . . . . . . . 79

5-2 Performance of ITVQ-FP and ITVQ-gradient methods on half circle dataset.A) Learning curve averaged over 50 trials. B) 16 point quantization results. . . . 80

5-3 Effect of annealing the kernel size on ITVQ fixed point method. . . . . . . . . . 82

5-4 64 points quantization results of ITVQ-FP, ITVQ-gradient method and LBGalgorithm on face dataset. A) ITVQ fixed point method. B) ITVQ gradient method.C) LBG algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5-5 Two popular images selected for image compression application. A) Baboonimage. B) Peppers image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5-6 Portions of Baboon and Pepper images used for image compression. . . . . . . . 85

5-7 Image compression results. The left column shows ITVQ-FP compression resultsand the right column shows LBG results. A,B) 99.75% compression for boththe algorithms, σ2 = 36 for ITVQ-FP algorithm. C,D) 90% compression, σ2 =16. E,F) 95% compression, σ2 = 16. G,H) 85% compression, σ2 = 16. . . . . . . 86

6-1 The principal curve for a mixture of 5 Gaussians using numerical integrationmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6-2 The evolution of principal curves starting with X initialized to the original datasetS. The parameters were set to β = 2 and σ2 = 2. . . . . . . . . . . . . . . . . . 93

10

Page 11: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

6-3 The principal curve passing through the modes (shown with black squares). . . . 94

6-4 Changes in the cost function and its two components for the spiral data as afunction of the number of iterations. A) Cost function J(X). B) Two componentsof J(X) namely H(X) and H(X; S). . . . . . . . . . . . . . . . . . . . . . . . . 94

6-5 Denoising ability of principal curves. A) Noisy 3D “chain of rings” dataset. B)Result after denoising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7-1 A novel idea of unfolding the structure of the data in the 2-dimensional spaceof β and σ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7-2 The idea of information bottleneck method. . . . . . . . . . . . . . . . . . . . . 103

A-1 Synthetic neural data. A) Spikes from two different neurons. B) An instance ofthe neural data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A-2 2-D embedding of training data which consists of total 100 spikes with a certainratio of spikes from the two different neurons. . . . . . . . . . . . . . . . . . . . 107

A-3 The outline of weighted LBG algorithm. . . . . . . . . . . . . . . . . . . . . . . 109

A-4 16 point quantization of the training data using WLBG algorithm. . . . . . . . . 112

A-5 Two-dimensional embedding of the training data and code vectors in a 5 × 5lattice obtained using SOM-DL algorithm. . . . . . . . . . . . . . . . . . . . . . 112

A-6 Performance of WLBG algorithm in reconstructing spike regions in the test data. 113

A-7 Firing probability of WLBG codebook on the test data. . . . . . . . . . . . . . . 114

A-8 A block diagram of Pico DSP system. . . . . . . . . . . . . . . . . . . . . . . . . 116

B-1 Example of extracellular potentials from two neurons. . . . . . . . . . . . . . . . 122

B-2 Distribution of the two spike waveforms along the first and the second principalcomponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

B-3 Comparison of clustering based on CS divergence with K-means for spike sorting. 128

11

Page 12: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

UNSUPERVISED LEARNING: AN INFORMATION THEORETIC FRAMEWORK

By

Sudhir Madhav Rao

August 2008

Chair: Jose C. PrıncipeMajor: Electrical and Computer Engineering

The goal of this research is to develop a simple and unified framework for unsupervised

learning. As a branch of machine learning, this constitutes the most difficult scenario

where a machine (or learner) is presented only with the data without any desired output

or reward for the action it takes. All that the machine can do then, is to somehow capture

all the underlying patterns and structures in the data at different scales. In doing so, the

machine has extracted maximal information from the data in an unbiased manner, which

it can then use as a feedback to learn and make decisions.

Till now, the different facets of unsupervised learning namely, clustering, principal

curves and vector quantization have been studied separately. This is understandable

from the complexity of the field and the fact that no unified theory exists. Recent

attempts to develop such a theory have been mired with complications. One of the

issues is the imposition of models and structures on the data rather than extracting them

directly through self organization of the data samples. Another reason is the lack of

non-parametric estimators for information theoretic measures. Gaussian approximations

generally follow, which fail to capture the higher order features in the data that are really

of interest.

In this thesis, we present a novel framework for unsupervised learning called - The

Principle of Relevant Entropy. By using concepts from information theoretic learning,

we formulate this problem as a weighted combination of two terms - an information

12

Page 13: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

preservation term and a redundancy reduction term. This information theoretic

formulation is a function of only two parameters. The user defined weighting parameter

controls the task (and hence the type of structure) to be achieved whereas the inherent

hidden scale parameter controls the resolution of our analysis. By including such a user

defined parameter, we allow the learning machine to influence the level of abstraction

to be extracted from the data. The result is the unraveling of “goal oriented structures”

unique to the given dataset.

Using information theoretic measures based on Renyi’s definition, we estimate these

quantities directly from the data. One can derive fixed point update scheme to optimize

this cost function, thus avoiding Gaussian approximation altogether. This leads to

interaction of data samples giving us a new self organizing paradigm. An added benefit

is the lack of unnecessary parameters (like step size) common in other optimization

approaches. The strength of this new formulation can be judged from the fact that the

existing mean shift algorithms appear as special cases of this cost function. By going

beyond the second order statistics and dealing directly with the probability density

function, we effectively extract maximal information to reveal the underlying structure

in the data. By bringing clustering, principal curves and vector quantization under one

umbrella, this powerful principle truly discovers the underlying mechanism of unsupervised

learning.

13

Page 14: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 1INTRODUCTION

1.1 Background

The functioning of the human brain has baffled scientists for a long time now.

The desire to unravel the mystery has led us to imitate this underlying mechanism in

machines so as to make them more intelligent and adaptable to the environment. Machine

Learning, the field which particularly deals with these questions, can be defined as the

science of the artificial [Langley, 1996]. It is a confluence of numerous fields ranging from

cognitive science and neurobiology to engineering, statistics and optimization theory to

name a few. Its goal involves modeling the mechanism underlying human learning and to

constantly corroborate it through applications to real world problems using empirical and

mathematical approaches. This branch of science can thus be considered as the field of

research devoted to formal study of learning systems [Ghahramani, 2003].

Human mind has been the subject of interest to many great philosophers. More than

2000 years back, Greek philosophers like Plato and Aristotle discussed the concept

of universal [Watanabe, 1985]. The universal simply means a concept or a general

notion. They argued that the way to find universal is to find “forms” or “patterns”

from exemplars. During the 16th century Francis Bacon furthered inductive reasoning

as central to understanding intelligence [Forsyth, 1989]. But the scientific disposition to

these philosophical ideas was not laid until 1748 when Hume [1999] discovered the rule

of induction. Meanwhile, deductive and formal reasoning gained increasing popularity,

especially among logicians. Working with symbols and heuristic rule based algorithms, the

first knowledge based systems were built. This lead to the birth of Artificial Intelligence

(AI) in mid-1950’s which can be considered primarily as a science of deduction. The

idea behind these domain specific expert systems was to explore role and organization of

knowledge in memory. Borrowing heavily from cognitive science literature, research on

knowledge representation, natural language and expert systems dominated this era.

14

Page 15: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

It was increasingly realized that such heuristic rule based symbolic methods led to

highly complex systems which could not be easily generalized to other similar applications.

Further, external world does not contain symbols but signals which need to be processed

and interpreted before any deductive reasoning could be applied. The role of pattern

recognition, signal processing and more importantly inductive inference to understand the

underlying learning principles started gaining attention by the middle of 1960’s. Learning

was regarded as central to intelligent behavior and was thus seen as a way to explain the

general mechanism behind one’s cognitive ability and the resulting actions. Researchers in

the area of pattern recognition began to emphasize algorithmic, numerical methods that

contrasted sharply with heuristic, symbolic methods associated with AI paradigm.

The gap between these two fields continued to increase due to differences in the

fundamental notion of intelligence. It was not until the late 1970’s, that a renewed interest

emerged which spawned the field of Machine Learning. This was necessitated by the need

to do away with overwhelming number of domain specific expert systems and a urge to

understand better the underlying principles of learning which govern these systems. Many

new methods were proposed with emphasis in the area of planning, diagnosis and control.

Work on real world problems provided strong clues to the success of these methods. The

incremental nature of this field helped it to grow tremendously for the past 30 years with

greater tendency to build on previous systems. This field today occupies a central position

in our quest to answer many unsolved mysteries of man and machine alike. Borrowing

ideas both from AI and pattern recognition, this field of science has sought to unravel the

central role of learning in intelligence.

Learning can be broadly defined as improvement in performance in some environment

through the acquisition of knowledge resulting from experience in that environment [Langley,

1996]. Thus, learning never occurs in isolation. A learner always finds himself attached to

an environment actively gathering data for knowledge acquisition. With a goal in mind,

the learner finds patterns in data relevant to the task at hand and takes appropriate

15

Page 16: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Environment

Actions Data

Learner

Figure 1-1. The process of learning showing the feedback between a learner and itsassociated environment.

actions resulting in more data from the environment. This feedback enriches the learner’s

knowledge thus improving his performance. Figure 1-1 depicts this relationship.

There are many aspects of environment that effects learning, but the most important

aspect that influences learning is the degree of supervision involved in performing the task.

Broadly, three cases emerge. In the first scenario called supervised learning, the machine

is provided with a sequence of data samples along with the corresponding desired outputs

and the goal is to learn the mapping so as to produce correct output for a given new test

sample. Another scenario is called reinforcement learning where the machine does not

have any desired output values to learn the mapping directly, but gets scalar rewards

(or punishments) for every action it takes. The goal of the machine then, is to learn and

perform actions so as to maximize its rewards (or in turn reduce its punishments). The

final and the most difficult scenario is called unsupervised learning where the machine

neither has outputs nor rewards and hence is left only with the data.

This thesis is about unsupervised learning. At first, it may seem unclear as to

what the machine can possibly do just with the data at hand. Nevertheless, there

16

Page 17: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

exists a structure to the input data and by finding relevant patterns one can learn to

make decisions and predict the future outputs. Such learning characteristics are found

throughout the biological systems. According to Barlow [1989],

statistical redundancy contains information about the patterns and regularities

of sensory stimuli. Structure and information exist in external stimulus and it

is the task of perceptual system to discover this underlying structure.

In order to do so, we need to extract as much information as possible from the data. It

is here that unsupervised learning relies heavily on statistics and information theory to

dynamically extract information and capture the underlying patterns and regularities in

the data.

1.2 Motivation

Traditionally, unsupervised learning has been viewed as consisting of three different

facets. Clustering, the first facet, is a statistical data analysis technique which deals with

the classification of data into similar objects or clusters. One of the most popular methods

is the K-means technique where Euclidean distance between samples is used to quantify

similarity. Other techniques like hierarchical clustering methods recursively join (or split)

clusters based on proximity measures to form larger (or smaller) clusters [Duda et al.,

2000]. Some other examples include Gaussian mixture models (GMM) [MacKay, 2003]

and recent kernel based spectral clustering methods [Shi and Malik, 2000; Ng et al., 2002]

which have become quite popular in vision community. Clustering is used extensively in

the areas of data mining, image segmentation and bioinformatics.

The second aspect is the notion of principal curves. This can be seen as a non-linear

counterpart of principal component analysis (PCA) where the goal is to find a lower

dimensional embedding appropriate to the intrinsic dimensionality of the data. This

topic was first addressed by Hastie and Stuetzle [1989] who defined it broadly as

“self-consistent” smooth curves which pass through the “middle” of a d-dimensional

probability distribution or data cloud. Using regularization to avoid overfitting, Hastie’s

17

Page 18: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

algorithm attempts to minimize the average squared distance of the data points and the

curve. Other ideas have been proposed including definitions based on mixture model

by Tibshirani [1992] and the recent popular work of Kegl [1999]. Due to their ability

to find intrinsic dimensionality of the data manifold, principal curves have immense

applications in dimensional reduction, manifold learning and denoising.

Vector quantization, the third component of unsupervised learning, is a compression

technique which deals with efficient representation of the data with few code vectors

by minimizing a distortion measure [Gersho and Gray, 1991]. A classic example is the

Linde-Buzo-Gray (LBG) [Linde et al., 1980] technique which minimizes the mean square

error between the code vectors and the data. Another popular and widely used technique

is Kohonen’s self organizing maps (SOM) [Kohonen, 1982; Haykin, 1999] which not only

represents the data efficiently but also preserves the topology by a process of competition

and cooperation between neurons. Speech compression, video coding and communications

are some of the fields where vector quantization is used extensively.

Until now, a common framework for unsupervised learning had been missing. The

three different aspects namely clustering, principal curves and vector quantization have

been studied separately. This is quite understandable from the breadth of the field and

the unique niche of applications these facets serve. Nevertheless, one could see them

under the same purview using ideas from statistics and information theory. For example,

it is well known that probability density function (pdf) subsumes all the information

contained in the data. Clusters of data samples would correspond to high density regions

of the data highlighted by the peaks in the distribution. The underlying manifold is

naturally revealed by the ridge lines of the pdf estimate connecting these clusters, while

vector quantization can be seen as preserving directly the pdf information of the data.

Concepts from information theory would help us quantify these pdf features through scalar

descriptors like entropy and divergence. By doing so, we not only gain the advantage of

going beyond the second order statistics and capturing all the information in the data, but

18

Page 19: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

also we reveal these and other hidden structures unique to the given dataset. In summary,

such a unifying theory would bring these discordant views yet highly overlapping notions

together under the same purview. Altogether, it would advance the learning theory by

bringing new mathematical insights and interrelations between different facets of learning.

1.3 Previous Work

The role of statistics and information theory has been widely accepted now in

unraveling the basic principles behind unsupervised learning. Earliest approaches were

based on building generative models to effectively describe the observed data. The

parameters of these generative models are adjusted to optimize the likelihood of the

data with constraints on model architecture. Bayesian inference models and maximum

likelihood competitive learning are examples of this line of thought [Bishop, 1995;

MacKay, 2003]. Many of these models employ the objective of minimum description

length. Minimizing the description length of the model forces the network to learn

economic representations that capture the underlying regularities in the data. The

downside of this approach is the imposition of external models on the data which

may artificially constrain data description. It has been observed that when the model

assumption is wrong, these methods perform poorly leading to wrong inferences.

Modern approaches have proposed computational architectures inspired by the

principal of self organization of biological systems. Most notable contribution in this

regard was made by Linsker [1988a,b] with the introduction of “Infomax” principle.

According to this principle, in order to find relevant information one needs to maximize

the information rate R of a network constrained by its architecture. This rate R is defined

as the mutual information between noiseless input X and the output Y , that is

R = I(X; Y ) = H(Y )−H(Y |X).

Interestingly, in a multilayer feedforward network one needs to only ensure that information

rate R is maximized between each subsequent layer. This would in turn ensure the

19

Page 20: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

maximization of the overall information transfer. Since H(Y |X) quantifies the output

noise which is independent of the network weights, maximization of R is equivalent to

maximization of the output entropy H(Y ). In particular, for Gaussian scenario and

simple linear network, the cost function reduces to maximization of output variance giving

us a Hebbian rule as a stochastic online update. Thus, Principal Component Analysis

(PCA) and its Hebbian algorithms [Oja, 1982, 1989] could be seen in term of information

maximization leading to minimum reconstruction error.

A different way to explain the functioning of sensory information processing is the

principle of minimum redundancy proposed by Barlow [1989]. Since the external stimuli

entering our perceptual system are highly redundant, it would be desirable to extract

independent features which would facilitate subsequent learning. The learning of an event

E then requires the knowledge of only the M conditional probabilities of E given each

feature yi, instead of the conditional probabilities of E given all possible combination of

N sensory stimuli. In short, a N-dimensional problem is reduced to M one-dimensional

problems. A natural consequence of this idea is the Independent Component Analysis

(ICA) [Jutten and Herault, 1991; Comon, 1994; Hyvarinen and Oja, 2000]. Unfortunately,

the Shannon entropy definition makes these information theoretic quantities difficult to

compute. For the Gaussian assumption, one could instead reduce the correlation among

the outputs, but this would only remove second order statistics and not address higher

order dependencies.

Pioneering work in the area of unsupervised learning has been done by Becker

[1991]. By framing the problem as building global objective functions under optimization

framework, Becker [1998] has shown that many current methods can be seen as energy

minimizing functions, thus elucidating the similarity among different aspects of learning

theory. In an attempt to go beyond preprocessing stages of sensory systems, Becker

and Hinton [1992] proposed the Imax principle for unsupervised learning which dictates

that signals of interest should have high mutual information across different sensory

20

Page 21: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

input channels. By discovering properties of input that are coherent across space and

time through maximization of mutual information between different channels, one could

extract relevant information to build more abstract representations of the data. Using

multilayer neural network with the assumption that the signal is Gaussian and the noise

is additively distributed, Becker was able to discover higher order image features such as

stereo disparity in random dot stereograms [Becker and Hinton, 1992].

These information theoretic coding principles have significantly advanced our

understanding of unsupervised learning. The process of information preservation

and redundancy reduction strategies have provided important clues about the initial

preprocessing stages required to extract relevant features for learning. There exist though,

two major bottlenecks in fully realizing the potentiality of these theories. The first

constraint is the way the objective function is framed. Most of the previous approaches

either preserve maximal information or aim at redundancy reduction strategy. Some

authors have even tried to use one to achieve the other, thus concentrating on the special

case of equivalence. For example, Bell and Sejnowski [1995] used “Infomax” to achieve

ICA using a nonlinear network under the special case when the nonlinearity matches the

cumulative density function (cdf) of the independent components. In doing so, a lot of rich

intermediary structures are lost.

The second constraint is the information theoretic nature of the cost function which

makes it difficult to derive self organizing rules. All these models share the general

feature of modeling the brain as a communication channel and applying concepts from

information theory. Since Shannon definition of information theoretic measures are hard to

compute in closed form, Gaussian approximation generally follows giving rules which fail

to really capture the higher order features that are of interest. A recent exception is the

information bottleneck (IB) method proposed by Tishby et al. [1999]. Instead of trying to

encode all the information in a random variable S, the authors only quantify the relevant

or meaningful information by introducing another variable Y which is dependent on S, so

21

Page 22: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

that the mutual information I(S; Y ) > 0. Thus, the information that S provides about Y

is passed through a “bottleneck” formed by compact code book X. Using ideas from lossy

source compression, in particular the rate distortion theory, one can get self consistent

equations which are recursively iterated to minimize a variational objective function.

Nevertheless, the method assumes the knowledge of joint distribution p(s, y) which may

not be available in many applications.

1.4 Contribution of this Thesis

Learning is unequivocally connected to finding patterns in data. As Forsyth [1989]

puts it, learning is 99% pattern recognition and 1% reasoning. If we therefore wish to

develop a unified framework for unsupervised learning, we need to address this central

theme of capturing the underlying structure in the data. To put simply, complex pattern

could be described hierarchically (and quite often recursively) in terms of simpler

structures [Pavlidis, 1977]. This is a fundamentally new and promising approach for it

tries to understand the developmental process that generated the observed data itself.

Put succinctly, structures are regularities in the data with interdependencies between

subsystems [Watanabe, 1985]. This immediately tells us that white noise is not a structure

since knowledge of the partial system does not give us any idea of the whole system.

In short, structures have rigidity with interdependencies and are not amorphous. Since

entropy of the overall system with interdependent subsystems is smaller than the sum

of the entropies of partial subsystems, structures can be seen as entropy minimizing

entities. By doing so, they reduce redundancy in the representation of the data. Thus,

formulation of structures can be cast according to the principle of minimum entropy. But

mere minimization of entropy would not allow these structures to capture any relevant

information of the data leading to trivial solutions. This formulation therefore needs to

be modified as minimization of entropy with the constraint to quantify certain level of

information about the data. By framing this problem as a weighted combination of a

redundancy reduction term and an information preservation term, we propose a novel

22

Page 23: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

unsupervised learning principle called - The Principle of Relevant Entropy. This is one of

the major contributions of this thesis.

A learning machine is always associated with a goal [Lakshmivarahan, 1981].

In unsupervised scenarios where the machine only has the data, goals are especially

crucial since they help differentiate relevant from irrelevant information in the data. By

seeking “goal oriented structures” then, the machine can effectively quantify the relevant

information in the data as compact representations needed as a first step for successful

and effective learning. This is another contribution of our thesis. Notice that goals can

exist at many levels of learning. In this thesis, we address the lower level vision tasks

like clustering (for differentiating objects), principal curves (for denoising) and vector

quantization (for compression) which help in decoding patterns from the data relevant to

the goal at hand. By doing so, one can extract an internally derived teaching signal, thus

converting an unsupervised problem to supervised framework.

A key aspect needed for the success of this methodology is to compute the information

theoretic measures of the global cost function directly from the data. Though Shannon

definition is popularly used, few of them have non-parametric estimators which are easy

to compute leading to simplifications using Gaussian assumption. In this thesis, we do

away with the Shannon definition of information theoretic measures and use instead ideas

from Information Theoretic Learning (ITL) based on Renyi’s entropy [Prıncipe et al.,

2000; Erdogmus, 2002]. This is the third contribution of this thesis. Using information

theoretic principles based on Renyi’s definition [Renyi, 1961, 1976], we estimate different

information theoretic measures directly from the data non-parametrically. These scalar

descriptors go beyond the second order statistics to reveal higher order features in the

data.

Interestingly, one can derive fixed point update rule for self-organization of the data

samples to optimize the cost function, thus avoiding Gaussian approximations altogether.

Borrowing an analogy from physics, in this adaptive context, one can view the data

23

Page 24: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

samples as “particles” and the self organizing rules as local forces. With the “goal”

embedded in the cost, these local forces define the rules of interaction between the data

samples. Self organization of these particles should then reveal the structure in the data

relevant to the goal. Such self organizing, fixed point update rules is the final but an

important contribution of this thesis.

To summarize, this thesis proposes a number of key contributions which furthers the

theory of unsupervised learning;

• To formulate structure in terms of minimum entropy principle with a constraint toquantify certain level of information about the data.

• Seeing unsupervised learning as extracting “goal oriented structures” from the data.By doing so, we convert an unsupervised problem to supervised framework.

• To estimate different information theoretic measures of the formulation directly fromdata using ITL theory.

• To derive a simple and fast fixed point update rule so as to reveal the underlyingstructure through self-organization of the data.

This new understanding of unsupervised learning can be summarized with a block

diagram as shown in Figure 1-2. The important idea behind this schematic is the goal

oriented nature of unsupervised learning. In this framework, the neuro-mapper can be

seen as working on physics based principles. Using an analogy with biological systems,

data from the external environment entering our sensory inputs are processed to extract

important features for learning. Based on the goal of the learner, a self organizing rule

is initiated in the neuro-mapper to find appropriate structures in the extracted features.

Though a small portion, an important step is critical reasoning, which helps one to judge

how useful these patterns are in achieving the goal. This results in a dual correction; first

to take appropriate actions in the environment so as to actively generate useful data and

second, to change the functioning of the neuro-mapper itself. This cyclical loop is essential

for effective learning and to adapt in a changing environment. An optional correction may

be used to update the goal itself which would change the direction of learning. What is

24

Page 25: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Environment Neuro Structuresor

patterns

CriticalReasoning

Goal

Actions

Learner

Data

Self organizingprinciple

mapper

Figure 1-2. A new framework for unsupervised learning.

interesting in this approach is that the learner’s block is converted into a classic supervised

learning framework. This block represents an open system constantly interacting with the

environment. Together with the environment, the overall scheme is a closed system where

the principles of conservation is applicable [see Watanabe, 1985, chap. 6].

A natural outcome of this thesis is a simple yet elegant framework bringing different

facets of unsupervised learning namely clustering, principal curves and vector quantization

under one umbrella, thus providing a new outlook to this exciting field of science. In the

end, we would like to present a quote from Ghahramani [2003], which best summarizes

this field and is a constant source of motivation throughout this work;

In a sense, unsupervised learning can be thought of as finding patterns in the

data above and beyond what would be considered pure unstructured noise.

25

Page 26: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 2THEORETICAL FOUNDATION

2.1 Information Theoretic Learning

Let X ∈ Rd be a random variable in a d-dimensional space. Consider N independent

and identically distributed (iid) samples {x1, x2, . . . , xN} of this random variable. Given

this sample set of the population, a non-parametric estimator of the probability density

function (pdf) of X can be obtained using the Parzen window technique [Parzen, 1962] as

shown below.

pX(x) =1

Nhd

N∑j=1

K

(x− xj

h

), (2–1)

where K is a kernel satisfying the following conditions:

• supy∈Rd

|K(y)| < ∞

• ∫Rd

|K(y)|dy < ∞

• lim||y||→∞

||y||dK(y) = 0

• ∫Rd

K(y)dy = 1

The kernel size h is actually a function of the number of samples N ; represented as h(N).

Analysis of the estimator pX(x) along with the four properties above gives the following

results,

• If limN→∞

h(N) = 0 then, pX(x) is an asymptotically unbiased estimator.

• If in addition h(N) satisfies limN→∞

Nh(N) = ∞ then, the estimator pX(x) is also a

consistent estimate of the true pdf f(x).

We particularly consider the Gaussian kernel given by Gσ(x) = 1√(2πσ2)

e−x2

2σ2 for

our analysis. The advantage of this kernel selection is two-folded. First, it is a smooth,

continuous and infinitely differentiable kernel required for our purposes (shown later).

Second, the Gaussian kernel has a special property that convolution of two Gaussians gives

26

Page 27: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

another Gaussian with variance equal to the sum of the original variances. That is,

∫Gσ(x− xi) Gσ(x− xj) dx = G√

2σ(xi − xj).

As would be seen later, this property is important in developing non-parametric estimators

for Renyi’s information theoretic measures.

2.1.1 Renyi’s Quadratic Entropy

Renyi [1961, 1976] proposed a family of entropy measures characterized by order-α.

Renyi’s order-α entropy is defined as

Hα(X) =1

1− αlog

(∫fα(x)dx

). (2–2)

For the special case of α → 1, Renyi’s α-entropy tends to Shannon definition, that is,

limα→1

Hα(X) = −∫

f(x) logf(x) dx = HS(X).

Therefore, Hα(X) can be seen as a more general definition of entropy measure. One can

also show that this measure is a monotone decreasing function of α. This can be proved

by taking the first derivative and showing that this is always negative through Jensen’s

inequality. To summarize,

H0<α<1(X) > HS(X) > Hα>1(X).

Now, consider the special case of α = 2. The Renyi’s quadratic entropy1 is defined as

H(X) = − log

(∫f 2(x)dx

). (2–3)

Using pX(x) = 1N

∑Nj=1 GσX

(x− xj) as the Parzen estimate of f(x) along with the property

of the Gaussian kernel in 2–3, we can obtain a non-parametric estimator2 of Renyi’s

1 we drop the α = 2 subscript for convenience

2 Note that H(X) is an estimator of H(X) and so a different notation is used.

27

Page 28: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

quadratic entropy given by

H(X) = − log

(1

N2

N∑i=1

N∑j=1

∫GσX

(x− xi) GσX(x− xj) dx

)

= −log

(1

N2

N∑i=1

N∑j=1

Gσ(xi − xj)

),

(2–4)

where σ2 = 2σ2X . Notice the argument of the Gaussian kernel which considers all possible

pairs of samples. The idea of regarding the samples as information particles was first

introduced by Prıncipe and collaborators [Prıncipe et al., 2000; Erdogmus, 2002] upon

realizing that these samples interact with each other through laws that resemble the

potential field and their associated forces in physics.

Since the log is a monotonic function, any optimization based on H(X) can be

translated into optimization of the argument of the log, which we denote by V (X), and

call it the information potential of the samples. We can consider this quantity as an

average of the contributions from each particle xi as shown below. Note that V (xi) is the

potential field over the space of the samples, with an interaction law given by the kernel

shape.

V (X) =1

N

N∑i=1

V (xi)

V (xi) =1

N

N∑j=1

Gσ(xi − xj).

The derivative of V (xi) gives us the net information force acting on xi due to all other

samples xj . This force is attractive pulling the particle xi in the direction of xj due to

the vector term (xj − xi) in F (xi|xj). The vector summation of all these individual forces

would then give us a net force F (xi).

∂xi

V (xi) =1

N

N∑j=1

Gσ(xi − xj)(xj − xi)

σ2

⇒ F (xi) =1

N

N∑j=1

F (xi|xj).

(2–5)

28

Page 29: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

This force field is neatly demonstrated in Figure 2-1. Being inherently attractive,

the direction of the inter-force field on each sample always points inwards (towards other

samples) as shown in this plot.

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 2-1. Information force within a dataset.

2.1.2 Cauchy-Schwartz Divergence

In information theory, the concept of relative entropy is a measure to quantify

the difference between two probability distributions f and g corresponding to random

variables X and Y . In the case of Shannon definition, the relative entropy or Kullback-Leibler

(KL) divergence is uniquely defined as

DKL(f ||g) =

∫f(x) log

f(x)

g(x)dx. (2–6)

It could be seen as a measure of uncertainty when one approximates the “true” distribution

f(x) with another distribution g(x). Thus, DKL(f ||g) can also be rewritten as

DKL(f ||g) = −∫

f(x) log g(x) dx +

∫f(x) log f(x) dx

= HS(X; Y )−HS(X).

(2–7)

29

Page 30: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

HS(X; Y )3 is called the cross entropy and can be seen as expected gain of information

Ef(x)[− log g(x)] when observing g(x) with respect to the “true” distribution f(x).

On the contrary, in the case of Renyi’s definition, there exists more than one way of

representing the divergence measure. Two such definitions were given by Renyi himself

[Renyi, 1976]. The more popular one is written as

DRKL−1(f ||g) =1

α− 1log

∫fα(x)

gα−1(x)dx. (2–8)

For special case of α → 1,

limα→1

DRKL−1(f ||g) = DKL(f ||g).

One of the drawbacks of this definition is that while evaluating, if g(x) is zero at some

values of x where f(x) is finite, this could yield a value of infinity. Recently, another

definition for Renyi’s KL divergence was proposed by Lutwak et al. [2005]. The relative

entropy can be written as follows,

DRKL−2(f ||g) = log(∫

gα−1f)1

1−α (∫

gα)1α

(∫

fα)1

α(1−α)

. (2–9)

Note that the denominator in the argument of the log contains an integral evaluation. So

f(x) could be zero at some points but overall the integral is well defined, thus avoiding

numerical issues of the previous definition. Again, for α → 1, this gives DKL(f ||g). In

particular, for α = 2, one can rewrite 2–9 as

Dcs(f ||g) = − log

∫f(x)g(x) dx√

(∫

f 2(x) dx)(∫

g2(x) dx). (2–10)

We call this the Cauchy-Schwartz divergence since the same could also be derived using

the Cauchy-Schwartz inequality [Rudin, 1976]. Note that the argument is always ∈ (0, 1]

3 We use a semicolon instead of a comma to differentiate this quantity from jointentropy HS(X, Y ) = − ∫ ∫

f(x, y) log f(x, y) dxdy.

30

Page 31: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

and therefore, Dcs(f ||g) ≥ 0. Further, Dcs(f ||g) is symmetric unlike other divergence

measures. It does not satisfy the triangular inequality though, and hence is not a distance

metric. To summarize, the following properties hold,

• Dcs(f ||g) ≥ 0 for all f and g.

• Dcs(f ||g) = 0 iff f = g.

• Dcs(f ||g) is symmetric i.e. Dcs(f ||g) = Dcs(g||f).

2.1.3 Renyi’s Cross Entropy

Similar to the relation between KL divergence and Shannon entropy in 2–7, one can

derive a connection between Cauchy-Schwartz divergence and Renyi’s quadratic entropy.

Rearranging 2–10, we get

Dcs(f ||g) = − log

∫f(x)g(x) dx +

1

2log

∫f 2(x) dx +

1

2log

∫g2(x) dx

= H(X; Y )− 1

2H(X)− 1

2H(Y ).

(2–11)

The term H(X; Y ) can be considered as Renyi’s cross entropy. The argument of the log

can also be seen as Ef(x)[−g(x)]. One can also derive this following the actual derivation

of Renyi’s entropy [Renyi, 1976]. We prove this for the discrete case in the following

theorem.

Theorem 1. Consider a set E with “true” probabilities of its elements given by

{p1, p2, . . . , pN}(where∑

pi = 1). After an experiment, we observe these elements

with probabilities {q1, q2, . . . , qN}. The gain in information can then be expressed as

Hα(p, q) =1

1− αlog

k

pk q(α−1)k .

In particular, for α = 2, we get

H(p, q) = − log∑

k

pk qk

.

31

Page 32: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Proof. The information gained by observing an element with probability qk is Ik = log 1qk

.

Let us denote the information of all elements by the set,

= = {I1, I2, . . . , IN}

= {− log q1,− log q2, . . . ,− log qN}.

Since the “true” distribution of the samples is {p1, p2, . . . , pN}, the average gain in

information can be easily computed to give the cross entropy under Shannon definition.

HS(p, q) = −∑

k

pk log qk.

One could also consider the Kolmogorov-Nagumo function associated with the mean.

According to this general theory of means, a mean of the real numbers x1, x2, . . . , xN with

weights p1, p2, . . . , pN is given by

ϕ−1( N∑

k=1

pk ϕ(x)),

where ϕ(x) is an arbitrary continuous and strictly monotone function on the real numbers.

Using this, the average mean information can be represented as,

I = ϕ−1( N∑

k=1

pk ϕ(Ik)).

In the case of information, it is not just sufficient that the function ϕ(x) be continuous

and strictly monotone function. One needs ϕ(x) to satisfy the postulate of additivity

also. A trivial example of ϕ(x) which satisfies this requirement is a linear function which

was used by Shannon. Another function which satisfies this constraint is the exponential

function. In particular, Renyi used the following function,

ϕ(x) = 2(1−α)x.

32

Page 33: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Using this exponential function we get,

Hα(p, q) = ϕ−1( N∑

k=1

pk ϕ(Ik))

= ϕ−1( N∑

k=1

pk 2(1−α)Ik

)

= ϕ−1

( N∑

k=1

pk qα−1k

)

=1

1− αlog

( N∑

k=1

pk qα−1k

).

In particular, for α = 2, we get H(p, q) = − log∑k

pk qk.

Notice that the argument of Renyi’s cross entropy is an inner product between two

pdfs and hence is amenable for deriving a non-parametric estimator. Let X, Y ∈ Rd

be two random variables with pdfs f and g respectively. Suppose we have iid samples

{x1, x2, . . . , xN} and {y1, y2, . . . , yM} from these two random variables. Using the Parzen

estimates pX(x) = 1N

N∑i=1

GσX(x− xi) and pY (y) = 1

M

M∑i=1

GσY(y− yi) corresponding to f and

g respectively, we have

H(X; Y ) = − log

(1

NM

N∑i=1

M∑j=1

∫GσX

(u− xi) GσY(u− yj) du

)

= −log

(1

NM

N∑i=1

M∑j=1

Gσ(xi − yj)

),

(2–12)

where σ2 = σ2X + σ2

Y . The argument of the Gaussian kernel now shows interaction of the

samples from one dataset with samples of another dataset. One can consider the overall

argument of the log as the cross information potential V (X; Y ) which can be represented

as an average over individual potentials V (xi; Y ).

V (X; Y ) =1

N

N∑i=1

V (xi; Y )

V (xi; Y ) =1

M

M∑j=1

Gσ(xi − yj).

33

Page 34: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

The derivative of V (xi; Y ) gives us the cross information force acting on each sample xi

due to all samples {yj}Mj=1 of the set Y .

∂xi

V (xi; Y ) =1

M

M∑j=1

Gσ(xi − yj)(yj − xi)

σ2

⇒ F (xi; Y ) =1

M

M∑j=1

F (xi|yj).

(2–13)

Similarly, one can derive the cross information force acting on sample yj due to all samples

{xi}Ni=1 of the set X by expressing V (X; Y ) as an average over V (yj; X) and differentiating

with respect to yj as shown below for completeness.

∂yj

V (yj; X) =1

N

N∑i=1

Gσ(yj − xi)(xi − yj)

σ2

⇒ F (yj; X) =1

N

N∑i=1

F (yj|xi).

(2–14)

Note that F (xi|yj) and F (yj|xi) have the same magnitude but opposite directions giving

us a symmetric graph model. This concept is well depicted in Figure 2-2 where one can

see the intra-force field operating between samples of the two datasets.

2.2 Summary

In the previous sections, we have derived non-parametric estimators for both Renyi’s

entropy and cross entropy terms. This gives us a means to estimate these quantities

directly from the data. From 2–11, one can then get a non-parametric estimator for

Cauchy-Schwartz divergence measure as well. Thus,

Dcs(pX ||pY ) = H(X; Y )− 1

2H(X)− 1

2H(Y ). (2–15)

Another aspect which requires particular attention is the concept of information

forces. Figures 2-1 and 2-2 urges one to ask the question as to what would happen to

the samples if they were allowed to freely move under the influence of these forces. This

question led to the culmination of this thesis. Initial efforts of moving the samples using

34

Page 35: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

Figure 2-2. Cross information force between two datasets.

gradient method showed promising results, but were mired with problems of step-size

selection and stability issues. We soon realized that for success of this concept, two things

were essential;

• To devise a global cost function which would combine these two forces appropriatelyto achieve a “goal”.

• To come up with a fast fixed-point update rule, thus avoiding step-size issues.

The subsequent chapters would elaborate on how we developed a novel framework for

unsupervised learning taking these two aspects into consideration.

These ideas lie at the heart of information theoretic learning (ITL) [Prıncipe et al.,

2000]. By playing directly with the pdf of the data, ITL effectively goes beyond the

second order statistics. The result is a cost function that directly manipulates information

giving rise to powerful techniques with numerous applications in the area of adaptive

systems [Erdogmus, 2002] and machine learning [Jenssen, 2005; Rao et al., 2006a].

35

Page 36: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 3A NEW UNSUPERVISED LEARNING PRINCIPLE

3.1 Principle of Relevant Entropy

A common theme among all learning principles is the information feedback loop

necessary for knowledge acquisition. There is more than one way of summarizing a given

dataset, but information is only relevant when it is useful in attaining the “goal”. This

aspect (as to what information to extract and learn) is quite clear in some scenarios while

in some others it is not. In supervised learning scenario, for example, one is not only given

the data but also a priori information. This could be in the form of class labels in the case

of classification problems or desired signal to distinguish from noise in the case of filtering

paradigms. Using this training set, one then needs to design systems which could adapt

and learn the mapping so as to generalize it to new test samples. From optimization point

of view therefore, this problem is well posed.

On the other hand, in unsupervised learning context the learner is only presented

with the data. What is then a best strategy to learn? The answer lies in discovering

structures in the data. As [Watanabe, 1985] puts it, these are rigid entities with

interdependent subsystems that capture the regularities in the data. Since these

regularities or patterns are subsets of the given data, their entropy is smaller than the

entire sample set. The first step therefore in discovering these patterns is to minimize

the entropy of the data. This in turn would reduce redundancy. But mere minimization

of entropy would give us a trivial solution of data collapsing to a single point which is

not useful. To discover structure, one needs to model them as compact representations

of the data. This connection between structures and their corresponding data could be

established by introducing a distortion measure. Depending on the extent of the distortion

then, one is bound to find a range of structures which can be seen as an information

extraction process. Since minimizing the entropy with this constraint adds relevance to the

resulting structure, we call this - The Principle of Relevant Entropy.

36

Page 37: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

To be specific, consider a dataset S ∈ Rd with sample population {s1, s2, . . . , sM}.This original dataset is static and kept constant throughout the process. Let us now

create a new dataset X = {x1, x2, . . . , xN} such that X = S initially. Starting with this

initialization, we would like to evolve the samples of X so as to capture relevant structures

of S. We formulate this problem as minimizing a weighted combination of the Renyi’s

entropy of X namely H(X), and the Cauchy Schwartz divergence measure between the

probability densities (pdfs) of X and S as shown below.

J(X) = minX

H(X) + β Dcs(pX ||pS), (3–1)

where β ∈ [0,∞). The first quantity H(X) can be seen as a redundancy reduction term

and the second quantity Dcs(pX ||pS), as an information preservation term. At the first

iteration, Dcs(pX ||pS) = 0 due to the initialization X = S. This would reduce the entropy

of X irrespective of the value of β selected, thus making Dcs(pX ||pS) positive. This is a

key factor in preventing X from diverging. From the second iteration onwards, depending

on the β value chosen, one would iteratively move the samples of X to capture different

regularities in S. Interestingly, there exists a hierarchical trend among these regularities

resulting from the continuous nature of the distortion measure which varies between

[0,∞). From this point of view, one can see these regularities as quantifying different

levels of structure in the data, thus giving us a composite framework.

To understand this better, we would investigate the behavior of the cost function for

some special values of β. Before we do this, let us first simplify the cost function. We can

substitute the expansion of Dcs(pX ||pS) term as given in 2–15 in 3–1. To avoid working

with fractional values, we use a more convenient definition of Cauchy-Schwartz divergence

given by

Dcs(pX ||pS) = − log

(∫pX(u)pS(u) du

)2

(∫

p2X(u) du)(

∫p2

S(u) du)

= 2 H(X; S)−H(X)−H(S).

37

Page 38: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

This satisfies all the properties of divergence as enlisted in Chapter 2 following the

Cauchy-Schwartz inequality. We could also see this change as absorbing a scaling constant

(a value of 1/2) in the weighting factor β in 3–1. In any case, this is not going to alter any

results, but would only change the values of β at which we attain those results. Having

done this, the cost function now looks like

J(X) = minX

H(X) + β [2 H(X; S)−H(X)−H(S)]

≡ minX

(1− β)H(X) + 2β H(X; S).

(3–2)

Note that we have dropped the extra term H(S) since this is constant with respect to X.

With this simplified form of the cost function, we will now investigate the effect of the

weighting parameter β.

3.1.1 Case 1: β = 0

For this special case, the cost function in equation 3–2 reduces to

J(X) = minX

H(X).

Looking back at equation 3–1, we see that the information preservation term has been

eliminated and the cost function directly minimizes the entropy of the dataset iteratively.

As pointed out earlier, this simple case leads to a trivial solution with all the samples

converging to a single point.

Since log is a monotonous function, minimization of entropy is equivalent to

maximization of the argument of the log which is the information potential V (X) of

the dataset.

J(X) ≡ maxX

V (X)

= maxX

1

N2

N∑i=1

N∑j=1

Gσ(xi − xj).

38

Page 39: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Differentiating this quantity with respect to xk={1,2,...,N} ∈ X and equivating to zero, we

would get the following fixed point update rule.

∂xk

V (X) = 0

2

N2

N∑j=1

Gσ(xk − xj){xj − xk

σ2} = 0

{ N∑j=1

Gσ(xk − xj)

}xk =

N∑j=1

Gσ(xk − xj)xj

xk =

∑Nj=1 Gσ(xk − xj)xj∑N

j=1 Gσ(xk − xj).

Since we run this fixed point rule continuously for each xk ∈ X, we specifically write the

iteration number τ as

x(τ+1)k =

∑Nj=1 Gσ(x

(τ)k − x

(τ)j )x

(τ)j∑N

j=1 Gσ(x(τ)k − x

(τ)j )

. (3–3)

Due to the minimization of Renyi’s entropy of the dataset iteratively, this update

rule results in the collapse of the entire dataset to a single point as proved in the following

theorem.

Theorem 2. The update rule 3–3 converges with all the data samples merging to a single

point.

Proof. Let us denote the sample set after τ iterations by X(τ+1) = {x(τ+1)1 , x

(τ+1)2 , . . . , x

(τ+1)N }.

Starting with the initialization X0 = S, the dataset X then evolves in the following

sequence.

X(0) → X(1) → . . . → X(τ+1) . . . (3–4)

We could rewrite equation 3–3 as

x(τ+1)k =

N∑j=1

w(τ)j (k)x

(τ)j ,

N∑j=1

w(τ)j (k) = 1.

39

Page 40: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Therefore, each sample x(τ+1)k is a convex combination of the samples of X(τ) and hence

belongs to its convex hull1 denoted by Hull(X(τ)). Further, due to the nature of the

Gaussian kernel each weight w(τ)j (k) is strictly ∈ (0, 1). Therefore, each sample x

(τ+1)k

strictly lies inside the convex hull of X(τ). Generalizing this to all the samples of X(τ+1),

we have,

X(τ+1) ⊂ Hull(X(τ)).

From convexity of the Hull, this also implies that

Hull(X(τ+1)) ⊂ Hull(X(τ)).

With the sequence in 3–4 then, one could conclude the following,

Hull(X(0) = S) ⊃ Hull(X(1)) ⊃ Hull(X(2)) . . . ⊃ Hull(X(τ+1)) . . .

This collapsing sequence therefore converges with all the data sample merging to a single

point.

3.1.2 Case 2: β = 1

For β = 1, the cost function directly minimizes Renyi’s cross entropy between X and

the original dataset S. Again, using the monotonous property of the log, the cost function

can be rewritten as

J(X) = minX

H(X; S)

≡ maxX

V (X; S)

= maxX

1

NM

N∑i=1

M∑j=1

Gσ(xi − sj).

Similar to the derivation in the previous section, one could obtain the following fixed point

update rule by differentiating J(X) with respect to xk={1,2,...,N} ∈ X and equivating it to

1 A convex hull of a set X, is the smallest convex set containing X.

40

Page 41: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

zero.

x(τ+1)k =

∑Mj=1 Gσ(x

(τ)k − sj)sj∑M

j=1 Gσ(x(τ)k − sj)

. (3–5)

Notice the lack of any superscript on sj ∈ S since it is kept constant throughout the

process of adaptation of X. Unlike 3–3 when β = 0 then, in this case the samples of X(τ)

are constantly compared to this fixed dataset. This results in the samples of X converging

to the modes (peaks) of pdf pS(u), thus revealing an important structure of S. We prove

this in the following theorem.

Theorem 3. Modes of the pdf estimate pS(x) = 1M

M∑j=1

Gσ(x− sj) are the fixed points of the

update rule 3–5.

Proof. The proof is quite simple. We could rearrange the update rule 3–5 as

{ M∑j=1

Gσ(xk − sj)

}xk =

M∑j=1

Gσ(xk − sj)sj

M∑j=1

Gσ(xk − sj){sj − xk} = 0

1

M

M∑j=1

Gσ(xk − sj){sj − xk

σ2

}= 0

∂xk

pS(x) = 0

Since modes are stationary solutions of ∂∂xk

pS(x) = 0, they are the fixed points of 3–5.

We can also rewrite equation 3–5 as

x(τ+1)k =

M∑j=1

w(τ)j (k)sj,

M∑j=1

w(τ)j (k) = 1.

w(τ)j (k) are the weights evaluated with respect to the original data S which is kept

constant throughout the process. Since each w(τ)j (k) is strictly ∈ (0, 1), each iteration

sample x(τ+1)k is mapped strictly inside the convex hull of S (and not X(τ) as in the case of

41

Page 42: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

β = 0). Taking into consideration all the data samples then,

X(τ+1) ⊂ Hull(S).

This gives rise to a stable sequence where the data samples converge to their respective

modes. Therefore, for β = 1, one can identify high concentration regions in the original

dataset S, which could be used for clustering applications.

3.1.3 Case 3: β →∞Before we analyze this asymptotic case, we will derive a general update rule for β 6= 0.

We rewrite equation 3–2 as

J(X) = minX

(1− β)H(X) + 2β H(X; S)

= minX

−(1− β) logV (X)− 2β logV (X; S)

= minX

−(1− β)log

{1

N2

N∑i=1

N∑j=1

Gσ(xi − xj)

}− 2β log

{1

NM

N∑i=1

M∑j=1

Gσ(xi − sj)

}

Taking the derivative of J(X) with respect to xk={1,2,...,N} ∈ X and equivating to zero, we

have2

(1− β)

V (X)

2

N2

N∑j=1

Gσ(xk − xj){xj − xk

σ2

}+

β

V (X; S)

2

NM

M∑j=1

Gσ(xk − sj){sj − xk

σ2

}= 0

(1− β)

V (X)F (xk) +

β

V (X; S)F (xk; S) = 0,

(3–6)

where F (xk) and F (xk; S) are given by equations 2–5 and 2–13. Equation 3–6 shows the

net force acting on sample xk ∈ X. This force is a weighted combination of two forces; the

information force within the dataset X and the cross information force between X and S.

2 The factor 2 in the first term is due to the double summation and symmetry of theGaussian kernel.

42

Page 43: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Depending on the value of β, these two forces combine in different proportions giving rise

to different rules of interaction between data samples.

Rearranging 3–6 we have,

V (X; S)

1

M

M∑j=1

Gσ(xk − sj)

}xk =

(1− β)

V (X)

1

N

N∑j=1

Gσ(xk − xj)xj

V (X; S)

1

M

M∑j=1

Gσ(xk − sj)sj

−{

(1− β)

V (X)

1

N

N∑j=1

Gσ(xk − xj)

}xk.

This gives us the following fixed point update rule,

x(τ+1)k = c

(1− β)

β

∑Nj=1 Gσ(x

(τ)k − x

(τ)j )x

(τ)j∑M

j=1 Gσ(x(τ)k − sj)

+

∑Mj=1 Gσ(x

(τ)k − sj)sj∑M

j=1 Gσ(x(τ)k − sj)

− c(1− β)

β

∑Nj=1 Gσ(x

(τ)k − x

(τ)j )

∑Mj=1 Gσ(x

(τ)k − sj)

x(τ)k .

(3–7)

where c = V (X;S)V (X)

MN

. Notice that there are three different ways of rearranging 3–6 to get a

fixed point update rule, but we found only the fixed point iteration 3–7 to be stable and

consistent.

Having derived the general form, we are now ready to discuss the asymptotic case

of β → ∞. It turns out that for this special case, one ends up directly minimizing the

Cauchy-Schwartz divergence measure Dcs(pX ||pS). This is not obvious by looking at

equation 3–1. We prove this in Theorem 4. Since we start with the initialization X = S,

a consequence of this extreme case is that we get back the data itself as the solution. This

corresponds to the highest level of structure with all the information preserved at the cost

of no redundancy reduction.

Theorem 4. When β →∞, the cost function 3–1 directly minimizes the Cauchy-Schwartz

pdf divergence Dcs(pX ||pS).

43

Page 44: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Proof. We prove this theorem by showing that the fixed point equation derived by directly

optimizing Dcs(pX ||pS) is same as (3–7) when β →∞. Consider the cost function,

F (X) = minX

Dcs(pX ||pS)

≡ minX

2 H(X; S)−H(X)

= minX

−2 log V (X; S) + log V (X).

(3–8)

Differentiating with respect to xk={1,2,...,N} ∈ X and equivating to zero, we have

1

V (X)F (xk)− 1

V (X; S)F (xk; S) = 0

Rearranging the above equation just as we did to derive equation 3–7, we get

x(τ+1)k =− c

∑Nj=1 Gσ(x

(τ)k − x

(τ)j )x

(τ)j∑M

j=1 Gσ(x(τ)k − sj)

+

∑Mj=1 Gσ(x

(τ)k − sj)sj∑M

j=1 Gσ(x(τ)k − sj)

+ c

∑Nj=1 Gσ(x

(τ)k − x

(τ)j )

∑Mj=1 Gσ(x

(τ)k − sj)

x(τ)k ,

(3–9)

where again c = V (X;S)V (X)

MN

. Taking the limit β →∞ in 3–7 gives 3–9.

Corollary 4.1. With X initialized to S, the data satisfies the fixed point update rule 3–9.

Proof. When X(0) = S, where X(0) corresponds to initialization at τ = 0, we have

V (X) = V (X; S) and N = M . Substituting this in 3–9 we see that

x(τ+1)k = x

(τ)k

and hence, the update rule 3–9 converges at the very first iteration giving the data itself as

the solution.

3.2 Summary

The analysis so far of the Principle of Relevant Entropy at particular values of β

revealed structures which at one end balance information preservation and at the other

end minimize redundancy. Further, a continuous array of such structures exists for

44

Page 45: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

A

−4

−2

0

2

4

−8

−6

−4

−2

0

2

4

X axis

Y axis

B

Figure 3-1. A simple example dataset. A) Crescent shaped data. B) pdf plot forσ2 = 0.05.

different proportions of these two terms. For example, when 1 < β < 3 one could get

principal curves passing through the data. These can be considered as non-linear versions

of the principal components, embedding the data in a lower dimensional space. What

is even more interesting is that these curves represent ridges passing through the modes

(peaks) of the pdf pS(u) and hence subsume the information of the modes revealed for

β = 1. There exists therefore, a hierarchical structure to this framework. At one extreme

with β = 0, we have a single point which is highly redundant but preserves no information

of the data while at the other extreme when β → ∞, we have the data itself as the

solution preserving all the information we have with no redundancy reduction. In between

these two extremes, one encounters different phases like modes and principal curves which

reveal interesting patterns unique to the given dataset. To demonstrate this idea, we

use a simple example of crescent shaped dataset as shown in Figure 3-1. Applying our

cost function for different values of β reveals various structures relevant to the dataset as

shown in Figure 3-2.

From unsupervised learning point of view, there are very interesting aspects unique to

this principle. Notice that the cost in equation 3–1 is a function of just two parameters;

namely the weighting parameter β and the inherently hidden scale parameter σ. The

45

Page 46: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

A β = 0, Single point

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

B β = 1, Modes

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

C β = 2, Principal Curve

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

D β = 2.8

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

E β = 3.5

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

F β = 5.5

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

G β = 13.1

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

H β = 20

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

I β →∞, The data

Figure 3-2. An illustration of the structures revealed by the Principle of Relevant Entropyfor the crescent shaped dataset for σ2 = 0.05. As the values of β increases wepass through the modes, principal curves and in the extreme case of β →∞we get back the data itself as the solution.

parameter β controls the “task” to be achieved (like finding the modes or the principal

curves of the data) whereas σ controls the resolution of our analysis. By including an

internal user defined parameter β, we allow the learning machine to influence the level

of abstraction to be extracted from the data. To the best of our knowledge, this is the

first attempt to build a framework with the inclusion of goals at the data analysis level.

A natural consequence of this is a rich and hierarchical structure unfolding the intricacies

between all levels of information in the data. In Chapters 4 to 6, we show different

applications of our principle like clustering, manifold learning and compression which now

fall under a common purview.

46

Page 47: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Before we end this chapter, we would like to highlight two important details. The

different structures shown in Figure 3-2 were obtained for kernel size of σ2 = 0.05. For

a different kernel size, we get a new set of structures relevant to that particular scale.

This rich representation of the data is a natural consequence of framing our cost function

with information theoretic principles. The other aspect corresponds to the redundancy

achieved. Throughout this discussion we have kept N = M . To achieve redundancy in the

number of points to represent the data, we can follow with a post processing stage where

merged points are replaced with a single point. This would finally give us N ≤ M . For

example, to capture the modes of the data we can replace M points which have merged

to their respective modes with a single point at each mode. The special case of equality

occurs when β → ∞. In this case, we sacrifice redundancy and store all the samples to

preserve maximal information about the data.

47

Page 48: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 4APPLICATION I: CLUSTERING

4.1 Introduction

The first application we would like to discuss is the popular facet of unsupervised

learning called Clustering. In this learning scenario, we are presented with a set of data

samples and the task is to dynamically extract groups (or patterns) in the set such that

samples within a group are more similar to each other than samples of different groups.

One therefore needs to address the question of what similarity means before tackling this

issue. The traditional view of similarity is to associate it with some distance metric [Duda

et al., 2000; Watanabe, 1985]. These two quantities generally share an inverse relationship.

So, if we denote an arbitrary similarity measure by S and the corresponding distance

metric by D, then

S ×D = constant.

In essence, the smaller the distance between two points the larger is their similarity and

vice versa. Examples of distance metric include Euclidean distance, Mahalanobis distance

and others.

This approach is popular since distance metric is well defined and quantifiable. By

defining a metric for a feature space, one indirectly defines an appropriate similarity

measure thus bypassing the ill posed problem of similarity itself. Nevertheless, one needs

to pick a distance metric which can be seen as part of a priori information. For example,

K-means clustering technique uses a Euclidean distance metric. This leads to a priori

assumption that the clusters have a spherical Gaussian distribution. If this is really

true, then the clustering solution obtained is good. But most of the time such a priori

information is not available and is used as a convenience to ease the difficulty of the

problem. The imposition of such external models on the data leads to biased information

extraction, thus hampering the learning process. For example, most neural spike data

we have worked with do not satisfy this assumption at all, but a lot of medical work

48

Page 49: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

use K-means technique for spike sorting leading to poor clustering solution and wrong

inference [Rao et al., 2006b].

Recent trends in clustering have tried to make use of information from higher order

moments or to deal directly with the probability density function (pdf) of the data. One

such technique which has become quite popular lately in image processing community

are the mean shift algorithms. The idea is very simple. Since clusters correspond to

high density regions in the dataset, a good pdf estimation would result in these clusters

demarcated by peaks (or modes) of the pdf. We could, for example, use Parzen window

technique to non-parametrically estimate the pdf thus avoiding any a priori model

assumption. Using any mode finding procedure, we could then move these points in the

direction of normalized density gradient. By correlating the points with the modes to

which they converge one can in essence achieve a very good clustering solution. The fixed

point non-parametric iterative scheme makes this method particularly attractive with only

a single parameter (kernel size) to estimate. An added benefit is automatic inference of the

number of clusters K which in many other algorithms like K-means and spectral clustering

needs to be specified a priori.

We show in this chapter, that essentially these mean-shift algorithms are special cases

of our novel principle corresponding to β = 1 and β = 0. The independent derivation of

these algorithms gives a new perspective from optimization point of view thus highlighting

their differences and in general their connection to a broader framework. Before we dwell

into this, we briefly review the existing literature on mean shift algorithms.

4.2 Review of Mean Shift Algorithms

Let us consider a sample set S = {s1, s2, . . . , sM} ∈ Rd. The non-parametric pdf

estimation of this dataset using Parzen window technique is given by

pS(x) =1

M

M∑j=1

Gσ(x− sj),

49

Page 50: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

where Gσ(t) = e−t2

2σ2 is a Gaussian kernel with bandwidth σ > 0. In order to solve the

modes of the pdf, we solve the stationary point equation ∇xpS(x) = 0 for which the modes

are the fixed points. This gives us an iterative scheme as shown below.

x(τ+1)k = T (x

(τ)k ) =

∑Mj=1 Gσ(x

(τ)k − sj)x

(τ)j∑M

j=1 Gσ(x(τ)k − sj)

. (4–1)

The expression T (x) is the sample mean of all the samples si weighted by the kernel

centered at x. Thus, the term T (x) − x was coined “mean shift” in their landmark paper

by Fukunaga and Hostetler [1975]. In this process, two different datasets are maintained

namely X and S. The dataset X would be initialized to S as X(0) = S. At every

iteration, a new dataset X(τ+1) is produced by comparing the present dataset X(τ) with

S. Throughout this process S is fixed and kept constant. Due to the usage of Gaussian

kernel, this algorithm was called Gaussian Mean Shift (GMS).

Unfortunately, the original paper by Fukunaga and Hostetler used a modified version

of 4–1 which is shown below.

x(τ+1)k = T (x

(τ)k ) =

∑Mj=1 Gσ(x

(τ)k − x

(τ)j )x

(τ)j∑M

j=1 Gσ(x(τ)k − x

(τ)j )

. (4–2)

In this iteration scheme, one compares the samples of the dataset X(τ) with X(τ) itself.

Starting with the initialization X(0) = S and using 4–2, we then successively “blur” the

dataset S to produce datasets X(1), X(2) . . . X(τ+1). As the new datasets are produced we

forget the previous one giving rise to the blurring process which is inherently unstable.

It was Cheng [1995] who first pointed out this and named the fixed point update 4–2 as

Gaussian Blurring Mean Shift (GBMS).

Recent advancements in mean shift have made these algorithms quite popular in

image processing literature. In particular, the mean shift vector of GMS has been shown

to always point in the direction of normalized density gradient [Cheng, 1995]. Since points

lying in low density region have small value of pS(x), the normalized gradient at these

points have large value. This helps the samples to quickly move from low density regions

50

Page 51: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

toward the modes. On the other hand, due to relatively high value of pS(x) near the

mode, the steps are highly refined around this region. This adaptive nature of step size

gives GMS a significant advantage over traditional gradient based algorithms where step

size selection is well known problem.

A rigorous proof of stability and convergence of GMS was given in [Comaniciu and

Meer, 2002] where the authors proved that the sequence generated for each sample xk

by 4–1 is a Cauchy sequence that converges due to the monotonic increasing sequence of

the pdfs estimated at these points. Further the trajectory is always smooth in the sense

that the consecutive angles between mean shift vectors is always between(−π

2, π

2

). M. A.

Carreira-Perpinan [2007] also showed that GMS is an Expectation-Maximization (EM)

algorithm and thus has a linear convergence rate.

Due to these interesting and useful properties, GMS has been successfully applied

in low level vision tasks like image segmentation and discontinuity preserving smoothing

[Comaniciu and Meer, 2002] as well as in high level vision tasks like appearance based

clustering [Ramanan and Forsyth, 2003] and real-time tracking of non rigid objects

[Comaniciu et al., 2000]. Carreira-Perpinan [2000] used mean shift for mode finding in

mixture of Gaussian distributions. The connection to Nadarayana-Watson estimator from

kernel regression and the robust M-estimators of location has been thoroughly explored by

Comaniciu and Meer [2002]. With just a single parameter to control the scale of analysis,

this simple non-parametric iterative procedure has become particularly attractive and

suitable for wide range of applications.

Compared to GMS, the understanding of GBMS algorithm remains relatively poor

since this concept first appeared in [Fukunaga and Hostetler, 1975]. Apart from the

preliminary work done in [Cheng, 1995], the only other notable contribution which we

are aware of was recently made by Carreira-Perpinan. In his paper [Carreira-Perpinan,

2006], the author showed that GBMS has a cubic convergence rate and to overcome its

instability, developed a new stopping criterion. By removing the redundancy among points

51

Page 52: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

which have already merged, an accelerated GBMS was developed which was 2 × −4×faster1 .

4.3 Contribution of this Thesis

In spite of these achievements of mean shift techniques, there are still some

fundamental questions unanswered. It is not clear, for example, what changes are incurred

when going from GBMS to GMS and vice versa. Further, the implications and instability

of GBMS is least understood. The justification till date has been ad hoc and heuristic.

Though Cheng [1995] provided a comprehensive comparison, the analysis is quite complex.

Fashing and Tomasi [2005] showed mean shift as quadratic bound maximization but the

analysis is indirect and the scope limited. On top of this, the derivation of GBMS by

replacing the original samples of S with X(τ) itself is quite heuristic.

A right step to tackle this issue is to pose the question - “What does mean shift

algorithms optimize?”. For this would elucidate the cost function and hence the

optimization surface itself. Rigorous theoretical analysis is then possible to account

for their inherent properties. This we claim is our major contribution in this area. Notice

that equation 4–1 is the same as 3–5. The mode finding ability of our principle for β = 1

actually corresponds to GMS algorithm. Thus, GMS optimizes Renyi’s cross entropy

between datasets X and S. From Theorem 3, it is then clear that the iteration procedure

4–1 leads to a stable algorithm. The data samples move in the direction of normalized

gradient converging to their respective modes.

On the other hand, equation 4–2 exactly corresponds to 3–3. This iteration scheme

is the result of directly minimizing Renyi’s quadratic entropy of the data X. Since this

happens for β = 0, GBMS is a special case of our general principle. From optimization

point of view then, GBMS with initialization X(0) = S leads to rapid collapse of the data

to a single point as proved in Theorem 2. GBMS therefore does not truly poses mode

1 Note that this can also be done for GMS algorithm

52

Page 53: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

finding ability to track the modes in the original given dataset S. With the purpose of

finding high concentration regions in S and dynamically find clusters, this algorithm then

is really unstable and should not be used.

Some authors have suggested GBMS as a fast alternative to GMS algorithm if it

can be stopped appropriately. The inherent assumption here is that the clusters are well

separated in the given data. This would lead to two phases in GBMS optimization. In the

first phase, the points would rapidly collapse to the modes of their respective clusters, and

in the second phase, the modes slowly move towards each other to ultimately yield a single

point. By stopping the algorithm after the first phase, one could ideally get the clustering

solution. Further, since GBMS has approximately 3× faster convergence rate than GMS,

this approach can be seen as fast alternative to GMS. Unfortunately, this assumption

of well demarcated clusters is not true in practical applications especially in areas like

image segmentation. Further, since modes are not the fixed points of GBMS, any stopping

criteria would be utmost heuristic. We demonstrate this rigorously for many problems in

the following sections in this chapter.

By clarifying these differences through an optimization route, we bring in new

perspective to these algorithms [Rao et al., 2008]. It is important to note that our

derivation is completely independent of the earlier efforts. Further, it is only in our

case that the patterns extracted from the data can be seen as structures of a broader

unsupervised learning framework with their connections to other facets like vector

quantization and manifold learning. By elucidating their respective strengths and

weaknesses from information theoretic point of view we simplify greatly the understanding

of these algorithms.

4.4 Stopping Criteria

Before we start comparing the performance of the mean shift algorithms, we need to

devise a stopping criteria for them. Because these techniques are iterative procedure, they

53

Page 54: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

would continue running unnecessarily even after convergence. To avoid this redundancy,

we look into simple and practical way of stopping the fixed point iteration.

4.4.1 GMS Algorithm

Stopping the GMS algorithm to find the modes is very simple. Since samples move in

the direction of normalized gradient toward the modes which are fixed points of 4–1, the

average distance moved by samples becomes smaller over subsequent iterations. By setting

a tol level on this quantity to a low value we can get the modes as well as stop GMS from

running unnecessarily. This is summarized in 4–3.

Stop when1

N

N∑i=1

d(τ)(xi) < tol where

d(τ)(xi) =‖x(τ)i − x

(τ−1)i ‖.

(4–3)

Another version of the stopping criteria which we found more useful in image segmentation

applications, is to stop when the maximum distance moved among all the particles is less

than some tol level rather than the average distance, that is,

Stop when maxi‖x(τ)

i − x(τ−1)i ‖ < tol . (4–4)

4.4.2 GBMS Algorithm

As stated earlier, modes are not the solution of GBMS fixed point update equation

and hence GBMS cannot be used to find them. But assume that the modes are far apart

compared to kernel size. In such cases, there generally seems to be two distinct phases of

convergence. In the first phase, the points quickly collapse to their respective modes while

the modes move very slowly towards each other. In the second phase, the modes start

merging and ultimately yield a single point. If the algorithm can be stopped after the first

phase, then it could be used in applications like clustering where the exact position of

modes is not important, although any such stopping criterion would at most be heuristic.

Of course the stopping criterion 4–3 cannot be used unless we carefully hand-pick the tol

54

Page 55: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

level since the average distance moved by the particles never settles down until all of them

have merged.

The assumption of two distinct phases was effectively used to formulate a stopping

criterion by Carreira-Perpinan [2006]. In phase 2, d(τ) = {d(τ)(xi)}Ni=1 takes on at most

K different values (for K modes). Binning d(τ) using large number of bins gives us the

histogram which has K or fewer non empty bins. Since entropy does not depend on exact

location of the bins, its value does not change and can be used to stop the algorithm as

shown in 4–5.∣∣Hs(d

(τ+1))−Hs(d(τ))

∣∣ < 10−8, (4–5)

where Hs(d) = −∑Bi=1 fi log fi is the Shannon entropy, fi is the relative frequency of

bin i and the bins span the interval [0,max(d)]. The number of bins B was selected as

B = 0.9N .

It is clear that there is no guarantee that we would find all the modes using this

rule. Further, the assumption used in developing this criterion does not hold true in many

practical scenarios as will be shown in our experiments.

4.5 Mode Finding Ability

Here, we study the mode finding ability of the two algorithms. We use a systematic

approach, by generating a mixture of Gaussian dataset with known modes. We select the

kernel size (σ) such that the modes corresponding to the estimated pdf (using Parzen

window technique) is as close as possible to the original modes. We then use GMS and

GBMS to iteratively track these modes and compare their performance.

The dataset in Figure 4-1 consists of a mixture of 16 Gaussians with centers spread

uniformly around a circle of unit radius. Each Gaussian density has a spherical covariance

of σ2gI = 0.01× I. To include a more realistic scenario, different a priori probabilities were

selected which is shown in Figure 4-1B. Using this mixture model, 1500 iid data points

were generated. We selected the scale of analysis σ2 = 0.01 such that the estimated modes

are very close to the modes of the Gaussian mixture. Note that since the dataset is a

55

Page 56: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

A

2 4 6 8 10 12 14 160

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

B

−1.5

−1

−0.5

0

0.5

1

1.5

−1.5

−1

−0.5

0

0.5

1

1.5

0

0.1

0.2

Y axis

X axis

Pdf

est

imat

ion

C

Figure 4-1. Ring of 16 Gaussian dataset with different a priori probabilities. Thenumbering of the clusters is in anticlockwise direction starting with center(1, 0). A) R16Ga dataset. B) a priori probabilities. C) probability densityfunction estimated using σ2 = 0.01.

mixture of 16 Gaussians each with variance σ2g = 0.01 and spread across the unit circle,

the overall variance of the data is much larger than 0.01. Thus, by using a kernel size of

σ2 = 0.01 for Parzen estimation of the pdf, we ensure that the Parzen kernel size is smaller

than the actual kernel size of the data. Figure 4-1C shows the 3−D view of this estimated

pdf. Note the unequal peaks due to different proportion of points in each cluster.

Figure 4-2 shows the mode finding ability of the two algorithms. To compare with

ground truth we also plot 2σg contour lines and actual centers of the Gaussian mixture.

With tol level in 4–3 set to 10−6, GMS algorithm stops at 46th iteration giving almost

perfect results. On the other hand, using stopping criterion 4–5, GBMS stops at 20th

iteration missing already 4 modes (shown with arrows). We would also like to point out

that this is the best result achievable by GBMS algorithm even if we had used stopping

criterion 4–3 and selectively hand-picked the best tol value.

Figure 4-3 shows the cost functions which these algorithms minimize for a duration

of 70 iterations. Notice how the cost function H(X) of GBMS continuously drops as the

modes merge. This would go on until H(X) becomes zero when all the samples would

have merged to a single point. For GMS, on the other hand, H(X; S) decreases and settles

down smoothly to its fixed points which are the modes of pS(x). Thus, a more intuitive

stopping criterion for GMS which originates directly from its cost function is to stop when

56

Page 57: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5True centersGMS centers

A

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5True centersGBMS centers

B

Figure 4-2. Modes of R16Ga dataset found using GMS and GBMS algorithms. A) GoodMode finding ability of GMS algorithm. B) Poor mode finding ability ofGBMS algorithm.

the absolute difference between subsequent values of H(X; S) becomes smaller than some

tol level as summarized in 4–6. These are some of the unforeseen advantages when we

know exactly what we are optimizing.

∣∣H(X(τ+1); S)−H(X(τ); S)∣∣ < 10−10. (4–6)

Another interesting result pops up with this new understanding. Notice that even

though GBMS does not minimize Renyi’s cross entropy H(X; S) directly, we can always

measure this quantity between X(τ) at every iteration τ and the original dataset S. If

the assumption of two distinct and well separated phases in GBMS holds true, then the

samples will quickly collapse to the actual modes of the pdf before they start slowly

moving toward each other. Since we start with initialization X = S, H(X; S) will reach its

local minimum at this point before it again starts increasing due to the merging of GBMS

modes (and hence moving them away from the actual modes of the pdf). By stopping

GBMS at this minimum, we could devise an effective stopping criterion giving the same

result as GMS with less number of iterations.

57

Page 58: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

10 20 30 40 50 60 702

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

iterations

H(X) − GBMSH(X;S) − GMS

Figure 4-3. Cost function of GMS and GBMS algorithms.

Unfortunately, we found that this works only when the modes (or clusters) are

very well separated compared to the kernel size (making the assumption to hold true).

For example, Figure 4-4 shows H(X; S) computed for GBMS for R16Ga dataset. The

minimum is reached at 7th iteration. Therefore, this stopping criterion would have

prematurely stopped GBMS algorithm giving very poor results. It is clear that GBMS is

not a good mode finding algorithm.

These results shed a new light in our understanding of these two algorithms. Mode

finding can be used as a means to cluster data into different groups. We will see next,

the performance of these algorithms in clustering where their respective properties effect

greatly the outcome of the applications.

4.6 Clustering Example

We generated 10 Gaussian clusters with centers spread uniformly in unit square. The

Gaussian clusters have random spherical covariance matrices with 50 iid samples each.

Figure 4-5 shows the dataset with true labeling as well as the 2σg contour plots.

Although, different kernel sizes should be used for density estimation of different

clusters, for simplicity and to express our idea clearly, we use a common Parzen kernel

58

Page 59: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

10 20 30 40 50 60 703.35

3.4

3.45

3.5

3.55

3.6

3.65

3.7

3.75

iterations

H(X;S) − GBMS

Figure 4-4. Renyi’s “cross” entropy H(X; S) computed for GBMS algorithm. This doesnot work as a good stopping criterion for GBMS in most cases since theassumption of two distinct phases of convergence does not hold true in general.

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

A

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

B

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

0

0.5

X axisY axis

Pdf

est

imat

ion

C

Figure 4-5. A clustering problem A) Random Gaussian clusters (RGC) dataset. B) 2σg

contour plots. C) probability density estimated using σ2 = 0.01.

size for pdf estimation. Using Silverman’s rule of thumb [Silverman, 1986], we estimated

the kernel size as σ2 = 0.01. A cross check with the pdf plot validated the efficacy of this

estimate. The plot is shown in Figure 4-5C. Note that all the clusters are well identified

for this particular kernel size. By correlating the points with their respective modes we

wish to segment this dataset into meaningful clusters.

With tol level set at 10−6, the GMS algorithm converges at the 41st iteration.

The segmentation result is shown in Figure 4-6A . Clearly GMS performs very well in

59

Page 60: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

clustering the dataset into meaningful clusters. There are a total of 20 misclassification

(out of 500 points) which arise mostly due to the cluster with the largest spherical

covariance matrix. Notice that this cluster is under represented with just 50 points.

Further due to the overlap of the 2σg contour of this cluster with the neighboring

cluster as shown in Figure 4-5B, these misclassifications are bound to occur. Another

interesting mistake occur at the top right corner, where 4 points belonging to a cluster are

misclassified and put as part of another highly concentrated cluster. These points lie in

the narrow valley bordering the two clusters and unfortunately their gradient directions

point toward the incorrect mode. But it should be appreciated that even for this complex

dataset with varying shapes of Gaussian clusters, GMS with the simplest solution of single

kernel size gives such a good result.

On the other hand, using stopping criterion 4–5, GBMS stops at 18th iteration with

the output shown in Figure 4-6B. Notice the poor segmentation result as a consequence

of multiple modes merging. It should be kept in mind that by defining the kernel size

σ2 = 0.01, we have selected the similarity measure for clustering and are looking for

spherical Gaussians with variance around this value. In this regard, the result of GBMS is

incoherent. On the other hand, the segmentation result obtained for GMS is much more

homogeneous and consistent with our similarity measure. Further, it is only in the case of

GMS algorithm that the modes estimated from the pdf directly translate into clusters. On

the contrary, for GBMS algorithm, it is not clear how the modes in Figure 4-5C correlate

with the clustering solution obtained in Figure 4-6B.

Figure 4-7 shows the average change in sample position for both the algorithms.

Notice the peaks in GBMS curve corresponding to modes merging. This is a classic

example where the assumption of two distinct phases in GBMS becomes fuzzy. By 5th

iteration, two of the modes have already merged and by 18th iteration a total of 5 modes

are lost giving rise to poor segmentation result. In case of GMS, on the other hand, the

60

Page 61: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

A

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

B

Figure 4-6. Segmentation results of RGC dataset. A) GMS algorithm. B) GBMSalgorithm.

averaged norm distance steadily decreases and by selecting a tol level sufficiently low, we

are always assured of a good segmentation result.

5 10 15 20 25 30 35 40 45 500

0.005

0.01

0.015

0.02

0.025

0.03

0.035

iterations

Ave

rage

d no

rm d

ista

nce

GMSGBMS

Figure 4-7. Averaged norm distance moved by particles in each iteration.

4.7 Image Segmentation

In this section, we extend our claim to real applications like image segmentation

where these techniques have been widely used. In the first part of this section, we compare

the performance of GMS and GBMS algorithms on a classic dataset of baseball game

61

Page 62: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

image [Shi and Malik, 2000]. In the second part, we compare GMS algorithm with the

current state of art in segmentation literature called spectral clustering on a wide range of

popular images.

4.7.1 GMS vs GBMS Algorithm

We highlight the differences between GMS and GBMS by applying it on a real

dataset. For this purpose, we use the famous baseball game image of the normalized cuts

paper by Shi and Malik [2000] shown in Figure 4-8. For computational reasons, the image

has been reduced to 110× 73 pixels. This gray level image is transformed to 3-dimensional

feature space consisting of two spatial features, namely the x and y coordinates of the

pixels, and the range feature which is the intensity value at that location. Thus, the

dataset consists of 8030 points in the feature space. In order to use an isotropic kernel, we

scale the intensity value such that they fall in the same range as the spatial features as

done in [M. A. Carreira-Perpinan, 2007]. All the values reported are in pixel units. Since

this is an image segmentation application, we use stopping criterion 4–4 for GMS. Further,

we set the tol level equal to 10−3 for both the algorithms throughout this experiment.

Figure 4-8. Baseball image.

We performed an elaborate experiment of multi scale analysis where the kernel size

σ was changed from a small value to a large value in steps of 0.5. We selected the best

segmentation result for both the algorithms for a particular number of segments. The

results are shown in Figure 4-9. The top row shows the segmentation result for 8 clusters.

62

Page 63: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Since the clusters are well separated for the respective kernel sizes, both GMS and GBMS

give very similar results. The interesting development occurs when we try to achieve

segments less than 8. Note that, for this image, the best number of segments is 5 to 6

segments as seen in the image itself. Many researchers have tried to do this using various

methods [M. A. Carreira-Perpinan, 2007; Shi and Malik, 2000].

Figures 4-9C and 4-9D show the GMS and GBMS results for 6 segments. Note the

poor performance of GBMS. Instead of grouping similar objects into one, GBMS splits

and merges them to two different clusters. The disc segment in the image was split into

two with one of them merging with the player and the other with the bottom background.

This is counter intuitive given the fact that two of the coordinates of the feature space

are spatial coordinates of the image. On the other hand, GMS clearly gives a very good

segmentation result with each segment corresponding to an object in the image. Further,

a nice consistent and hierarchical structure is seen in GMS. As we reduce the number of

clusters, GMS merges clusters of same intensity and which are closer to each other before

merging similar intensity clusters which are far apart. This is what we would expect for

this feature space. This results in a beautiful pattern in the image space where whole

objects which are similar are merged together in an intuitive manner. This phenomenon is

again observed as we move from 6 segments to 4 where GMS puts all the gray objects in

one cluster, thus putting together three full objects of similar intensity in one group.

Thus, starting from results for 8 segments which were very similar to each other,

GMS and GBMS tread a very different path for lower number of segments. GMS neatly

clusters objects in the image into different segments and hence is very close to human

segmentation result. The different path followed by the two algorithms results in a

completely different 2 level image segmentation as shown in Figure 4-9.

4.7.2 GMS vs Spectral Clustering Algorithm

The aim here is to compare the performance of GMS algorithm with spectral

clustering which is considered the state of the art in image segmentation, and in turn

63

Page 64: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A GMS: segments=8 , σ = 11 B GBMS: segments=8 , σ = 10

C GMS: segments=6 , σ = 13 D GBMS: segments=6 , σ = 11.5

E GMS: segments=4 , σ = 18 F GBMS: segments=4 , σ = 13

G GMS: segments=2 , σ = 28.5 H GBMS: segments=2 , σ = 18

Figure 4-9. Baseball image segmentation using GMS and GBMS algorithms. The leftcolumn shows results from GMS for various different number of segments andthe kernel size at which it was achieved. The right column similarly shows theresults from GBMS algorithm.

64

Page 65: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

validate the efficacy of mean shift technique. There exists many variations of spectral

clustering techniques, but the one we found very convenient and robust was the method

proposed by Ng et al. [2002]. For completeness, we briefly summarize this method below.

Given a set of points S = {s1, s2, . . . , sM} in Rd that we want to segment into K

clusters:

1. Form an affinity matrix A ∈ RM×M defined by Aij = exp(−||si − sj||2/2σ2) if i 6= j

and Aii = 0.

2. Define D to be the diagonal matrix whose (i, i)-element is the sum of A′s i-th row

and construct the matrix L = D−1/2AD−1/2.

3. Find e1, e2, . . . , eK , the K largest eigenvectors of L (chosen to be orthogonal to each

other in the case of repeated eigenvalues), and form the matrix E = [e1, e2, . . . , eK ] ∈RM×K by stacking the eigenvectors in columns.

4. Form the matrix Y from E by normalizing each of E ′s rows to have unit length (i.e.

Yij = Eij/(∑

j Eij)1/2).

5. Treating each row of Y as a point in RK , cluster them into K clusters via K-means

algorithm.

6. Finally, assign the original point si to cluster j if and only if row i of the matrix Y

was assigned to cluster j.

The idea here is that if the clusters are well defined, then the affinity matrix has a clear

block structure. The dominant eigenvectors would then correspond to each block matrix

with 1’s in the rows corresponding to the block matrix and zeros in other regions. The

Y matrix would then result in projecting the clusters in orthogonal directions which can

be easily clustered using a simple algorithm like K-means. To eliminate the effects of

K-means initialization, we run K-means for 10 different initializations and then select

the result with the minimum distortion (which is the mean square error in the case

of K-means). Note that one needs to give a priori the number of clusters K in this

algorithm.

65

Page 66: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

We start with the performance of spectral clustering on the RGC dataset. The

number of clusters K was set to 10 and the kernel size σ2 = 0.01 was set using Silverman’s

rule of thumb. The result is shown in Figure 4-10. Although both methods perform very

well, spectral clustering gives a better result with just 5 misclassifications. In the case of

GMS, the number of misclassifications are 20, mainly arising in low valley areas of pdf

estimation which are generally tricky regions. It seems that by projecting the clusters onto

their dominant eigenvectors, spectral clustering is able to enhance the separation between

different clusters leading to a better result. Nevertheless, it should not be forgotten that

the number of clusters was automatically detected in GMS algorithm unlike in spectral

clustering where it had to be specified a priori.

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

A

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

B

Figure 4-10. Performance comparison of GMS and spectral clustering algorithms on RGCdataset. A) Clustering solution obtained using GMS algorithm. B)Segmentation obtained using spectral clustering algorithm.

Selecting a priori the number of segments in images is a difficult task. Instead we

first gain an understanding of the problem through a multi scale analysis using GMS

algorithm. Since GMS naturally reveals the number of clusters at a particular scale, this

analysis would help us in ascertaining the kernel size and the number of segments K to

compare with spectral clustering. In our experience, a good segmentation result is stable

over a broad range of kernel size.

66

Page 67: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Figure 4-11. Bald Eagle image.

Figure 4-11 shows a full resolution image of a bald eagle. We downsampled this image

to 48 × 72 size using a bicubic interpolation technique. This still preserves the segments

very well at the benefit of reducing the computation time. This step is important since

both the algorithms have O(N2) complexity. In particular, the affinity matrix construction

in spectral clustering becomes difficult to manage for more than 5000 data points due to

memory issues. In order to get meaningful segments in images close to human accuracy,

the perceived color differences should correspond to Euclidean distances of pixels in the

feature space. One such feature space that best approximates perpetually uniform color

spaces is the L*u*v space [Connolly, 1996; Wyszecki and Stiles, 1982]. Further, we added

the x and y coordinates to this feature space to take into account the spatial correlation.

Thus, we have 3456 data points spread across a 5-dimensional feature space.

The obtained results for the multiscale analysis is shown in Figure 4-12. Before

plotting these segments, we did a post processing operation where segments with less than

10 data points were merged to the closest cluster. This is needed to eliminate spurious

clusters arising due to isolated points or outliers. In the case of bald eagle image, we

found only the segmentation result for kernel size σ2 = 9 to have such spurious clusters.

This is clearly a result of low kernel size selection. As we increase the kernel size, the

number of segments are drastically reduced reaching a level of 7 segments for both σ2 = 25

and σ2 = 36. Note the sharp and clear segments obtained. For example, the eagle itself

67

Page 68: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

with its beak, head and torso is very well represented. One can also appreciate the nice

hierarchical structure in this analysis where previous disjoint but close clusters merge to

form larger clusters in a very systematic way. For σ2 = 49, we obtain 5 segments which

is the most broad segmentation of this image. Beyond this we start losing important

segments as shown for σ2 = 64 where the beak of the eagle is lost.

Since both 5 and 7 segments look very appealing with the previous analysis, we

show the comparison with spectral clustering for both these segments. Figure 4-13

shows these results for σ2 = 25 and σ2 = 49 respectively. Spectral clustering performs

extremely well with clear well defined segments. Two things need special mention. First,

the segment boundaries are very sharp in spectral clustering compared to GMS results.

This is understandable since in GMS each data point is moved towards its modes and

in boundary regions there may be a clear ambiguity as to which peak to climb. This

localized phenomenon leads to a pixelization effect at the boundary. Second, its surprising

how spectral clustering technique could depict the beak region so well. It should be

remembered that the image used is a low resolution image as shown in Figure 4-12A. A

close observation shows that the region of intesection between the beak and the face of the

eagle has a color gradation. This is clearly depicted in the GMS segmentation result for

σ2 = 16 as shown in Figure 4-12C. Probably, by using the dominant eigenvectors one can

concentrate on the main component and reject other information. This could also explain

the clear crisp boundaries produced in Figure 4-13.

As a next step in this comparison, we selected two popular images shown in

Figure 4-14 from the Berkeley image database [Martin et al., 2001] which has been

widely used as a standard benchmark. The most important part of this database is that

human segmentation results are available for all the images which helps one to compare

the performance of their algorithm. Once again we performed a multiscale analysis of

these two images. Selecting the number of clusters is especially difficult in the case of the

tiger image as can be seen in human segmentation results also.

68

Page 69: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A Low resolution 48x72 image

B σ2 = 9, segments=20 C σ2 = 16, segments=14

D σ2 = 25, segments=7 E σ2 = 36, segments=7

F σ2 = 49, segments=5 G σ2 = 64, segments=4

Figure 4-12. Multiscale analysis of bald eagle image using GMS algorithm.

69

Page 70: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A σ2 = 25, segments=7 B σ2 = 49, segments=5

Figure 4-13. Performance of spectral clustering on bald eagle image.

A B

Figure 4-14. Two popular images from Berkeley image database. A) Flight image. B)Tiger image.

Our analysis depicted in Figure 4-15 shows good performance of GMS algorithm.

Both the images are well segmented especially the flight and the tiger regions. Based

on this analysis, we selected a value of 2 segments in the case of flight image and 8

segments in the case of the tiger image. Using the same kernel size as in GMS algorithm,

we initialized the spectral clustering method. Figure 4-16 shows the performance of this

algorithm. Again, spectral clustering does a very good segmentation especially in the flight

image. In the case of GMS algorithm, a portion of the flight region is lost due to the large

kernel size and oversmoothing effect of the pdf estimation. The results in the case of the

tiger image is pretty much the same, though spectral clustering seems slightly better.

The case of the flight image is especially interesting. Ideally, one would want to use

a smaller kernel size if the flight region is of interest. This is seen in segmentation results

for GMS with σ2 = 16 in Figure 4-15C. But such a small kernel size would lead to many

70

Page 71: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A 48x72 flight image B 48x72 tiger image

C σ2 = 16, segments=15 D σ2 = 16, segments=14

E σ2 = 49, segments=4 F σ2 = 25, segments=10

G σ2 = 100, segments=2 H σ2 = 36, segments=8

Figure 4-15. Multiscale analysis of Berkeley images using GMS algorithm.

71

Page 72: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A σ2 = 100, segments=2 B σ2 = 36, segments=8

Figure 4-16. Performance of spectral clustering on Berkeley images.

clusters. Since it is very clear from the beginning that we need 2 clusters for this image,

one can do a post processing stage where closer clusters are recursively merged until we

have only two clusters left. Such a result is shown in Figure 4-17A for σ2 = 16. Note the

improved and clear segmentation of the flight region. To be fair, we provide a comparison

with spectral clustering for the same parameters in Figure 4-17C. Clearly, both perform

very well with a marked improvement in the performance of GMS algorithm. A similar

comparison is shown for the tiger image with a priori selection of σ2 = 16 and number of

segments K = 8.

For more complicated images like natural scenary, we need to go beyond a single

kernel size for all data points. One technique is to use different kernels for each sample

estimated using K nearest neighborhood (KNN) method. Readers who are interested in

this are advised to refer to Appendix C. Other techniques of adaptive kernel size have

been proposed in literature as in [Comaniciu, 2003] which have greatly improved the

performance of GMS algorithm giving similar results to spectral clustering.

4.8 Summary

Clustering is a statistical data analysis technique with importance in pattern

recognition, data mining and image analysis. In this chapter we have introduced the mean

shift algorithms which have become increasingly popular in image processing and vision

community to perform segmentation and tracking. With a new independent derivation

using information theoretic concepts, we have clearly elucidated the differences between

72

Page 73: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A GMS: σ2 = 16, segments=2 B GMS: σ2 = 16, segments=8

C SC: σ2 = 16, segments=2 D SC: σ2 = 16, segments=8

Figure 4-17. Comparison of GMS and spectral clustering (SC) algorithms with a prioriselection of parameters. A new post processing stage has been added to GMSalgorithm where close clusters were recursively merged until the requirednumber of segments was achieved.

the two variations namely GMS and GBMS algorithms. Since both these methods appear

as special cases of our novel unsupervised learning principle, we were not only able to show

the intrinsic connection between these techniques, but also the general relationship they

share with other unsupervised learning facets like principle curves and vector quantization.

With this new understanding a number of interesting results follow. We have shown

that GBMS directly minimizes Renyi’s quadratic entropy and hence is an unstable

mode finding algorithm. Since modes are not the fixed points of this cost function, any

stopping criterion would at most be heuristic. On the other hand, its stable counterpart

GMS, minimizes Renyi’s “cross” entropy giving the modes as stationary solutions.

Thus, a new stopping criterion is to stop when the change in the cost function is

small. We have corroborated this theoretical analysis with extensive experiments with

superior performance of GMS over GBMS algorithm. Finally, we have also compared the

73

Page 74: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

performance of GMS with the state of art spectral clustering technique highlighting the

difference of the two methods and their inherent advantages.

74

Page 75: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 5APPLICATION II: DATA COMPRESSION

5.1 Introduction

A second important aspect of our unsupervised learning principle is vector quantization

with applications in compression of data and images. This field can be regarded as

the art of compressing information in data in a few code vectors for efficient storage

and transmission [Gersho and Gray, 1991]. Inherent to every vector quantization

technique then, there should be an associated distortion measure to quantify the degree of

compression. For example, in the simplest technique of Linde-Buzo-Gray (LBG) algorithm

[Linde et al., 1980] one tries to recursively minimize the sum of euclidean distance between

the data points and their corresponding winner code vectors. This can be summarized as

shown below.

DLBG(X, C) =1

2

N∑i=1

‖xi − ci∗‖2 , (5–1)

where {c1, c2, . . . , cK} ∈ C are K code vectors with K << N and i∗ = argmink

‖xi − ck‖.This mean square distortion measure preserves the second order statistics of the data and

can be implemented using a simple learning rule where each code vector is updated as the

mean of all data points for which it is a winner neuron. After reaching stable state, if the

K code vectors has a higher distortion measure, then these code vectors are split into two

and used as initialization for the next level of coding until the required distortion is met.

In our context, an information theoretic criterion which can be seen as a distortion

measure is the Cauchy-Schwartz pdf divergence Dcs(pX ||pS). We have shown in Theorem 4

that this corresponds to the extreme case of β →∞ in our learning principle. Here, X can

be considered as the code book and S as the original dataset whose information needs to

be preserved. If X is initialized to S, then Corollary 4.1 shows that the fixed point update

rule would give us back the original data S as the solution. This is obvious since the best

vector quantizer of a dataset without any constraint on the number of code vectors is the

data itself. Of course, this ideal scenario is not interesting if the goal is to compress the

75

Page 76: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

data. In any vector quantization technique this is achieved by initializing X with far fewer

points than S. Then Dcs(pX ||pS) > 0 and by minimize this quantity the algorithm returns

the codebook such that the code vectors best model the data. Note that Dcs(pX ||pS) is a

measure between the pdfs of X and S and thus, by minimizing this quantity we ensure

that the pdf of codebook X is as close as possible to S. By preserving directly the pdf

information, this technique goes beyond the second order statistics and captures maximum

information about the data. In lieu of this, we call this technique - Information Theoretic

Vector Quantization (ITVQ) [Rao et al., 2007a].

5.2 Review of Previous Work

Previous work on ITVQ by Lehn-Schiøler et al. [2005] uses gradient method to

achieve the same. The gradient update is given by

x(τ+1)k = x

(τ)k − η

∂xk

J(X), (5–2)

where η is the learning rate parameter and J(X) = Dcs(pX ||pS). Thus, the derivative

∂∂xk

J(X) would be equal to the expression below as was shown in Theorem 4.

∂xk

J(X) =2

V (X; S)F (xk; S)− 2

V (X)F (xk).

Unfortunately, as is true in any gradient method, the algorithm almost always gets stuck

in local minima. A common method used to overcome this problem is the simulated

annealing technique [Kirkpatrick et al., 1983; Haykin, 1999] where the kernel size σ is

annealed from a large value to a small value slowly. Large kernel size has the effect of

over smoothing the surface of the cost function, thus eliminating or suppressing most

local minima and accelerating the samples quickly near the vicinity of the global solution.

Reducing the kernel size would then allow the samples to reduce the bias in the solution

and effectively capture the true global minimum. Equation 5–3 shows how the kernel size

is annealed from a constant κ1 times the initial kernel σo with an annealing rate ζ. Here,

76

Page 77: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

σn denotes the kernel size used at the nth iteration.

σn =κ1σo

1 + ζκ1n(5–3)

The problem with the gradient method is that to actually utilize the benefit of kernel

annealing, the step size also needs to be annealed, albeit with a different set of parameters.

This helps speed up the algorithm by using a step size proportional to the kernel size. As

shown in 5–4, the step size is annealed from a constant κ2 times the initial value ηo with

an annealing rate γ giving us the step size at the nth iteration, ηn as shown below.

ηn =κ2ηo

1 + γκ2n(5–4)

As can be imagined, selecting these parameters can be a daunting task. Not only do

we need to tune these parameters finely, but also we need to take into consideration the

interdependencies between the parameters. For example, for a given annealing rate ζ of

the kernel size there is a narrow range of values for the annealing rate γ of the step size

which makes the algorithm work. Selecting outside this range either makes the algorithm

unstable or would stop it prematurely due to slow convergence process.

By doing away with gradient method and using fixed point update rule 3–9, we

effectively solve this problem and at the same time speed up the algorithm tremendously.

One of the easiest initialization for X is to select randomly N points from S with

N << M (M is the number of samples in original dataset S). Unfortunately, without

using kernel annealing, the problem of local minima exists (though not as severe as

gradient method) with different initialization giving slightly different solutions. On the

other hand, kernel annealing benefits the algorithm in two ways. It gives the samples

enough freedom initially to quickly capture the important aspects of the pdf distribution

and speeds up the algorithm. Secondly, it makes the solution almost independent of any

initialization giving identical result every time. We would show this in our experiments

77

Page 78: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

where we initialize X not from the dataset S but random points anywhere in the vicinity

of this dataset.

In this chapter, we will present the quantization results obtained on an artificial

dataset as well as some real images. The artificial dataset shown here was used in

[Lehn-Schiøler et al., 2005]. A real application of compressing the edges of a face image

will be shown next. To distinguish between the two variation of the ITVQ methods, we

call our method ITVQ fixed point (ITVQ-FP) and the gradient method as ITVQ-gradient.

Comparison between ITVQ-FP, ITVQ-gradient and the standard Linde-Buzo-Gray (LBG)

algorithm is provided and quantified. To compare with LBG, we also report the mean

square quantization error (MSQE) which is shown in 5–5. Finally, we present some image

compression results showing the performance of ITVQ-FP and LBG techniques.

ε =1

N

N∑i=1

‖xi − ci∗‖2 , ci∗ = mincj∈C

‖xi − cj‖2 . (5–5)

5.3 Toy Problem

This artificial dataset consists of two semicircles of unit radius interlocking into each

other and perturbed by the Gaussian noise of standard deviation 0.1. N = 16 random

points were selected from a unit square in the middle of the data with −0.5 < X < 1.5 and

−1 < Y < 1 as shown in Figure 5-1. In the gradient method the initial kernel was set as

shown in 5–6 with the diagonal entries being the variance along each feature component.

The kernel was annealed with parameters κ1 = 1 and ζ = 0.05. At no point of time, the

kernel was allowed to go below σo/√

N . Note that these are the same parameters used

in [Lehn-Schiøler et al., 2005].

σo =

0.75 0

0 0.51

(5–6)

The most difficult part of gradient method is to select the step size and its annealing

parameters to best suit the kernel annealing rates. The step size should be sufficiently

78

Page 79: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1.5

−1

−0.5

0

0.5

1

1.5

Figure 5-1. Half circle dataset and the square grid inside which 16 random points wheregenerated from a uniform distribution.

small to ensure smooth convergence. Further, the step size annealing rate γ should be

selected such that the step size at each iteration is well below the maximum allowed step

size for present iteration kernel size. After lots of hit and trial runs, the following step size

parameters were selected: ηo = 1, κ2 = 1 and γ = 0.15. Further, the step size was never

allowed to go below 0.05.

In the case of fixed point method, for fair comparison with the gradient counterpart,

we select the same kernel initialization and its associated annealing parameters. There is

no step size or its annealing parameters in this case. To quantify the statistical variation,

50 Monte Carlo simulations were performed. It was ensured that the initialization was

same for both the methods in every simulation. With these selected parameters, it was

observed that both the methods almost gave good result every time. Nevertheless, the

fixed point was found to be more consistent in its results. Figure 5-2A shows the plot

of the cost function J(X) = Dcs(pX ||pS). Clearly, the fixed point algorithm is almost

10× faster than the gradient method and gives a better solution in terms of minimizing

the cost function. Figure 5-2B shows one of the best results for gradient and fixed point

method. A careful look shows some small differences which may explain the smaller J(X)

79

Page 80: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

0 50 100 150 200 2500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

iterations

Cos

t fun

ctio

n J(

X)

fixed pointgradient

A

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1.5

−1

−0.5

0

0.5

1

1.5datafixed pointgradient

B

Figure 5-2. Performance of ITVQ-FP and ITVQ-gradient methods on half circle dataset.A) Learning curve averaged over 50 trials. B) 16 point quantization results.

obtained for fixed point over gradient method. Further,the fixed point has a lower MSQE

with 0.0201 value compared to 0.0215 for gradient method.

5.4 Face Boundary Compression

Here, we present 64 point quantization results of the edges of a face image. Apart

from being an integral part of image compression, this technique also finds application

in face modeling and recognition. We would like to preserve the facial details as much as

possible, especially the eyes and ears which are more complex features.

We repeat the procedure of finding the best parameters for the gradient method.

After an exhaustive search we end up with the following parameters: κ1 = 1, ζ = 0.1,

ηo = 1, κ2 = 1, γ = 0.15. σo was set to a diagonal matrix with variance along each feature

component as the diagonal entries. The algorithm was initialized with 64 random points

inside a small square grid centered on the mouth region. As done before for the fixed

point, the kernel parameters were set exactly as the gradient method for a fair comparison.

As discussed earlier, annealing the kernel size in ITVQ-FP not only gives the samples

enough freedom to quickly move and capture the important aspects of the data, thus

speeding up the algorithm, but also makes the final solution more robust to random

initialization. This idea is best illustrated in Figure 5-3 where the initialization is done

80

Page 81: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

with random points in the middle of the face and around the mouth region. Notice

how the code vectors spread beyond the face region initially with large kernel size and

immediately capture the broad aspects of the data. As the kernel size is decreased, the

samples model the fine details and give a very good solution.

Figure 5-4 shows the results obtained using ITVQ fixed point, ITVQ gradient

method and LBG. Two important conclusions can be made. Firstly, among the ITVQ

algorithms, the fixed point represents the facial features better than the gradient method.

For example, the ears are very well modeled in the fixed point method. Perhaps, the

most important advantage of the fixed point over the gradient method is the speed of

convergence. We found that ITVQ fixed point was more than 5× faster than its gradient

counterpart. In image applications, where the number of data points (pixels) is generally

large, this translates into a huge saving of computational time.

Secondly, both the ITVQ algorithms outperformed LBG in terms of facial feature

representation. The LBG uses many code vectors to model the shoulder region and few

code vectors to model the eyes or ears. This is due to the fact that LBG just uses second

order statistics whereas ITVQ, due to its intrinsic formulation, uses all higher order

statistics to better extract the information from the data and allocate the code vectors

to suit the structural properties. This also explains why ITVQ methods also perform

very close to LBG from MSQE point of view as shown in Table 5-1. On the other hand,

notice the poor performance of LBG in terms of minimizing the cost function J(X).

Obviously, the LBG and ITVQ fixed point methods give lowest values for their respective

cost functions.

Table 5-1. J(X) and MSQE on the face dataset for the three algorithms averaged over 50Monte Carlo trials.

Method J(X) MSQE

ITVQ-FP 0.0253 3.355× 10−4

ITVQ-gradient 0.0291 3.492× 10−4

LBG 0.1834 3.05× 10−4

81

Page 82: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

A

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

B

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

C

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

D

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

E

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

F

Figure 5-3. Effect of annealing the kernel size on ITVQ fixed point method.

82

Page 83: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

A

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

B

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

C

Figure 5-4. 64 points quantization results of ITVQ-FP, ITVQ-gradient method and LBGalgorithm on face dataset. A) ITVQ fixed point method. B) ITVQ gradientmethod. C) LBG algorithm.

83

Page 84: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A B

Figure 5-5. Two popular images selected for image compression application. A) Baboonimage. B) Peppers image.

5.5 Image Compression

One of the most popular applications of vector quantization is the area of image

compression for efficient storage, retrieval and transmission. We demonstrate here

the performance of ITVQ-FP on some popular images and compare those with LBG

technique. Figure 5-5 shows two images used in this section. The first is the popular

baboon image and the other is the peppers image available in Matlab. Due to the huge

size of these images, the computational complexity would be exorbitant. Therefore,

we selected only some portions of these images to work with as shown in Figure 5-6.

These portions were limited to at most 5000 pixel points for convenience and ease of

implementation.

We used the L*u*v feature space for this application and initialized the codebook as

random points selected from the dataset itself depending on the compression level. Instead

of using simulated annealing technique, we performed 10 to 20 Monte Carlo simulations

for both the methods and selected the best result in terms of lowering their respective

cost functions. We found that our algorithm was able to find good results every time with

these Monte Carlo runs and the results were similar to that generated using simulated

annealing technique. Different levels of compression were evaluated starting from 80%

to as far as 99.75% compression in some images. We hand picked the maximum level of

84

Page 85: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A B

C D

Figure 5-6. Portions of Baboon and Pepper images used for image compression.

compression for both ITVQ-FP and LBG after comparing the reconstructed image to

the original image. Since ITVQ-FP has an extra parameter of kernel size, we ran the

algorithm for different kernel sizes in the range of σ2 = 16 to σ2 = 64 and selected the best

result.

Figure 5-7 shows the comparison between ITVQ-FP and LBG techniques. In the

case of images, it is hard to quantify the difference as such, unless this compression is

used as a preprocessing stage in some bigger application. Nevertheless, one can see some

small differences between these methods like the eye part of the baboon in Figures 5-7A

and 5-7B. Overall both methods are fast, perform reliably and produce a very high level of

compression suitable of many practical applications.

5.6 Summary

Vector quantization, an art of preserving the information of the data in few code

vectors, is an important prerequisite in many applications. However, for this to be

85

Page 86: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A B

C D

E F

G H

Figure 5-7. Image compression results. The left column shows ITVQ-FP compressionresults and the right column shows LBG results. A,B) 99.75% compression forboth the algorithms, σ2 = 36 for ITVQ-FP algorithm. C,D) 90% compression,σ2 = 16. E,F) 95% compression, σ2 = 16. G,H) 85% compression, σ2 = 16.

86

Page 87: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

practical, not only should the algorithm be fast but at the same time preserve as much

information as possible.

In this chapter, we have shown how information theoretic vector quantization arises

naturally as a special case of our general unsupervised learning principle. This algorithm

has a dual advantage. First, by using information theoretic concepts, we use all the higher

order statistics available in the data and hence preserve maximum possible information.

Secondly, by using fixed point update rule, we substantially speed up the algorithm

compared to gradient method. At the same time, we circumvent many parameters which

were used in gradient method whose tuning is too heuristic to be applicable in any real life

scenario.

87

Page 88: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 6APPLICATION III: MANIFOLD LEARNING

6.1 Introduction

In this chapter, we describe a final aspect of our learning theory called principal

curves with application prospects in the area of manifold learning and denoising.

The notion of principal curves was first introduced by Hastie and Stuetzle [1989] as a

nonlinear extension of principal component analysis (PCA). The authors describe them

as “self-consistent” smooth curves which pass through the “middle” of a d-dimensional

probability distribution or data cloud. Hastie-Stuetzle (HS) algorithm attempts to

minimize the average squared distance of the data points and the curve. In essence,

the curve is built by finding local mean along directions orthogonal to the curve. This

definition also uses two different smoothing techniques to avoid overfitting. An alternative

definition of principal curves based on the mixture model was given by Tibshirani [1992].

An EM algorithm was used to carry out the estimation iteratively.

A major drawback of Hastie’s definition is that it is not amenable to theoretical

analysis. This has prevented researchers from formally proving the convergence of

HS algorithm. The only known fact is that principal curves are fixed points of the

algorithm. Recently, Kegl [1999] developed a regularized version of Hastie’s definition.

By constraining the total length of the curve to be fit to the data, one could prove the

existence of principal curve for every dataset with finite second order moment. With

this theoretical model, Kegl derived a practical algorithm to iteratively estimate the

principal curve of the data. The Polygonal Line Algorithm produces piecewise linear

approximations to the principal curve using gradient based method. This has been applied

to hand-written character skeletonization to find the medial axis of the character [Kegl

and Krzyzak, 2002]. In general, one could find a principal graph of a given dataset using

this technique. Similarily, other improvements were proposed by Sandilya and Kulkarni

[2002] to regularize the principal curve by imposing bounds on the turns of the curve.

88

Page 89: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

The common theme among all these methods is to initialize a line to the dataset

and then, recursively update its parametric form to minimize a given distortion measure.

This parametric approach generates poor results as the complexity of the data manifold

increases. Part of the reason is attributed to the initialization which has no a prior

information in built about the shape of the data. Further, with the gradient method used

as an optimization procedure, this technique gets stuck in local solution quite often. Could

we instead extract the principal curve directly from the data? Since this information is

contained in the data, with data itself as the initialization, we are bound to capture the

shape of the manifold much better through self organization of the samples.

In light of this background, consider once again our cost function which is reproduced

here for convenience.

J(X) = minX

H(X) + β Dcs(pX ||pS).

We have seen that for β = 1 we get the modes of the data which effectively captures the

high concentration regions. What happens if we increase the value of β? From the cost

function, we see that as we increase this value, we increase the weight of the similarity

term Dcs(pX ||pS). By putting more emphasis on the similarity measure, the algorithm

would have to give something more (information) than just the modes. What we actually

see is that the algorithm returns a curve passing through the modes of the data. The

reason for modes to be part of the curve becomes clear when we see the cost function in

the simplified form as shown below.

J(X) = minX

(1− β)H(X) + 2βH(X; S).

The second term which is the Renyi’s cross entropy is the cost function to find the modes

of the data. This term is the cost function of GMS algorithm and is minimized at the

modes of the data. Therefore to minimize J(X), it is essential that modes be part of

the solution. On the other hand since (1 − β) < 0 we are effectively maximizing H(X).

To satisfy this and minimize J(X) overall, the data spreads along a curve connecting

89

Page 90: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

the modes of the pdf. This satisfies our intuition since a curve through the modes of the

data has more information than just discrete points representing modes. As we continue

increasing the value of β, in the extreme case for β → ∞, we get back the data itself as

the solution capturing all possible information.

It is interesting to note that the principal curve proposed here passes through the

modes of the data unlike other definitions which passes through the mean of the projection

of all the points orthogonal to the curve. This would then dictate a new maximum

likelihood definition of principal curves. Since the idea of principal curves is to find the

intrinsic lower dimensional manifold from which the data originated, we are convinced

that mode (or peak) along every orthogonal direction is much more informative than the

mean. By capturing the local dense regions, one can imagine this principal curve a a ridge

passing through the pdf contours. Such a definition was recently proposed by Erdogmus

and Ozertem [2007]. We briefly summarize the definition and the results obtained here for

our discussion.

6.2 A New Definition of Principal Curves

Let x ∈ Rn be a random vector and p(x) be its pdf. Let g(x) denote the transpose of

the gradient of this pdf and U(x) its local Hessian.

Definition 1. A point x is an element of the d-dimensional principal set, denoted by ρd,

iff g(x) is orthonormal (null inner product) to at least (n − d) eigenvectors of U(x) and

p(x) is a strict local maximum in the subspace spanned by these (n− d) eigenvectors.

Using this definition, Erdogmus proved the following important results.

• Modes of the data correspond to ρ0 i.e. 0-dimensional principal set.

• ρd ⊂ ρd+1

This immediately highlights the hierarchical structure in the data, with modes given by

ρ0, being part of the 1-dimensional curve ρ1. By going through all the dense regions of

the data (the modes), ρ1 can be considered as a new definition of principal curve which

effectively captures the underlying 1-dimensional structure in the data.

90

Page 91: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Figure 6-1. The principal curve for a mixture of 5 Gaussians using numerical integrationmethod.

In order to implement this, Erdogmus first found the modes of the data using

Gaussian Mean Shift (GMS) algorithm. Starting from each mode and shooting a

trajectory in the direction of the eigenvector corresponding to largest eigenvalue 1 ,

one can then effectively trace out the curve. A numerical integration technique like

Runge-Kutta order 4 (RK4) could be utilized to get the next point on the curve starting

at the current point and moving in the direction of the eigenvector of the local Hessian

evaluated at that point. Figure 6-1 shows an example to clarify this idea.

The example shown here is a simple case with well defined Hessian values. In many

practical problems though, this approach suffers from ill conditioning and numerical issues.

Further, the notion of the existence of 1-dimensional structure for every data is not true.

For example, for a T-shaped 2-dimensional dataset there does not exist a 1-dimensional

manifold. What is only possible is a 2-dimensional denoised representation, which can

be seen as a principal graph of the data. In the following section, we experimentally

show how our algorithm satisfies the same definition extracting principal curve (or

1 Note that the Hessian at the modes is semi-negative definite

91

Page 92: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

graphs) directly from the data but much more elegantly. Principal curves have immense

applications in denoising and as a tool for manifold learning. We will illustrate with some

examples how this can be achieved.

6.3 Results

Two examples are presented here. One is a spiral dataset which is a classic example

used in the literature of principal curves. The other is a “chain of rings” dataset showing

the denoising ability of principal curves.

6.3.1 Spiral Dataset

We will start with the example of spiral data which is considered a very difficult

problem in principal curves literature [Kegl, 1999]. The data consists of 1000 samples

which are perturbed by Gaussian noise with variance equal to 0.25. We would like to point

out here that this is a much noisy dataset compared to the one actually used in [Kegl,

1999]. We use β = 2 for our experiment here, although we found that any value between

1 < β < 3 can be used. Figure 6-2 shows the different stages of the principal curve as it

evolves starting with initialization X = S where S is the spiral data.

Note how quickly the samples tend to the curve. By the 10th iteration the structure

of the curve is clearly revealed. After this, the changes in the curve are minimal with

the samples only moving along the curve (and hence always preserving it). What is even

more exciting is that this curve exactly passes through the modes of the data for the

same scale σ as shown in Figure 6-3. Thus our method gives a principal curve which

satisfies definition 1 naturally. We also depict the development of the cost function J(X)

and its two components H(X) and H(X; S) as a function of the number of iterations in

Figure 6-4. Notice the quick decrease in the cost function due to the rapid collapse of the

data to the curve. Further decrease is associated with small changes in the movement of

samples along the curve and by stopping at this juncture we can get a very good result as

shown in Figure 6-2F

92

Page 93: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−10 −5 0 5

−10

−5

0

5

10

A Starting initialization X = S

−10 −5 0 5

−10

−5

0

5

10

B Iteration 2

−10 −5 0 5

−10

−5

0

5

10

C Iteration 5

−10 −5 0 5

−10

−5

0

5

10

D Iteration 10

−10 −5 0 5

−10

−5

0

5

10

E Iteration 20

−10 −5 0 5

−10

−5

0

5

10

F Iteration 30

Figure 6-2. The evolution of principal curves starting with X initialized to the originaldataset S. The parameters were set to β = 2 and σ2 = 2.

93

Page 94: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−10 −5 0 5

−10

−5

0

5

10

Figure 6-3. The principal curve passing through the modes (shown with black squares).

5 10 15 20 25 30 35 40 45 50

2.95

3

3.05

3.1

3.15

3.2J(X)

A

5 10 15 20 25 30 35 40 45 502.75

2.8

2.85

2.9

2.95

3

3.05

3.1

3.15

3.2

3.25

iterations

H(X;S)H(X)

B

Figure 6-4. Changes in the cost function and its two components for the spiral data as afunction of the number of iterations. A) Cost function J(X). B) Twocomponents of J(X) namely H(X) and H(X; S).

94

Page 95: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−10

0

10−10

0

10

−5

0

5

A

−10−5

0

5

10 −10

−5

05

10

−5

0

5

B

Figure 6-5. Denoising ability of principal curves. A) Noisy 3D “chain of rings” dataset. B)Result after denoising.

6.3.2 Chain of Rings Dataset

Figure 6-5A shows a 3-dimensional “chain of rings” dataset corrupted with Gaussian

noise of unit variance. The parameters of the algorithm were set to β = 1.5 and σ2 = 1.

With less than 20 iterations, the algorithm was able to successfully removes noise and

extract the underlying “principal manifold” as can be seen in Figure 6-5B.

6.4 Summary

In this chapter, we have made an attempt to understand the principle of Relevant

Entropy as a tool for denoising and manifold learning. As the β value is increased beyond

one, the modes start giving way to curves which characterize the data in lower dimensional

space. We have shown through experiments that these results actually satisfy the new

definition of principal curves. By passing through the modes of the data, these curves not

only take into account the high density regions of the pdf, but also reveal the underlying

lower dimensional structure of the data. We see a strong biological imitation and hope

that this is one step in the right direction to learn how the human brain assimilates data

efficiently.

95

Page 96: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

CHAPTER 7SUMMARY AND FUTURE WORK

7.1 Discussion and Conclusions

In this thesis, we have presented a novel framework for unsupervised learning.

The principle of Relevant Entropy is the first to address the fundamental idea behind

unsupervised learning, that of finding the underlying structure in the data. There are

many attributes and strengths which come along with this formulation. This is the first

theory which self organizes the data samples to unravel hidden structures in the given

dataset instead of imposing external structures. For example, there exist an entire field

called structural pattern recognition with numerous applications in the area of natural

language and syntactic grammar [Fu, 1974], optical character recognition [Amin, 2002],

time-series data analysis [Olszewski, 2001], image processing [Caelli and Bischof, 1997],

face and gesture recognition [Bartlett, 2001] and many more. Obviously, the type of

“structures” or “primitives” depends on the data at hand. This has been one of the

bottlenecks of this field. All the applications described so far are domain dependent and

use extensive knowledge of the problem at hand to construct the appropriate structures.

Creating knowledge base is a difficult task and in some applications, especially online

documents or biological data, it is a costly affair. Further, in some applications like scene

analysis, quantifying “structures” manually may be an ill posed problem due to their

variety and multitude of effective combinations possible to generate different pictures. A

self organizing technique to dynamically find appropriate structures from the data itself

would not only make knowledge bases redundant, but also lead to online and fast learning

systems. For a thorough treatment of this topic, we would direct the readers to Watanabe

[1985, chap. 10].

The notion of “goal” is a crucial aspect of unsupervised learning so much so that one

can call this field as - Goal Oriented Learning (GOL!!). This approach is needed since

the learner only has the data available without any additional a priori information like

96

Page 97: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

desired response or an external critic. Speaking in a broad sense, we humans do this in

our day-to-day life. There many ways to assimilate information from the incoming sensory

inputs, but we only learn what serves our purpose (or goal). Referring back to Figure 1-2,

we see that our formulation is driven by the goal the machine seeks. This is modeled

through a parameter β which is under the control of the system itself. By allowing the

machine to influence the level of abstraction to be extracted from the data and in turn

capturing “goal oriented structures”, the Principle of Relevant Entropy addresses this

fundamental concept behind unsupervised learning.

Appropriate goals are needed for appropriate levels of learning. For example, in

a game of chess, the ultimate goal is to checkmate your opponent and win the game.

There are also intermediary goals like making the best move to remove the queen etc.

These correspond to higher level goals. A lower level vision task is to make such a move

happen. This involves processing the incoming vision signals and going through the

process of clustering (segregation of different objects), principal curves (denoising) and

vector quantization (compact representation) to assimilate the data. It is these lower

level goals that we have addressed in this thesis. Further, beyond this sensory processing

and cognition stage, one needs to convert this knowledge into appropriate actions and

motor actuators. This involves the process of reasoning and inference (especially inductive

inference, though deductive reasoning may be necessary to compare with past experiences)

which would generate appropriate control signals to drive actions. These actions in

the environment (like making your respective moves in the game of chess) would in

turn generate new data which needs to be further analyzed. Though our schematic in

Figure 1-2 shows this complete idea, we certainly do not intend to solve this entire gamut

of machine learning and we encourage readers to think on these lines to further this

theory. An interesting aspect of this schematic is that reinforcement learning can be seen

as a special case of our framework where goal and the critical reasoning blocks are external

to the learner and controlled by an independent critic.

97

Page 98: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Another aspect we would like to highlight is the nature of the cost function and its

complexity. Notice that the entire framework is a function of just two parameters; the

weighting parameter β which controls the task (and hence the type of structure) to be

achieved and the inherent hidden scale parameter σ which controls the resolution of our

analysis. Thus, the parameters are minimal and have a clear purpose. We have eluded the

discussion about kernel size till now. Most current approaches try to select a particular

kernel size to analyze the data. There exists extensive literature for such kernel size

selection techniques. Some of the examples include Silverman’s rule of thumb [Silverman,

1986], maximum-likelihood (ML) technique and K nearest neighborhood (KNN) method.

We have summarized these techniques in Appendix C for easy reference. Some of these

techniques make extra assumptions to find a single suitable kernel size appropriate to the

given dataset . For example, the Silverman’s rule assumes that the data is Gaussian in

nature. A more robust method would be to assign different kernel size to different samples.

The KNN technique can yield such a result as discussed in Appendix C.

A different way to see the kernel size issue is its role as controlling the scale of

analysis. Since structures exist at different scales, it would be prudent to analyze the data

with a multiscale analysis before we make an inference. This is certainly true in image

and vision applications as well as in many biological data. From this angle, we see kernel

size as a strength of our formulation, encompassing multiscale analysis needed for effective

learning. This gives rise to an interesting perspective where one can see our framework

as analyzing the data in a 2-D plane. The two axes of this plane correspond to the two

continuous parameters β and σ. By extracting structures for different combinations of

(β, σ), we unfold the data in this two dimensional space to truly capture the very essence

of unsupervised learning. This idea is depicted in Figure 7-1. One can also see this

analysis as a “2D spectrum” of the data revealing all the underlying patterns, giving rise

to a strong biological connection.

98

Page 99: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Figure 7-1. A novel idea of unfolding the structure of the data in the 2-dimensional spaceof β and σ.

The computational complexity of our algorithm is O(N2) in general where N is

the number of data samples. This compares favorably with other methods particularly

clustering techniques. For example, the latest method of spectral clustering [Shi and

Malik, 2000] requires one to find the N × N Gram matrix (which require huge memory,

especially for images) and then find the K dominant eigenvectors. This would entail

a theoretical cost of O(N3) which can be brought down to O(KN2) by using the

symmetry of the matrix and recursively finding the top eigenvectors. A way to alleviate

the computational burden of our algorithm is to use the Fast Gauss transform (FGT)

technique [Greengard and Strain, 1991; Yang et al., 2005]. By approximating the Gaussian

with Hermite series expansion, we could reduce the computional complexity to O(LpN)

where L and p are parameters of the expansion (these are very small compared to N).

Another idea would be to employ resampling techniques to work with only M points

(with M << N) randomly selected from the data and allow only these point to converge.

For example, in GMS algorithm for clustering application, all we need to do is to ensure

that there is at least one point from this resampled set corresponding to each mode.

99

Page 100: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Convergence of these points would then be sufficient to identify all the modes. A practical

way to do this is to employ the bootstrapping technique from statistics and perform monte

carlo simulations with resampling done at each experiment. This would bring down the

cost to O(MN), M << N for each monte carlo run.

Information Theoretic Learning (ITL) developed by Principe and collaborators

at CNEL is the first theory which is able to break the barrier and address all the four

important issues of unsupervised learning framework as enlisted at the end of Chapter 1.

Crucial to this success, are the non-parametric estimators for information theoretic

measures like entropy and divergence which can be computed directly from the data.

Further, the ability to derive simple and fast fixed point iterative scheme creates a self

organizing paradigm, thus avoiding problems like step size selection which are part of

the gradient methods. This is an important as well as a crucial bottleneck which ITL

solves elegantly. Due to the lack of such iterative scheme, many methods end up making

Gaussian approximation and missing important higher order features in the data.

This thesis deals in general with spacial data with the assumption of iid samples. This

occurs in diverse fields including pattern recognition, image analysis and machine learning

applications. There are other problems where the data has time structure. Examples

include financial data, speech and video signals to name a few. We can also extract similar

features from these time series (as done is speaker recognition) or do a local embedding

to create a feature space and still apply our method. We show two examples of this

approach in Appendix A and B 1 . There are also plenty of sophisticated methods like

Hidden Markov Models (HMMs), State Space Models (SSMs) etc. that are appropriate

for such time series data [Ghahramani, 2003]. A novel concept developed at CNEL called

Correntropy [Santamaria et al., 2006] also extracts time structures in the data using

1 Although the unsupervised learning techniques used in these sections are differentthan our algorithm.

100

Page 101: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

kernel methods. Nevertheless, a good future direction to pursue is to bring in time in our

formulation. Of course, this is a whole new area and we encourage readers to take this up

as a Ph.D topic!!

7.2 Future Direction of Research

The field of machine learning, and in particular unsupervised learning, is at a

crossroad with some exciting developments taking place both in terms of theoretical

as well as practical applications. The advancements in brain science as well as the

engineering needs of the 21st century is only going to fuel such activity. For example, the

explosion of data due to the rapid growth of internet and mass communication is one

such area which has garnered particular attention in information sciences and computing.

One needs sophisticated machine learning techniques to sort these out and arrange them

so as to efficiently retrieve and transmit information. Due to advancements in data

collection and storage techniques, other scientific area like bioinformatics, astronomy,

neuro-computing and image analysis are unable to cope with the exponential growth of

data. This has led to a shift from traditional hypothesis driven to data driven science2 .

By working directly with the data to unravel hidden structures, we believe our novel

framework is one such step in the right direction.

There are some particularly interesting areas in which we would like to pursue

our future work. One such area is the theoretical analysis of the cost function. The

convergence of the fixed point update rule for some special cases have been shown in this

thesis. But for the general case, this seems especially difficult. Special attention is required

in understanding the working of the principle beyond β = 1. In particular, rigorous

mathematical analysis is needed to quantify the concept of principal curves. We believe

we have something more than just principal curves here, but this would need a totally

2 http://cnls.lanl.gov/

101

Page 102: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

different approach and substantial mathematical analysis. Work in this area would lead to

new applications while enhancing our understanding of this principle.

Another topic which is of interest is the connection to information bottleneck (IB)

method [Tishby et al., 1999]. There are two particular reasons we are interested in

pursuing this research. As pointed out earlier in Chapter 1, this technique tries to address

meaningful or relevant information in a given data S through another variable Y which

is dependent on S. Inherent to this setup, is the assumption that the joint density p(s, y)

exists so that I(S; Y ) > 0. This idea is depicted clearly in Figure 7-2. In order to capture

this relevant information Y in a minimally redundant codebook X, one needs to minimize

the mutual information I(X; S) constrained on maximizing the mutual information

I(X; Y ). This is framed as minimizing a variational problem as shown below

L(p(x|s)) = I(X; S)− βI(X; Y ), (7–1)

where β is a Lagrange multiplier. Most methods at this stage make a Gaussian assumption

due to the difficulty in deriving iterative scheme using Shannon definition. But the

authors, using the ideas from rate distortion theory and in particular by generalizing the

Blahut-Arimoto algorithm [Blahut, 1972], were able to derive a convergent re-estimation

scheme. These iterative self consistent equations work directly with the probability

density, thus providing an alternative solution to capture relevant higher order information

and avoid Gaussian approximations altogether.

Instead of assuming an external relevant variable, the Principle of Relevant Entropy

extracts such relevant information directly from the data for different combinations of

(β, σ). Thus, we go one step further, by not just extracting one structure, but a range of

structures relevant for different tasks and at different resolutions. By doing so, we gather

maximal information to actively learn directly from the data. Additionally, working with

just two variables namely X and S, we simplify the problem with no extra assumption of

the existence of the joint density function p(s, y).

102

Page 103: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

S X Y

I(S;Y)>0

p(x|s) p(y|x)

p(s,y)

Figure 7-2. The idea of information bottleneck method.

It should be pointed out that our method is particularly designed to address

unsupervised learning. The IB method has been applied to supervised learning paradigms

due to the availability of extra variable Y , though some applications of IB method to

unsupervised learning like clustering and ICA have also been pursued. In particular,

changing the value of parameter β leads to a coarse to a very fine analysis and has

interesting implications with the parameters in our cost function. Similar form of cost

function based on energy minimizing principle for unsupervised learning has been

proposed by Ranzato et al. [2007] and it would be important to study further the

connection to these methods.

Finally, we would like to pursue research in the area of inference, data fusion and

active learning [Cohn et al., 1996]. Going back to our schematic in Figure 1-2, a constant

feedback loop exists between the environment and the learner. Thus, a learner is not a

passive recipient of the data, but can actively sense and gather it to advance his learning.

To do so, one needs to address the issue of inference and reasoning beyond the structural

analysis of the incoming data. Lessons from reinforcement learning could also be used to

103

Page 104: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

study actions resulting in both new data and rewards for the learner. This is a broad area

of research and has vast implications in many important fields.

To summarize, machine learning is an exciting field and we encourage readers as well

as critics to pursue research in this important area of science.

104

Page 105: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

APPENDIX ANEURAL SPIKE COMPRESSION

The work presented in this section specifically addresses the application of unsupervised

learning techniques to time-series data. We have devised a new compression algorithm

suitable and highly tailored to the needs of neural spike compression [Rao et al., 2007b].

Due to its simplicity, we were able to successfully implement this algorithm in low power

DSP processor for real-time BMI applications [Goh et al., 2008]. What follows is a

summary of this research.

A.1 Introduction

Brain Machine Interfaces (BMI) aim at establishing a direct communication pathway

between human or animal brain and external devices (prosthetics, computers) [Nicolelis,

2003]. The ultimate goal is to provide paralyzed or motor-impaired patients a mode

of communication through the translation of thought into direct computer control. In

this emerging technology, a tiny chip containing hundreds of electrodes are chronically

implanted in the motor, premotor and parietal cortices and connected through wires to

external signal processor, which is then processed to generate control signals [Nicolelis

et al., 1999; Wessberg et al., 2000].

Recently, the idea of wireless neuronal data transmission protocols has gained

considerable attention [Wise et al., 2004]. Not only would this enable increased mobility

and reduce risk of infection in clinical settings, but also would free cumbersome wired

behavior paradigms. Although this idea looks simple, a major bottleneck in implementing

it is the high constraints on the bandwidth and power imposed on these bio-chips. On

the other hand, to extract as much information as possible, we would like to transmit all

the electrophysiological signals for which the bandwidth requirement can be daunting.

For example, to transmit the entire raw digitized potentials from 32 channels sampled at

20 kHz with 16 bits of resolution we need a huge bandwidth of 10 Mbps.

105

Page 106: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Many solutions were proposed to solve this problem. One solution is to perform

spike detection on site, and then transmit spike signal only or the time at which the

spike occurred [Bossetti et al., 2004]. An alternative is to use spike detection and

sorting techniques so that binning (the number of spikes in a given time interval) can

be immediately done [Wise et al., 2004]. The disadvantage of these methods lies in the

weakness of current automated spike detection methods without human interaction as

well as the missed opportunities of any post processing since the original waveform is lost.

It is in this regard, that we propose compressing the raw neuron potentials using well

established vector quantization techniques as an alternative and viable solution.

Before we dwell into neural spike compression, it is important to first understand the

data at hand. We present here a synthetic neural data which was designed to emulate as

accurately as possible the scenario encountered with actual recordings of neuronal activity.

The waveform contains spike from two neurons differing in both peak amplitude and

width as shown in Figure A-1A. Both neurons fired according to homogeneous Poisson

process with firing rates of 10 spikes/s (continuous line) and 20 spikes/s (dotted line).

Further, to introduce some variability in the recorded template, each time a neuron fired

the template was scaled by a Gaussian distributed random number with mean 1 and

standard deviation 0.01. Finally, the waveform was contaminated with zero mean white

noise of standard deviation 0.05. An instance of the neural data is shown in Figure A-1B.

Notice the sparseness of the spikes compared to the noise in the data. This is a peculiar

characteristic of neural signal in general.

We show here the two dimensional non-overlapping embedding of the training data

in Figure A-2. Note the dense cluster near the origin corresponding to the noise which

constitutes more than 90% of the data. Further, since the spikes correspond to large

amplitudes in magnitude, the farther the data point from the origin the more likely it

is to belong to the spike region. It is these sparse points which we need to preserve as

accurately as possible while compressing the neural information.

106

Page 107: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

0 5 10 15 20 25 30−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2spike1spike2

A

4.4 4.5 4.6 4.7 4.8 4.9

x 104

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

B

Figure A-1. Synthetic neural data. A) Spikes from two different neurons. B) An instanceof the neural data.

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure A-2. 2-D embedding of training data which consists of total 100 spikes with acertain ratio of spikes from the two different neurons.

107

Page 108: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Applying blindly tradition vector quantization techniques like Linde-Buzo-Gray

(LBG) and Self Organizing Maps (SOM) for this data is bound to give very low SNR

for the spike region (which is of interest here) since these algorithms would devote more

code vectors to model the denser regions of the data than the sparse regions. This

would correspond to modeling the noise regions of the data well at the cost of the spike

regions. To correct this, a new algorithm called self organizing map with dynamic learning

(SOM-DL) was introduced [Cho et al., 2007]. Though there is slight improvement in

the SNR of the spike region, the changes made were heuristic. Further, these algorithms

are computationally intensive with large number of parameters to tune and can only be

executed offline.

In this chapter, we introduce a novel technique called weighted LBG (WLBG)

algorithm which effectively solves this problem. Using a novel weighting factor we give

more weightage to sparse region corresponding to the spikes in the neural data leading to

a 15 dB increase in the SNR of the spike region apart from achieving a compression ratio

of 150 : 1. The simplicity and the speed of the algorithm makes it feasible to implement

this in real-time, opening new doors of opportunity in online spike compression for BMI

applications.

A.2 Theory

A.2.1 Weighted LBG (WLBG) Algorithm

The weighted LBG is actually a recursive implementation of weighted K-means

algorithm. The cost function optimized is the weighted L2 distortion measure between the

data points and the codebook as shown below in (A–1).

D(C) =N∑

i=1

wi ‖xi − ci∗‖2 , (A–1)

where ci∗ is the nearest code vector to data point xi as given in (A–2).

ci∗ = mincj∈C

‖xi − cj‖2 (A–2)

108

Page 109: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Step 1 Specify the maximum distortion Dmax and maximum number of levelsLmax.

Step 2 Initialize L = 0 and codebook C as random point in Rd.Step 3 Set M = 2L. Start the optimization loop

a. Calculate for each xi ∈ X, the nearest code vector ci∗ ∈ C using (A–2).b. Update the the code vectors in the codebook using (A–3) where the sum

is taken over all data points for which cj is the nearest code vector.

cj =

∑k:ck∗=cj

wkxk∑k:ck∗=cj

wk

(A–3)

c. Measure the new distortion D(C) as shown in (A–1). If D(C) ≤ Dmax

then go to step 5 or else continue.d. Go back to (a) unless the change in distortion measure is less than δ

Step 4 If L = Lmax, go to step 5 . Else L = L + 1 and split each point cj ∈ C intotwo point cj + ε and cj − ε and go back to step 3.

Step 5 Stop the algorithm. Return C, the optimized codebook.

Figure A-3. The outline of weighted LBG algorithm.

Consider a dataset X = {x1, x2, . . . , xN} ∈ Rd. Let C = {c1, c2, . . . , cM} denote the

codebook to be found. The outline of the algorithm is show in Figure A-3. Both δ and ε

are set to a very small value. A typical value is δ = 0.001 and ε = 0.0001[1−1−1 . . .] where

the random 1 and -1 is a d dimensional vector. This recursive splitting of the codebook

has two advantages over the direct K-means method.

• Firstly, there is no need to specify the exact number of code vectors. In most realapplications, the maximum distortion level Dmax is known. The LBG algorithmstarts with one code vector and recursively splits it so that D(C) ≤ Dmax.

• Secondly, the recursive splitting effectively avoids the formation of empty clusterswhich is very common in K-means.

Novel weighting factor. Since the spikes correspond to large amplitudes in

magnitude, the farther the data point from the origin the more likely it is to belong to

the spike region. Further, information at the tip of the spike should be modeled well since

amplitude of the spike is an important feature in spike sorting. Thus, to reconstruct spike

information as accurately as possible, we need to give more weightage to the points far

109

Page 110: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

from the origin. Thus, we select the weighting for our algorithm as shown below.

wi =

‖xi‖2 if ‖xi‖2 ≥ τ

τ if ‖xi‖2 ≤ τ ,

(A–4)

where τ is a small constant to prevent the weighting from going to zero. Though an

arbitrary choice of τ would do, we can make an intelligent selection. Note that we can

estimate the standard deviation σ of the noise from the data which corresponds to the

dense Gaussian cluster at the origin in Figure A-2. Since 2σ denotes 95 percent confidence

interval of the Gaussian noise, therefore we can set τ = (2σ)2 giving same weightage to

all points belonging to the Gaussian noise. For higher dimensional embedding like L = 5

or L = 10, one can use τ = (√

Lσ)2, which ensure more tight cluster boundary. In our

experiment σ = 0.05 and so we set τ = 0.01.

A.2.2 Review of SOM and SOM-DL Algorithms

Self organizing maps (SOM) is an idea based on competitive learning. The goal is

to learn the non linear mapping between the data in the input space and a two or one

dimensional fully connected lattice of neurons in an adaptive and topologically ordered

fashion [Haykin, 1999]. Each processing element (PE) in the lattice of M PEs has a

corresponding synaptic weight vector which has the same dimensionality as that of the

input space. At every iteration, the synaptic weight closest to every input vector xk is

found as shown in (A–5).

i∗ = argmin1≤i≤M

‖xk − wi‖ (A–5)

Having found the winner PE for each xk, a topological neighborhood is determined

around the winner neuron. The weight vector of each PE is then updated as

wi,k+1 = wi,k + ηkΛi,k(xk − wi,k), (A–6)

where ηk ∈ [0, 1] is the learning rate. The topological neighborhood is typically defined

as Λi,k = exp(−‖ri−ri∗‖2

2σ2k

)where ‖ri − ri∗‖ represents the euclidean distance in the

110

Page 111: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

output lattice between ith PE and the winner PE. Notice that both learning rate (ηk)

and the neighborhood width (σk) are time dependent and are normally annealed for best

performance.

When applying SOM to neural data, it was found that most of the PEs were used to

model the noise rather than the spikes in the data. This is typical of any neural recording

which generally has sparse number of spikes. In order to alleviate this problem and to

move PEs from low amplitude region of state space to the region corresponding to the

spikes, the following update rule was proposed,

wi,k+1 = wi,k + µΛi,ksign(xk − wi,k)(xk − wi,k)2. (A–7)

This was called self organizing map with dynamic learning (SOM-DL) [Cho et al., 2007].

By accelerating the movements of the PEs toward the spikes, the SOM-DL represents the

spikes better. But for good performance, careful tuning of the parameters is important.

For example, it was experimentally verified that µ between 0.05 and 0.5 balances between

fast convergence and small quantization error for the spikes. Further, it is well known that

SOM based algorithms are computationally very intensive.

A.3 Results

In this section, we present the results obtained by WLBG on the neural spike data

using the novel weighting factor developed in the previous section and compare it with

results obtained from SOM-DL and SOM.

A.3.1 Quantization Results

Figure A-4 shows 16 point quantization obtained using WLBG on the training data.

As can be seen, more code vectors are used to model the points far away from the origin

even though they are sparse. This helps to code the spike information in greater details

and hence minimize reconstruction errors. On the other hand, SOM-DL wastes a lot

of points in modeling the noise cluster as shown in Figure A-5. Further, not only does

SOM-DL have large number of parameters which needs to be fine tuned for optimal

111

Page 112: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure A-4. 16 point quantization of the training data using WLBG algorithm.

−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure A-5. Two-dimensional embedding of the training data and code vectors in a 5× 5lattice obtained using SOM-DL algorithm.

performance, but also takes immense amount of time to train the network, making it only

suitable for off line training.

We test this on a separate test data generated to emulate real neural spike signal. A

small region is highlighted in Figure A-6 which shows the comparison between the original

and the reconstructed signal. Clearly weighted LBG does a very good job in preserving

112

Page 113: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

2.226 2.227 2.228 2.229 2.23 2.231 2.232 2.233 2.234

x 104

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1 data

reconstructed

Figure A-6. Performance of WLBG algorithm in reconstructing spike regions in the testdata.

spike features. Also notice the suppression of noise in the non-spike region. This denoising

ability is one of the strengths of this algorithms and is attributed to the novel weighting

factor we selected.

We report the SNR obtained by using WLBG, SOM-DL and SOM in Table A-1.

As can be seen, there is a huge increase of 15dB in the SNR of the spike regions of the

test data compared to SOM-DL which only marginally improves the SNR over SOM.

Obviously, by concentrating more on the spike region, our performance on the non-spike

region suffers but the decrease is negligible compared to SOM-DL. It should be noted

that good reconstruction of the spike region is of utmost importance and hence the only

measure which should be considered is the SNR in the spike region. Further, the result

reported here for WLBG is for 16 code vectors which is far less than 25 code vectors (5× 5

lattice) for SOM-DL and SOM algorithms.

Table A-1. SNR of spike region and of the whole test data obtained using WLBG,SOM-DL and SOM algorithms.

SNR of WLBG SOM-DL SOM

Spike region 31.12dB 16.8dB 14.6dBWhole test data 8.12dB 8.6dB 9.77dB

113

Page 114: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A.3.2 Compression Ratio

We quantify here the theoretical compression ratio achievable by using the codebook

generated by WLBG. In order to do so, we use a test data consisting of 5 seconds of spike

data sampled at 20kHz and digitized to 16 bits of resolution. We use this to measure the

firing rate of the code vectors. Figure A-7 shows the probability of firing of the WLBG

codebook. Code vector 16 models the noisy part of the signal and hence fires most of the

time. It should be noted that in general, neural data has very sparse number of actual

spikes. The probability values for the code vectors is given in Table A-2.

0 2 4 6 8 10 12 14 160

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure A-7. Firing probability of WLBG codebook on the test data.

The entropy of this distribution is

H(C) =16∑i=1

pi log(pi) = 0.2141.

From information theory, we know that this is a lower bound for average number of bits

needed to represent the codebook. Thus, with good coding like arithmetic codes we can

reach very close to this optimal value. Since we are using 2D non overlapping embedding

of the signal sampled at 20 kHz, the number of bits needed to transmit the data is

20k2× 0.2141 = 2.141 kbps. If the data had been transmitted without any compression

114

Page 115: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Table A-2. Probability values obtained for the code vectors and used in Figure A-7.

Code Vectors Probability

1 0.00072 0.00173 0.00074 0.00145 0.00036 0.00077 0.00108 0.01939 0.000310 0.001811 0.000712 0.004513 0.000214 0.001515 0.000616 0.9647

then, the number of bits needed is 20k × 16 = 320 kbps. Thus we achieve a compression

ratio of 150 and at the same time maintain a 32 dB SNR on the spike region. Further,

on real datasets, where a 10D embedding is generally used, the compression ratio would

increase to 750 with only 428 bps needed to transmit the data. This is a significant

achievement and would help alleviate the bandwidth problem faced in transmitting data in

BMI experiments.

A.3.3 Real-time Implementation

This simple inner product algorithm has been successfully implemented recently

on low power digital signal processor (DSP) with the aim of wirelessly transmitting raw

neuronal recordings. The hardware platform is a Pico DSP [Cieslewski et al., 2006] that

couples with the University of Florida’s existing technology, the Pico Remote (a battery

powered neural data processor, and a wireless transmitter) [Cheney et al., 2007]. This

board consists of a DSP from Texas Instruments (TMS320F28335), a low power Altera

Max II CPLD, and a Nordic Semiconductor’s ultra-low-power transceiver (nRF24L01).

The Pico DSP can provide up to 150 MIPS and can operate for nearly 4 hours in low

115

Page 116: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Wireless Module DSP CPLD

Pico Remote

PICO DSP

FIFO Buffer 2FIFO Buffer 1State Machine

SPI

CPLD

Figure A-8. A block diagram of Pico DSP system.

power modes. This system is designed to have a maximum of 8 Pico Remotes, each with a

maximum of 8 channels (20 kHz, 12 bits of resolution) resulting in a total of 64 channels of

neural data. Figure A-8 shows the block diagram of this system.

The signal is sampled at real-time using the A/D conversion in Pico Remotes. All

these daughterboards are connected to the CPLD which buffers the data and controls

the flow to the DSP. The neural data captured by the DSP are then encoded using the

code vectors generated by the WLBG algorithm. When the wireless buffer is full with

compressed data, it is transmitted wirelessly using the DSP’s onboard Serial Peripheral

Interface (SPI) which is connected to the wireless transceiver. A brief summary of the

results follow. This is an ongoing work and readers who are interested in this should refer

to [Goh et al., 2008].

Two different neural signals were used to test this architecture. The first is generated

by a Bionic Neural Simulator and has an ideal signal to noise ratio (SNR) of 55 dB.

The second signal is a real neuronal recording from a microwire electrode chronically

116

Page 117: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

implanted in layer V of a rat motor cortex. This signal is more heterogeneous, has a lower

SNR (23 dB), and is representative of an average chronic neural recording. With the

simulated signal (with high SNR) and 32 code vectors, we were able to achieve a very

high compression ratio of 184 : 1 with a 15-D embedding. On the other hand, due to

severe noise presence in the in-vivo signal, a compression ratio of only 10 : 1 was possible

using 2-D embedding and 64 code vectors. Nevertheless, the reconstructed signal modeled

the spike information extremely well. This real-time implementation is unique in term of

tradeoff between power and bandwidth requirement for BMI experiments and surpasses

the previous technology both in terms of efficiency and performance.

A.4 Summary

We have proposed a weighted LBG as an excellent compression algorithm for neural

spike data with emphasis on good reconstruction of the spikes. The following are the

salient features of this algorithm compared to SOM-DL.

• A 15dB increase in SNR of spike regions in the data.

• A smaller and more efficient codebook achieving a compression ratio of 150 : 1.

• The increase in speed is many folds. The WLBG takes less than 1 seconds onmachine with Pentium IV and 512 MB RAM versus SOM-DL which takes more than15 minutes.

• There is no parameters to tune in WLBG compared to SOM algorithms which hasstep size, neighborhood kernel size and their annealing parameters which needs to beproperly tuned. The only variable to be defined in WLBG is the weights wi whichhave a clear interpretation for the application at hand.

• A real-time implementation of this algorithm is possible due to its computationsbased on inner product unlike SOM algorithms. This has been demonstration usingPico DSP system for a single channel showing promising results.

There are number of ways to build upon this work. We plan to construct an efficient

k-d tree search algorithm for the codebook taking into account the weighting factor

and the probability of the code vectors. This would further speed up the algorithm. We

would also like to use advanced encoding techniques like entropy coding and achieve bit

117

Page 118: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

rate as close as possible to the theoretical value. Finally, we are pursuing a real-time

implementation of this algorithm to wirelessly transmit all the 64 channels of the neural

data which would lead to significant progress in BMI research.

118

Page 119: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

APPENDIX BSPIKE SORTING USING CAUCHY-SCHWARTZ PDF DIVERGENCE

The research summarized here proposes a new method of clustering neural spike

waveforms for spike sorting [Rao et al., 2006b]. After detecting the spikes using a

threshold detector, we use principal component analysis (PCA) to get the first few PCA

components of the data. Clustering on these PCA components is achieved by maximizing

the Cauchy Schwartz PDF divergence measure which uses the Parzen window method

to non-parametrically estimate the pdf of the clusters. We provide a comparison with

other clustering techniques in spike sorting like K-means and Gaussian mixture model,

showing the superiority of our method in terms of classification results and computational

complexity.

B.1 Introduction

The ability to identify and discriminate the activity of single neurons from neural

ensemble recordings is a vital and integral part of basic neuroscience research. By tracking

the modulation of the fundamental constituents of the nervous system, neurophysiologists

have begun to formulate the basic constructs of how systems of neurons interact and

communicate [Fetz, 1992]. Recently, this knowledge of neurophysiology has been applied

to Brain Machine Interface (BMI) experiments where multielectrode arrays are used

to monitor the electrical activity of hundreds of neurons from the motor, premotor,

and partietal cortices [Wessberg et al., 2000]. In multielectrode BMI experiments,

experimenters are faced with the labor intensive task of analyzing each of the extracellular

recordings for the signatures of electrical activity related to the neurons surrounding the

electrode tip. Separating these different neural sources - a term called “spike sorting”,

helps the neurophysiologist to study and infer the role played by each individual neuron

with respect to the experimental task at hand.

Spike sorting is based upon the property that every neuron has its own characteristic

“spike” shape which is dependent on its intrinsic electrochemical dynamics as well as

119

Page 120: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

the position of the electrode with respect to the neuron. The key is to separate these

spikes from the background noise and use features in each of the shapes to discriminate

different neurons. The analysis of this non-stationary signal is made more difficult by

the fluctuating effects of the electrode-tissue interface which is affected by movement,

glial encapsulation, and ionization [Sanchez et al., 2005]. To overcome these challenges,

many signal processing and machine learning techniques have been successfully applied

which are well summarized in [Lewicki, 1998]. Modern methods use multiple electrodes

and sophisticated techniques like Independent Component Analysis (ICA) to address the

issue of overlapping and similarity between spikes [Brown et al., 2004]. Nevertheless, in

many cases, good single-unit activity can be obtained with a simple hardware threshold

detector [Wheeler, 1999]. After detection, the classification is done using either template

matching or clustering of the principal components (PCA) of the waveforms [Lewicki,

1998; Wood et al., 2004]. The advantage with the PCA of the spike waveforms is that

it dynamically exploits differences in the variance of the waveshapes to discriminate and

cluster neurons.

A common clustering algorithm which is used extensively on the PCA of the

waveforms is the ubiquitous K-means [MacKay, 2003]. For spike sorting, the K-means

algorithm always clusters neurons, but there is no guarantee that it converges to

the optimum solution leading to incorrect sorts. The result depends on the original

cluster centers (the random initialization problem) as well as the fact that K-means

assumes hyper spherical or hyper ellipsoidal clusters. Researchers have employed other

clustering techniques to overcome this problem. Lewicki [1994] used Bayesian clustering

to successfully classify neural signals. The advantage of using Bayesian framework is

that it is possible to quantify the certainty of the classification. Recently, Hillel et al.

[2004] extended this technique by automating the process of detecting and classification.

Other clustering techniques like Gaussian Mixture Model (GMM) [Duda et al., 2000]

and Support Vector Machines (SVM) [Vogelstein et al., 2004] have also been applied

120

Page 121: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

successfully. The tradeoff of simplicity for accuracy results in techniques suffering

from high computational cost and consequently, they are not very suitable for online

classification of neural signals in low-power devices. Further, model order selection is a

difficult task in both GMM and Bayesian clustering. Due to these reasons, K-means is still

used extensively for its simplicity and ease of use.

In this section, we propose the Cauchy Schwartz PDF divergence measure for

clustering of neural spike data. We show that this method not only yields superior

results to K-means but also is computationally less expensive with O(N) complexity for

classifying online test samples.

B.2 Data Description

B.2.1 Electrophysiological Recordings

Extracellular cortical neuronal activity was collected using 50m tungsten microelectrodes

from behaving animals. A neural recording system sampling at 24, 414.1Hz was used to

digitize the extracellular analog voltages with 16 bits of resolution. To emphasize the

frequencies contained in action potentials, the raw waveforms were bandpass filtered

between 300Hz and 6kHz. Representative microelectrode recordings are shown in

Figure B-1. Here, the action potentials from two neurons can be identified (with asterisks)

by the differences in amplitude and width of the waveshapes.

B.2.2 Neuronal Spike Detection

The voltage threshold was set by the experimenter using Spike 2 (CED, UK) through

visual inspection of the digitized time series. For the example given in Figure B-1, a

threshold of 25µV is sufficient to detect each of the two waveshapes. A set of unique

waveshapes were constructed from the thresholded waveforms based upon the width which

was measured −0.6ms to the left and 1.5ms to the right of the threshold crossing. Using

electrophysiological parameters (amplitude and width) of the spike, artifact signals (e.g.,

electrical noise, movement artifact) were removed. The peak-to-peak amplitude, waveform

shape, and interspike interval (ISI) were then evaluated to ensure that the detected spikes

121

Page 122: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Figure B-1. Example of extracellular potentials from two neurons.

had a characteristic and distinct waveform shape when compared with other waveforms

in the same channel. Next, the first ten principal components (PC) were computed from

all of the detected spike waveforms. Of all the PCs, only the first two eigenvalues were

greater than the value 1 and captured majority of the variance1 . This feature space was

found to be sufficient to discriminate between the two neurons. The first two PCs are

plotted in Figure B-2 where two overlapping clusters of points correspond to each of the

detected waveshapes. The challenge here is to cluster each of the neurons using automated

techniques.

B.3 Theory

We provide a brief summary of the non-parametric clustering method using

Cauchy-Schwartz(CS) pdf divergence measure. For a detailed treatment of this topic,

the readers are advised to refer to the Ph.D dissertation of Jenssen [2005].

B.3.1 Non-parametric Clustering using CS Divergence

Let p(u) and q(u) denote the probability density functions of two random variables

X and Y respectively. Since these functions are square integrable and are also strictly

1 The first three PCs accounted for more than 95% of the variance.

122

Page 123: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−3 −2 −1 0 1 2 3−4

−3

−2

−1

0

1

2

3

4

PC1

PC

2

Figure B-2. Distribution of the two spike waveforms along the first and the secondprincipal components.

non-negative, by Cauchy-Schwartz inequality the following condition holds,

(∫p(u)q(u)du

)2

≤∫

p2(u)du

∫q2(u)du.

Using this inequality, a divergence measure can then be expressed as

Dcs(p||q) = − log Jcs(p, q)

Jcs(p, q) =

∫p(u)q(u)du√∫

p2(u)du∫

q2(u)du.

(B–1)

Note that Jcs(p, q) ∈ (0, 1] and hence the divergence measure Dcs(p||q) ≥ 0 with equality iff

the two pdfs are the same.

Given the sample population {x1, x2, . . . , xN1} and {y1, y2, . . . , yN2} of the random

variable X and Y , one can estimate this divergence measure directly from the data using

ideas from information theoretic learning [Prıncipe et al., 2000]. The readers are referred

to Chapter 2 for a detailed derivation of the following non-parametric estimator,

Jcs =1

N1N2

∑N1,N2

i,j=1 Gij,σ2I√1

N21

∑N1,N1

i,i′=1 Gii′,σ2I1

N22

∑N2,N2

j,j′=1 Gjj′,σ2I

.

123

Page 124: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Here, Gij,σ2I = Gσ2I(xi − yj) with G being a Gaussian kernel. Further, Gii′,σ2I =

Gσ2I(xi − xi′) and Gjj′,σ2I = Gσ2I(yj − yj′). Careful observation of Jcs shows that it

is a ratio of between-cluster information potential and the within-cluster information

potentials. Since entropy and information potential are inversely related, minimizing Jcs

maximizes between-cluster entropy and at the same time minimizes the within-cluster

entropies of the two clusters. This in turn would maximize Dcs(p||q), thus providing a

clustering solution with maximum divergence between the two pdfs.

To recursively find the best labeling of each data point so as to minimize Jcs, we first

have to express the above quantity as

Jcs =12

∑N,Ni,j=1(1−mT

i mj)Gij,σ2I√∑N,Ni,j=1(1−mi1mj1)Gij,σ2I

∑N,Ni,j=1(1−mi2mj2)Gij,σ2I

.

The quantity mi is the fuzzy membership vector of sample xi={1,2,...,N}, N = N1 + N2 and

mik, k = 1, 2 represent the elements of mi. If xi (or yj) truly belongs to cluster C1 (C2),

then the crisp membership vector is given by mi = [1, 0] ([0,1]). Note that Jcs is a function

of these membership vectors and hence can be written as Jcs(m1,m2, . . . ,mN).

With mik, k = 1, 2 initialized to any value in the interval [0, 1], the optimization

problem can be formulated as

minm1,m2,...,mN

Jcs(m1,m2, . . . ,mN)

subject to mTj 1− 1 = 0, j = 1, 2, . . . , N.

Note that the constraint term ensures that the elements of each vector mi sum to unity.

In order to derive a iterative fixed point update rule, we need to make the change of

variable, mik = v2ik, k = 1, 2. The optimization problem now looks like

minv1,v2,...,vN

Jcs(v1,v2, . . . ,vN)

subject to vTj vj − 1 = 0, j = 1, 2, . . . , N.

124

Page 125: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Using the method of Lagrange multipliers, we have

L = Jcs(v1,v2, . . . ,vN) +N∑

j=1

λj(vTj vj − 1),

where λj, j = 1, 2, . . . , N are the Lagrange multipliers. Differentiating with respect to vj

and λj for j = 1, 2, . . . , N , we can get the following iterative scheme.

vj =− 1

2λj

∂Jcs

∂vj

λj =1

2

öJcs

∂vj

T∂Jcs

∂vj

.

(B–2)

After the convergence of the algorithm, one can get crisp membership vectors by

designating the maximum value of the elements of each mi, i = 1, 2, . . . , N to one,

and rest to zero.

This technique implements a constrained gradient descent search, with built in

variable step-size for each coordinate direction. The generalization to more than

two clusters can be found in [Jenssen, 2005]. Jenssen also demonstrated the superior

performance of this algorithm compared to GMM and Fuzzy K-means (FKM) for many

non convex clustering problems. For a d-dimensional data, the kernel size is calculated

using Silverman’s rule of thumb given by equation B–3.

σopt = σX{4N−1(2d + 1)−1} 1d+4 , (B–3)

where σ2X = d−1

∑i ΣXii

and ΣXiiare the diagonal elements of the sample covariance

matrix. To avoid local minima, we anneal the kernel size from 2σopt to 0.5σopt over a

period of 100 iterations. Calculating the gradients requires O(N2) computations. To

reduce the complexity, we stochastically sample the membership space by randomly

selecting M membership vectors where M << N . Thus, the complexity of the algorithm

drops to O(NM) per iteration.

125

Page 126: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

B.3.2 Online Classification of Test Samples

For online spike sorting, the goal is to assign the test point to the cluster whose

within cluster entropy has changed the least. This rule automatically ensures the

maximization of between-cluster entropy and hence the maximization of CS PDF

divergence measure for every new test point. The kernel size σopt can be estimated from

the training samples which have already been classified. The kernel size σk, k = 1, 2 of each

individual cluster was calculated from B–3 using the corresponding training samples and

then averaged to obtain an estimate of σopt.

Since information potential and entropy are inversely related, we assign the test

point to the cluster which incur maximum change in information potential. The change in

information potential of the clusters C1 and C2 due to a new test point xt is given by

4Vck=

Nk∑j=1

Gσ2optI

(xt − xj), k = 1, 2. (B–4)

Thus the classification rule can be summed up as,

If4Vc1 > 4Vc2 99K Classify as Cluster 1 sample

If4Vc2 > 4Vc1 99K Classify as Cluster 2 sample.

(B–5)

Note that this computation takes just O(N) calculations, and hence is much more efficient

than calculating the change in the entire CS divergence measure.

B.4 Results

B.4.1 Clustering of PCA Components

The dataset consists of 2734 points out of which the first 500 points were used in

training phase. The algorithm was fully automated and involved kernel annealing. Only

one fourth of the data (M= 0.25N) was used for calculating the membership update thus

speeding up the algorithm significantly. Our algorithm took 6 − 9 seconds (Dell P4,

1.8GHz, 512MB RAM) to converge giving good clustering result for almost every new

initialization as shown in Figure B-3A. As seen in Figure B-3B, K-means clearly fails to

126

Page 127: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

separate the two clusters giving poor classification results. The main reason for this poor

performance is the inherent assumption of hyper ellipsoidal clusters in K-means which is a

wrong hypothesis for this application scenario.

B.4.2 Online Labeling of Neural Spikes

Using the classification rule presented in B–5, the remaining 2234 points were

classified as belonging to either of the clusters by just comparing with the training samples

as shown in Figure B-3C. For comparison, we have also plotted the training samples along

with the test samples. The algorithm took 2 − 3 seconds to classify all 2234 test points.

We also used another method of comparison where a new test point is not only compared

to the training samples, but also to the previous test points which have been already

classified. This will linearly increase the computational complexity, but generally gives

better classification results. Nevertheless, in our case, the two methods gave identical

results which may be due to the fact that the training samples defined the clusters

sufficiently well.

B.5 Summary

With the advent of multielectrode data acquisition techniques, fast and efficient

sorting of neural spike data is of utmost importance for monitoring the activity of

ensembles of single neurons. The state-of-the-art in neuronal waveform PCA analysis

is faced with a clustering problem due to the electrochemical dynamics in the tissue. We

have proposed a technique for clustering that addresses waveform PCA distributions

that are non Gaussian and non convex, arising from neural sources both near and far

from the electrode. Clustering based on Cauchy-Schwartz PDF divergence helps address

these issues encountered in multielectrode BMI experiments and classifies new incoming

neural spike with O(N) complexity which is suitable for implementation in low-power

portable hardware. With no parameters to be selected, this method additionally provides

neurophysiologists with an easy and powerful tool for spike sorting. Future research

127

Page 128: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−4

−3

−2

−1

0

1

2

3

PCA 1

PC

A 2

cluster 1cluster 2

A Clustering of training data using CS divergence

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−4

−3

−2

−1

0

1

2

3

B Clustering of training data using K-means

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

PCA 1

PC

A 2

test cluster 1test cluster 2train cluster 1train cluster 2

C Online Classification of test points

Figure B-3. Comparison of clustering based on CS divergence with K-means for spikesorting.

128

Page 129: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

involves extending this technique simultaneously to hundreds of electrodes and extensive

comparison with other methods like ICA and Bayesian clustering.

129

Page 130: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

APPENDIX CA SUMMARY OF KERNEL SIZE SELECTION METHODS

We provide here a brief summary of some kernel size selection methods which we have

used in this thesis. There exists extensive literature work in this field and we certainly

do not intend to summarize everything in this chapter. For a thorough treatment on this

topic, the readers are directed to the following references [Katkovnik and Shmulevich,

2002; Wand and Jones, 1995; Fukunaga, 1990; Silverman, 1986].

One of the earliest approaches taken to find an optimal kernel size was by minimizing

the mean square error (MSE) between the estimated and the original pdf. The MSE

between the estimator f(x) and the true density f(x) at a particular point x is given by

MSE{f(x)} = E{[f(x)− f(x)]2},

where the expectation is taken over the kernel size σ. This can be rewritten in terms of

the bias and the variance quantities as

MSE{f(x)} = [E{f(x)} − f(x)]2 + V ar{f(x)}.

Instead of just evaluating at a particular point x, we could integrate this quantity over the

entire space to find an optimal σ. Since the bias and the variance term behave differently

with respect to σ, one has to make a best compromise between these two terms, depicting

the classic bias-variance dilemma.

MISE{f(x)} =

∫[E{f(x)} − f(x)]2 dx +

∫V ar{f(x)} dx.

A bottleneck of this approach is the inherent assumption of the knowledge of true

density f(x). For the particular case of N samples from a multivariate d-dimensional

normal density with some mild and asymptotic assumptions, one can derive the

Silverman’s rule of thumb [Silverman, 1986] as shown in equation C–1.

σ∗ = σX{4N−1(2d + 1)−1} 1d+4 , (C–1)

130

Page 131: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

where σ2X = d−1

∑i ΣXii

and ΣXiiare the diagonal elements of the sample covariance

matrix. Due to the inherent assumption of normal density, this method generally yields a

large kernel size resulting in an oversmoothed density estimate.

Another technique of optimality is the maximum likelihood solution. Consider a

random sample {x1, . . . , xN}. The general Parzen pdf estimation can be formulated as

p(x) =1

N

N∑i=1

Gσ2Ci(x− xi),

where Ci = 1K

∑Kj=1(x

ji − xi)(x

ji − xi)

T with{xj

i

}j=1,...,K

being the K nearest neighbors of

xi. To reduce the complexity, we can also set Ci = σ2i I where σi is the mean or median

of the distance of xi to its K neighbors. K is generally selected as√

N (or in general N1p

with p > 1) where N is the number of data points to ensure asymptotic unbiasedness of

the estimate. The global scale σ is then optimized to maximize the leave-one-out cross

validation likelihood.

σ∗ = maxσ

N∑j=1

log1

N

N∑

i=1,i6=j

Gσ2Ci(xj − xi). (C–2)

The maximization is carried out as a brute-force line search. A good upper limit is the

kernel estimate from Silverman’s rule since it yields an over estimated result. With a log

spaced line search, the σ that yields the maximum value in equation C–2 is selected as a

good approximation to the true optimal.

One could also use K nearest neighbor (KNN) technique for each sample xi to get

a local estimate of the kernel size. This local estimate σi is computed as the mean of K

nearest neighbors as shown below.

σi =1

K

K∑j=1

‖xi − xji‖,

where{xj

i

}j=1,...,K

are the K nearest neighbors of xi. K is generally set as√

N where

N is the number of data points to ensure asymptotic unbiasedness of the estimate. An

advantage of this multi-kernel technique is that one could get individual kernel size

131

Page 132: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

appropriate for each sample xi. Thus, the global pdf estimate is well tailored taking into

consideration all the local effects. Further, this technique is quite robust to outliers since

σi would be quite large resulting in smoothing out these effects in the pdf estimate. From

this multi-kernel rule, we could also select a global kernel size as the median value of all

these local σi, that is

σ∗ = mediani

σi. (C–3)

132

Page 133: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

REFERENCES

A. Amin. Structural description to recognising arabic characters using decision treelearning techniques. In T. Caelli, A. Amin, R. P. W. Duin, M. S. Kamel, andD. de Ridder, editors, SSPR/SPR, volume 2396 of Lecture Notes in Computer Sci-ence, pages 152–158. Springer, 2002. ISBN 3-540-44011-9.

H. B. Barlow. Unsupervised learning. Neural Computation, 1(3):295–311, 1989.

M. S. Bartlett. Face Image Analysis by Unsupervised Learning. Springer, 2001.

S. Becker. Unsupervised learning procedures for neural networks. International Journal ofNeural Systems, 2:17–33, 1991.

S. Becker. Unsupervised learning with global objective functions. In M. A. Arbib, editor,The Handbook of Brain Theory and Neural Networks, pages 997 – 1000. MIT Press,Cambridge, MA, USA, 1998.

S. Becker and G. E. Hinton. A self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature, 355:161–163, 1992.

A. Bell and T. Sejnowski. An information-maximization approach to blind separation andblind deconvolution. Neural Computation, 7(6):1129–1159, 1995.

C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.

R. E. Blahut. Computation of channel capacity and rate distortion function. IEEE Trans.on Information Theory, IT-18:460–473, 1972.

C. A. Bossetti, J. M. Carmena, M. A. L. Nicolelis, and P. D. Wolf. Transmission latenciesin a telemetry-linked brain-machine interface. IEEE Transactions on BiomedicalEngineering, 51(6):919–924, June 2004.

E. M. Brown, R. E. Kass, and P. P. Mitra. Multiple neural spike train data analysis:state-of-the-art and future challenges. Nature Neuroscience, 7(5):456–461, May 2004.

T. Caelli and W. F. Bischof. Machine Learning and Image Interpretation. Springer, 1997.

M. A. Carreira-Perpinan. Mode-finding for mixtures of gaussian distributions. IEEETrans. on Pattern Analysis and Machine Intelligence, 22(11):1318–1323, November2000.

M. A. Carreira-Perpinan. Fast nonparametric clustering with gaussian blurring mean-shift.In W. W. Cohen and A. Moore, editors, ICML, pages 153–160. ACM, 2006. ISBN1-59593-383-2.

D. Cheney, A. Goh, J. Xu, K. Gugel, J. G. Harris, J. C. Sanchez, and J. C. Prıncipe.Wireless, in vivo neural recording using a custom integrated bioamplifier and the picosystem. In International IEEE/EMBS Conference on Neural Engineering, pages 19–22,May 2007.

133

Page 134: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

Y. Cheng. Mean shift, mode seeking and clustering. IEEE Trans. on Pattern Analysis andMachine Intelligence, 17(8):790–799, August 1995.

J. Cho, A. R. C. Paiva, S. Kim, J. C. Sanchez, and J. C. Prıncipe. Self-organizing mapswith dynamic learning for signal reconstruction. Neural Networks, 20(11):274–284,March 2007.

G. Cieslewski, D. Cheney, K. Gugel, J. Sanchez, and J. C. Prıncipe. Neural signalsampling via the low power wireless pico system. In Proceedings of IEEE EMBS, pages5904–5907, August 2006.

D. A. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models.Journal of Artificial Intelligence Research, 4:129–145, March 1996.

D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Trans. on PatternAnalysis and Machine Intelligence, 25(2):281–288, 2003.

D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis.IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(5):603–619, May 2002.

D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects usingmean shift. In Proceedings of IEEE Conf. Computer Vision and Pattern Recognition,volume 2, pages 142–149, June 2000.

P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994.

C. Connolly. The relationship between colour metrics and the appearance ofthree-dimensional coloured objects. Color Research and Applications, 21:331–337,1996.

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd edition.Wiley-Interscience, 2000.

D. Erdogmus. Information Theoretic Learning: Renyi’s Entropy and its Applications toAdaptive System Training. PhD thesis, University of Florida, 2002.

D. Erdogmus and U. Ozertem. Self-consistent locally defined principal surfaces. InProceedings of the International Conference on Accoustic, Speech and Signal Processing,volume 2, pages 15–20, April 2007.

M. Fashing and C. Tomasi. Mean shift is a bound optimization. IEEE Trans. on PatternAnalysis and Machine Intelligence, 27(3):471–474, March 2005.

E. E. Fetz. Are movement parameters recognizably coded in the activity of single neurons.Behavioral and Brain Sciences, 15(4):679–690, March 1992.

R. Forsyth. Machine Learning: Principles and Techniques. Chapman and Hall, 1989.

K. S. Fu. Syntactic Methods in Pattern Recognition. Academic Press, New York, 1974.

134

Page 135: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

K. Fukunaga. Statistical Pattern Recognition. Academic Press, 1990.

K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density functionwith applications in pattern recognition. IEEE Trans. on Information theory, 21(1):32–40, January 1975.

A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Springer, 1991.

Z. Ghahramani. Unsupervised learning. In O. Bousquet, U. von Luxburg, and G. Ratsch,editors, Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes inComputer Science, pages 72–112. Springer, 2003. ISBN 3-540-23122-6.

A. Goh, S. Craciun, S. Rao, D. Cheney, K. Gugel, J. C. Sanchez, and J. C.Prıncipe. Wireless transmission of neural recordings using a portable real-timediscimination/compression algorithm. In Proceedings of the International Conference ofthe IEEE Engineering in Medicine and Biology Society, August 2008.

L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on ScientificComputing, 24:79–94, 1991.

T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Associa-tion, 84(406):502–516, 1989.

S. Haykin. Neural Networks: A Comprehensive Foundation, 2nd edition. New Jersey,Prentice Hall, 1999.

A. B. Hillel, A. Spiro, and E. Stark. Spike sorting: Bayesian clustering of non-stationarydata. In Proceedings of NIPS, December 2004.

D. Hume. An Enquiry Concerning Human Understanding. Oxford University Press, 1999.

A. Hyvarinen and E. Oja. Independent component analysis: Algorithms and applications.Neural Networks, 13(4-5):411–430, 2000.

R. Jenssen. An Information Theoretic Approach to Machine Learning. PhD thesis,University of Tromso, 2005.

C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive algorithm basedon neuromimetic architecture. Signal Processing, 24:1–10, 1991.

V. Katkovnik and I. Shmulevich. Kernel density estimation with adaptive varying windowsize. Pattern Recognition Letters, 23:1641–1648, 2002.

B. Kegl. Principal curves: learning, design, and applications. PhD thesis, ConcordiaUniversity, 1999.

B. Kegl and A. Krzyzak. Piecewise linear skeletonization using principal curves. IEEETransactions on Pattern Analysis and Machine Intelligence, 24(1):59–74, 2002.

135

Page 136: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.Science, 220(4598):671–680, 1983.

T. Kohonen. Self-organized formation of topologically correct feature maps. BiologicalCybernetics, 43:59–69, 1982.

S. Lakshmivarahan. Learning Algorithms: Theory and Applications. Springer-Verlag,Berlin-Heidelberg-New York, 1981.

P. Langley. Elements of Machine Learning. Morgan Kaufmann, 1996.

T. Lehn-Schiøler, A. Hegde, D. Erdogmus, and J. C. Prıncipe. Vector-quantization usinginformation theoretic concepts. Natural Computing, 4:39–51, Jan. 2005.

M. S. Lewicki. Bayesian modeling and classification of neural signals. Neural Computation,6:1005–1030, May 1994.

M. S. Lewicki. A review of methods for spike sorting: the detection and classification ofneural action potentials. Network: Computation in Neural Systems, 9(4):R53–R78, 1998.

Y. Linde, Buzo A., and Gray R.M. An algorithm for vector quantizer design. IEEE Trans.on Communications, 28(1):84–95, Jan. 1980.

R. Linsker. Self-organization in a perceptual network. IEEE Computer, 21(3):105–117,March 1988a.

R. Linsker. Towards an organizing principle for a layered perceptual network. In D. Z.Anderson, editor, Neural Information Processing Systems - Natural and Synthetic.American Institute of Physics, New York, 1988b.

E. Lutwak, D. Yang, and G. Zhang. Cramer-rao and moment-entropy inequalities for renyientropy and generalized fisher information. IEEE Transactions of Information Theory,51(2):473–478, February 2005.

M. A. Carreira-Perpinan. Gaussian mean shift is an em algorithm. IEEE Transactions onPattern Analysis and Machine Intelligence, 29(5):767 – 776, May 2007.

D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CambridgeUniversity Press, 2003.

D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuringecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423,July 2001.

A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. InProceedings of NIPS, 2002.

M. A. L. Nicolelis. Brain machine interfaces to restore motor function and probe neuralcircuits. Nature Reviews Neuroscience, 4:417–422, 2003.

136

Page 137: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

M. A. L. Nicolelis, C. R. Stambaugh, A. Bristen, and M. Laubach. Methods forsimultaneous multisite neural ensemble recordings in behaving primates. In MiguelA. L. Nicolelis, editor, Methods for Neural Ensemble Recordings, chapter 7, pages157–177. CRC Press, 1999.

E. Oja. A simplified neuron model as a principal component analyser. Journal ofMathematical Biology, 15:267–273, 1982.

E. Oja. Neural networks, principal components and subspaces. International Journal ofNeural Systems, 1(1):61–68, 1989.

R. T. Olszewski. Generalized Feature Extraction for Structural Pattern Recognition inTime-Series Data. PhD thesis, Carnegie Mellon University, 2001.

E. Parzen. On the estimation of probability density function and mode. The Annals ofMathematical Statistics, 33(3):1065–1076, 1962.

T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, Berlin-Heidelberg-New York,1977.

J. C. Prıncipe, D. Xu, and J. Fisher. Information theoretic learning. In Simon Haykin,editor, Unsupervised Adaptive Filtering, pages 265–319. John Wiley, 2000.

D. Ramanan and D. A. Forsyth. Finding and tracking people from the bottom up. InProceedings of IEEE Conf. Computer Vision and Pattern Recognition, pages 467–474,June 2003.

M. Ranzato, Y. Boureau, S. Chopra, and Y. LeCun. A unified energy-based framework forunsupervised learning. In Proc. of the 11-th Int’l Workshop on Artificial Intelligence andStatistics (AISTATS), 2007.

S. Rao, S. Han, and J. C. Prıncipe. Information theoretic vector quantization with fixedpoint updates. In Proceedings of the International Joint Conference on Neural Networks,pages 1020–1024, August 2007a.

S. Rao, A. M. Martins, W. Liu, and J. C. Prıncipe. Information theoretic mean shiftalgorithm. In Proceedings of IEEE Conf. on Machine Learning for Signal Processing,pages 155–160, Sept. 2006a.

S. Rao, A. M. Martins, and J. C. Prıncipe. Mean shift: An information theoreticperspective. Submitted to Pattern Recognition Letters, 2008.

S. Rao, A. R. C. Paiva, and J. C. Prıncipe. A novel weighted lbg algorithm for neuralspike compression. In Proceedings of the International Joint Conference on NeuralNetworks, pages 1883–1887, August 2007b.

S. Rao, J. C. Sanchez, and J. C. Prıncipe. Spike sorting using non parametric clusteringvia cauchy schwartz pdf divergence. In Proceedings of the International Conference onAccoustic, Speech and Signal Processing, volume 5, pages 881–884, May 2006b.

137

Page 138: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

A. Renyi. On measure of entropy and information. In Proceedings 4th Berkeley Symp.Math. Stat. and Prob., volume 1, pages 547–561, 1961.

A. Renyi. Some fundamental questions of information theory. In Selected Papers of AlfredRenyi, pages 526–552. Akademia Kiado, 1976.

W. Rudin. Principles of Mathematical Analysis. McGraw-Hill Publishing Co., 1976.

J. Sanchez, J. Pukala, J. C. Prıncipe, F. Bova, and M. Okun. Linear predictive analysisfor targeting the basal ganglia in deep brain stimulation surgeries. In Proceedings of theConference on 2nd Int IEEE Workshop on Neural Engineering, Washington, 2005.

S. Sandilya and S. R. Kulkarni. Principal curves with bounded turn. IEEE Transactionson Information Theory, 48(10):2789–2793, 2002.

I. Santamaria, P. P. Pokharel, and J. C. Principe. Generalized correlation function:Definition, properties and application to blind equalization. IEEE Trasactions on SignalProcessing, 54(6):2187–2197, 2006.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions onPattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman andHall, 1986.

R. Tibshirani. Principal curves revisited. Statistics and Computing, 2:183–190, 1992.

N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedingsof the 37-th Annual Allerton Conference on Communication,Control and Computing,pages 368–377, 1999.

R. J. Vogelstein, K. Murari, P. H. Thakur, G. Cauwenberghs, S. Chakrabartty, andC. Diehl. Spike sorting with support vector machines. In Proceedings of IEEE EMBS,pages 546–549, September 2004.

M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall, 1995.

S. Watanabe. Pattern Recognition: Human and Mechanical. Wiley, 1985.

J. Wessberg, C. R. Stambaugh, J. D. Kralik, P. D. Beck, M. Laubach, J. K. Chapin,J. Kim, S. J. Biggs, M. A. Srinivasan, and M. A. L. Nicolelis. Real-time predictionof hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810):361–365, November 2000.

B. C. Wheeler. Automatic discrimination of single units. In Miguel A. L. Nicolelis, editor,Methods for Neural Ensemble Recordings, chapter 4, pages 61–78. CRC Press, 1999.

K. D. Wise, D. J. Anderson, J. F. Hetke, D. R. Kipke, and K. Najafi. Wireless implantablemicrosystems: high-density electronic interfaces to the nervous system. Proceedings ofthe IEEE, 92(1):76–97, January 2004.

138

Page 139: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

F. Wood, M. Fellows, J. P. Donoghue, and M. J. Black. Automatic spike sorting for neuraldecoding. In Proceedings of IEEE EMBS, pages 4009–4012, September 2004.

G. Wyszecki and W.S. Stiles. Color Science: Concepts and Methods, Quantitative Dataand Formulae. Wiley, 1982.

C. Yang, R. Duraiswami, and L. Davis. Efficient mean-shift tracking via a new similaritymeasure. In Proceedings of IEEE Conf. Computer Vision and Pattern Recognition,volume 1, pages 176–183, June 2005.

139

Page 140: c 2008 Sudhir Madhav Rao - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/02/25/31/00001/rao_s.pdf · Sudhir Madhav Rao August 2008 Chair: Jos e C. Pr ‡ncipe Major: Electrical

BIOGRAPHICAL SKETCH

Sudhir Madhav Rao was born in Hyderabad, India in September of 1980. He received

his B.E. from the Department of Electronics and Communications Engineering, Osmania

University, India in 2002. He obtained his M.S in electrical and computer engineering in

2004 from the University of Florida with emphasis in the area of signal processing. In

the Fall of 2004, he joined the Computational NeuroEngineering Laboratory (CNEL) at

the University of Florida and started working towards his Ph.D under the supervision of

Dr. Jose C. Prıncipe. He received his Ph.D in the Summer of 2008. His primary interests

include signal processing, machine learning and adaptive systems. He is a member of Eta

Kappa Nu, Tau Beta Pi and IEEE.

140