Neighborhood Component Analysis 20071108

Post on 29-Nov-2014

1.611 views 0 download

description

An introduction to

Transcript of Neighborhood Component Analysis 20071108

Neighbourhood Component Analysis

T.S. Yo

References

Outline

● Introduction● Learn the distance metric from data● The size of K● Procedure of NCA● Experiments● Discussions

Introduction (1/2)

● KNN – Simple and effective– Nonlinear decision surface– Non-parametric– Quality improved with more data– Only one parameter, K -> easy for tuning

Introduction (2/2)

● Drawbacks of KNN – Computationally expensive: search through the

whole training data in the test time– How to define the “distance” properly?

● Learn the distance metric from data, and force it to be low rank.

Learn the Distance from Data (1/5)

● What is a good distance metric? – The one that minimize (optimize) the cost!

● Then, what is the cost?– The expected testing error– Best estimated with leave-one-out (LOO) cross-

validation error in the training dataKohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection". Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (Morgan Kaufmann, San Mateo)

Learn the Distance from Data (2/5)

● Modeling the LOO error:– Let pij be the probability that point xj is selected as

point xi's neighbour.– The probability that points are correctly classified

when xi is used as the reference is:

● To maximize pi for all xi means to minimize LOO error.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Softmax Function

X

exp(

-X)

Learn the Distance from Data (3/5)

● Then, how do we define pij ?– According to the softmax of the distance dij

– Relatively smoother than dij

Learn the Distance from Data (4/5)

● How do we define dij ?● Limit the distance measure within Mahalanobis

(quadratic) distance.

● That is to say, we project the original feature vectors x into another vector space with q transformation matrix, A

Learn the Distance from Data (5/5)

● Substitute the dij in pij :

● Now, we have the objective function :

● Maximize f(A) w.r.t. A → minimize overall LOO error

The Size of k ● For the probability distribution pij :

● The perplexity can be used as an estimate for the size of neighbours to be considered, k

Procedure of NCA (1/2)

● Use the objective function and its gradient to learn the transformation matrix A and K from the training data, Dtrain(with or without dimension reduction).

● Project the test data, Dtest, into the transformed space.

● Perform traditional KNN (with K and ADtrain) on the transformed test data, ADtest.

Procedure of NCA (2/2)

● Functions used for optimization

Experiments – Datasets (1/2)

● 4 from UCI ML Repository, 2 self-made

Experiments – Datasets (2/2)

n2d is a mixture of two bivariate normal distributions with different means and covariance matrices. ring consists of 2-d concentric rings and 8 dimensions of uniform random noise.

Experiments – Results (1/4)

Error rates of KNN and NCA with the same K.

It is shown that generally NCA does improve the performance of KNN.

Experiments – Results (2/4)

Experiments – Results (3/4)

● Compare with other classifiers

Experiments – Results (4/4)

● Rank 2 dimension reduction

Discussions (1/8)

● Rank 2 transformation for wine

Discussions (2/8)

● Rank 1 transformation for n2d

Discussions (3/8)

● Results of Goldberger et al.

(40 realizations of 30%/70% splits)

Discussions (4/8)

● Results of Goldberger et al.

(rank 2 transformation)

Discussions (5/8)

● Results of experiments suggest that with the learned distance metric by NCA algorithm, KNN classification can be improved.

● NCA also outperforms traditional dimension reduction methods for several datasets.

Discussions (6/8)

● Comparing to other classification methods (i.e. LDA and QDA), NCA usually does not give the best accuracy.

● Some odd performance on dimension reduction suggests that a further investigation on the optimization algorithm is necessary.

Discussions (7/8)

● Optimize a matrix● Can we Optimize these Functions? (Michael L. Overton)

– Globally, no. Related problems are NP-hard (Blondell-Tsitsiklas, Nemirovski)

– Locally, yes.● But not by standard methods for nonconvex,

smooth optimization● Steepest descent, BFGS or nonlinear conjugate gradient will typically jam because of nonsmoothness

Discussions (8/8)

● Other methods learn distant metric from data– Discriminant Common Vectors(DCV)

● Similar to NCA, DCV focuses on optimizing the distance metric on certain objective functions

– Laplacianfaces(LAP)● Emphasizes more on dimension reduction

J. Liu and S. Chen,Discriminant Common Vecotors Versus Neighbourhood Components Analysis and Laplacianfaces: A comparative study in small sample size problem. Image and Vision Computing

Question?

Thank you!

Derive the Objective Function (1/5)

● From the assumptions, we have :

Derive the Objective Function (2/5)

Derive the Objective Function (3/5)

Derive the Objective Function (4/5)

Derive the Objective Function (5/5)