Distance metric learning, with application to clustering with side-information

Distance metric learning, with applicationto clustering with side-information

Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell

University of California, Berkeley{epxing,ang,jordan,russell}@cs.berkeley.edu

Neural Information Processing Systems 2002

2004/9 2

Abstract Many algorithms rely critically on being given a good

metric over the input space.– Provide a more systematic way for users to indicate what they

consider “similar”. If a clustering algorithm fails to find one that is meaningful

to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found.

In this paper, we present an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in , learns a distance metric over that respects these relationships.

nn

2004/9 3

Introduction Given a good metrics that reflect reasonably well the important

relationships between the data.– K-means , nearest-neighbors , SVMs ,and so on.

Many learning and datamining algorithm depend on a good metric measurement, this problem is particularly acute in unsupervised settings such as clustering.

One important family of algorithms that learn metrics are the unsupervised ones that take an input dataset, and find an embedding of it in some space. Such as:

– Multidimensional Scaling (MDS) – Locally Linear Embedding (LLE)

• Have “no right answer problem”

• Similar to Principal Components Analysis (PCA)

2004/9 4

Introduction

For clustering with similarity information certain pairs are “similar” or “dissimilar” , they search for a clustering that puts the similar pairs into the same, and dissimilar pairs into different, clusters.

This gives a way of using similarity side-information to find clusters that reflect a user’s notion of meaningful clusters.

2004/9 5

Learning Distance Metrics Suppose we have some set of points ,and are g

iven information that certain pairs of them are “similar”:

Consider learning a distance metric of the form

How can we learn a distance metric d(x,y) between points x and y that respects this; specifically, so that “similar” points end up close to each other .

nmiix 1}{

2004/9 6

Learning Distance Metrics

metric – satisfying non-negativity and the triangle inequality – We require that A be positive semi-definite, – Setting A = givens Euclidean distance , – If we restrict A to be diagonal, this corresponds to learning a

metric in which the different axes are given different “weights”– More generally, A parameterizes a family of Mahalanobis dis

tances over .– Learning such a distance metric is also equivalent to finding

a rescaling of a data that replaces each point x with and applying the standard Euclidean metric to the rescaled data.

0AI

n

xA 2/1

2004/9 7

Learning Distance Metrics A simple way of defining a criterion for the desired metric is

to demand that pairs of points (xi, yj) in S have, say, small squared distance between them :

This is trivially solved with A=0, which is not useful, and we add the constraint :

– To ensure that A does not collapse the dataset into a single point.

– Here, D can be a set of pairs of points known to be “dissimilar” ,i.e., all pairs not in S.

Sxx AjiA ji

xx),(

2||||min

1||||),(

ADxx jiji

xx

2004/9 8

Learning Distance Metrics This givens the optimization problem:

• The optimization problem is convex, which enables us to derive efficient, local-minima-free algorithm to solve it.

2004/9 9

Note.

2004/9 10

Note.

Newton-Raphson Method:

2004/9 11

Note.

2004/9 12

Learning Distance Metrics The case of diagonal

– In the case that we want to learn a diagonal A=diag(A11,A22,…Ann), we can derive an efficient algorithm using the Newton-Raphson to efficiently optimize g.

Newton update by ,where is a step-size parameter optimized via a line-search to give the largest downhill step subject to Aii 0

gH 1 gH 1

2004/9 13


The case of full A – Newton’s method often becomes prohibitively

expensive requiring O(n6) time to invert the Hessian over n2 parameters.

2004/9 14


2004/9 15

Experiments and Examples

In the experiments with synthetic data, S was a randomly sampled 1% of all pairs of similar points.

We can use the fact discussed earlier that learning ||.||A is equivalent to finding rescaling of the data x -> A1/2x

2004/9 16

Experiments and Examples -Examples of learned distance metrics

2004/9 17

Experiments and Examples -Examples of learned distance metrics

2004/9 18

Experiments and Examples-Application to clustering

Clustering with side information:– Given S,and told that each pair mean

s xi and xj belong to the same cluster.

Let be the cluster to which point xi is assigned by an automatic clustering algorithm,and let ci be some “correct” or desired clustering of the data.

Sxx ji ,

)1(ˆ mic

2004/9 19


2004/9 20


2004/9 21


2004/9 22


2004/9 23

Conclusions We have presented an algorithm that, given examples

of similar pairs of points in , learns a distance metric that respects these relationships.

The method is based on posing metric learning as a convex optimization problem, which allowed us to derive efficient, local optima free algorithms.

We also showed examples of diagonal and full metrics learned from simple artificial examples, and demonstrated on artificial and on UCI datasets how our methods can be used to improve clustering performance.

n

Distance metric learning, with application to clustering with side-information

Documents

Transcript of Distance metric learning, with application to clustering with side-information