Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven...

106
Similarity-based Learning via Data Driven Embeddings * Purushottam Kar 1 Prateek Jain 2 1 Indian Institute of Technology Kanpur 2 Microsoft Research India Bengaluru November 3, 2011 * To appear in the proceedings of NIPS 2011 P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 1 / 29

Transcript of Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven...

Page 1: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Similarity-based Learning via Data DrivenEmbeddings∗

Purushottam Kar1 Prateek Jain2

1Indian Institute of TechnologyKanpur

2Microsoft Research IndiaBengaluru

November 3, 2011

∗To appear in the proceedings of NIPS 2011P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 1 / 29

Page 2: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 2 / 29

Page 3: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 4: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 5: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 6: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 7: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 8: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 9: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 10: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 11: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 12: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 13: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 14: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 15: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 16: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

Page 17: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Spam mail detection

Dear Junta,

The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).

Please collect your Lunch Packet. Themess would resume its normal working from27th October.

A legitimate mail

Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa

Most likely a spam mail

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium

Title: Similarity-based Learning via Data Driven Embeddings

Speaker: Purushottam Kar

Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur

To each his own ...

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29

Page 18: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Spam mail detection

Dear Junta,

The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).

Please collect your Lunch Packet. Themess would resume its normal working from27th October.

A legitimate mail

Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa

Most likely a spam mail

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium

Title: Similarity-based Learning via Data Driven Embeddings

Speaker: Purushottam Kar

Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur

To each his own ...

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29

Page 19: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Spam mail detection

Dear Junta,

The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).

Please collect your Lunch Packet. Themess would resume its normal working from27th October.

A legitimate mail

Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa

Most likely a spam mail

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium

Title: Similarity-based Learning via Data Driven Embeddings

Speaker: Purushottam Kar

Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur

To each his own ...

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29

Page 20: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Spam mail detection

Dear Junta,

The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).

Please collect your Lunch Packet. Themess would resume its normal working from27th October.

A legitimate mail

Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa

Most likely a spam mail

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium

Title: Similarity-based Learning via Data Driven Embeddings

Speaker: Purushottam Kar

Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur

To each his own ...

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29

Page 21: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

More formally ...

We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.

We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.

Prx∼D

[ˆ(x) 6= `(x)

]< ε

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 5 / 29

Page 22: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

More formally ...

We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.

Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.

Prx∼D

[ˆ(x) 6= `(x)

]< ε

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 5 / 29

Page 23: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

More formally ...

We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.

Prx∼D

[ˆ(x) 6= `(x)

]< ε

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 5 / 29

Page 24: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 25: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?

SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 26: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 27: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 28: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 29: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 30: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data X

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29

Page 31: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29

Page 32: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Classical algorithms that learn with similarities

Let K be a similarity measure (or w.l.o.g. a distance measure)

Nearest neighbor classification

ˆ(x) = `(NN(x))

NN(x) = arg maxx ′∈S

[K (x , x ′)

]Perceptron algorithm : X ⊂ Rd

ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd

ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)

SVM allows use of arbitrary Positive semi-definite kernels

ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29

Page 33: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Classical algorithms that learn with similarities

Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification

ˆ(x) = `(NN(x))

NN(x) = arg maxx ′∈S

[K (x , x ′)

]

Perceptron algorithm : X ⊂ Rd

ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd

ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)

SVM allows use of arbitrary Positive semi-definite kernels

ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29

Page 34: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Classical algorithms that learn with similarities

Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification

ˆ(x) = `(NN(x))

NN(x) = arg maxx ′∈S

[K (x , x ′)

]Perceptron algorithm : X ⊂ Rd

ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd

ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)

SVM allows use of arbitrary Positive semi-definite kernels

ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29

Page 35: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Classical algorithms that learn with similarities

Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification

ˆ(x) = `(NN(x))

NN(x) = arg maxx ′∈S

[K (x , x ′)

]Perceptron algorithm : X ⊂ Rd

ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd

ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)

SVM allows use of arbitrary Positive semi-definite kernels

ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29

Page 36: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Classical algorithms that learn with similarities

Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification

ˆ(x) = `(NN(x))

NN(x) = arg maxx ′∈S

[K (x , x ′)

]Perceptron algorithm : X ⊂ Rd

ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd

ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)

SVM allows use of arbitrary Positive semi-definite kernels

ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29

Page 37: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Learning with Similarities

A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]

A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29

Page 38: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Learning with Similarities

A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]

Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29

Page 39: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Learning with Similarities

A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]

However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29

Page 40: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Learning with Similarities

A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29

Page 41: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformance

It is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 42: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performance

In general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 43: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)

Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 44: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 45: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 46: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 47: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Similarities

Suitable Similarities

A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 48: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29

Page 49: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29

Page 50: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

What is a good similarity function ?

Intuitively, a good similarity function should at least respect thelabeling of the domain

It should not assign small similarity to points with same label andlarge similarity to distinctly labeled points

Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ

In other words, according to the similarity function, most points, onan average, are more similar to points of the same label

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29

Page 51: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

What is a good similarity function ?

Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points

Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ

In other words, according to the similarity function, most points, onan average, are more similar to points of the same label

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29

Page 52: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

What is a good similarity function ?

Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points

Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ

In other words, according to the similarity function, most points, onan average, are more similar to points of the same label

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29

Page 53: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

What is a good similarity function ?

Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points

Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ

In other words, according to the similarity function, most points, onan average, are more similar to points of the same label

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29

Page 54: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

Learning with a good similarity function

Theorem ([Balcan and Blum, 2006])

Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2

δ

labeled points (xi)ni=1, the classifier ˆdefined below has error at margin

γ2 no more than ε+ δ with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

w(xi)`(xi)K (x , xi)

)

Notice that the classifier is very similar in form to the SVM andPerceptron classifiersConsequently one can use these algorithms to learn this classifieras well

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29

Page 55: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

Learning with a good similarity function

Theorem ([Balcan and Blum, 2006])

Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2

δ

labeled points (xi)ni=1, the classifier ˆdefined below has error at margin

γ2 no more than ε+ δ with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

w(xi)`(xi)K (x , xi)

)

Notice that the classifier is very similar in form to the SVM andPerceptron classifiers

Consequently one can use these algorithms to learn this classifieras well

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29

Page 56: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Similarity Function

Learning with a good similarity function

Theorem ([Balcan and Blum, 2006])

Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2

δ

labeled points (xi)ni=1, the classifier ˆdefined below has error at margin

γ2 no more than ε+ δ with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

w(xi)`(xi)K (x , xi)

)

Notice that the classifier is very similar in form to the SVM andPerceptron classifiersConsequently one can use these algorithms to learn this classifieras well

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29

Page 57: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29

Page 58: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

What is a good distance function

Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)

D(x) <√

B andD−(x)D(x) <

√B, such that at least a (1− ε) probability mass of examples

x ∼ D satisfies

Prx ′∼D+

x ′′∼D−

[`(x)

(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)

)< 0

]≥ 1

2+ γ

The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled pointsYet again this yields a classifier with guaranteed generalizationproperties

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 12 / 29

Page 59: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

What is a good distance function

Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)

D(x) <√

B andD−(x)D(x) <

√B, such that at least a (1− ε) probability mass of examples

x ∼ D satisfies

Prx ′∼D+

x ′′∼D−

[`(x)

(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)

)< 0

]≥ 1

2+ γ

The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled points

Yet again this yields a classifier with guaranteed generalizationproperties

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 12 / 29

Page 60: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

What is a good distance function

Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)

D(x) <√

B andD−(x)D(x) <

√B, such that at least a (1− ε) probability mass of examples

x ∼ D satisfies

Prx ′∼D+

x ′′∼D−

[`(x)

(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)

)< 0

]≥ 1

2+ γ

The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled pointsYet again this yields a classifier with guaranteed generalizationproperties

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 12 / 29

Page 61: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

Learning with a good distance function

Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2

γ2 lg 1δ pairs of positive and negatively labeled points

(x+

i , x−i

)ni=1,

the classifier ˆdefined below has error at margin γB no more than ε+ δ

with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

βi sgn(d(x , x+

i )− d(x , x−1 )))

,n∑

i=1βi = 1, βi ≥ 0

This naturally lends itself to a boosting-like implementationEach of the pairs yields a stump sgn

(d(x , x+

i )− d(x , x−1 ))

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 13 / 29

Page 62: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

Learning with a good distance function

Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2

γ2 lg 1δ pairs of positive and negatively labeled points

(x+

i , x−i

)ni=1,

the classifier ˆdefined below has error at margin γB no more than ε+ δ

with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

βi sgn(d(x , x+

i )− d(x , x−1 )))

,n∑

i=1βi = 1, βi ≥ 0

This naturally lends itself to a boosting-like implementation

Each of the pairs yields a stump sgn(d(x , x+

i )− d(x , x−1 ))

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 13 / 29

Page 63: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Learning with Suitable Similarities Learning with a Suitable Distance Function

Learning with a good distance function

Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2

γ2 lg 1δ pairs of positive and negatively labeled points

(x+

i , x−i

)ni=1,

the classifier ˆdefined below has error at margin γB no more than ε+ δ

with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

βi sgn(d(x , x+

i )− d(x , x−1 )))

,n∑

i=1βi = 1, βi ≥ 0

This naturally lends itself to a boosting-like implementationEach of the pairs yields a stump sgn

(d(x , x+

i )− d(x , x−1 ))

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 13 / 29

Page 64: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29

Page 65: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

A unified notion of what is a good similarity/distance

Disparate as the last two models may seem, they are, in fact, quiterelated to each other

Motivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29

Page 66: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

A unified notion of what is a good similarity/distance

Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitive

This notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29

Page 67: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

A unified notion of what is a good similarity/distance

Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiers

The resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29

Page 68: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

A unified notion of what is a good similarity/distance

Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two models

Consequently, the model does not require separate treatment forsimilarity and distance functions either

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29

Page 69: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

A unified notion of what is a good similarity/distance

Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29

Page 70: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

What is a good similarity/distance function

Definition (K. and Jain, 2011)A similarity function K : X × X → R is said to be (ε, γ,B)-good for aclassification problem if for some antisymmetric transfer functionf : R→ [−Cf ,Cf ] and some weighing function w : X × X → [−B,B], atleast a (1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))] ≥ 2Cfγ

With appropriate setting of the weighing function and the transferfunction, the previous two models can be recovered.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 15 / 29

Page 71: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability

What is a good similarity/distance function

Definition (K. and Jain, 2011)A similarity function K : X × X → R is said to be (ε, γ,B)-good for aclassification problem if for some antisymmetric transfer functionf : R→ [−Cf ,Cf ] and some weighing function w : X × X → [−B,B], atleast a (1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))] ≥ 2Cfγ

With appropriate setting of the weighing function and the transferfunction, the previous two models can be recovered.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 15 / 29

Page 72: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 15 / 29

Page 73: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.

Let us first see how, given a (good) transfer function, can we learna (good) classifier.We will later on plug in the routines to learn the transfer functionas well.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 16 / 29

Page 74: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.Let us first see how, given a (good) transfer function, can we learna (good) classifier.

We will later on plug in the routines to learn the transfer functionas well.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 16 / 29

Page 75: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.Let us first see how, given a (good) transfer function, can we learna (good) classifier.We will later on plug in the routines to learn the transfer functionas well.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 16 / 29

Page 76: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

Algorithm 1 LEARN-DISSIM

Require: A similarity function K , landmark pairs L =(

x+i , x

−i

)n

i=1, a good

transfer function f .Ensure: A classifier ˆ : X → {−1,+1}

1: Define ΦL : X → Rn as ΦL : x 7→(

f (K (x , x+i )− K (x , x−i ))

)n

i=1

2: Get a labeled training set T ={

tj}n′

j=1 ⊂ X sampled from D.

3: T ′ ←{

ΦL(tj )}n′

j=1 ⊂ Rn be the data set embedded in Rn

4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))

6: return ˆ

LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 17 / 29

Page 77: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

Algorithm 1 LEARN-DISSIM

Require: A similarity function K , landmark pairs L =(

x+i , x

−i

)n

i=1, a good

transfer function f .Ensure: A classifier ˆ : X → {−1,+1}

1: Define ΦL : X → Rn as ΦL : x 7→(

f (K (x , x+i )− K (x , x−i ))

)n

i=1

2: Get a labeled training set T ={

tj}n′

j=1 ⊂ X sampled from D.

3: T ′ ←{

ΦL(tj )}n′

j=1 ⊂ Rn be the data set embedded in Rn

4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))

6: return ˆ

LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.

The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 17 / 29

Page 78: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

Algorithm 1 LEARN-DISSIM

Require: A similarity function K , landmark pairs L =(

x+i , x

−i

)n

i=1, a good

transfer function f .Ensure: A classifier ˆ : X → {−1,+1}

1: Define ΦL : X → Rn as ΦL : x 7→(

f (K (x , x+i )− K (x , x−i ))

)n

i=1

2: Get a labeled training set T ={

tj}n′

j=1 ⊂ X sampled from D.

3: T ′ ←{

ΦL(tj )}n′

j=1 ⊂ Rn be the data set embedded in Rn

4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))

6: return ˆ

LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 17 / 29

Page 79: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

The results given earlier guarantee small classification error atlarge margin

Not amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 18 / 29

Page 80: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

The results given earlier guarantee small classification error atlarge marginNot amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]

We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 18 / 29

Page 81: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Learning with data-sensitive notions of suitability

The results given earlier guarantee small classification error atlarge marginNot amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 18 / 29

Page 82: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Working with surrogate loss functions

Definition (K. and Jain, 2011)A similarity function is said to be (ε,B)-good with respect to a lossfunction L : R→ R+ if for some transfer function f : R→ R and someweighing function w : X × X → [−B,B], E

x∼D[L(G(x))] ≤ ε where

G(x) = Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′) 6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))]

Theorem (K. and Jain, 2011)If K is an (ε,B)-good similarity function with respect to a CL-Lipschitzloss function L then for any ε1 > 0, with probability at least 1− δ overthe choice of d = (16B2C2

L/ε21) ln(4B/δε1) landmark pairs, the

expected loss of the classifier ˆ(x) returned by LEARN-DISSIM withrespect to L satisfies E

x

[L(ˆ(x))

]≤ ε+ ε1.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 19 / 29

Page 83: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Working with surrogate loss functions

Definition (K. and Jain, 2011)A similarity function is said to be (ε,B)-good with respect to a lossfunction L : R→ R+ if for some transfer function f : R→ R and someweighing function w : X × X → [−B,B], E

x∼D[L(G(x))] ≤ ε where

G(x) = Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′) 6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))]

Theorem (K. and Jain, 2011)If K is an (ε,B)-good similarity function with respect to a CL-Lipschitzloss function L then for any ε1 > 0, with probability at least 1− δ overthe choice of d = (16B2C2

L/ε21) ln(4B/δε1) landmark pairs, the

expected loss of the classifier ˆ(x) returned by LEARN-DISSIM withrespect to L satisfies E

x

[L(ˆ(x))

]≤ ε+ ε1.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 19 / 29

Page 84: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 19 / 29

Page 85: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Learning the transfer function

We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.

This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E

x∼D[L(G(x))] and let L(f ,L) denote the generalization loss

of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29

Page 86: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Learning the transfer function

We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.

For any transfer function f and arbitrary set of landmarks L, letL(f ) = E

x∼D[L(G(x))] and let L(f ,L) denote the generalization loss

of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29

Page 87: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Learning the transfer function

We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E

x∼D[L(G(x))] and let L(f ,L) denote the generalization loss

of the best classifier that uses the embedding ΦL defined by thelandmarks L.

The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29

Page 88: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Learning the transfer function

We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E

x∼D[L(G(x))] and let L(f ,L) denote the generalization loss

of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29

Page 89: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Learning the transfer function

Theorem (K. and Jain, 2011)Let F be a compact class oftransfer functions with respect tothe infinity norm and ε1, δ > 0. LetN (F , r) be the size of the smallestε-net over F with respect to theinfinity norm at scale r = ε1

4CLB .

Taking n =64B2C2

Lε21

ln(

16B·N (F ,r)δε1

)random landmark pairs, we havewith probability greater than (1− δ)

supf∈F

[|L(f ,L)− L(f )|] ≤ ε1

Algorithm 2 FTUNERequire: A family of transfer functions F , a

similarity function K , a loss function L.Ensure: An optimal transfer function f∗ ∈ F .

1: Select d landmark pairs L .2: for all f ∈ F do3: wf ← LEARN-DISSIM(K ,L, f ),

Lf ← L (f ,L)4: end for5: f∗ ← arg min

f∈FLf

6: return f∗.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 21 / 29

Page 90: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Learning the transfer function

Theorem (K. and Jain, 2011)Let F be a compact class oftransfer functions with respect tothe infinity norm and ε1, δ > 0. LetN (F , r) be the size of the smallestε-net over F with respect to theinfinity norm at scale r = ε1

4CLB .

Taking n =64B2C2

Lε21

ln(

16B·N (F ,r)δε1

)random landmark pairs, we havewith probability greater than (1− δ)

supf∈F

[|L(f ,L)− L(f )|] ≤ ε1

Algorithm 2 FTUNERequire: A family of transfer functions F , a

similarity function K , a loss function L.Ensure: An optimal transfer function f∗ ∈ F .

1: Select d landmark pairs L .2: for all f ∈ F do3: wf ← LEARN-DISSIM(K ,L, f ),

Lf ← L (f ,L)4: end for5: f∗ ← arg min

f∈FLf

6: return f∗.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 21 / 29

Page 91: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Intelligent choice of landmark points

If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossible

Thus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit

Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.

1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min

x∈T

∑x′∈S

K (x , x ′).

4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.

`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29

Page 92: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Intelligent choice of landmark points

If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasets

On large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit

Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.

1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min

x∈T

∑x′∈S

K (x , x ′).

4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.

`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29

Page 93: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Intelligent choice of landmark points

If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit

Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.

1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min

x∈T

∑x′∈S

K (x , x ′).

4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.

`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29

Page 94: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Intelligent choice of landmark points

If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit

Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.

1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min

x∈T

∑x′∈S

K (x , x ′).

4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.

`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29

Page 95: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29

Page 96: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Results

50 100 150 200 250 3000.5

0.6

0.7

0.8

0.9

1AmazonBinary (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE+DFTUNEBBS+DBBSDBOOST

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7Amazon47 (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE+DFTUNEBBS+DBBSDBOOST

50 100 150 200 250 3000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45Mirex07 (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE+DFTUNEBBS+DBBSDBOOST

0 100 200 3000

0.1

0.2

0.3

0.4

0.5

0.6

FaceRec (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE+DFTUNEBBS+DBBSDBOOST

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 23 / 29

Page 97: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Results

50 100 150 200 250 3000.65

0.7

0.75

0.8

0.85

0.9

0.95

1Isolet (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE (Single)FTUNE (Multiple)BBSDBOOST

50 100 150 200 250 3000.5

0.6

0.7

0.8

0.9

1Letters (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE (Single)FTUNE (Multiple)BBSDBOOST

50 100 150 200 250 3000.93

0.94

0.95

0.96

0.97

0.98

0.99

1Pen−digits (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE (Single)FTUNE (Multiple)BBSDBOOST

50 100 150 200 250 3000.88

0.9

0.92

0.94

0.96

0.98

Number of Landmarks

Acc

urac

y

Opt−digits (Accuracy vs Landmarks)

FTUNE (Single)FTUNE (Multiple)BBSDBOOST

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 24 / 29

Page 98: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.

In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29

Page 99: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.

Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29

Page 100: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.

In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29

Page 101: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29

Page 102: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Data-sensitive Notions of Suitability Results

Thanks

Preprint available athttp://www.cse.iitk.ac.in/users/purushot/

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 26 / 29

Page 103: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

References

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 27 / 29

Page 104: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

References

References I

Arora, S., Babai, L., Stern, J., and Sweedyk, Z. (1997).The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations.

Journal of Computer and System Sciences, 54(2):317–331.

Balcan, M.-F. and Blum, A. (2006).On a Theory of Learning with Similarity Functions.In International Conference on Machine Learning, pages 73–80.

Balcan, M.-F., Blum, A., and Vempala, S. (2006).Kernels as Features: On Kernels, Margins, and Low-dimensional Mappings.Machine Learning, 65(1):79–94.

Garey, M. R. and Johnson, D. (1979).Computers and Intractability: A Guide to the theory of NP-Completeness.Freeman, San Francisco.

Goldfarb, L. (1984).A Unified Approach to Pattern Recognition.Pattern Recognition, 17(5):575–582.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 27 / 29

Page 105: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

References

References II

Gottlieb, L.-A., Kontorovich, A. L., and Krauthgamer, R. (2010).Efficient Classification for Metric Data.In Annual Conference on Computational Learning Theory.

Haasdonk, B. (2005).Feature Space Interpretation of SVMs with Indefinite Kernels.IEEE Transactions on Pattern Analysis and Machince Intelligence, 27(4):482–492.

Kar, P. and Jain, P. (2011).Similarity-based Learning via Data Driven Embeddings.In 25th Annual Conference on Neural Information Processing Systems.(to appear).

Pekalska, E. and Duin, R. P. W. (2001).On Combining Dissimilarity Representations.In Multiple Classifier Systems, pages 359–368.

Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998).Structural Risk Minimization Over Data-Dependent Hierarchies.IEEE Transactions on Information Theory, 44(5):1926–1940.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 28 / 29

Page 106: Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

References

References III

von Luxburg, U. and Bousquet, O. (2004).Distance-Based Classification with Lipschitz Functions.Journal of Machine Learning Research, 5:669–695.

Wang, L., Yang, C., and Feng, J. (2007).On Learning with Dissimilarity Functions.In International Conference on Machine Learning, pages 991–998.

Weinberger, K. Q. and Saul, L. K. (2009).Distance Metric Learning for Large Margin Nearest Neighbor Classification.Journal of Machine Learning Research, 10:207–244.

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 29 / 29