November, 2006 CCKM'06 1

94
November, 2006 CCKM'06 1 Kernel Methods: the Emergence of a Well- founded Machine Learning John Shawe-Taylor Centre for Computational Statistics and Machine Learning University College London

description

 

Transcript of November, 2006 CCKM'06 1

Page 1: November, 2006 CCKM'06 1

November, 2006 CCKM'06 1

Kernel Methods: the Emergence of a Well-founded Machine LearningJohn Shawe-Taylor

Centre for Computational Statistics and Machine Learning

University College London

Page 2: November, 2006 CCKM'06 1

November, 2006 CCKM'06 2

Overview

Celebration of 10 years of kernel methods: what has been achieved and what can we learn from the experience?

Some historical perspectives: Theory or not? Applicable or not?

Some emphases: Role of theory – need for plurality of approaches Importance of scalability

Surprising fact: there are now 13,444 citations of Support Vectors reported by Google scholar – the vast majority from papers not about machine learning!

Ratio NNs/SVs in Google scholar 1987-1996: 220 1997-2006: 11

Page 3: November, 2006 CCKM'06 1

November, 2006 CCKM'06 3

Caveats

Personal perspective with inevitable bias One very small slice through what is now a very

big field Focus on theory with emphasis on frequentist

analysis There is no pro-forma for scientific research

But role of theory worth discussing Needed to give firm foundation for proposed

approaches?

Page 4: November, 2006 CCKM'06 1

November, 2006 CCKM'06 4

Motivation behind kernel methods Linear learning typically has nice properties

Unique optimal solutions Fast learning algorithms Better statistical analysis

But one big problem Insufficient capacity

Page 5: November, 2006 CCKM'06 1

November, 2006 CCKM'06 5

Historical perspective

Minsky and Pappert highlighted the weakness in their book Perceptrons

Neural networks overcame the problem by gluing together many linear units with non-linear activation functions Solved problem of capacity and led to very

impressive extension of applicability of learning But ran into training problems of speed and

multiple local minima

Page 6: November, 2006 CCKM'06 1

November, 2006 CCKM'06 6

Kernel methods approach

The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:

The expectation is that the feature space has a much higher dimension than the input space.

Page 7: November, 2006 CCKM'06 1

November, 2006 CCKM'06 7

Example

Consider the mapping

If we consider a linear equation in this feature space:

We actually have an ellipse – i.e. a non-linear shape in the input space.

Page 8: November, 2006 CCKM'06 1

November, 2006 CCKM'06 8

Capacity of feature spaces

The capacity is proportional to the dimension – for example:

2-dim:

Page 9: November, 2006 CCKM'06 1

November, 2006 CCKM'06 9

Form of the functions

So kernel methods use linear functions in a feature space:

For regression this could be the function For classification require thresholding

Page 10: November, 2006 CCKM'06 1

November, 2006 CCKM'06 10

Problems of high dimensions

Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well

Computational costs involved in dealing with large vectors

Page 11: November, 2006 CCKM'06 1

November, 2006 CCKM'06 11

Overview

Two theoretical approaches converged on very similar algorithms: Frequentist led to Support Vector Machine Bayesian approach led to Bayesian inference

using Gaussian Processes First we briefly discuss the Bayesian

approach before mentioning some of the frequentist results

Page 12: November, 2006 CCKM'06 1

November, 2006 CCKM'06 12

Bayesian approach

The Bayesian approach relies on a probabilistic analysis by positing a noise model a prior distribution over the function class

Inference involves updating the prior distribution with the likelihood of the data

Possible outputs: MAP function Bayesian posterior average

Page 13: November, 2006 CCKM'06 1

November, 2006 CCKM'06 13

Bayesian approach Avoids overfitting by

Controlling the prior distribution Averaging over the posterior

For Gaussian noise model (for regression) and Gaussian process prior we obtain a ‘kernel’ method where Kernel is covariance of the prior GP Noise model translates into addition of ridge to kernel

matrix MAP and averaging give the same solution Link with infinite hidden node limit of single hidden layer

Neural Networks – see seminal paperWilliams, Computation with infinite neural networks (1997)

Surprising fact: the covariance (kernel) function that arises from infinitely many sigmoidal hidden units is not a sigmoidal kernel – indeed the sigmoidal kernel is not positive semi-definite and so cannot arise as a covariance function!

Page 14: November, 2006 CCKM'06 1

November, 2006 CCKM'06 14

Bayesian approach

Subject to assumptions about noise model and prior distribution: Can get error bars on the output Compute evidence for the model and use for

model selection Approach developed for different noise

models eg classification Typically requires approximate inference

Page 15: November, 2006 CCKM'06 1

November, 2006 CCKM'06 15

Frequentist approach

Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data

Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived

Main focus is on generalisation error analysis

Page 16: November, 2006 CCKM'06 1

November, 2006 CCKM'06 16

Capacity problem What do we mean by generalisation?

Page 17: November, 2006 CCKM'06 1

November, 2006 CCKM'06 17

Generalisation of a learner

Page 18: November, 2006 CCKM'06 1

November, 2006 CCKM'06 18

Example of Generalisation

We consider the Breast Cancer dataset from the UCIrepository

Use the simple Parzen window classifier: weight vector is

where is the average of the positive (negative) training examples.

Threshold is set so hyperplane bisects the line joining these two points.

Page 19: November, 2006 CCKM'06 1

November, 2006 CCKM'06 19

Example of Generalisation

By repeatedly drawing random training sets S of size m we estimate the distribution of

by using the test set error as a proxy for the true generalisation

We plot the histogram and the average of the distribution for various sizes of training set

648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.

Page 20: November, 2006 CCKM'06 1

November, 2006 CCKM'06 20

Example of Generalisation Since the expected classifier is in all cases

the same

we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly.

Page 21: November, 2006 CCKM'06 1

November, 2006 CCKM'06 21

Error distribution: full dataset

Page 22: November, 2006 CCKM'06 1

November, 2006 CCKM'06 22

Error distribution: dataset size: 342

Page 23: November, 2006 CCKM'06 1

November, 2006 CCKM'06 23

Error distribution: dataset size: 273

Page 24: November, 2006 CCKM'06 1

November, 2006 CCKM'06 24

Error distribution: dataset size: 205

Page 25: November, 2006 CCKM'06 1

November, 2006 CCKM'06 25

Error distribution: dataset size: 137

Page 26: November, 2006 CCKM'06 1

November, 2006 CCKM'06 26

Error distribution: dataset size: 68

Page 27: November, 2006 CCKM'06 1

November, 2006 CCKM'06 27

Error distribution: dataset size: 34

Page 28: November, 2006 CCKM'06 1

November, 2006 CCKM'06 28

Error distribution: dataset size: 27

Page 29: November, 2006 CCKM'06 1

November, 2006 CCKM'06 29

Error distribution: dataset size: 20

Page 30: November, 2006 CCKM'06 1

November, 2006 CCKM'06 30

Error distribution: dataset size: 14

Page 31: November, 2006 CCKM'06 1

November, 2006 CCKM'06 31

Error distribution: dataset size: 7

Page 32: November, 2006 CCKM'06 1

November, 2006 CCKM'06 32

Observations

Things can get bad if number of training examples small compared to dimension (in this case input dimension is 9)

Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly wrong

Key ingredient of learning – keep flexibility high while still ensuring good generalisation

Page 33: November, 2006 CCKM'06 1

November, 2006 CCKM'06 33

Controlling generalisation The critical method of controlling

generalisation for classification is to force a large margin on the training data:

Page 34: November, 2006 CCKM'06 1

November, 2006 CCKM'06 34

Intuitive and rigorous explanations Makes classification robust to uncertainties in inputs Can randomly project into lower dimensional spaces

and still have separation – so effectively low dimensional

Rigorous statistical analysis shows effective dimension

This is not structural risk minimisation over VC classes since hierarchy depends on the data: data-dependent structural risk minimisation

see S-T, Bartlett, Williamson & Anthony (1996 and 1998)

Surprising fact: structural risk minimisation over VCclasses does not provide a bound on the generalisation of SVMs, except in the transductive setting – and then only if the margin is measured on training and test data! Indeed SVMs were wake-up call that classical PAC analysis was not capturing critical factors in real-world applications!

Page 35: November, 2006 CCKM'06 1

November, 2006 CCKM'06 35

Learning framework

Since there are lower bounds in terms of the VC dimension the margin is detecting a favourable distribution/task alignment – luckiness framework captures this idea

Now consider using an SVM on the same data and compare the distribution of generalisations

SVM distribution in red

Page 36: November, 2006 CCKM'06 1

November, 2006 CCKM'06 36

Error distribution: dataset size: 205

Page 37: November, 2006 CCKM'06 1

November, 2006 CCKM'06 37

Error distribution: dataset size: 137

Page 38: November, 2006 CCKM'06 1

November, 2006 CCKM'06 38

Error distribution: dataset size: 68

Page 39: November, 2006 CCKM'06 1

November, 2006 CCKM'06 39

Error distribution: dataset size: 34

Page 40: November, 2006 CCKM'06 1

November, 2006 CCKM'06 40

Error distribution: dataset size: 27

Page 41: November, 2006 CCKM'06 1

November, 2006 CCKM'06 41

Error distribution: dataset size: 20

Page 42: November, 2006 CCKM'06 1

November, 2006 CCKM'06 42

Error distribution: dataset size: 14

Page 43: November, 2006 CCKM'06 1

November, 2006 CCKM'06 43

Error distribution: dataset size: 7

Page 44: November, 2006 CCKM'06 1

November, 2006 CCKM'06 44

Handling training errors

So far only considered case where data can be separated

For non-separable sets we can introduce a penalty proportional to the amount by which a point fails to meet the margin

These amounts are often referred to as slack variables from optimisation theory

Page 45: November, 2006 CCKM'06 1

November, 2006 CCKM'06 45

Support Vector Machines

SVM optimisation

Analysis of this case given using augmented space

trick inS-T and Cristianini (1999 and 2002) On the generalisation of soft margin algorithms.

Page 46: November, 2006 CCKM'06 1

November, 2006 CCKM'06 46

Complexity problem

Let’s apply the quadratic example

to a 20x30 image of 600 pixels – gives approximately 180000 dimensions!

Would be computationally infeasible to work in this space

Page 47: November, 2006 CCKM'06 1

November, 2006 CCKM'06 47

Dual representation

Suppose weight vector is a linear combination of the training examples:

can evaluate inner product with new example

Page 48: November, 2006 CCKM'06 1

November, 2006 CCKM'06 48

Learning the dual variables

The αi are known as dual variables Since any component orthogonal to the space

spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem.

Hence, can reformulate algorithms to learn dual variables rather than weight vector directly

Page 49: November, 2006 CCKM'06 1

November, 2006 CCKM'06 49

Dual form of SVM

The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:

Note that threshold must be determined from border examples

Page 50: November, 2006 CCKM'06 1

November, 2006 CCKM'06 50

Using kernels

Critical observation is that again only inner products are used

Suppose that we now have a shortcut method of computing:

Then we do not need to explicitly compute the feature vectors either in training or testing

Page 51: November, 2006 CCKM'06 1

November, 2006 CCKM'06 51

Kernel example

As an example consider the mapping

Here we have a shortcut:

Page 52: November, 2006 CCKM'06 1

November, 2006 CCKM'06 52

Efficiency

Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result!

Can even work in infinite dimensional spaces, eg using the Gaussian kernel:

Page 53: November, 2006 CCKM'06 1

November, 2006 CCKM'06 53

Using Gaussian kernel for Breast: 273

Page 54: November, 2006 CCKM'06 1

November, 2006 CCKM'06 54

Data size 342

Surprising fact: kernel methods are invariant torotations of the coordinate system – so any specialinformation encoded in the choice of coordinates will be lost! Surprisingly one of the most successful applicationsof SVMs has been for text classification in which one would expect the encoding to be very informative!

Page 55: November, 2006 CCKM'06 1

November, 2006 CCKM'06 55

Constraints on the kernel

There is a restriction on the function:

This restriction for any training set is enough to guarantee function is a kernel

Surprising fact: the property that guarantees theexistence of a feature space is precisely the propertyrequired to ensure convexity of the resulting optimisationproblem for SVMs, etc!

Page 56: November, 2006 CCKM'06 1

November, 2006 CCKM'06 56

What have we achieved?

Replaced problem of neural network architecture by kernel definition Arguably more natural to define but restriction is a bit

unnatural Not a silver bullet as fit with data is key Can be applied to non-vectorial (or high dim) data

Gained more flexible regularisation/ generalisation control

Gained convex optimisation problem i.e. NO local minima!

Page 57: November, 2006 CCKM'06 1

November, 2006 CCKM'06 57

Historical note

First use of kernel methods in machine learning was in combination with perceptron algorithm: Aizerman et al., Theoretical foundations of the potential function method in pattern recognition learning(1964)

Apparently failed because of generalisation issues – margin not optimised but this can be incorporated with one extra parameter!

Surprising fact: good covering number bounds on generalisation of SVMs use bound on updates of theperceptron algorithm to bound the covering numbers and hence generalisation – but the same bound can be applied directly to the perceptron without margin using a sparsity argument, so bound is tighter for classicalperceptron!

Page 58: November, 2006 CCKM'06 1

November, 2006 CCKM'06 58

PAC-Bayes: General perspectives The goal of different theories is to capture the key

elements that enable an understanding, analysis and learning of different phenomena

Several theories of machine learning: notably Bayesian and frequentist

Different assumptions and hence different range of applicability and range of results

Bayesian able to make more detailed probabilistic predictions

Frequentist makes only i.i.d. assumption

Page 59: November, 2006 CCKM'06 1

November, 2006 CCKM'06 59

Evidence and generalisation Evidence is a measure used in Bayesian analysis to

inform model selection Evidence related to the posterior volume of weight

space consistent with the sample Link between evidence and generalisation

hypothesised by McKay First such result was obtained by S-T & Williamson

(1997): PAC Analysis of a Bayes Estimator Bound on generalisation in terms of the volume of

the sphere that can be inscribed in the version space -- included a dependence on the dim of the space but applies to non-linear function classes

Used Luckiness framework where luckiness measured by the volume of the ball

Page 60: November, 2006 CCKM'06 1

November, 2006 CCKM'06 60

PAC-Bayes Theorem

PAC-Bayes theorem has a similar flavour but bounds in terms of prior and posterior distributions

First version proved by McAllester in 1999 Improved proof and bound due to Seeger in 2002

with application to Gaussian processes Application to SVMs by Langford and S-T also in

2002 Gives tightest bounds on generalisation for SVMs

see Langford tutorial on JMLR

Page 61: November, 2006 CCKM'06 1

November, 2006 CCKM'06 61

Page 62: November, 2006 CCKM'06 1

November, 2006 CCKM'06 62

Page 63: November, 2006 CCKM'06 1

November, 2006 CCKM'06 63

Page 64: November, 2006 CCKM'06 1

November, 2006 CCKM'06 64

Examples of bound evaluationProblem PAC-Bayes PAC-Bayes

priorTest Error

Wdbc 0.334±.005 0.306±.018 0.073±.021

Image 0.254±.003 0.184±.010 0.074±.014

(0.056±.01)

Waveform 0.198±.002 0.151±.005 0.089±.005

Ringnorm 0.212±.002 0.109±.024 0.026±.005

Surprising fact: optimising the bound does not typically improve the test error – despite significant improvements in the bound itself! Training an SVM onhalf the data to learn a prior and then using the restto learn relative to this prior further improves the boundwith almost no effect on the test error!

Page 65: November, 2006 CCKM'06 1

November, 2006 CCKM'06 65

Principal Components Analysis (PCA) Classical method is principal component

analysis – looks for directions of maximum variance, given by eigenvectors of covariance matrix

Page 66: November, 2006 CCKM'06 1

November, 2006 CCKM'06 66

Dual representation of PCA

Eigenvectors of kernel matrix give dual representation:

Means we can perform PCA projection in a kernel defined feature space: kernel PCA

Page 67: November, 2006 CCKM'06 1

November, 2006 CCKM'06 67

Kernel PCA

Need to take care of normalisation to obtain:

where λ is the corresponding eigenvalue.

Page 68: November, 2006 CCKM'06 1

November, 2006 CCKM'06 68

Generalisation of k-PCA How reliable are estimates obtained from a sample

when working in such high dimensional spaces Using Rademacher complexity bounds can show

that if low dimensional projection captures most of the variance in the training set, will do well on test examples as well even in high dimensional spacessee S-T, Williams, Cristianini and Kandola (2004) On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA

The bound also gives a new bound on the difference between the process and empirical eigenvalues

Surprising fact: no frequentist bounds apply for learning in the representation given by k-PCA – the data is used twice, once to find the subspace and then to learn the classifier/regressor! Intuitively, this shouldn’t be a problem since the labels are not used by k-PCA!

Page 69: November, 2006 CCKM'06 1

November, 2006 CCKM'06 69

Latent Semantic Indexing

Developed in information retrieval to overcome problem of semantic relations between words in BoWs representation

Use Singular Value Decomposition of Term-doc matrix:

Page 70: November, 2006 CCKM'06 1

November, 2006 CCKM'06 70

Lower dimensional representation If we truncate the matrix U at k columns we obtain the best k dimensional representation in the least squares sense:

In this case we can write a ‘semantic matrix’ as

Page 71: November, 2006 CCKM'06 1

November, 2006 CCKM'06 71

Latent Semantic Kernels

Can perform the same transformation in a kernel defined feature space by performing an eigenvalue decomposition of the kernel matrix (this is equivalent to kernel PCA):

Page 72: November, 2006 CCKM'06 1

November, 2006 CCKM'06 72

Related techniques Number of related techniques:

Probabilistic LSI (pLSI) Non-negative Matrix Factorisation (NMF) Multinomial PCA (mPCA) Discrete PCA (DPCA)

All can be viewed as alternative decompositions:

Page 73: November, 2006 CCKM'06 1

November, 2006 CCKM'06 73

Different criteria

Vary by: Different constraints (eg non-negative entries) Different prior distributions (eg Dirichlet, Poisson) Different optimisation criteria (eg max likelihood,

Bayesian) Unlike LSI typically suffer from local minima

and so require EM type iterative algorithms to converge to solutions

Page 74: November, 2006 CCKM'06 1

November, 2006 CCKM'06 74

Other subspace methods

Kernel partial Gram-Schmidt orthogonalisation is equivalent to incomplete Cholesky decomposition – greedy kernel PCA

Kernel Partial Least Squares implements a multi-dimensional regression algorithm popular in chemometrics – takes account of labels

Kernel Canonical Correlation Analysis uses paired datasets to learn a semantic representation independent of the two views

Page 75: November, 2006 CCKM'06 1

November, 2006 CCKM'06 75

Paired corpora

Can we use paired corpora to extract more information?

Two views of same semantic object – hypothesise that both views contain all of the necessary information, eg document and translation to a second language:

Page 76: November, 2006 CCKM'06 1

November, 2006 CCKM'06 76

aligned text

E1

E2

EN

Ei

..

..

F1

F2

FN

Fi

..

..

Page 77: November, 2006 CCKM'06 1

November, 2006 CCKM'06 77

Canadian parliament corpusLAND MINES Ms. Beth Phinney (Hamilton Mountain, Lib.):Mr. Speaker, we are pleased that the Nobel peace prize has been given to those working to ban land mines worldwide. We hope this award will encourage the United States to join the over 100 countries planning to come to …

LES MINES ANTIPERSONNEL Mme Beth Phinney (Hamilton Mountain, Lib.):Monsieur le Président, nous nous réjouissons du fait que le prix Nobel ait été attribué à ceux qui oeuvrent en faveur de l'interdiction des mines antipersonnel dans le monde entier. Nous espérons que cela incitera les Américains à se joindre aux représentants de plus de 100 pays qui ont l'intention de venir à …

E12

F12

Page 78: November, 2006 CCKM'06 1

November, 2006 CCKM'06 78

cross-lingual lsi via svd

T

F

E

F

E VU

U

D

DD

'

''

M. L. Littman, S. T. Dumais, and T. K. Landauer. Automaticcross-language information retrieval using latent semanticindexing. In G. Grefenstette, editor, Cross-languageinformation retrieval. Kluwer, 1998.

Page 79: November, 2006 CCKM'06 1

November, 2006 CCKM'06 79

cross-lingual kernel canonical correlation analysis

input “English” space input “French” space

fE1

fE2

fF1

fF2

Φ(x)

feature “English” space feature “French” space

Page 80: November, 2006 CCKM'06 1

November, 2006 CCKM'06 80

kernel canonical correlation analysis

))(,,)(,(max , FfEfcorr FFEEff FE

l

lElE Ef )( l

lFlF Ff )(

DB

OKK

KKOB

EF

FE

2

2

F

E

KO

OKD

Page 81: November, 2006 CCKM'06 1

November, 2006 CCKM'06 81

regularization

2

2

)(

)(

IKO

OIKD

F

E

is the regularization parameter

using kernel functions may result in overfitting Theoretical analysis shows that provided the norms of the

weight vectors are small the correlation will still hold for new data

see S-T and Cristianini (2004) Kernel Methods for Pattern Analysis

need to control flexibility of the projections fE and fF: add diagonal to the matrix D:

Page 82: November, 2006 CCKM'06 1

November, 2006 CCKM'06 82

pseudo query test

Ei qei

F1

F2

FN

Fi

..

..

Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.

Page 83: November, 2006 CCKM'06 1

November, 2006 CCKM'06 83

Experimental ResultsExperimental ResultsThe goal was to retrieve the paired document.

Experimental procedure:

(1) LSI/KCCA trained on paired documents,

(2) All test documents projected into the LSI/KCCA semantic space,

(3) Each query was projected into the LSI/KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query.

Page 84: November, 2006 CCKM'06 1

November, 2006 CCKM'06 84

English-French retrieval accuracy, %

0.00

20.00

40.00

60.00

80.00

CL-LSI 29.95 38.20 42.44 45.35 47.11

CL_KCCA 67.88 74.92 77.56 78.87 79.79

100 200 300 400 500

Page 85: November, 2006 CCKM'06 1

November, 2006 CCKM'06 85

Data Combined image and associated text obtained from

the web Three categories: sport, aviation and paintball 400 examples from each category (1200 overall) Features extracted: HSV, Texture , Bag of words

Tasks1. Classification of web pages into the 3 categories

2. Text query -> image retrieval

Applying to different data types

Page 86: November, 2006 CCKM'06 1

November, 2006 CCKM'06 86

Classification error of baseline method

Colour Texture Colour +

Texture

Text All

22.9% 18.3% 13.6% 9.0% 3.0%

• Previous error rates obtained using probabilistic ICA classification done for single feature groups

Page 87: November, 2006 CCKM'06 1

November, 2006 CCKM'06 87

Classification rates using KCCA

K Error rate

Plain SVM 2.13%0.23%

KCCA-SVM (150) 1.36%0.15%

KCCA-SVM (200) 1.21%0.27%

Page 88: November, 2006 CCKM'06 1

November, 2006 CCKM'06 88

Classification with multi-views If we use KCCA to generate a semantic feature

space and then learn with an SVM, can envisage combining the two steps into a single ‘SVM-2k’

learns two SVMs one on each representation, but constrains their outputs to be similar across the training (and any unlabelled data).

Can give classification for single mode test data – applied to patent classification for Japanese patents

Again theoretical analysis predicts good generalisation if the training SVMs have a good match, while the two representations have a small overlapsee Farquhar, Hardoon, Meng, S-T, Szedmak (2005) Two view learning: SVM-2K, Theory and Practice

Page 89: November, 2006 CCKM'06 1

November, 2006 CCKM'06 89

Results

pSVM kcca_SVM SVM SVM_2k_j Concat SVM_2k

1 59.4±3.9 60.3±2.8 66.6±2.8 66.1± 2.6 67.5±2.3 67.5±2.1

2 71.1±4.5 68.4±4.4 73.0±4.0 74.8±4.7 73.9±4.0 75.1±4.1

3 16.7±1.2 13.1±1.0 18.8±1.6 20.8±1.9 21.5±1.9 22.5±1.7

7 74.9±1.8 76.0±1.2 76.7±1.3 77.5±1.4 79.0±1.2 80.7±1.5

12 75.0±0.8 73.6±0.8 76.8±1.0 77.6±0.7 76.8±0.6 78.4±0.6

14 76.0±1.6 71.5±1.5 80.9±1.3 82.2±1.3 81.4±1.4 82.7±1.3

Dualvariables

KCCA+SVM

DirectSVM

X-ling.SVM-2k

Concat+SVM

Co-ling.SVM-2k

Page 90: November, 2006 CCKM'06 1

November, 2006 CCKM'06 90

Targeted Density Learning

Learning a probability density function in an L1 sense is hard

Learning for a single classification or regression task is ‘easy’

Consider a set of tasks (Touchstone Class) F that we think may arise with a distribution emphasising those more likely to be seen

Now learn density that does well for the tasks in F. Surprisingly it can be shown that just including a

small subset of tasks gives us convergence over all the tasks

Page 91: November, 2006 CCKM'06 1

November, 2006 CCKM'06 91

Example plot Consider fitting cumulative distribution up to sample

points as proposed by Mukherjee and Vapnik. Plot shows loss as a function of the number of

constraints (tasks) included:

Page 92: November, 2006 CCKM'06 1

November, 2006 CCKM'06 92

Idealised view of progress

Study problem to develop theoretical model

Derive analysis that indicates

factors that affect solutionquality

Translateinto optimisation maximising

factors – relaxing toensure convexity

Develop efficient solutions using specifics

of the task

Page 93: November, 2006 CCKM'06 1

November, 2006 CCKM'06 93

Role of theory

Theoretical analysis plays a critical auxiliary role: Can never be complete picture Needs to capture the factors that most influence

quality of result Means to an end – only as good as the result Important to be open to refinements/relaxations

that can be shown to better characterise real-world learning and translate into efficient algorithms

Page 94: November, 2006 CCKM'06 1

November, 2006 CCKM'06 94

Conclusions and future directions Extensions to learning data with output structure that

can be used to inform the learning More general subspace methods: algorithms and

theoretical foundations Extensions to more complex tasks, eg system

identification, reinforcement learning Linking learning into tasks with more detailed prior

knowledge eg stochastic differential equations for climate modelling

Extending theoretical framework for more general learning tasks eg learning a density function