Recent Advances in Machine Learningml.typepad.com/Talks/iwfhr.pdfConstructing Algorithms Choosing...

Constructing Algorithms Choosing the Appropriate Algorithm Wrap-up and Conclusion

Recent Advances in Machine Learning

Olivier Bousquet, Pertinence

IWFHR, La Baule, 2006

Olivier Bousquet, Pertinence Recent Advances in Machine Learning

Goal of this talk

Demystifying some of the recent learning algorithms

Forget about how they were originally derivedForget about how they are ”marketed”Rebuild them from scratch

Give hints at how to choose between them

Show how to integrate prior knowledge

Goal of this talk

Outline

1 Constructing AlgorithmsStarting From Similarity IStarting From Similarity IIStarting From Features

2 Choosing the Appropriate AlgorithmUnified view via RegularizationHow to choose?

3 Wrap-up and Conclusion

Outline

Starting From Scratch

Assume we are engineers who want to build a good binaryclassification algorithm

Assume we have not heard about recent advances in MachineLearning

Standard notation

Training examples (x1, y1), . . . , (xn, yn)xi arbitrary object in X (e.g. image)yi binary label +1,−1f : X → R classification function (decision corresponds tosgn f (x))

Standard notation

Starting From Similarity I

Starting From Similarity

Assume some colleague of yours gives you a similarity measureon images and tells you that whenever the similarity is high,the images are likely to correspond to the same character

Similarity function: s : X × X → RAssume further that for any x ,

s(x , x) ≥ 0

Simplistic Approach

Compute similarity of a new example to all training examples

Compare average similarity to positives and average similarityto negatives

f (x) =1

∑i :yi=+1

s(xi , x)− 1

∑i :yi=−1

s(xi , x)

Simplistic Approach

Compute similarity of a new example to all training examples

Compare average similarity to positives and average similarityto negatives

f (x) =1

∑i :yi=+1

s(xi , x)− 1

∑i :yi=−1

s(xi , x)

Refined Approach

Not fully satisfactory: some training examples are misclassified

Try to modify the weights, look for a function

f (x) =n∑

αiyi s(xi , x)

With the following constraints on the weights

∀i , αi ≥ 0,∑

yiαi = 0,∑

αi = 2

which is equivalent to

∀i , αi ≥ 0,∑

i :yi=+1

αi =∑

i :yi=−1

αi = 1

Refined Approach

f (x) =n∑

αiyi s(xi , x)

∀i , αi ≥ 0,∑

yiαi = 0,∑

αi = 2

∀i , αi ≥ 0,∑

i :yi=+1

αi =∑

i :yi=−1

αi = 1

Refined Approach

f (x) =n∑

αiyi s(xi , x)

∀i , αi ≥ 0,∑

yiαi = 0,∑

αi = 2

∀i , αi ≥ 0,∑

i :yi=+1

αi =∑

i :yi=−1

αi = 1

Tuning the weights

When xi is misclassified, yi f (xi ) ≤ 0

yi f (xi ) = αi s(xi , xi ) +∑j 6=i

yiyjs(xi , xj)

In order to increase yi f (xi ), we need to increase αi , i.e.decrease αiyi f (xi )

Let us do it simultaneously for all examples

n∑i=1

αiyi f (xi )

but f itself depends on the αi so that replacing we get

∑i ,j

αiαjyiyjs(xi , xj)

Tuning the weights

yiyjs(xi , xj)

n∑i=1

αiyi f (xi )

∑i ,j

Tuning the weights

yiyjs(xi , xj)

n∑i=1

αiyi f (xi )

∑i ,j

Illustration

Evolving weights

What did we obtain?

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

Exactly the hard-margin SVM!

This optimization problem is convex (which implies it has aunique solution) provided s is a positive definite kernel

What did we obtain?

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

What did we obtain?

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

Regularizing further

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

We may want to avoid that one αi takes all the weight

First way: add a constraint αi ≤ c (L1 soft margin SVM)

Second way: add a term to the objective function∑αiαjyiyjs(xi , xj) + c

∑α2

i (L2 soft margin SVM)

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

∑α2

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

∑α2

minαi≥0,

Pαiyi=0,

Pαi=2

∑i ,j

∑α2

Wrap-up

Convex combination of similarities to examples

Increase the weights of misclassified examples till convergence

Possibly add a regularization term or a constraint on theweights

Forget about margin, high dimensional feature space, linearseparators... kernels are used to make the optimizationtractable

Wrap-up

Starting From Similarity II

Starting from Similarity II

Assume some colleague of yours gives you a similarity measureon images and tells you that it makes sense only locally, but itcan be considered as transitive (a similar to b, and b similar toc implies a similar to c)

Assume further you already know the examples to be classified(semi-supervised learning) and set yn+i = 0 for those

Assume also∀x , x ′, s(x , x ′) ≥ 0

Propagating Similarity

Basic idea: predict using similarity weighting

f (x) =

∑ni=1 yi s(xi , x)∑ni=1 s(xi , x)

This only uses the local similarityTo use transitivity, consider the matrix

Sij =s(xi , xj)∑ni=1 s(xi , x)

Use transitivity to make similarity more global

Sk+1 = (1− α)I + αSSk , S0 = I , Sk → (1− α)(I − αS)−1

Combine predictions with this new similarity

f (xi ) =n∑

yi s∞(xi , x)

f (x) =

f (xi ) =n∑

yi s∞(xi , x)

f (x) =

f (xi ) =n∑

yi s∞(xi , x)

f (x) =

f (xi ) =n∑

yi s∞(xi , x)

f (x) =

f (xi ) =n∑

yi s∞(xi , x)

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1

−0.5

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−1

−0.5

Wrap-up

Propagate similarity to make it more global (i.e. add all paths)

Predict by summing all labels with similarity weight

Forget about manifolds, spectrum of Laplacian...

Wrap-up

Starting From Features

Starting from Features

Assume some colleague of yours gives you a large set of binaryfeatures and tells you that he believes that a small number ofthem will allow to classify the images

Set of features H possibly infinite, h(x) ∈ −1, 1 (can begeneralized to [−1, 1])

Goal: construct a linear combination

Building a linear combination I

Idea: let us be greedy

Pick the most accurate feature:

n∑i=1

yih(xi )

Add it to the linear combination: f (x) = h(x)

Update (compute error differently)

n∑i=1

yih(xi )

n∑i=1

yih(xi )

n∑i=1

yih(xi )

Building a linear combination II

Modify the way of choosing the next feature (in order toreduce the error): increase the weight of misclassifiedexamples (just as for SVM!)

Introduce di ∝ exp(−yi f (xi ))

n∑i=1

diyih(xi )

Add it to the linear combination: f (x) =∑

αjhj(x)

n∑i=1

diyih(xi )

αjhj(x)

n∑i=1

diyih(xi )

αjhj(x)

n∑i=1

diyih(xi )

αjhj(x)

What do we end up with?

Choice of the best feature at a given step is called weaklearning

Many variants of Boosting (including Adaboost) work in thisway (with various ways of choosing the αi )

Also similar to iterative regression (e.g. LAR)

Wrap-up

Create linear combination of a few features

Choose most discriminative feature, and update weights onexample

Forget about weak and strong learning, margin, ensembles...

Wrap-up

Outline

Unified view via Regularization

The functional viewpoint

All approaches boil down to a regularized functional viewpoint

minf ∈F

∑`(f (xi ), yi ) + λΩf

Key ingredients

Convex loss functionConvex regularizer (ensure smoothness of the function)Convex search space (e.g. linear combinations)

The functional viewpoint

All approaches boil down to a regularized functional viewpoint

minf ∈F

∑`(f (xi ), yi ) + λΩf

Key ingredients

Convex loss functionConvex regularizer (ensure smoothness of the function)Convex search space (e.g. linear combinations)

SVM: f linear combination f (x) =∑

αik(xi , x)

minf ∈F

∑(1− yi f (xi ))+ + λ ‖f ‖2

minf ∈F

∑(1− yi f (xi ))

2+ + λ ‖f ‖2

with ‖f ‖2k =

∑i ,j αiαjk(xi , xj)

Manifold

n∑i=1

(f (xi )− yi )2 + λf t∆f

with f t∆f =∑

s(xi , xj)(f (xi )− f (xj))2

Boosting

Boosting: f linear combination f (x) =∑

αjhj(x)

n∑i=1

e−yi f (xi ) + λ ‖f ‖1

with ‖f ‖1 =∑|αj |

What if you have too many features?

Yet another trick: Random Projection!

Just project down

Randomly

How many dimensions?

Roughly log n divided by the square of the desired accuracy

Just project down

Randomly

Just project down

Randomly

Just project down

Randomly

Just project down

Randomly

How to choose?

Criteria

The main criteria for choosing the appropriate algorithm

Knowledge you have about the problem

Computational constraints

How to choose?

Criteria

The main criteria for choosing the appropriate algorithm

Knowledge you have about the problem

Computational constraints

How to choose?

Decision List

How to choose?

More Knowledge

Similarity: build sophisticated kernels

Incremental approach: use known kernels and combine them invarious ways (algebra +, ∗, lim, convolution, exp), e.g.sequencesInvariances (e.g. tangent distance)Structured objects (sets, probability distributions, graphs,trees, sequences...)

Features: use sophisticated features (i.e. classifiers)

How to choose?

More Knowledge

Similarity: build sophisticated kernels

Incremental approach: use known kernels and combine them invarious ways (algebra +, ∗, lim, convolution, exp), e.g.sequencesInvariances (e.g. tangent distance)Structured objects (sets, probability distributions, graphs,trees, sequences...)

Features: use sophisticated features (i.e. classifiers)

Outline

Wrap-up

Various tools

Need to understand what each brings

Can combine those basic tools to build the desired system

Trends

Do not refrain from using complex representationsBut avoid overfittingAnd remain tractableMany tools and tricks for doing both

Wrap-up

Various tools

Trends

Wrap-up

Various tools

Trends

Wrap-up

Various tools

Trends

Conclusion

Forget about fancy ideas (SVM margin, implicit featuremapping, manifolds, boosting the margin...)

Only relevant: regularization. Works if you have the goodfeatures/similarity and if you have the appropriateregularization mechanism!

Future directions: multiclass made easy, more kernel buildingtools, more modularity (easy to combine several algorithms)

Conclusion

Recent Advances in Machine Learningml.typepad.com/Talks/iwfhr.pdfConstructing Algorithms Choosing...

Documents

Transcript of Recent Advances in Machine Learningml.typepad.com/Talks/iwfhr.pdfConstructing Algorithms Choosing...