Introduction to Machine Learning Manik Varma Microsoft Research India manik [email protected].

110
Introduction to Machine Learning Manik Varma Microsoft Research India http://research.microsoft.com/~ manik [email protected]
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    253
  • download

    3

Transcript of Introduction to Machine Learning Manik Varma Microsoft Research India manik [email protected].

Page 1: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Introduction to Machine Learning

Manik VarmaMicrosoft Research India

http://research.microsoft.com/[email protected]

Page 2: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Binary Classification

• Is this person Madhubala or not?• Is this person male or female?• Is this person beautiful or not?

Page 3: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-Class Classification

• Is this person Madhubala, Lalu or Rakhi Sawant?• Is this person happy, sad, angry or bemused?

Page 4: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Ordinal Regression

• Is this person very beautiful, beautiful, ordinary or ugly?

Page 5: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Regression

• How beautiful is this person on a continuous scale of 1 to 10? 9.99?

Page 6: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Ranking

• Rank these people in decreasing order of attractiveness.

Page 7: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-Label Classification

• Tag this image with the set of relevant labels from {female, Madhubala, beautiful, IITD faculty}

Page 8: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Are These Problems Distinct?• Can regression solve all these problems• Binary classification – predict p(y=1|x)• Multi-Class classification – predict p(y=k|x)• Ordinal regression – predict p(y=k|x)• Ranking – predict and sort by relevance• Multi-Label Classification – predict p(y{1}k|x)• Learning from experience and data• In what form can the training data be obtained?• What is known a priori?• Complexity of training• Complexity of prediction

Page 9: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

In This Course• Supervised learning

Classification• Generative methods• Nearest neighbour, Naïve Bayes

• Discriminative methods• Logistic Regression

• Discriminant methods• Support Vector Machines

Regression, Ranking, Feature Selection, etc. Unsupervised learning Semi-supervised learning Reinforcement learning

Page 10: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Learning from Noisy Data• Noise and uncertainty• Unknown generative model Y = f(X)• Noise in measuring input and feature extraction• Noise in labels• Nuisance variables• Missing data• Finite training set size

Page 11: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Under and Over Fitting

Page 12: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Probability Theory• Non-negativity and unit measure• 0 ≤ p(y) , p() = 1, p() = 0• Conditional probability – p(y|x)• p(x, y) = p(y|x) p(x) = p(x|y) p(y)• Bayes’ Theorem• p(y|x) = p(x|y) p(y) / p(x) • Marginalization• p(x) = y p(x, y) dy • Independence• p(x1, x2) = p(x1) p(x2) p(x1|x2) = p(x1)

Chris Bishop, “Pattern Recognition & Machine Learning”

Page 13: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Univariate Gaussian Density

• p(x|,) = exp( -(x – )2/22) / (22)½

1-1 2-3 3-2 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

p(x

)Univariate Gaussian

Page 14: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Multivariate Gaussian Density

• p(x|,) = exp( -½ (x – )t -1 (x – ) )/ (2)D/2||½

-4-2

02

4

-2

0

20

0.1

0.2

0.3

0.4

x

Surface Plot

y

p(x

,y)

x

y

Contour Plot p(x,y)

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

Page 15: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Beta Density

• p(|a,b) = a-1(1 – )b-1 (a+b) / (a)(b)

0 0.5 10

1

2

3

x

p(x

)

(0.1,0.1)

0 0.5 10

1

2

3

x

p(x

)

(1.0,1.0)

0 0.5 10

1

2

3

x

p(x

)

(2.0,2.0)

0 0.5 10

1

2

3

x

p(x

)

(8.0,4.0)

Page 16: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Probability Distribution Functions• Bernoulli: Single trial with probability of success =• n {0, 1}, [0, 1]• p(n|) = n(1 – )1-n

• Binomial: N iid Bernoulli trials with n successes• n {0, 1, …, N}, [0, 1], • p(n|N,) = NCn n(1 – )N-n

• Multinomial: N iid trials, outcome k occurs nk times• nk {0, 1, …, N}, k nk = N, k [0, 1], k k = 1• p(n|N,) = N! k k

nk / nk!

Page 17: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

A Toy Example• We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips.

• We are asked to predict whether the next coin flip will result in a head or a tail.

• Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail

• We should predict heads if p(y=1|n,N) > p(y=0|n,N)

Page 18: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Maximum Likelihood Approach• Let p(y=1|n,N) = and p(y=0|n,N) = 1 - so that we should predict heads if > ½

• How should we estimate ?

• Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of that maximizes the likelihood of observing the data

• ML = argmax p(n|) = argmax NCn n(1 – )N-n

= argmax n log() + (N – n) log(1 – ) = n / N

• We should predict heads if n > ½ N

Page 19: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Maximum A Posteriori Approach• We should choose the value of maximizing the posterior probability of conditioned on the data

• We assume a• Binomial likelihood : p(n|) = NCn n(1 – )N-n • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b)

• MAP = argmax p(|n,a,b) = argmax p(n|) p(|a,b) = argmax n (1 – )N-n a-1 (1–)b-1

= (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails

• We should predict heads if n > ½ (N + b – a)

Page 20: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Bayesian Approach• We should marginalize over

• p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d= p(|a,b,n) d= (|a + n, b + N – n) d= (n + a) / (N + a + b) as if we saw an extra a heads & b tails

• We should predict heads if n > ½ (N + b – a)

• The Bayesian and MAP prediction coincide in this case

• In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N)

Page 21: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Classification

Page 22: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Binary Classification

Page 23: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Approaches to Classification• Memorization• Can not deal with previously unseen data• Large scale annotated data acquisition cost might

be very high• Rule based expert system• Dependent on the competence of the expert.• Complex problems lead to a proliferation of rules,

exceptions, exceptions to exceptions, etc.• Rules might not transfer to similar problems• Learning from training data and prior knowledge• Focuses on generalization to novel data

Page 24: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Notation• Training Data• Set of N labeled examples of the form (xi, yi)• Feature vector – x D. X = [x1 x2 … xN]• Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y)• Example – Gender Identification

(x1 = , y1 = +1) (x2 = , y2 = +1)

(x3 = , y3 = +1) (x4 = , y4 = -1)

Page 25: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Binary Classification

Page 26: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

wtx + b = 0

b

w

Binary Classification

= [w ; b]

Page 27: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Bayes’ Decision Rule• Bayes’ decision rule

p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 p(y=+1|x) > ½ ? y = +1 : y = -1

-6 -4 -2 0 2 4 60

0.5

1

1.5

x

p(y

|x)

p(y=+1|x)p(y=-1|x)

Page 28: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Issues to Think About• Bayesian versus MAP versus ML• Should we choose just one function to explain

the data?• If yes, should this be the function that explains

the data the best?• What about prior knowledge?• Generative versus Discriminative• Can we learn from “positive” data alone?• Should we model the data distribution?• Are there any missing variables?• Do we just care about the final decision?

Page 29: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Bayesian Approachp(y|x,X,Y) = f p(y,f|x,X,Y) df

= f p(y|f,x,X,Y) p(f|x,X,Y) df= f p(y|f,x) p(f|X,Y) df

• This integral is often intractable. • To solve it we can• Choose the distributions so that the solution is

analytic (conjugate priors)• Approximate the true distribution of p(f|X,Y) by a

simpler distribution (variational methods)• Sample from p(f|X,Y) (MCMC)

Page 30: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Maximum A Posteriori (MAP)p(y|x,X,Y) = f p(y|f,x) p(f|X,Y) df

= p(y|fMAP,x) when p(f|X,Y) = (f – fMAP)

• The more training data there is the better p(f|X,Y) approximates a delta function• We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP.

f

p(f

|X,Y

)

No DataModerate DataLots of Data

Page 31: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

MAP & Maximum Likelihood (ML)fMAP = argmaxf p(f|X,Y)

= argmaxf p(X,Y|f) p(f) / p(X,Y) = argmaxf p(X,Y|f) p(f)

fML argmaxf p(X,Y|f) (Maximum Likelihood)

• Maximum Likelihood holds if• There is a lot of training data so that

p(X,Y|f) >> p(f)• Or if there is no prior knowledge so that p(f) is

uniform (improper)

Page 32: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

IID Data fML = argmaxf p(X,Y|f)

= argmaxf I p(xi,yi|f)

• The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. In particular, p(X,Y) I p(xi,yi)

Page 33: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative MethodsNaïve Bayes

Page 34: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative Methods MAP = argmax p() I p(xi,yi| )

= argmax p(x) p(y) I p(xi,yi| )= argmax p(x) p(y) I p(xi|yi,) p(yi|)= argmax p(x) p(y) I p(xi|yi,) p(yi|)= [argmaxx p(x) I p(xi|yi,x)] * [argmaxy p(y) I p(yi|x)]

• x and y can be solved for independently• The parameters of each class decouple and can be solved for independently

Page 35: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative Methods• The parameters of each class decouple and can be solved for independently

Page 36: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative Methods – Naïve Bayes MAP = [argmaxx p(x) I p(xi|yi,x)] *

[argmaxy p(y) I p(yi|x)]

• Naïve Bayes assumptions• Independent Gaussian features

p(xi|yi,x) = j p(xij|yi,x)p(xij|yi=1,x) = N(xij| j

1, i)• Improper uniform priors (no prior knowledge)

p(x) = p(y) = const• Bernoulli labels

p(yi=+1|y) = , p(yi=-1|y) = 1-

Page 37: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative Methods – Naïve Bayes ML = [argmaxx I j N(xij| j

1, i)] * [argmax I (1+yi)/2 (1-)(1-yi)/2]

• Estimating ML

ML = argmax I (1+yi)/2 (1-)(1-yi)/2 = argmax (N+I yi) log() + (N-I yi) log(1-)

= N+ / N (by differentiating and setting to zero)

• Estimating ML, ML

ML = (1 / N) yi=1

xi

2jML = [ yi=+1 (xij - +

jML)2 + yi=-1 (xij - -jML)2 ]/N

Page 38: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Naïve Bayes – Prediction

-5 0 5-5

0

5Data

-5 0 5-5

0

5Train Err = 16.67%

Decision Boundaries

20 40 60 80 100

20

40

60

80

100-5

05

-50

50

0.5

1

Posterior p(y|x)

Page 39: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Naïve Bayes – Prediction

-5 0 5-5

0

5Data

-5 0 5-5

0

5Train Err = 0.00%

Decision Boundaries

20 40 60 80 100

20

40

60

80

100-5

05

-50

50

0.5

1

Posterior p(y|x)

Page 40: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Naïve Bayes – Predictionp(y=+1|x) = p(x|y=+1) p(y=+1) / p(x)

= 1 / (1 + exp(log(p(y=-1)/ p(y=+1))+log(p(x|y=-1) / p(x|y=+1)))

= 1 / (1 + exp( log(1/ - 1) - ½ -t-1- + ½ +t-1+ + (+ - -) t-1x ))

= 1 / (1 + exp(-b – wtx)) (Logistic Regression)

Þ p(y=-1|x) = exp(-b – wtx) / (1 + exp(-b – wtx))Þ log(p(y=-1|x)/ p(y=+1|x)) = -b – wtxÞ y = sign(b + wtx)Þ The decision boundary will be linear!

Page 41: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Discriminative Methods Logistic Regression

Page 42: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Discriminative Methods MAP = argmax p() I p(xi,yi| )• We assume that• p() = p(w) p(w)• p(xi,yi| ) = p(yi| xi, ) p(xi| )

= p(yi| xi, w) p(xi| w) Þ MAP = [argmaxw p(w) I p(yi| xi, w)] * [argmaxw

p(w) I p(xi|w)]

• It turns out that only w plays no role in determining the posterior distributionÞ p(y|x,X,Y) = p(y|x, MAP) = p(y|x, wMAP) where wMAP = argmaxw p(w) I p(yi| xi, w)

Page 43: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Disc. Methods – Logistic RegressionMAP = argmaxw,b p(w) I p(yi| xi, w)

• Regularized Logistic Regression• Gaussian prior – p(w) = exp( -½ wtw)• Logistic likelihood–

p(yi| xi, w) = 1 / (1 + exp(-yi(b + wtxi)))

-6 -4 -2 0 2 4 60

0.5

1

1.5

x

p(y

|x)

p(y=+1|x)p(y=-1|x)

Page 44: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Regularized Logistic RegressionMAP = argmaxw,b p(w) I p(yi| xi, w)

= argminw,b ½wtw + I log(1+exp(-yi(b+wtxi)))

• Bad news: No closed form solution for w and b• Good news: We have to minimize a convex function• We can obtain the global optimum• The function is smooth

Tom Minka, “A comparison of numerical optimizers for LR” (Matlab code)Keerthi et al., “A Fast Dual Algorithm for Kernel Logistic Regression”, ML 05Andrew and Gao, “OWL-QN” ICML 07Krishnapuram et al., “SMLR” PAMI 05

Page 45: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Regularized Logistic Regression

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01

-5 0 5-5

0

5Regularized Logistic Regression

Page 46: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Regularized Logistic Regression

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01

-5 0 5-5

0

5Regularized Logistic Regression

Page 47: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Naïve Bayes versus Logistic Regression

-5 0 5-5

0

5NB acc=100.00%, LR acc=100.00%

NBLR

Page 48: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Naïve Bayes versus Logistic Regression

-5 0 5-5

0

5NB acc=92.65%, LR acc=100.00%

NBLR

Page 49: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Naïve Bayes versus Logistic Regression

-5 0 5-5

0

5NB acc=84.85%, LR acc=87.88%

NBLR

Page 50: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Convex Functions

• Convex f : f(x1 + (1- )x2) f(x1) + (1- )f(x2)• The Hessian 2f is always positive semi-definite• The tangent is always a lower bound to f

0 0.5 1 1.5 2-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

x lo

g(x

)

Page 51: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Gradient Descent

• Iteration : xn+1 = xn - nf(xn)• Step size selection : Armijo rule• Stopping criterion : Change in f is “miniscule”

0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

2

x1

x2

x3 x

4

x

x lo

g(x

)

Page 52: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Gradient Descent – Logistic Regression(w, b) = ½wtw + I log(1+exp(-yi(b+wtxi)))

Þw(w, b) = w – I p(-yi|xi,w) yi xi

Þ b(w, b) = – I p(-yi|xi,w) yi

Beware of numerical issues while coding!

Page 53: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Newton Methods

• Iteration : xn+1 = xn - nH-1f(xn)• Approximate f by a 2nd order Taylor expansion• The error can now decrease quadratically

0 0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

2

x1

x2

x3

x

x lo

g(x

)

Page 54: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Quasi-Newton Methods• Computing and inverting the Hessian is expensive• Quasi-Newton methods can approximate H-1 directly (LBFGS)• Iteration : xn+1 = xn - nBn

-1f(xn)• Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn)• The secant equation does not fully determine B • LBFGS updates Bn+1

-1 using two rank one matrices

Page 55: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative versus Discriminative• A discriminative model might be correct even when the corresponding generative model is not• A discriminative model has fewer parameters than the corresponding generative model• A generative models parameters are uncoupled and can often be estimated in closed form• A discriminative models parameters are correlated and training algorithms can be relatively expensive• A discriminative model often has lower test error given a “reasonable” amount of training data.• A generative model can deal with missing data

Page 56: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative versus Discriminative• Let (hA,N) denote the error of hypothesis h trained using algorithm A on N data points• When the generative model is correct

(hDis,) = (hGen,)• When the generative model is incorrect

(hDis,) (hGen,)• For a linear classifier trained in D dimensions

(hDis,N) (hDis,) + O( [-z log z]½) where z=D/N 1• It suffices to pick N = (D) points for discriminative learning of linear classifiers• For some generative models N = (log D)

Page 57: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Generative versus Discriminative• A generative classifier might converge much faster to its higher asymptotic error

Ng & Jordan, “On Discriminative vs. Generative Classifiers” NIPS 02.Tom Mitchell, “Generative and Discriminative Classifiers“

Page 58: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic Regression• Multinomial Logistic Regression• 1-vs-All• Learn L binary classifiers for an L class problem• For the lth classifier, examples from class l are +ve

while examples from all other classes are –ve• Classify new points according to max probability• 1-vs-1• Learn L(L-1)/2 binary classifiers for an L class

problem by considering every class pair• Classify novel points by majority vote• Classify novel points by building a DAG

Page 59: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic Regression• Assume• Non-linear multi-class classifier• Number of classes = L• Number of training points per class = N• Algorithm training time for M points = O(M3)• Classification time given M training points=O(M)

Page 60: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic Regression• Multinomial Logistic Regression• Training time = O(L6N3)• Classification time for a new point = O(L2N)• 1-vs-All• Training time = O(L4N3)• Classification time for a new point = O(L2N)• 1-vs-1• Training time = O(L2N3)• Majority vote classification time = O(L2N)• DAG classification time = O(LN)

Page 61: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multinomial Logistic RegressionMAP = argmaxw,b p(w) I p(yi| xi, w)

• Regularized Multinomial Logistic Regression• Gaussian prior

p(w) = exp( -½ lwltwl)

• Multinomial logistic posteriorp(yi = l | xi, w) = efl(xi) / k efk(xi) where fk(xi) = wk

txi + bk

Note that we have to learn an extra classifier by not explicitly enforcing l p(yi = l | xi, w) = 1

Page 62: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multinomial Logistic Regression(w, b) = ½ kwk

twk + I [log(k fk(xi)) - kkyi fk(xi)]

Þwk(w, b) = wk + I [ p(yi = k | xi,w) - kyi

] xi

Þ bk(w, b) = I [ p(yi = k | xi,w) - kyi

]

Page 63: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic Regression

-5 0 5-5

0

5Press class number or Esc to quit

-5 0 5-5

0

5MLR

-5 0 5-5

0

51-vs-1

-5 0 5-5

0

51-vs-All

Page 64: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic RegressionMLR

20 40 60 80 100

20

40

60

80

100

1-vs-All

20 40 60 80 100

20

40

60

80

100

Maj 1-vs-1

20 40 60 80 100

20

40

60

80

100

Dag 1-vs-1

20 40 60 80 100

20

40

60

80

100

Page 65: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic Regression

-5 0 5-5

0

5Press class number or Esc to quit

-5 0 5-5

0

5MLR

-5 0 5-5

0

51-vs-1

-5 0 5-5

0

51-vs-All

Page 66: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic RegressionMLR

20 40 60 80 100

20

40

60

80

100

1-vs-All

20 40 60 80 100

20

40

60

80

100

Maj 1-vs-1

20 40 60 80 100

20

40

60

80

100

Dag 1-vs-1

20 40 60 80 100

20

40

60

80

100

Page 67: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic Regression

-5 0 5-5

0

5Press class number or Esc to quit

-5 0 5-5

0

5MLR

-5 0 5-5

0

51-vs-1

-5 0 5-5

0

51-vs-All

Page 68: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-class Logistic RegressionMLR

20 40 60 80 100

20

40

60

80

100

1-vs-All

20 40 60 80 100

20

40

60

80

100

Maj 1-vs-1

20 40 60 80 100

20

40

60

80

100

Dag 1-vs-1

20 40 60 80 100

20

40

60

80

100

Page 69: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

From Probabilities to Loss FunctionsMAP = argminw,b ½wtw + I log(1+exp(1-yi(b+wtxi)))

-5 0 5

0

1

2

3

4

5

6

yf(x)

Lo

ssLoss Functions

0/1HingeSquare HingeLogRobust

Page 70: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Support Vector Machines

Page 71: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Binary Classification

Page 72: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

A Separating Hyperplane

Page 73: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Maximum Margin Hyperplane

Geometric Intuition: Choose the perpendicular bisector of the shortest line segment joining the convex hulls of the two classes

Page 74: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

wtx + b = 0

b

w

SVM Notation

wtx + b = +1

wtx + b = -1

Support Vector

Support Vector

Support Vector

Support Vector

Margin = 2 / wtw

Page 75: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Calculating the Margin • Let x+ be any point on the +ve supporting plane and x- the closest point on the –ve supporting plane

Margin = |x+ – x-|= |w| (since x+ = x- + w)= 2 |w|/|w|2 (assuming = 2/|w|2)= 2/|w|

wtx+ + b = +1wtx- + b = -1 wt(x+ – x-)= 2 wtw= 2 = 2/|w|2

Page 76: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Hard Margin SVM Primal• Maximize 2/|w| such that wtxi + b +1 if yi = +1

wtxi + b -1 if yi = -1

• Difficult to optimize directly

• Convex Quadratic Program (QP) reformulation• Minimize ½wtw such that yi(wtxi + b) 1

• Convex QPs can be easy to optimize

Page 77: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Linearly Inseparable Data• Minimize ½wtw + C #(Misclassified points) such that yi(wtxi + b) 1 (for “good” points)

• The optimization problem is NP Hard in general• Disastrous errors are penalized the same as near misses

Page 78: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

wtx + b = 0

b

w

Inseparable Data – Hinge Loss

wtx + b = +1

wtx + b = -1

Support Vector

Misclassified point

Support Vector

Margin = 2 / wtw

= 0

< 1

> 1

= 0

Page 79: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The C-SVM Primal Formulation• Minimize ½wtw + C i i

such that yi(wtxi + b) 1 – i i 0

• The optimization is a convex QP• The globally optimal solution will be obtained• Number of variables = D + N + 1• Number of constraints = 2N • Solvers can train on 800K points in 47K (sparse) dimensions in less than 2 minutes on a standard PC

Fan et al., “LIBLINEAR” JMLR 08Bordes et al., “LaRank” ICML 07

Page 80: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The C-SVM Dual Formulation• Maximize 1t – ½tYKY

such that 1tY = 0 0 C

• K is a kernel matrix such that Kij = K(xi, xj) = xitxj

• are the dual variables (Lagrange multipliers)• Knowing gives us w and b • The dual is also a convex QP• Number of variables = N• Number of constraints = 2N + 1

Fan et al., “LIBSVM” JMLR 05 Joachims, “SVMLight”

Page 81: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

SVMs versus Regularized LR

Most of the SVM s are zero!

-5 0 5-5

0

5SVM vs RLR

-5 0 5-5

0

5RLR

-5 0 5-5

0

5SVM

RLRSVM

Page 82: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

SVMs versus Regularized LR

Most of the SVM s are zero!

-5 0 5-5

0

5SVM vs RLR

-5 0 5-5

0

5RLR

-5 0 5-5

0

5SVM

RLRSVM

Page 83: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

SVMs versus Regularized LR

Most of the SVM s are not zero

-5 0 5-5

0

5SVM vs RLR

-5 0 5-5

0

5RLR

-5 0 5-5

0

5SVM

RLRSVM

Page 84: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Duality• Primal P = Minx f0(x)

s. t. fi(x) 0 1 i N hi(x) = 0 1 i M

• Lagrangian L(x,,) = f0(x) + i ifi(x) + i ihi(x)

• Dual D = Max, Minx L(x,,) s. t. 0

Page 85: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Duality• The Lagrange dual is always concave (even if the primal is not convex) and might be an easier problem to optimize

• Weak duality : P D • Always holds

• Strong duality : P = D • Does not always hold• Usually holds for convex problems • Holds for the SVM QP

Page 86: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Karush-Kuhn-Tucker (KKT) Conditions• If strong duality holds, then for x*, * and * to be optimal the following KKT conditions must necessarily hold

• Primal feasibility : fi(x*) 0 & hi(x*) = 0 for 1 i • Dual feasibility : * 0• Stationarity : x L(x*, *,*) = 0• Complimentary slackness : i*fi(x*) = 0

• If x+, + and + satisfy the KKT conditions for a convex problem then they are optimal

Page 87: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

SVM – Duality• Primal P = Minw,,b ½wtw + Ct

s. t. Y(Xtw + b1) 1 – 0

• Lagrangian L(,, w,,b) = ½wtw + Ct – t –t[Y(Xtw + b1) – 1 + ]

• Dual D = Max 1t – ½tYKY

s. t. 1tY = 0 0 C

Page 88: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

SVM – KKT Conditions• Lagrangian L(,, w,,b) = ½wtw + Ct – t

–t[Y(Xtw + b1) – 1 + ]

• Stationarity conditions• w L= 0 w* = XY* (Representer Theorem)• L= 0 C = * + *• b L= 0 *tY1 = 0

• Complimentary Slackness conditions• i* [ yi (xi

tw* + b*) – 1 + i*] = 0• i*i* = 0

Page 89: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Hinge Loss and Sparseness in • From the Stationarity and Complimentary Slackness conditions it is easy to show that

• i = 0 xi has been classified correctly and lies beyond its supporting hyperplane

• 0 < i < C xi is a support vector and lies on its supporting hyperplane

• i = C xi has been misclassified or is a margin violator

Page 90: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Hinge Loss and Sparseness in

• SVM s are sparse but LR s are not

-2 -1 0 1 2 3 4-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

yf(x)

Lo

ss

Loss Functions

LogHinge

Page 91: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Linearly Inseparable Data

• This 1D dataset can not be separated using a single hyperplane (threshold)• We need a non-linear decision boundary

x

Page 92: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Increasing Dimensionality Non-linearly

• The dataset is now linearly separable in space

(x) = (x, x2)x

Page 93: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Kernel Trick• Let the “lifted” training set be { ((xi), yi) } • Define the kernel such that Kij = K(xi, xj) = (xi)t (xj)

• Primal P = Minw,,b ½wtw + Ct

s. t. Y((X)tw + b1) 1 – 0

• Dual D = Max 1t – ½tYKY

s. t. 1tY = 0 0 C

• Classifier: f(x) = sign((x)tw + b) = sign(tYK(:,x) + b)

Page 94: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

The Kernel Trick• Let (x) = [1, 2x1, … , 2xD , x1

2, … , xD2, 2x1x2, …, 2x1xD, …, 2xD-1xD]t

• Define K(xi, xj) = (xi)t (xj) = (xitxj + 1)2

• Primal• Number of variables = D + N + 1• Number of constraints = 2N• Number of flops for calculating (x)tw = O(D2)• Number of flops for deg 20 polynomial = O(D20)• Dual• Number of variables = N• Number of constraints = 2N + 1• Number of flops for calculating Kij = O(D)• Number of flops for deg 20 polynomial = O(D)

Page 95: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Some Popular Kernels• Linear : K(xi,xj) = xi

t-1xj

• Polynomial : K(xi,xj) = (xit-1xj + c)d

• Gaussian (RBF) : K(xi,xj) = exp( –k k(xik – xjk)2)

• Chi-Squared : K(xi,xj) = exp( –2(xi, xj) )

• Sigmoid : K(xi,xj) = tanh(xitxj – c)

should be positive definite, c 0, 0 and d should be a natural number

Page 96: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Valid Kernels – Mercer’s Theorem• Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

Z Z f(x) K(x,z) f(z) dx dz 0

for all square integrable real valued function f on Z.

Page 97: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Valid Kernels – Mercer’s Theorem• Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

Z Z f(x) K(x,z) f(z) dx dz 0

for all square integrable real valued function f on Z.

• K is a kernel if every finite symmetric matrix formed by evaluating K on pairs of points from Z is positive semi-definite

Page 98: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Operations on Kernels• The following operations result in valid kernels

• K(xi,xj) = k k Kk(xi,xj) (k 0)• K(xi,xj) = k Kk(xi,xj)• K(xi,xj) = f(xi) f(xj) (f : D )• K(xi,xj) = p(K1(xi,xj)) (p : +ve coeff poly)• K(xi,xj) = exp(K1(xi,xj))

• Kernels can be defined over graphs, sets, strings and many other interesting data structures

Page 99: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Kernels• Kernels should encode all our prior knowledge about feature similarities.

• Kernel parameters can be chosen through cross validation or learnt (see Multiple Kernel Learning).

• Non-linear kernels can sometimes boost classification performance tremendously.

• Non-linear kernels are generally expensive (both during training and for prediction)

Page 100: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Polynomial Kernel of Degree 2

1 2 3 4 5

1

2

3

4

5Poly Deg 2

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

2

4

6

8

Classification

Page 101: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Polynomial Kernel of Degree 5

1 2 3 4 5

1

2

3

4

5Poly Deg 5

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

5

10

15

Classification

Page 102: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

RBF Kernel

1 2 3 4 5

1

2

3

4

5RBF =1.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.2

0.4

0.6

0.8

1

Classification

Page 103: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Exponential 2 Kernel

1 2 3 4 5

1

2

3

4

5EChi2 =1.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.5

1

1.5

Classification

Page 104: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Kernel Parameter Setting - Underfitting

1 2 3 4 5

1

2

3

4

5RBF =0.001

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.5

1

1.5

2

2.5

Classification

Page 105: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Kernel Parameter Setting

1 2 3 4 5

1

2

3

4

5RBF =1.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.2

0.4

0.6

0.8

1

Classification

Page 106: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Kernel Parameter Setting – Overfitting

1 2 3 4 5

1

2

3

4

5RBF =100.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.2

0.4

0.6

0.8

1

Classification

Page 107: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Structured Output Prediction • Minimize f ½|f|2 + C i i

such that f(xi,yi) f(xi,y) + (yi,y) – i y yi

i 0

• Prediction argmaxy f(x,y)

• This formulation minimizes the hinge on the loss on the training set subject to regularization on f Can be used to predict sets, graphs, etc. for suitable choices of Taskar et al., “Max-Margin Markov Networks” NIPS 03Tsochantaridis et al., “Large Margin Methods for Structured & Interdependent Output Variables” JMLR 05

Page 108: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-Class SVM• Minimize f ½|f|2 + C i i

such that f(xi,yi) f(xi,y) + (yi,y) – i y yi

i 0

• Prediction argmaxy f(x,y)

• (yi,y) = yi,y

• f(x,y) = wt [ (x) (y) ] + bt(y) = wy

t(x) + by (assuming (y) = ey)

Weston and Watkin, “SVMs for Multi-Class Pattern Recognition” ESANN 99Bordes et al., “LaRank” ICML 07

Page 109: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Multi-Class SVM• Minw,b ½ kwk

twk + C i i

s. t. wyit(xi) + byi

wyt(xi) + by + 1 – I yyi

i 0

• Prediction argmaxy wyt(x) + by

• For L classes, with N points per class, the number of constraints is NL2

• Finding the exact solution for real world non-linear problems is often infeasible• In practice, we can obtain an approximate solution or switch to the 1-vs-All or 1-vs-1 formulations

Page 110: Introduction to Machine Learning Manik Varma Microsoft Research India manik manik@microsoft.com.

Acknowledgements • Ankit Sagwal• Saurabh Gupta