Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0...

112
Classification Yan Pan

Transcript of Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0...

Page 1: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Classification

Yan Pan

Page 2: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Under and Over Fitting

Page 3: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Probability Theory• Non-negativity and unit measure

• 0 ≤ p(y) , p() = 1, p() = 0• Conditional probability – p(y|x)

• p(x, y) = p(y|x) p(x) = p(x|y) p(y)• Bayes’ Theorem

• p(y|x) = p(x|y) p(y) / p(x) • Marginalization

• p(x) = y p(x, y) dy • Independence

• p(x1, x2) = p(x1) p(x2) p(x1|x2) = p(x1)

Chris Bishop, “Pattern Recognition & Machine Learning”

Page 4: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 5: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Univariate Gaussian Density

• p(x|,) = exp( -(x – )2/22) / (22)½

1-1 2-3 3-2 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

p(x

)Univariate Gaussian

Page 6: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Multivariate Gaussian Density

• p(x|,) = exp( -½ (x – )t -1 (x – ) )/ (2)D/2||½

-4-2

02

4

-2

0

20

0.1

0.2

0.3

0.4

x

Surface Plot

y

p(x

,y)

x

y

Contour Plot p(x,y)

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

Page 7: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Beta Density

• p(|a,b) = a-1(1 – )b-1 (a+b) / (a)(b)

0 0.5 10

1

2

3

x

p(x

)

(0.1,0.1)

0 0.5 10

1

2

3

x

p(x

)

(1.0,1.0)

0 0.5 10

1

2

3

x

p(x

)

(2.0,2.0)

0 0.5 10

1

2

3

x

p(x

)

(8.0,4.0)

Page 8: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Probability Distribution Functions• Bernoulli: Single trial with probability of success =

• n {0, 1}, [0, 1]• p(n|) = n(1 – )1-n

• Binomial: N iid Bernoulli trials with n successes• n {0, 1, …, N}, [0, 1], • p(n|N,) = NCn n(1 – )N-n

Page 9: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

A Toy Example• We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips.

• We are asked to predict whether the next coin flip will result in a head or a tail.

• Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail

• We should predict heads if p(y=1|n,N) > p(y=0|n,N)

Page 10: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Maximum Likelihood Approach• Let p(y=1|n,N) = and p(y=0|n,N) = 1 - so that we should predict heads if > ½

• How should we estimate ?

• Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of that maximizes the likelihood of observing the data

• ML = argmax p(n|) = argmax NCn n(1 – )N-n

= argmax n log() + (N – n) log(1 – ) = n / N

• We should predict heads if n > ½ N

Page 11: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Maximum A Posteriori Approach• We should choose the value of maximizing the posterior probability of conditioned on the data

• We assume a•Binomial likelihood : p(n|) = NCn n(1 – )N-n •Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b)

• MAP = argmax p(|n,a,b) = argmax p(n|) p(|a,b) = argmax n (1 – )N-n a-1 (1–)b-1

= (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails

• We should predict heads if n > ½ (N + b – a)

Page 12: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Bayesian Approach

Page 13: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 14: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 15: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 16: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 17: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 18: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Classification

Page 19: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Binary Classification

Page 20: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Approaches to Classification• Memorization

• Can not deal with previously unseen data• Large scale annotated data acquisition cost might be very high

• Rule based expert system• Dependent on the competence of the expert.• Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc.• Rules might not transfer to similar problems

• Learning from training data and prior knowledge• Focuses on generalization to novel data

Page 21: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Notation• Training Data

• Set of N labeled examples of the form (xi, yi)• Feature vector – x D. X = [x1 x2 … xN]• Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y)

• Example – Gender Identification

(x1 = , y1 = +1) (x2 = , y2 = +1)

(x3 = , y3 = +1) (x4 = , y4 = -1)

Page 22: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Binary Classification

Page 23: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

wtx + b = 0

b

w

Binary Classification

= [w ; b]

Page 24: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Machine Learning from the Optimization View

• Before we go into the details of classification and regression methods, we should take a close look at the objective functions of machine learning

• Machine Learning:根据数据找规律(从多个候选规律里面选最好的),选择的标准是什么?

• 把候选规律放到训练数据上预测一下,看看预测的错误率是多少,预测错误最少的规律就是我们要找的。

Page 25: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Supervised Learning

Page 26: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Common Form of Supervised Learning Problems

• Minimize the following objective function– Regularization term + Loss function

• Regularization term: control the model complexity, avoid over fitting

• Loss function: measure the quality of the learned function, i.e. predict error on the training data.

Page 27: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Ex.1 Linear Regression

• E(w) = ½ n (yn - wtxn)^2 + ½wtw

Page 28: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Ex.2 Logistic Regression (classification method)

• (w, b) = ½wtw + I log(1+exp(-yi(b+wtxi)))

Page 29: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Ex.3 SVM

• E(w) = ½wtw + I max(0,1-yiwtxi)

• Or• E(w) = ½wtw + I max(0,1-yiwtxi)^2

Page 30: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

How to measure error?

• True: yi

• Predicted: wtxi

• 越像越好。相等?– I (yi ! = wtxi )– ( yi - wtxi ) ^2

• 假设取值范围为 [-1,1]: 乘积尽量大– yi wtxi

Page 31: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Approximate the Zero-One Loss

• Squared Error• Exponential Loss• Logistic Loss• Hinge Loss• Sigmoid Loss

Page 32: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Regularized Logistic Regression

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01

-5 0 5-5

0

5Regularized Logistic Regression

Page 33: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Regularized Logistic Regression

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01

-5 0 5-5

0

5Regularized Logistic Regression

Page 34: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 35: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 36: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Convex Functions

• Convex f : f(x1 + (1- )x2) f(x1) + (1- )f(x2)• The Hessian 2f is always positive semi-definite• The tangent is always a lower bound to f

0 0.5 1 1.5 2-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

x lo

g(x

)

Page 37: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.
Page 38: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Gradient Descent

• Iteration : xn+1 = xn - nf(xn)• Step size selection : Armijo rule• Stopping criterion : Change in f is “miniscule”

0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

2

x1

x2

x3 x

4

x

x lo

g(x

)

Page 39: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Gradient Descent – Logistic Regression(w, b) = ½wtw + I log(1+exp(-yi(b+wtxi)))

w(w, b) = w – I p(-yi|xi,w) yi xi

b(w, b) = – I p(-yi|xi,w) yi

Beware of numerical issues while coding!

Page 40: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Gradient Decent Algorithm• Input: x0, objective f(x), e, T• Output: x_star that minimize f(x)• t=0• While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<100000 )) {• g_t = gradient of f(x) at x_t• for( i=10; i>=-6; i--)• {• s=2^i• x_{t+1}=x_t – s*g_t• if (f(x_{t+1} < f(x_t))• break;• }• t++;• }• Output x_t

Page 41: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Newton Methods

• Iteration : xn+1 = xn - nH-1f(xn)• Approximate f by a 2nd order Taylor expansion• The error can now decrease quadratically

0 0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

2

x1

x2

x3

x

x lo

g(x

)

Page 42: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Newton Decent Algorithm• Input: x0, objective f(x), e, T• Output: x_star that minimize f(x)• t=0• While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<10)) {• g_t = gradient of f(x) at x_t• h_t = hessian matrix of f(x) at x_t• s = inverse matrix of h_t• • x_{t+1}=x_t – s*g_t• • t++;• }• Output x_t

Page 43: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Quasi-Newton Methods• Computing and inverting the Hessian is expensive• Quasi-Newton methods can approximate H-1 directly (LBFGS)• Iteration : xn+1 = xn - nBn

-1f(xn)• Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn)• The secant equation does not fully determine B • LBFGS updates Bn+1

-1 using two rank one matrices

Page 44: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Machine Learning Problems from the

Probability View

Page 45: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Bayes’ Decision Rule• Bayes’ decision rule

p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 p(y=+1|x) > ½ ? y = +1 : y = -1

-6 -4 -2 0 2 4 60

0.5

1

1.5

x

p(y

|x)

p(y=+1|x)p(y=-1|x)

Page 46: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Bayesian Approachp(y|x,X,Y) = f p(y,f|x,X,Y) df

= f p(y|f,x,X,Y) p(f|x,X,Y) df= f p(y|f,x) p(f|X,Y) df

• This integral is often intractable. • To solve it we can

• Choose the distributions so that the solution is analytic (conjugate priors)• Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods)• Sample from p(f|X,Y) (MCMC)

Page 47: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Maximum A Posteriori (MAP)p(y|x,X,Y) = f p(y|f,x) p(f|X,Y) df

= p(y|fMAP,x) when p(f|X,Y) = (f – fMAP)

• The more training data there is the better p(f|X,Y) approximates a delta function• We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP.

f

p(f

|X,Y

)

No DataModerate DataLots of Data

Page 48: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

MAP & Maximum Likelihood (ML)fMAP = argmaxf p(f|X,Y)

= argmaxf p(X,Y|f) p(f) / p(X,Y) = argmaxf p(X,Y|f) p(f)

fML argmaxf p(X,Y|f) (Maximum Likelihood)

• Maximum Likelihood holds if• There is a lot of training data so that

p(X,Y|f) >> p(f)• Or if there is no prior knowledge so that p(f) is uniform (improper)

Page 49: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

IID Data fML = argmaxf p(X,Y|f)

= argmaxf I p(xi,yi|f)

• The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. In particular, p(X,Y) I p(xi,yi)

Page 50: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Discriminative Methods Logistic Regression

Page 51: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Discriminative Methods MAP = argmax p() I p(xi,yi| )• We assume that

• p() = p(w) p(w)• p(xi,yi| ) = p(yi| xi, ) p(xi| )

= p(yi| xi, w) p(xi| w) MAP = [argmaxw p(w) I p(yi| xi, w)] * [argmaxw

p(w) I p(xi|w)]

• It turns out that only w plays a role in determining the posterior distribution p(y|x,X,Y) = p(y|x, MAP) = p(y|x, wMAP) where wMAP = argmaxw p(w) I p(yi| xi, w)

Page 52: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Disc. Methods – Logistic RegressionMAP = argmaxw,b p(w) I p(yi| xi, w)

• Regularized Logistic Regression• Gaussian prior – p(w) = exp( -½ wtw)• Logistic likelihood–

p(yi| xi, w) = 1 / (1 + exp(-yi(b + wtxi)))

-6 -4 -2 0 2 4 60

0.5

1

1.5

x

p(y

|x)

p(y=+1|x)p(y=-1|x)

Page 53: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Regularized Logistic RegressionMAP = argmaxw,b p(w) I p(yi| xi, w)

= argminw,b ½wtw + I log(1+exp(-yi(b+wtxi)))

• Bad news: No closed form solution for w and b• Good news: We have to minimize a convex function

• We can obtain the global optimum• The function is smooth

Tom Minka, “A comparison of numerical optimizers for LR” (Matlab code)Keerthi et al., “A Fast Dual Algorithm for Kernel Logistic Regression”, ML 05Andrew and Gao, “OWL-QN” ICML 07Krishnapuram et al., “SMLR” PAMI 05

Page 54: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Regularized Logistic Regression

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01

-5 0 5-5

0

5Regularized Logistic Regression

Page 55: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Regularized Logistic Regression

Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01

-5 0 5-5

0

5Regularized Logistic Regression

Page 56: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Convex Functions

• Convex f : f(x1 + (1- )x2) f(x1) + (1- )f(x2)• The Hessian 2f is always positive semi-definite• The tangent is always a lower bound to f

0 0.5 1 1.5 2-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

x lo

g(x

)

Page 57: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Gradient Descent

• Iteration : xn+1 = xn - nf(xn)• Step size selection : Armijo rule• Stopping criterion : Change in f is “miniscule”

0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

2

x1

x2

x3 x

4

x

x lo

g(x

)

Page 58: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Gradient Descent – Logistic Regression(w, b) = ½wtw + I log(1+exp(-yi(b+wtxi)))

w(w, b) = w – I p(-yi|xi,w) yi xi

b(w, b) = – I p(-yi|xi,w) yi

Beware of numerical issues while coding!

Page 59: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Newton Methods

• Iteration : xn+1 = xn - nH-1f(xn)• Approximate f by a 2nd order Taylor expansion• The error can now decrease quadratically

0 0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

2

x1

x2

x3

x

x lo

g(x

)

Page 60: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Quasi-Newton Methods• Computing and inverting the Hessian is expensive• Quasi-Newton methods can approximate H-1 directly (LBFGS)• Iteration : xn+1 = xn - nBn

-1f(xn)• Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn)• The secant equation does not fully determine B • LBFGS updates Bn+1

-1 using two rank one matrices

Page 61: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic Regression• Multinomial Logistic Regression• 1-vs-All

• Learn L binary classifiers for an L class problem• For the lth classifier, examples from class l are +ve while examples from all other classes are –ve• Classify new points according to max probability

• 1-vs-1• Learn L(L-1)/2 binary classifiers for an L class problem by considering every class pair• Classify novel points by majority vote• Classify novel points by building a DAG

Page 62: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic Regression• Assume

• Non-linear multi-class classifier• Number of classes = L• Number of training points per class = N• Algorithm training time for M points = O(M3)• Classification time given M training points=O(M)

Page 63: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic Regression• Multinomial Logistic Regression

• Training time = O(L6N3)• Classification time for a new point = O(L2N)

• 1-vs-All• Training time = O(L4N3)• Classification time for a new point = O(L2N)

• 1-vs-1• Training time = O(L2N3)• Majority vote classification time = O(L2N)• DAG classification time = O(LN)

Page 64: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multinomial Logistic RegressionMAP = argmaxw,b p(w) I p(yi| xi, w)

• Regularized Multinomial Logistic Regression• Gaussian prior

p(w) = exp( -½ lwltwl)

• Multinomial logistic posteriorp(yi = l | xi, w) = efl(xi) / k efk(xi) where fk(xi) = wk

txi + bk

Note that we have to learn an extra classifier by not explicitly enforcing l p(yi = l | xi, w) = 1

Page 65: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multinomial Logistic Regression(w, b) = ½ kwk

twk + I [log(k fk(xi)) - kkyi fk(xi)]

wk(w, b) = wk + I [ p(yi = k | xi,w) - kyi

] xi

bk(w, b) = I [ p(yi = k | xi,w) - kyi

]

Page 66: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic Regression

-5 0 5-5

0

5Press class number or Esc to quit

-5 0 5-5

0

5MLR

-5 0 5-5

0

51-vs-1

-5 0 5-5

0

51-vs-All

Page 67: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic RegressionMLR

20 40 60 80 100

20

40

60

80

100

1-vs-All

20 40 60 80 100

20

40

60

80

100

Maj 1-vs-1

20 40 60 80 100

20

40

60

80

100

Dag 1-vs-1

20 40 60 80 100

20

40

60

80

100

Page 68: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic Regression

-5 0 5-5

0

5Press class number or Esc to quit

-5 0 5-5

0

5MLR

-5 0 5-5

0

51-vs-1

-5 0 5-5

0

51-vs-All

Page 69: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic RegressionMLR

20 40 60 80 100

20

40

60

80

100

1-vs-All

20 40 60 80 100

20

40

60

80

100

Maj 1-vs-1

20 40 60 80 100

20

40

60

80

100

Dag 1-vs-1

20 40 60 80 100

20

40

60

80

100

Page 70: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic Regression

-5 0 5-5

0

5Press class number or Esc to quit

-5 0 5-5

0

5MLR

-5 0 5-5

0

51-vs-1

-5 0 5-5

0

51-vs-All

Page 71: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-class Logistic RegressionMLR

20 40 60 80 100

20

40

60

80

100

1-vs-All

20 40 60 80 100

20

40

60

80

100

Maj 1-vs-1

20 40 60 80 100

20

40

60

80

100

Dag 1-vs-1

20 40 60 80 100

20

40

60

80

100

Page 72: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

From Probabilities to Loss FunctionsMAP = argminw,b ½wtw + I log(1+exp(1-yi(b+wtxi)))

-5 0 5

0

1

2

3

4

5

6

yf(x)

Lo

ssLoss Functions

0/1HingeSquare HingeLogRobust

Page 73: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Support Vector Machines

Page 74: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Binary Classification

Page 75: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

A Separating Hyperplane

Page 76: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Maximum Margin Hyperplane

Geometric Intuition: Choose the perpendicular bisector of the shortest line segment joining the convex hulls of the two classes

Page 77: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

wtx + b = 0

b

w

SVM Notation

wtx + b = +1

wtx + b = -1

Support Vector

Support Vector

Support Vector

Support Vector

Margin = 2 / wtw

Page 78: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Calculating the Margin • Let x+ be any point on the +ve supporting plane and x- the closest point on the –ve supporting plane

Margin = |x+ – x-|= |w| (since x+ = x- + w)= 2 |w|/|w|2 (assuming = 2/|w|2)= 2/|w|

wtx+ + b = +1wtx- + b = -1 wt(x+ – x-)= 2 wtw= 2 = 2/|w|2

Page 79: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Hard Margin SVM Primal• Maximize 2/|w| such that wtxi + b +1 if yi = +1

wtxi + b -1 if yi = -1

• Difficult to optimize directly

• Convex Quadratic Program (QP) reformulation• Minimize ½wtw such that yi(wtxi + b) 1

• Convex QPs can be easy to optimize

Page 80: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Linearly Inseparable Data• Minimize ½wtw + C #(Misclassified points) such that yi(wtxi + b) 1 (for “good” points)

• The optimization problem is NP Hard in general• Disastrous errors are penalized the same as near misses

Page 81: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

wtx + b = 0

b

w

Inseparable Data – Hinge Loss

wtx + b = +1

wtx + b = -1

Support Vector

Misclassified point

Support Vector

Margin = 2 / wtw

= 0

< 1

> 1

= 0

Page 82: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The C-SVM Primal Formulation• Minimize ½wtw + C i i

such that yi(wtxi + b) 1 – i i 0

• The optimization is a convex QP• The globally optimal solution will be obtained• Number of variables = D + N + 1• Number of constraints = 2N • Solvers can train on 800K points in 47K (sparse) dimensions in less than 2 minutes on a standard PC

Fan et al., “LIBLINEAR” JMLR 08Bordes et al., “LaRank” ICML 07

Page 83: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The C-SVM Dual Formulation• Maximize 1t – ½tYKY

such that 1tY = 0 0 C

• K is a kernel matrix such that Kij = K(xi, xj) = xitxj

• are the dual variables (Lagrange multipliers)• Knowing gives us w and b • The dual is also a convex QP

• Number of variables = N• Number of constraints = 2N + 1

Fan et al., “LIBSVM” JMLR 05 Joachims, “SVMLight”

Page 84: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

SVMs versus Regularized LR

Most of the SVM s are zero!

-5 0 5-5

0

5SVM vs RLR

-5 0 5-5

0

5RLR

-5 0 5-5

0

5SVM

RLRSVM

Page 85: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

SVMs versus Regularized LR

Most of the SVM s are zero!

-5 0 5-5

0

5SVM vs RLR

-5 0 5-5

0

5RLR

-5 0 5-5

0

5SVM

RLRSVM

Page 86: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

SVMs versus Regularized LR

Most of the SVM s are not zero

-5 0 5-5

0

5SVM vs RLR

-5 0 5-5

0

5RLR

-5 0 5-5

0

5SVM

RLRSVM

Page 87: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Duality• Primal P = Minx f0(x)

s. t. fi(x) 0 1 i N hi(x) = 0 1 i M

• Lagrangian L(x,,) = f0(x) + i ifi(x) + i ihi(x)

• Dual D = Max, Minx L(x,,) s. t. 0

Page 88: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Duality• The Lagrange dual is always concave (even if the primal is not convex) and might be an easier problem to optimize

• Weak duality : P D • Always holds

• Strong duality : P = D • Does not always hold• Usually holds for convex problems • Holds for the SVM QP

Page 89: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Karush-Kuhn-Tucker (KKT) Conditions• If strong duality holds, then for x*, * and * to be optimal the following KKT conditions must necessarily hold

• Primal feasibility : fi(x*) 0 & hi(x*) = 0 for 1 i • Dual feasibility : * 0• Stationarity : x L(x*, *,*) = 0• Complimentary slackness : i*fi(x*) = 0

• If x+, + and + satisfy the KKT conditions for a convex problem then they are optimal

Page 90: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

SVM – Duality• Primal P = Minw,,b ½wtw + Ct

s. t. Y(Xtw + b1) 1 – 0

• Lagrangian L(,, w,,b) = ½wtw + Ct – t –t[Y(Xtw + b1) – 1 + ]

• Dual D = Max 1t – ½tYKY

s. t. 1tY = 0 0 C

Page 91: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

SVM – KKT Conditions• Lagrangian L(,, w,,b) = ½wtw + Ct – t

–t[Y(Xtw + b1) – 1 + ]

• Stationarity conditions• w L= 0 w* = XY* (Representer Theorem)• L= 0 C = * + *• b L= 0 *tY1 = 0

• Complimentary Slackness conditions• i* [ yi (xi

tw* + b*) – 1 + i*] = 0• i*i* = 0

Page 92: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Hinge Loss and Sparseness in • From the Stationarity and Complimentary Slackness conditions it is easy to show that

• i = 0 xi has been classified correctly and lies beyond its supporting hyperplane

• 0 < i < C xi is a support vector and lies on its supporting hyperplane

• i = C xi has been misclassified or is a margin violator

Page 93: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Hinge Loss and Sparseness in

• SVM s are sparse but LR s are not

-2 -1 0 1 2 3 4-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

yf(x)

Lo

ss

Loss Functions

LogHinge

Page 94: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Linearly Inseparable Data

• This 1D dataset can not be separated using a single hyperplane (threshold)• We need a non-linear decision boundary

x

Page 95: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Increasing Dimensionality Non-linearly

• The dataset is now linearly separable in space

(x) = (x, x2)x

Page 96: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Kernel Trick• Let the “lifted” training set be { ((xi), yi) } • Define the kernel such that Kij = K(xi, xj) = (xi)t (xj)

• Primal P = Minw,,b ½wtw + Ct

s. t. Y((X)tw + b1) 1 – 0

• Dual D = Max 1t – ½tYKY

s. t. 1tY = 0 0 C

• Classifier: f(x) = sign((x)tw + b) = sign(tYK(:,x) + b)

Page 97: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

The Kernel Trick• Let (x) = [1, 2x1, … , 2xD , x1

2, … , xD2, 2x1x2, …, 2x1xD, …, 2xD-1xD]t

• Define K(xi, xj) = (xi)t (xj) = (xitxj + 1)2

• Primal• Number of variables = D + N + 1• Number of constraints = 2N• Number of flops for calculating (x)tw = O(D2)• Number of flops for deg 20 polynomial = O(D20)

• Dual• Number of variables = N• Number of constraints = 2N + 1• Number of flops for calculating Kij = O(D)• Number of flops for deg 20 polynomial = O(D)

Page 98: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Some Popular Kernels• Linear : K(xi,xj) = xi

t-1xj

• Polynomial : K(xi,xj) = (xit-1xj + c)d

• Gaussian (RBF) : K(xi,xj) = exp( –k k(xik – xjk)2)

• Chi-Squared : K(xi,xj) = exp( –2(xi, xj) )

• Sigmoid : K(xi,xj) = tanh(xitxj – c)

should be positive definite, c 0, 0 and d should be a natural number

Page 99: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Valid Kernels – Mercer’s Theorem• Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

Z Z f(x) K(x,z) f(z) dx dz 0

for all square integrable real valued function f on Z.

Page 100: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Valid Kernels – Mercer’s Theorem• Let Z be a compact subset of D and K a continuous symmetric function. Then K is a kernel if

Z Z f(x) K(x,z) f(z) dx dz 0

for all square integrable real valued function f on Z.

• K is a kernel if every finite symmetric matrix formed by evaluating K on pairs of points from Z is positive semi-definite

Page 101: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Operations on Kernels• The following operations result in valid kernels

• K(xi,xj) = k k Kk(xi,xj) (k 0)• K(xi,xj) = k Kk(xi,xj)• K(xi,xj) = f(xi) f(xj) (f : D )• K(xi,xj) = p(K1(xi,xj)) (p : +ve coeff poly)• K(xi,xj) = exp(K1(xi,xj))

• Kernels can be defined over graphs, sets, strings and many other interesting data structures

Page 102: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Kernels• Kernels should encode all our prior knowledge about feature similarities.

• Kernel parameters can be chosen through cross validation or learnt (see Multiple Kernel Learning).

• Non-linear kernels can sometimes boost classification performance tremendously.

• Non-linear kernels are generally expensive (both during training and for prediction)

Page 103: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Polynomial Kernel of Degree 2

1 2 3 4 5

1

2

3

4

5Poly Deg 2

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

2

4

6

8

Classification

Page 104: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Polynomial Kernel of Degree 5

1 2 3 4 5

1

2

3

4

5Poly Deg 5

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

5

10

15

Classification

Page 105: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

RBF Kernel

1 2 3 4 5

1

2

3

4

5RBF =1.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.2

0.4

0.6

0.8

1

Classification

Page 106: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Exponential 2 Kernel

1 2 3 4 5

1

2

3

4

5EChi2 =1.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.5

1

1.5

Classification

Page 107: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Kernel Parameter Setting - Underfitting

1 2 3 4 5

1

2

3

4

5RBF =0.001

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.5

1

1.5

2

2.5

Classification

Page 108: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Kernel Parameter Setting

1 2 3 4 5

1

2

3

4

5RBF =1.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.2

0.4

0.6

0.8

1

Classification

Page 109: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Kernel Parameter Setting – Overfitting

1 2 3 4 5

1

2

3

4

5RBF =100.000

1 2 3 4 5

1

2

3

4

5Decision Boundaries

Abs(Distances)

0.2

0.4

0.6

0.8

1

Classification

Page 110: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Structured Output Prediction • Minimize f ½|f|2 + C i i

such that f(xi,yi) f(xi,y) + (yi,y) – i y yi

i 0

• Prediction argmaxy f(x,y)

• This formulation minimizes the hinge on the loss on the training set subject to regularization on f Can be used to predict sets, graphs, etc. for suitable choices of Taskar et al., “Max-Margin Markov Networks” NIPS 03Tsochantaridis et al., “Large Margin Methods for Structured & Interdependent Output Variables” JMLR 05

Page 111: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-Class SVM• Minimize f ½|f|2 + C i i

such that f(xi,yi) f(xi,y) + (yi,y) – i y yi

i 0

• Prediction argmaxy f(x,y)

• (yi,y) = yi,y

• f(x,y) = wt [ (x) (y) ] + bt(y) = wy

t(x) + by (assuming (y) = ey)

Weston and Watkin, “SVMs for Multi-Class Pattern Recognition” ESANN 99Bordes et al., “LaRank” ICML 07

Page 112: Classification Yan Pan. Under and Over Fitting Probability Theory Non-negativity and unit measure 0 ≤ p(y), p(  ) = 1, p(  ) = 0 Conditional probability.

Multi-Class SVM• Minw,b ½ kwk

twk + C i i

s. t. wyit(xi) + byi

wyt(xi) + by + 1 – I

yyi

i 0

• Prediction argmaxy wyt(x) + by

• For L classes, with N points per class, the number of constraints is NL2

• Finding the exact solution for real world non-linear problems is often infeasible• In practice, we can obtain an approximate solution or switch to the 1-vs-All or 1-vs-1 formulations