Maximum Entropy Discrimination

26
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

description

Maximum Entropy Discrimination. Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT. Classification. inputs x , class y = +1, -1 data D = { (x 1 ,y 1 ), …. (x T ,y T ) } learn f opt (x) discriminant function - PowerPoint PPT Presentation

Transcript of Maximum Entropy Discrimination

Page 1: Maximum Entropy Discrimination

Maximum Entropy Discrimination

Tommi Jaakkola Marina Meila Tony Jebara

MIT CMU MIT

Page 2: Maximum Entropy Discrimination

· inputs x, class y = +1, -1· data D = { (x1,y1), …. (xT,yT) }

· learn fopt(x) discriminant functionfrom F = {f} family of discriminants

· classify y = sign fopt(x)

Classification

Page 3: Maximum Entropy Discrimination

Model averaging

· many f with near optimal performance

· Instead of choosing fopt, average over all f in F

Q(f) = weight of f

y(x) = sign Q(f)f(x) F

= sign < f(x) >Q

· To specify:F = { f } family of discriminant functions

· To learn Q(f) distribution over F

Page 4: Maximum Entropy Discrimination

Goal of this work

· Define a discriminative criterion for averaging over models

Advantages

· can incorporate prior

· can use generative model

· computationally feasible

· generalizes to other discrimination tasks

Page 5: Maximum Entropy Discrimination

Maximum Entropy Discrimination

given data set D = { (x1,y1), … (xT,yT) } find

QME = argmaxQ H(Q)

s.t. yt< f(xt) >Q for all t = 1,…,T (C)

and some > 0

solution QME correctly classifies D

· among all admissible Q, QME has max entropy

· max entropy least specific about f

Page 6: Maximum Entropy Discrimination

· convex problem: QME unique

· solution TQME (f) ~ exp{ tytf(xt) } t=1

· t 0 Lagrange multipliers

· finding QME : start with =0 and follow gradient of unsatisfied constraints

Solution: Q ME as a projection

uniform Q0

QME

admissible Q

=0

ME

Page 7: Maximum Entropy Discrimination

Finding the solution

· needed t, t = 1,...T

· by solving the dual problem

max J() = max [ - log Z + - log Z- - t ]

s.t. t >= 0 for t = 1,...T

Algorithm

· start with t = 0 (uniform distribution)

· iterative ascent on J() until convergence· derivative J/ t = yt<log +b >Q(P) -

P+(x)

P-(x)

Page 8: Maximum Entropy Discrimination

QME as sparse solution

· Classification ruley(x) = sign< f(x) >QME

is classification margin

t> 0 for yt< f(xt) >Q

=

xt on the margin

(support vector!)

Page 9: Maximum Entropy Discrimination

QME as regularization

· Uniform distribution Q0 =0

· ”smoothness” of Q = H(Q)

· QME is smoothest admissible distribution

fopt

QME Q0

Q(f)

f

Page 10: Maximum Entropy Discrimination

Goal of this work

· Define a discriminative criterion for averaging over models

Extensions

· incorporate prior

· relationship to support vectors

· use generative models

· generalizes to other discrimination tasks

Page 11: Maximum Entropy Discrimination

Priors

· prior Q0( f )

· Minimum Relative Entropy Discrimination

QMRE = argminQ KL( Q || Q0)

s.t. yt< f(xt) >Q for all t = 1,…,T (C)

· prior on learn QMRE( f, ) soft margin

Q0

QMRE

admissible Q

prior

KL( Q || Q0)

Page 12: Maximum Entropy Discrimination

Soft margins

· average also over margin · define Q0 (f,) = Q0(f) Q0()

· constraints < ytf(xt) - >Q(f,) 0

· learn QMRE (f, ) = QMRE(f) QMRE()

Q0() =c exp[c(-1)]

Potential as function of

Page 13: Maximum Entropy Discrimination

Examples: support vector machines

· Theorem

For f(x) = .x + b, Q0() = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers are obtained by

maximizing J() subject to 0 t 0 and ttyt = 0 , where

J() = t[ t + log( 1 - t/c) ] - 1/2t,stsytysxt.xs

· Separable D SVM recovered exactly· Inseparable D SVM recovered with different

misclassification penalty

· Adaptive kernel SVM....

Page 14: Maximum Entropy Discrimination

SVM extensions

· Example: Leptograpsus Crabs (5 inputs, Ttrain=80,

Ttest=120)

f(x) = log + b

with P+( x ) = normal( x ; m+, V+ )

quadratic classifier Q( V+, V- ) = distribution of kernel width

P+

(x)P

-(x)

MRE Gaussian

Linear SVM

Max Likelihood Gaussian

Page 15: Maximum Entropy Discrimination

Using generative models

· generative models

P+(x), P

-(x)

for y = +1, -1

· f(x) = log + b

· learn QMRE (P+,P

-, b, )

· if Q0 (P+,P

- b,) = Q0 (P

+) Q0 ( P

-) Q0 ( b) Q0 ( )

· QMRE (P+,P

-) = QME (P

+) QME (P

-) QMRE( b) QMRE ( )

(factored prior factored posterior)

P+

(x)P

-(x)

Page 16: Maximum Entropy Discrimination

Examples: other distributions

· Multinomial (1 discrete variable)

· Graphical model (fixed structure, no hidden variables)

· Tree graphical model ( Q over structures and parameters)

Page 17: Maximum Entropy Discrimination

Tree graphical models

· P(x| E, ) = P0(x) Puv(xuxv|uv)

· prior Q0(P) = Q0(E) Q0(|E)

· Q0(E) = uv

· Q0(|E) = conjugate prior

QMRE(P) = W0 Wuv

can be integrated analytically

Q0(P) conjugate prior

over E and

E

E

E

Page 18: Maximum Entropy Discrimination

Trees: experiments

· Splice junction classification task• 25 inputs, 400 training examples• compared with Max Likelihood trees

ML, err=14%

MaxEnt, err=12.3%

Page 19: Maximum Entropy Discrimination

Trees experiments (contd)

Tree edges’ weights

Page 20: Maximum Entropy Discrimination

Discrimination tasks

· Classification

· Classification with partially labeled data

· Anomaly detection

+

+++

-

-

-xx

x

x

x

x

xx

x

x

x

+ ++

++

+

+

++

++

+ ++

++

+

+

++

++

Page 21: Maximum Entropy Discrimination

Partially labeled data

· Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find

Q(f,,y) = argminQ KL(Q||Q0)

s. t. < ytf(x) - >Q 0 for all t = 1,…,T (C)

Page 22: Maximum Entropy Discrimination

Partially labeled data : experiment

Complete data

10% labeled + 90% unlabeled

10% labeled

·Splice junction classification• 25 inputs

• Ttotal=1000

Page 23: Maximum Entropy Discrimination

Anomaly detection

· Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that

Q( P,) = argminQ KL(Q||Q0)

s. t. < log P(x) - >Q 0 for all t = 1,…,T (C)

Page 24: Maximum Entropy Discrimination

Anomaly detection: experiments

MaxEnt

MaxLikelihood

Page 25: Maximum Entropy Discrimination

Anomaly detection: experiments

MaxEnt

MaxLikelihood

Page 26: Maximum Entropy Discrimination

Conclusions

· New framework for classification· Based on regularization in the space of distributions· Enables use of generative models· Enables use of priors· Generalizes to other discrimination tasks