A majorization-minimization algorithm for (multiple...

A majorization-minimization algorithm

for (multiple) hyperparameter learning

ICML 2009 Montreal, Canada

17th June 2009

Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng

Stanford University

Supervised learning

• Training set of m IID examples

• Probabilistic model

• Estimate parameters

Labels may be real-valued, discrete, structured

• Regularized maximum likelihood estimation

• Also maximum a posteriori (MAP) estimation

Regularization prevents overfitting

Log-prior over

model parameters

Data log-

likelihood

L2-regularized Logistic Regression Regularization Hyperparameter

How to select the hyperparameter(s)?

• Grid search

+ Simple to implement

− Scales exponentially with # hyperparameters

• Gradient-based algorithms

+ Scales well with # hyperparameters

− Non-trivial to implement

Can we get the best of both worlds?

Our contribution

�Striking ease of implementation

�Simple, closed-form updates for C

�Leverage existing solvers

�Scales well to multiple hyperparameter case

�Applicable to wide range of models

Outline

1. Problem definition

2. The “integrate out” strategy

3. The Majorization-Minimization algorithm

4. Experiments

5. Discussion

The “integrate out” strategy

• Treat hyperparameter C as a random variable

• Analytically integrate out C

• Need a convenient prior p(C)

Integrating out a single hyperparameter

• For L2 regularization,

• A convenient prior:

• The result:

2. Neither convex nor concave in w1. C is gone

The Majorization-Minimization Algorithm

• Replace hard problem by series of easier ones

• EM-like; two steps:

1. Majorization

Upper bound the objective function

2. Minimization

Minimize the upper bound

MM1: Upper-bounding the new prior

• New prior:

• Linearize the log:

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5

-4

-3

-2

-1

0

1

2

3

4

x

y

log(x)

expansion at x=1

expansion at x=1.5

expansion at x=2

MM2: Solving the resultant

optimization problem

• Resultant linearized prior

• Get standard L2-regularization!

Terms independent of w

Use existing solvers!

Visualization of the upper bound

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

x

y

log(0.5 x2 + 1)

expansion at x=1

expansion at x=1.5

expansion at x=2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5

-4

-3

-2

-1

0

1

2

3

4

x

y

log(x)

expansion at x=1

expansion at x=1.5

expansion at x=2

Overall algorithm

1. Closed form updates for C

2. Leverage existing solvers

Converges to a local minimum

What about multiple hyperparameters?

• Regularization groups

w = ( w1, w2, w3, w4, w5 )

Unigram

feature

weights

Hairpin

loops

Bigram

feature

weights

Bulge

loops

NLP

RNA

Secondary

Structure

Prediction

C = ( C1 , C2 )

Mapping from

weights to groups

“To C or not to C. That

is the question…”

What about multiple hyperparameters?

Separately update each

regularization group

Sum weights

in each group

Weighted L2-regularization

Experiments

• 4 probabilistic models

– Linear regression (too easy, not shown)

– Binary logistic regression

– Multinomial logistic regression

– Conditional log-linear model

• 3 competing algorithms

– Grid search

– Gradient-based algorithm (Do et al., 2007)

– Direct optimization of new objective

• Algorithm run with α = 0, β = 1

50

60

70

80

90

100a

ust

rali

an

bre

ast

-

can

cer

dia

be

tes

ge

rma

n-

nu

me

r

he

art

ion

osp

he

re

liv

er-

dis

ord

ers

mu

shro

om

s

son

ar

spli

ce

w1

a

Acc

ura

cy

Grid Grad Direct MM

Results: Binary Logistic Regression

Results: Multinomial Logistic Regression

30405060

708090

100

con

ne

ct-4

dn

a

g

lass

ir

is

le

6e

r

mn

ist1

sa

7m

ag

e

se

gm

en

t

sv

mg

uid

e2

usp

s v

eh

icle

v

ow

el

win

e

Acc

ura

cy

Grid Grad Direct MM

Results: Conditional Log-Linear Models

• RNA secondary structure

prediction

• Multiple hyperparameters

ROC Area

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

Single Grouped

Gradient Direct MM

AGCAGAGUGGCGCA

GUGGAAGCGUGCUG

GUCCCAUAACCCAGA

GGUCCGAGGAUCGA

AACCUUGCUCUGCUA

(((((((((((((.......))))..((((((....

(((....)))....))))))......))))))))).

Discussion

• How to choose α, β in Gamma prior?

– Sensitivity experiments

– Simple choice reasonable

– Further investigation required

• Simple assumptions sometimes wrong

• But competitive performance with Grid, Grad

• Suited for ‘Quick-and-dirty’ implementations

Thank you!

A majorization-minimization algorithm for (multiple...

Documents

Transcript of A majorization-minimization algorithm for (multiple...