A majorization-minimization algorithm for (multiple...
Transcript of A majorization-minimization algorithm for (multiple...
A majorization-minimization algorithm
for (multiple) hyperparameter learning
ICML 2009 Montreal, Canada
17th June 2009
Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng
Stanford University
Supervised learning
• Training set of m IID examples
• Probabilistic model
• Estimate parameters
Labels may be real-valued, discrete, structured
• Regularized maximum likelihood estimation
• Also maximum a posteriori (MAP) estimation
Regularization prevents overfitting
Log-prior over
model parameters
Data log-
likelihood
L2-regularized Logistic Regression Regularization Hyperparameter
How to select the hyperparameter(s)?
• Grid search
+ Simple to implement
− Scales exponentially with # hyperparameters
• Gradient-based algorithms
+ Scales well with # hyperparameters
− Non-trivial to implement
Can we get the best of both worlds?
Our contribution
�Striking ease of implementation
�Simple, closed-form updates for C
�Leverage existing solvers
�Scales well to multiple hyperparameter case
�Applicable to wide range of models
Outline
1. Problem definition
2. The “integrate out” strategy
3. The Majorization-Minimization algorithm
4. Experiments
5. Discussion
The “integrate out” strategy
• Treat hyperparameter C as a random variable
• Analytically integrate out C
• Need a convenient prior p(C)
Integrating out a single hyperparameter
• For L2 regularization,
• A convenient prior:
• The result:
2. Neither convex nor concave in w1. C is gone
The Majorization-Minimization Algorithm
• Replace hard problem by series of easier ones
• EM-like; two steps:
1. Majorization
Upper bound the objective function
2. Minimization
Minimize the upper bound
MM1: Upper-bounding the new prior
• New prior:
• Linearize the log:
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5
-4
-3
-2
-1
0
1
2
3
4
x
y
log(x)
expansion at x=1
expansion at x=1.5
expansion at x=2
MM2: Solving the resultant
optimization problem
• Resultant linearized prior
• Get standard L2-regularization!
Terms independent of w
Use existing solvers!
Visualization of the upper bound
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
x
y
log(0.5 x2 + 1)
expansion at x=1
expansion at x=1.5
expansion at x=2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5
-4
-3
-2
-1
0
1
2
3
4
x
y
log(x)
expansion at x=1
expansion at x=1.5
expansion at x=2
Overall algorithm
1. Closed form updates for C
2. Leverage existing solvers
Converges to a local minimum
What about multiple hyperparameters?
• Regularization groups
w = ( w1, w2, w3, w4, w5 )
Unigram
feature
weights
Hairpin
loops
Bigram
feature
weights
Bulge
loops
NLP
RNA
Secondary
Structure
Prediction
C = ( C1 , C2 )
Mapping from
weights to groups
“To C or not to C. That
is the question…”
What about multiple hyperparameters?
Separately update each
regularization group
Sum weights
in each group
Weighted L2-regularization
Experiments
• 4 probabilistic models
– Linear regression (too easy, not shown)
– Binary logistic regression
– Multinomial logistic regression
– Conditional log-linear model
• 3 competing algorithms
– Grid search
– Gradient-based algorithm (Do et al., 2007)
– Direct optimization of new objective
• Algorithm run with α = 0, β = 1
50
60
70
80
90
100a
ust
rali
an
bre
ast
-
can
cer
dia
be
tes
ge
rma
n-
nu
me
r
he
art
ion
osp
he
re
liv
er-
dis
ord
ers
mu
shro
om
s
son
ar
spli
ce
w1
a
Acc
ura
cy
Grid Grad Direct MM
Results: Binary Logistic Regression
Results: Multinomial Logistic Regression
30405060
708090
100
con
ne
ct-4
dn
a
g
lass
ir
is
le
6e
r
mn
ist1
sa
7m
ag
e
se
gm
en
t
sv
mg
uid
e2
usp
s v
eh
icle
v
ow
el
win
e
Acc
ura
cy
Grid Grad Direct MM
Results: Conditional Log-Linear Models
• RNA secondary structure
prediction
• Multiple hyperparameters
ROC Area
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
Single Grouped
Gradient Direct MM
AGCAGAGUGGCGCA
GUGGAAGCGUGCUG
GUCCCAUAACCCAGA
GGUCCGAGGAUCGA
AACCUUGCUCUGCUA
(((((((((((((.......))))..((((((....
(((....)))....))))))......))))))))).
Discussion
• How to choose α, β in Gamma prior?
– Sensitivity experiments
– Simple choice reasonable
– Further investigation required
• Simple assumptions sometimes wrong
• But competitive performance with Grid, Grad
• Suited for ‘Quick-and-dirty’ implementations
Thank you!