New Quasi-Newton Methods for Efficient Large-Scale vishy/talks/LBFGS.pdf · PDF fileNew...
date post
11-Jun-2018Category
Documents
view
213download
0
Embed Size (px)
Transcript of New Quasi-Newton Methods for Efficient Large-Scale vishy/talks/LBFGS.pdf · PDF fileNew...
New Quasi-Newton Methods for EfficientLarge-Scale Machine Learning
S.V.N. VishwanathanJoint work with Nic Schraudolph, Simon Gnter,
Jin Yu, Peter Sunehag, and Jochen Trumpf
National ICT Australia and Australian National [email protected]
December 8, 2007
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 1 / 21
Broyden, Fletcher, Goldfarb, Shanno
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 2 / 21
Standard BFGS - I
Locally Quadratic Model
mt() = f (t) +f (t)>( t) +12( t)>Ht( t)
Ht is an n n estimate of the Hessian
Parameter Update
t+1 = t tBtf (t)
Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21
Standard BFGS - I
Locally Quadratic Model
t+1 = argmin
f (t) +f (t)>( t) +12( t)>Ht( t)
Ht is an n n estimate of the Hessian
Parameter Update
t+1 = t tBtf (t)
Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21
Standard BFGS - I
Locally Quadratic Model
t+1 = argmin
f (t) +f (t)>( t) +12( t)>Ht( t)
Ht is an n n estimate of the Hessian
Parameter Update
t+1 = t tBtf (t)
Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21
Standard BFGS - II
B Matrix Update
Update B H1 by
Bt+1 = argminB||B Bt ||w s.t. st = Byt
yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula
Bt+1 =(
I sty>ts>t yt
)Bt
(I
yts>ts>t yt
)+
sts>ts>t yt
Limited memory variant: use a low-rank approximation to B
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21
Standard BFGS - II
B Matrix Update
Update B H1 by
Bt+1 = argminB||B Bt ||w s.t. st = Byt
yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula
Bt+1 =(
I sty>ts>t yt
)Bt
(I
yts>ts>t yt
)+
sts>ts>t yt
Limited memory variant: use a low-rank approximation to B
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21
Standard BFGS - II
B Matrix Update
Update B H1 by
Bt+1 = argminB||B Bt ||w s.t. st = Byt
yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula
Bt+1 =(
I sty>ts>t yt
)Bt
(I
yts>ts>t yt
)+
sts>ts>t yt
Limited memory variant: use a low-rank approximation to B
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21
Standard BFGS - II
B Matrix Update
Update B H1 by
Bt+1 = argminB||B Bt ||w s.t. st = Byt
yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula
Bt+1 =(
I sty>ts>t yt
)Bt
(I
yts>ts>t yt
)+
sts>ts>t yt
Limited memory variant: use a low-rank approximation to B
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
The Underlying Assumptions
Objective function is strictly convexObjectives from energy minimization methods are non-convex
Objective function is smoothRegularized risk minimization with hinge loss is not smooth
Batch gradientsProhibitively expensive on large datasets
Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS
Aim of this Talk: Systematically relax these assumptions
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21
Relaxing Strict Convexity
The ProblemIf obj. is not strictly convex then the Hessian has zero eigenvaluesThis can blow up our estimate B H1
The BFGS InvariantThe BFGS update maintains the secant equation
Ht+1st = yt
B Matrix Update
Update B (H + I)1 by
Bt+1 = argminB||B Bt ||w s.t. st = Byt
yt = f (t+1)f (t) + st and st = t+1 t .
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 6 / 21
Relaxing Strict Convexity
The ProblemIf obj. is not strictly convex then the Hessian has zero eigenvaluesThis can blow up our estimate B H1
The BFGS InvariantThe BFGS update maintains the secant equation
Ht+1st = yt
B Matrix Update
Update B (H + I)1 by
Bt+1 = argminB||B Bt ||w s.t. st = Byt
yt = f (t+1)f (t) + st and st = t+1 t .
S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 6 / 21