# New Quasi-Newton Methods for Efficient Large-Scale vishy/talks/LBFGS.pdf · PDF fileNew...

date post

11-Jun-2018Category

## Documents

view

213download

0

Embed Size (px)

### Transcript of New Quasi-Newton Methods for Efficient Large-Scale vishy/talks/LBFGS.pdf · PDF fileNew...

New Quasi-Newton Methods for EfficientLarge-Scale Machine Learning

S.V.N. VishwanathanJoint work with Nic Schraudolph, Simon Gnter,

Jin Yu, Peter Sunehag, and Jochen Trumpf

National ICT Australia and Australian National [email protected]

December 8, 2007

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 1 / 21

Broyden, Fletcher, Goldfarb, Shanno

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 2 / 21

Standard BFGS - I

Locally Quadratic Model

mt() = f (t) +f (t)>( t) +12( t)>Ht( t)

Ht is an n n estimate of the Hessian

Parameter Update

t+1 = t tBtf (t)

Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21

Standard BFGS - I

Locally Quadratic Model

t+1 = argmin

f (t) +f (t)>( t) +12( t)>Ht( t)

Ht is an n n estimate of the Hessian

Parameter Update

t+1 = t tBtf (t)

Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21

Standard BFGS - I

Locally Quadratic Model

t+1 = argmin

f (t) +f (t)>( t) +12( t)>Ht( t)

Ht is an n n estimate of the Hessian

Parameter Update

t+1 = t tBtf (t)

Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21

Standard BFGS - II

B Matrix Update

Update B H1 by

Bt+1 = argminB||B Bt ||w s.t. st = Byt

yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

Bt+1 =(

I sty>ts>t yt

)Bt

(I

yts>ts>t yt

)+

sts>ts>t yt

Limited memory variant: use a low-rank approximation to B

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

Standard BFGS - II

B Matrix Update

Update B H1 by

Bt+1 = argminB||B Bt ||w s.t. st = Byt

yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

Bt+1 =(

I sty>ts>t yt

)Bt

(I

yts>ts>t yt

)+

sts>ts>t yt

Limited memory variant: use a low-rank approximation to B

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

Standard BFGS - II

B Matrix Update

Update B H1 by

Bt+1 = argminB||B Bt ||w s.t. st = Byt

yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

Bt+1 =(

I sty>ts>t yt

)Bt

(I

yts>ts>t yt

)+

sts>ts>t yt

Limited memory variant: use a low-rank approximation to B

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

Standard BFGS - II

B Matrix Update

Update B H1 by

Bt+1 = argminB||B Bt ||w s.t. st = Byt

Bt+1 =(

I sty>ts>t yt

)Bt

(I

yts>ts>t yt

)+

sts>ts>t yt

Limited memory variant: use a low-rank approximation to B

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

The Underlying Assumptions

Objective function is strictly convexObjectives from energy minimization methods are non-convex

Objective function is smoothRegularized risk minimization with hinge loss is not smooth

Batch gradientsProhibitively expensive on large datasets

Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

Aim of this Talk: Systematically relax these assumptions

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

Relaxing Strict Convexity

The ProblemIf obj. is not strictly convex then the Hessian has zero eigenvaluesThis can blow up our estimate B H1

The BFGS InvariantThe BFGS update maintains the secant equation

Ht+1st = yt

B Matrix Update

Update B (H + I)1 by

Bt+1 = argminB||B Bt ||w s.t. st = Byt

yt = f (t+1)f (t) + st and st = t+1 t .

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 6 / 21

Relaxing Strict Convexity

The ProblemIf obj. is not strictly convex then the Hessian has zero eigenvaluesThis can blow up our estimate B H1

The BFGS InvariantThe BFGS update maintains the secant equation

Ht+1st = yt

B Matrix Update

Update B (H + I)1 by

Bt+1 = argminB||B Bt ||w s.t. st = Byt

yt = f (t+1)f (t) + st and st = t+1 t .

S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 6 / 21