New Quasi-Newton Methods for Efficient Large-Scale vishy/talks/LBFGS.pdf · PDF fileNew...

Click here to load reader

  • date post

    11-Jun-2018
  • Category

    Documents

  • view

    213
  • download

    0

Embed Size (px)

Transcript of New Quasi-Newton Methods for Efficient Large-Scale vishy/talks/LBFGS.pdf · PDF fileNew...

  • New Quasi-Newton Methods for EfficientLarge-Scale Machine Learning

    S.V.N. VishwanathanJoint work with Nic Schraudolph, Simon Gnter,

    Jin Yu, Peter Sunehag, and Jochen Trumpf

    National ICT Australia and Australian National [email protected]

    December 8, 2007

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 1 / 21

  • Broyden, Fletcher, Goldfarb, Shanno

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 2 / 21

  • Standard BFGS - I

    Locally Quadratic Model

    mt() = f (t) +f (t)>( t) +12( t)>Ht( t)

    Ht is an n n estimate of the Hessian

    Parameter Update

    t+1 = t tBtf (t)

    Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21

  • Standard BFGS - I

    Locally Quadratic Model

    t+1 = argmin

    f (t) +f (t)>( t) +12( t)>Ht( t)

    Ht is an n n estimate of the Hessian

    Parameter Update

    t+1 = t tBtf (t)

    Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21

  • Standard BFGS - I

    Locally Quadratic Model

    t+1 = argmin

    f (t) +f (t)>( t) +12( t)>Ht( t)

    Ht is an n n estimate of the Hessian

    Parameter Update

    t+1 = t tBtf (t)

    Bt H1t is a symmetric PSD matrixt is a step size usually found via a line search

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 3 / 21

  • Standard BFGS - II

    B Matrix Update

    Update B H1 by

    Bt+1 = argminB||B Bt ||w s.t. st = Byt

    yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

    Bt+1 =(

    I sty>ts>t yt

    )Bt

    (I

    yts>ts>t yt

    )+

    sts>ts>t yt

    Limited memory variant: use a low-rank approximation to B

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

  • Standard BFGS - II

    B Matrix Update

    Update B H1 by

    Bt+1 = argminB||B Bt ||w s.t. st = Byt

    yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

    Bt+1 =(

    I sty>ts>t yt

    )Bt

    (I

    yts>ts>t yt

    )+

    sts>ts>t yt

    Limited memory variant: use a low-rank approximation to B

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

  • Standard BFGS - II

    B Matrix Update

    Update B H1 by

    Bt+1 = argminB||B Bt ||w s.t. st = Byt

    yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

    Bt+1 =(

    I sty>ts>t yt

    )Bt

    (I

    yts>ts>t yt

    )+

    sts>ts>t yt

    Limited memory variant: use a low-rank approximation to B

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

  • Standard BFGS - II

    B Matrix Update

    Update B H1 by

    Bt+1 = argminB||B Bt ||w s.t. st = Byt

    yt = f (t+1)f (t) is the difference of gradientsst = t+1 t is the difference in parametersThis yields the update formula

    Bt+1 =(

    I sty>ts>t yt

    )Bt

    (I

    yts>ts>t yt

    )+

    sts>ts>t yt

    Limited memory variant: use a low-rank approximation to B

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 4 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • The Underlying Assumptions

    Objective function is strictly convexObjectives from energy minimization methods are non-convex

    Objective function is smoothRegularized risk minimization with hinge loss is not smooth

    Batch gradientsProhibitively expensive on large datasets

    Finite-dimensional parameter vectorKernel algorithms work in (potentially) infinite-dimensional RKHS

    Aim of this Talk: Systematically relax these assumptions

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 5 / 21

  • Relaxing Strict Convexity

    The ProblemIf obj. is not strictly convex then the Hessian has zero eigenvaluesThis can blow up our estimate B H1

    The BFGS InvariantThe BFGS update maintains the secant equation

    Ht+1st = yt

    B Matrix Update

    Update B (H + I)1 by

    Bt+1 = argminB||B Bt ||w s.t. st = Byt

    yt = f (t+1)f (t) + st and st = t+1 t .

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 6 / 21

  • Relaxing Strict Convexity

    The ProblemIf obj. is not strictly convex then the Hessian has zero eigenvaluesThis can blow up our estimate B H1

    The BFGS InvariantThe BFGS update maintains the secant equation

    Ht+1st = yt

    B Matrix Update

    Update B (H + I)1 by

    Bt+1 = argminB||B Bt ||w s.t. st = Byt

    yt = f (t+1)f (t) + st and st = t+1 t .

    S.V. N. Vishwanathan (NICTA and ANU) New Quasi-Newton Methods NIPS workshop on BigML 6 / 21