Stochastic Gradient Descent with Variance...

29
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: Jiawen Yao Stochastic Gradient Descent with Variance Reduction March 17, 2015 1 / 29

Transcript of Stochastic Gradient Descent with Variance...

Page 1: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Stochastic Gradient Descent with Variance Reduction

Rie Johnson, Tong ZhangPresenter: Jiawen Yao

March 17, 2015

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 1 / 29

Page 2: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Outline

1 Problem

2 Stochastic Average Gradient (SAG)

3 Accelerating SGD using Predictive Variance Reduction (SVRG)

4 Conclusion

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 2 / 29

Page 3: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

Outline

1 Problem

2 Stochastic Average Gradient (SAG)

3 Accelerating SGD using Predictive Variance Reduction (SVRG)

4 Conclusion

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 3 / 29

Page 4: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

Preliminaries

Recall a few definitions from convex analysis.Definition 1. A function f (x) is a L-Lipschitz continuous function if

‖f (x1)− f (x2)‖ ≤ L‖x1 − x2‖ (1)

for all x1, x2 ∈ dom(f )Definition 2. A convex function f (x) is β-strong convex if there exists aconstant β > 0 and for any α ∈ [0, 1], it holds:

f (αx1 + (1− α)x2) ≤ αf (x1) + (1− α)f (x2)− 1

2α(1− α)β‖x1 − x2‖2

(2)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 4 / 29

Page 5: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

Preliminaries

When f (x) is differentiable, the strong convexity is equivalent to

f (x1) ≥ f (x2)+ < ∇f (x2), x1 − x2 > +β

2‖x1 − x2‖2 (3)

Typically, we use the standard Euclidean norm to define Lipschitz andstrong convex functions.

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 5 / 29

Page 6: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

Minimizing finite average of convex functions

Let ψ1, ..., ψn be a sequence of vector functions from Rd to R.

minP(ω),P(ω) =1

n

n∑i=1

ψi (ω) (4)

assumptions:

each ψi (ω) is convex and differentiable on dom(R)

each ψi (ω) is smooth with Lipschitz constant L

‖∇ψi (ω)−∇ψi (ω′)‖ ≤ L‖ω − ω′‖ (5)

P(ω) is strongly convex

P(ω) ≥ P(ω′) +γ

2‖ω − ω′‖2 +∇P(ω′)ᵀ(ω − ω′) (6)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 6 / 29

Page 7: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

Gradient Descent

ω(t) = ω(t−1) − ηt∇P(ω(t−1)) = ω(t−1) − ηtn

n∑i=1

∇ψi (ω(t−1)) (7)

Stochastic Gradient DescentDraw it randomly from {1, ..., n}

ω(t) = ω(t−1) − ηt∇ψit (ω(t−1)) (8)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 7 / 29

Page 8: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

SGD

A more general version of SGD is the following

ω(t) = ω(t−1) − ηtgt(ω(t−1), ξt) (9)

where ξt is a random variable that may depend on ω(t−1), the expectationE[gt(ω

(t−1), ξt)|ω(t−1)] = ∇P(ω(t−1))

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 8 / 29

Page 9: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Problem

Variance

For general convex optimization, stochastic gradient descent methods canobtain an O(1/

√T ) convergence rate in expectation.

Randomness introduces large variance if gt(ω(t−1), ξt) is very large, it will

slow down the convergence.

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 9 / 29

Page 10: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Stochastic Average Gradient (SAG)

Outline

1 Problem

2 Stochastic Average Gradient (SAG)

3 Accelerating SGD using Predictive Variance Reduction (SVRG)

4 Conclusion

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 10 / 29

Page 11: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Stochastic Average Gradient (SAG)

Stochastic Average Gradient

SAG method (Le Roux, Schmidt, Bach 2012)

ωt = ωt−1 −η

n

n∑i=1

g/t(i) (10)

where

g(i)t =

{∇ψi (ωt), if i = it

g(i)t−1, otherwise

(11)

It needs to store all gradient, not practical for some cases

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 11 / 29

Page 12: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Outline

1 Problem

2 Stochastic Average Gradient (SAG)

3 Accelerating SGD using Predictive Variance Reduction (SVRG)

4 Conclusion

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 12 / 29

Page 13: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

SVRG

Motivation

Reduce the variance

Stochastic gradient descent has slow convergence asymptotically dueto the inherent variance.

SAG needs to store all gradients

Contribution

No need to store the intermediate gradients

The same convergence rate as SAG can obtain

Under mild assumptions, even work on nonconvex cases

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 13 / 29

Page 14: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Stochastic variance reduced gradient (SVRG)

SVRG (Johnson & Zhang, NIPS 2013)update form

ω(t) = ω(t−1) − ηt(∇ψit (ω(t−1))−∇ψit (ω̃) +∇P(ω̃)) (12)

update ω̃ periodically (every m SGD iterations)

~

)~(it

)~(P

~)~( itP

)( kit

~)~( itP k

)( kP

Figure: Intuition of variance reduction

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 14 / 29

Page 15: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Procedure SVRG

input: update frequency m and learning rate ηinitialization: ω̃0

for s=1,2,... doω̃ = ω̃s−1µ̃ = ∇P(ω̃) = 1

n

∑ni=1∇ψi (ω̃)

ω0 = ω̃Randomly pick it ∈ {1, ..., n} and update weight, repeat m timesωt = ωt−1 − ηt(∇ψit (ωt−1)−∇ψit (ω̃) +∇P(ω̃))option I: set ω̃s = ωm

option II: set ω̃s = ωt for randomly chosen t ∈ {0, ...,m − 1}end for

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 15 / 29

Page 16: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Convergence for SVRG

Theorem

Consider SVRG with option II. Assume that all ψi (ω) are convex andsmooth, P(ω) is strongly convex. Let ω∗ = argminωP(ω). Assume that mis sufficiently large so that

α =1

γη(1− 2Lη)m+

2Lη

1− 2Lη< 1

then we have geometric convergence in expectations for SVRG

EP(ω̃s) ≤ EP(ω̃∗) + αs [P(ω̃0)− P(ω∗)]

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 16 / 29

Page 17: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

Given any i, consider

gi (ω) = ψi (ω)− ψi (ω∗)−∇ψi (ω∗)T (ω − ω∗) (13)

where gi (ω∗) = argminωgi (ω) and ∇gi (ω∗) = 0

0 = gi (ω∗) ≤ minη[gi (ω − η∇gi (ω))]

≤ minη[gi (ω)− η‖∇gi (ω)‖2 + 0.5Lη2‖∇gi (ω)‖2] (14)

Here it uses a well-known inequality for a function with 1/L-Lipschitzcontinuous gradient

f (x)− f (y)− 〈∇f (y), x − y〉 ≤ L

2‖x − y‖2

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 17 / 29

Page 18: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

From (14), we can get η = 1/L, then

0 = gi (ω∗) ≤ gi (ω)− 1

2L‖∇gi (ω)‖2 (15)

It can be rewrite as

‖∇gi (ω)‖2 ≤ 2Lgi (ω) (16)

using the definition of gi (ω) and ∇gi (ω) = ∇ψi (ω)−∇ψi (ω∗), the (16)will be

‖∇ψi (ω)−∇ψi (ω∗)‖2 ≤ 2L[ψi (ω)− ψi (ω∗)−∇ψi (ω∗)T (ω − ω∗)] (17)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 18 / 29

Page 19: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

By summing the inequality (17) over i = {1, ..., n}, the fact thatP(ω) = 1

n

∑ni=1∇ψi (ω) and ∇P(ω∗) = 0, we can get

n−1n∑

i=1

‖∇ψi (ω)−∇ψi (ω∗)‖2 ≤ 2L[P(ω)− P(ω∗)] (18)

Use µ̃ = ∇P(ω̃) and let vt = ∇ψit (ωt−1)−∇ψit (ω̃) + µ̃, vt is theapproximate gradient of SVRG.

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 19 / 29

Page 20: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

With respect to it , expectation can be obtained as

E‖vt‖2 = E‖∇ψit (ωt−1)−∇ψit (ω̃) + µ̃‖2

≤ 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖[∇ψit (ω̃)−∇ψit (ω∗)]− µ̃‖2

= 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖[∇ψit (ω̃)−∇ψit (ω∗)]

− E[∇ψit (ω̃)−∇ψit (ω∗)]‖2

≤ 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖∇ψit (ω̃)−∇ψit (ω∗)‖2

≤ 4L[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)] (19)

The first inequality uses ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2The second inequality uses E‖X − EX‖2 ≤ E‖X‖2The third one uses (18)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 20 / 29

Page 21: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

The update form of SVRG is ωt = ωt−1 − ηvt , conditioned on ωt−1

E‖ωt − ω∗‖2 = E‖ωt−1 − ω∗ − ηvt‖2

= ‖ωt−1 − ω∗‖2 − 2η(ωt−1 − ω∗)ᵀEvt + η2E‖vt‖2

Here Evt = E[∇ψit (ωt−1)−∇ψit (ω̃) + µ̃] = ∇P(ωt−1)Using (19) then we can get

E‖ωt − ω∗‖2

≤‖ωt−1 − ω∗‖2 − 2η(ωt−1 − ω∗)ᵀ∇P(ωt−1)

+ 4Lη2[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)] (20)

By convexity of P(ω) that

−(ωt−1 − ω∗)ᵀ∇P(ωt−1) ≤ P(ω∗)− P(ωt−1) (21)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 21 / 29

Page 22: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

E‖ωt − ω∗‖2

≤‖ωt−1 − ω∗‖2 − 2η[P(ω∗)− P(ωt−1)]

+ 4Lη2[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)]

=‖ωt−1 − ω∗‖2 − 2η(1− 2Lη)[P(ωt−1)− P(ω∗)] + 4Lη2[P(ω̃)− P(ω∗)]

(22)

In each fixed stage s, ω̃ = ω̃s−1 and ω̃s is selected after all updates havecompleted. By summing the inequality over t = 1, ...,m, takingexpectation with all the history

E‖ωm − ω∗‖2 + 2η(1− 2Lη)mE[P(ω̃s)− P(ω∗)]

≤E‖ω̃ − ω∗‖2 + 4Lmη2E[P(ω̃)− P(ω∗)]

≤2

γE[P(ω̃)− P(ω∗)] + 4Lmη2E[P(ω̃)− P(ω∗)] (23)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 22 / 29

Page 23: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

From above inequality, we can have

2η(1− 2Lη)mE[P(ω̃s)− P(ω∗)]

≤ 2

γE[P(ω̃)− P(ω∗)] + 4Lmη2E[P(ω̃)− P(ω∗)] (24)

which can be also rewrite as

E[P(ω̃s)− P(ω̃∗)] ≤ αE[P(ω̃s−1)− P(ω∗)] (25)

where

α =1

γη(1− 2Lη)m+

2Lη

1− 2Lη(26)

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 23 / 29

Page 24: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Proof

From (25), we can get the desired bound in the Theorem

E[P(ω̃s)− P(ω̃∗)] ≤ αsE[P(ω̃0)− P(ω∗)] (27)

The bound in Theorem 1 is comparable to Le Roux et al. [2012] andShalev-Shwartz and Zhang [2012].The convergence rate of SVRG is O(1/T ) which improves the standardSGD convergence rate of O(1/

√T )

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 24 / 29

Page 25: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Experiments

Figure: (a) Training loss comparison with SGD with fixed learning rates. (b)Training loss residual P(ω)− P(ω∗) (c) Variance of weight update

It is hard to find a good η for SGD. Use a single relatively large value of η,SVRG smoothly goes down faster than SGD.

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 25 / 29

Page 26: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Experiments

Figure: More convex-case results. Loss residual P(ω)− P(ω∗) (top) and testerror rates (down)

SVRG is competitive with SDCA and better than the best-tuned SGD.Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 26 / 29

Page 27: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Accelerating SGD using Predictive Variance Reduction (SVRG)

Figure: Neural net results (nonconvex)

For nonconvex problems, it is useful to start with an initial vector ω̃0 thatis close to a local minimum. Results show that SVRG reduces the varianceand smoothly converges faster than the best-tuned SGD.

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 27 / 29

Page 28: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Conclusion

Conclusion

For smooth and strongly convex functions, we prove SVRG enjoys thesame fast convergence rate as SAG

Unlike SAG, no requirement of the storage of gradients

Unlike SAG, it is more easily applicable to complex problems

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 28 / 29

Page 29: Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Conclusion

Thank you!

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 29 / 29