Stochastic Gradient Descent with Variance...

Stochastic Gradient Descent with Variance Reduction

Rie Johnson, Tong ZhangPresenter: Jiawen Yao

March 17, 2015

Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 1 / 29

Outline

1 Problem

2 Stochastic Average Gradient (SAG)

3 Accelerating SGD using Predictive Variance Reduction (SVRG)

4 Conclusion


Problem

Outline

1 Problem



4 Conclusion


Problem

Preliminaries

Recall a few definitions from convex analysis.Definition 1. A function f (x) is a L-Lipschitz continuous function if

‖f (x1)− f (x2)‖ ≤ L‖x1 − x2‖ (1)

for all x1, x2 ∈ dom(f )Definition 2. A convex function f (x) is β-strong convex if there exists aconstant β > 0 and for any α ∈ [0, 1], it holds:

f (αx1 + (1− α)x2) ≤ αf (x1) + (1− α)f (x2)− 1

2α(1− α)β‖x1 − x2‖2

(2)


Problem

Preliminaries

When f (x) is differentiable, the strong convexity is equivalent to

f (x1) ≥ f (x2)+ < ∇f (x2), x1 − x2 > +β

2‖x1 − x2‖2 (3)

Typically, we use the standard Euclidean norm to define Lipschitz andstrong convex functions.


Problem

Minimizing finite average of convex functions

Let ψ1, ..., ψn be a sequence of vector functions from Rd to R.

minP(ω),P(ω) =1

n

n∑i=1

ψi (ω) (4)

assumptions:

each ψi (ω) is convex and differentiable on dom(R)

each ψi (ω) is smooth with Lipschitz constant L

‖∇ψi (ω)−∇ψi (ω′)‖ ≤ L‖ω − ω′‖ (5)

P(ω) is strongly convex

P(ω) ≥ P(ω′) +γ

2‖ω − ω′‖2 +∇P(ω′)ᵀ(ω − ω′) (6)


Problem

Gradient Descent

ω(t) = ω(t−1) − ηt∇P(ω(t−1)) = ω(t−1) − ηtn

n∑i=1

∇ψi (ω(t−1)) (7)

Stochastic Gradient DescentDraw it randomly from {1, ..., n}

ω(t) = ω(t−1) − ηt∇ψit (ω(t−1)) (8)


Problem

SGD

A more general version of SGD is the following

ω(t) = ω(t−1) − ηtgt(ω(t−1), ξt) (9)

where ξt is a random variable that may depend on ω(t−1), the expectationE[gt(ω

(t−1), ξt)|ω(t−1)] = ∇P(ω(t−1))


Problem

Variance

For general convex optimization, stochastic gradient descent methods canobtain an O(1/

√T ) convergence rate in expectation.

Randomness introduces large variance if gt(ω(t−1), ξt) is very large, it will

slow down the convergence.


Stochastic Average Gradient (SAG)

Outline

1 Problem



4 Conclusion


Stochastic Average Gradient (SAG)

Stochastic Average Gradient

SAG method (Le Roux, Schmidt, Bach 2012)

ωt = ωt−1 −η

n

n∑i=1

g/t(i) (10)

where

g(i)t =

{∇ψi (ωt), if i = it

g(i)t−1, otherwise

(11)

It needs to store all gradient, not practical for some cases


Accelerating SGD using Predictive Variance Reduction (SVRG)

Outline

1 Problem



4 Conclusion



SVRG

Motivation

Reduce the variance

Stochastic gradient descent has slow convergence asymptotically dueto the inherent variance.

SAG needs to store all gradients

Contribution

No need to store the intermediate gradients

The same convergence rate as SAG can obtain

Under mild assumptions, even work on nonconvex cases



Stochastic variance reduced gradient (SVRG)

SVRG (Johnson & Zhang, NIPS 2013)update form

ω(t) = ω(t−1) − ηt(∇ψit (ω(t−1))−∇ψit (ω̃) +∇P(ω̃)) (12)

update ω̃ periodically (every m SGD iterations)

~

)~(it

)~(P

~)~( itP

)( kit

~)~( itP k

)( kP

Figure: Intuition of variance reduction



Procedure SVRG

input: update frequency m and learning rate ηinitialization: ω̃0

for s=1,2,... doω̃ = ω̃s−1µ̃ = ∇P(ω̃) = 1

n

∑ni=1∇ψi (ω̃)

ω0 = ω̃Randomly pick it ∈ {1, ..., n} and update weight, repeat m timesωt = ωt−1 − ηt(∇ψit (ωt−1)−∇ψit (ω̃) +∇P(ω̃))option I: set ω̃s = ωm

option II: set ω̃s = ωt for randomly chosen t ∈ {0, ...,m − 1}end for



Convergence for SVRG

Theorem

Consider SVRG with option II. Assume that all ψi (ω) are convex andsmooth, P(ω) is strongly convex. Let ω∗ = argminωP(ω). Assume that mis sufficiently large so that

α =1

γη(1− 2Lη)m+

2Lη

1− 2Lη< 1

then we have geometric convergence in expectations for SVRG

EP(ω̃s) ≤ EP(ω̃∗) + αs [P(ω̃0)− P(ω∗)]



Proof

Given any i, consider

gi (ω) = ψi (ω)− ψi (ω∗)−∇ψi (ω∗)T (ω − ω∗) (13)

where gi (ω∗) = argminωgi (ω) and ∇gi (ω∗) = 0

0 = gi (ω∗) ≤ minη[gi (ω − η∇gi (ω))]

≤ minη[gi (ω)− η‖∇gi (ω)‖2 + 0.5Lη2‖∇gi (ω)‖2] (14)

Here it uses a well-known inequality for a function with 1/L-Lipschitzcontinuous gradient

f (x)− f (y)− 〈∇f (y), x − y〉 ≤ L

2‖x − y‖2



Proof

From (14), we can get η = 1/L, then

0 = gi (ω∗) ≤ gi (ω)− 1

2L‖∇gi (ω)‖2 (15)

It can be rewrite as

‖∇gi (ω)‖2 ≤ 2Lgi (ω) (16)

using the definition of gi (ω) and ∇gi (ω) = ∇ψi (ω)−∇ψi (ω∗), the (16)will be

‖∇ψi (ω)−∇ψi (ω∗)‖2 ≤ 2L[ψi (ω)− ψi (ω∗)−∇ψi (ω∗)T (ω − ω∗)] (17)



Proof

By summing the inequality (17) over i = {1, ..., n}, the fact thatP(ω) = 1

n

∑ni=1∇ψi (ω) and ∇P(ω∗) = 0, we can get

n−1n∑

i=1

‖∇ψi (ω)−∇ψi (ω∗)‖2 ≤ 2L[P(ω)− P(ω∗)] (18)

Use µ̃ = ∇P(ω̃) and let vt = ∇ψit (ωt−1)−∇ψit (ω̃) + µ̃, vt is theapproximate gradient of SVRG.



Proof

With respect to it , expectation can be obtained as

E‖vt‖2 = E‖∇ψit (ωt−1)−∇ψit (ω̃) + µ̃‖2

≤ 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖[∇ψit (ω̃)−∇ψit (ω∗)]− µ̃‖2

= 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖[∇ψit (ω̃)−∇ψit (ω∗)]

− E[∇ψit (ω̃)−∇ψit (ω∗)]‖2

≤ 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖∇ψit (ω̃)−∇ψit (ω∗)‖2

≤ 4L[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)] (19)

The first inequality uses ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2The second inequality uses E‖X − EX‖2 ≤ E‖X‖2The third one uses (18)



Proof

The update form of SVRG is ωt = ωt−1 − ηvt , conditioned on ωt−1

E‖ωt − ω∗‖2 = E‖ωt−1 − ω∗ − ηvt‖2

= ‖ωt−1 − ω∗‖2 − 2η(ωt−1 − ω∗)ᵀEvt + η2E‖vt‖2

Here Evt = E[∇ψit (ωt−1)−∇ψit (ω̃) + µ̃] = ∇P(ωt−1)Using (19) then we can get

E‖ωt − ω∗‖2

≤‖ωt−1 − ω∗‖2 − 2η(ωt−1 − ω∗)ᵀ∇P(ωt−1)

+ 4Lη2[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)] (20)

By convexity of P(ω) that

−(ωt−1 − ω∗)ᵀ∇P(ωt−1) ≤ P(ω∗)− P(ωt−1) (21)



Proof

E‖ωt − ω∗‖2

≤‖ωt−1 − ω∗‖2 − 2η[P(ω∗)− P(ωt−1)]

+ 4Lη2[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)]

=‖ωt−1 − ω∗‖2 − 2η(1− 2Lη)[P(ωt−1)− P(ω∗)] + 4Lη2[P(ω̃)− P(ω∗)]

(22)

In each fixed stage s, ω̃ = ω̃s−1 and ω̃s is selected after all updates havecompleted. By summing the inequality over t = 1, ...,m, takingexpectation with all the history

E‖ωm − ω∗‖2 + 2η(1− 2Lη)mE[P(ω̃s)− P(ω∗)]

≤E‖ω̃ − ω∗‖2 + 4Lmη2E[P(ω̃)− P(ω∗)]

≤2

γE[P(ω̃)− P(ω∗)] + 4Lmη2E[P(ω̃)− P(ω∗)] (23)



Proof

From above inequality, we can have

2η(1− 2Lη)mE[P(ω̃s)− P(ω∗)]

≤ 2

γE[P(ω̃)− P(ω∗)] + 4Lmη2E[P(ω̃)− P(ω∗)] (24)

which can be also rewrite as

E[P(ω̃s)− P(ω̃∗)] ≤ αE[P(ω̃s−1)− P(ω∗)] (25)

where

α =1

γη(1− 2Lη)m+

2Lη

1− 2Lη(26)



Proof

From (25), we can get the desired bound in the Theorem

E[P(ω̃s)− P(ω̃∗)] ≤ αsE[P(ω̃0)− P(ω∗)] (27)

The bound in Theorem 1 is comparable to Le Roux et al. [2012] andShalev-Shwartz and Zhang [2012].The convergence rate of SVRG is O(1/T ) which improves the standardSGD convergence rate of O(1/

√T )



Experiments

Figure: (a) Training loss comparison with SGD with fixed learning rates. (b)Training loss residual P(ω)− P(ω∗) (c) Variance of weight update

It is hard to find a good η for SGD. Use a single relatively large value of η,SVRG smoothly goes down faster than SGD.



Experiments

Figure: More convex-case results. Loss residual P(ω)− P(ω∗) (top) and testerror rates (down)

SVRG is competitive with SDCA and better than the best-tuned SGD.Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 26 / 29


Figure: Neural net results (nonconvex)

For nonconvex problems, it is useful to start with an initial vector ω̃0 thatis close to a local minimum. Results show that SVRG reduces the varianceand smoothly converges faster than the best-tuned SGD.


Conclusion

Conclusion

For smooth and strongly convex functions, we prove SVRG enjoys thesame fast convergence rate as SAG

Unlike SAG, no requirement of the storage of gradients

Unlike SAG, it is more easily applicable to complex problems


Conclusion

Thank you!


Stochastic Gradient Descent with Variance...

Documents

Transcript of Stochastic Gradient Descent with Variance...