Stochastic Gradient Descent with Variance...
Transcript of Stochastic Gradient Descent with Variance...
Stochastic Gradient Descent with Variance Reduction
Rie Johnson, Tong ZhangPresenter: Jiawen Yao
March 17, 2015
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 1 / 29
Outline
1 Problem
2 Stochastic Average Gradient (SAG)
3 Accelerating SGD using Predictive Variance Reduction (SVRG)
4 Conclusion
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 2 / 29
Problem
Outline
1 Problem
2 Stochastic Average Gradient (SAG)
3 Accelerating SGD using Predictive Variance Reduction (SVRG)
4 Conclusion
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 3 / 29
Problem
Preliminaries
Recall a few definitions from convex analysis.Definition 1. A function f (x) is a L-Lipschitz continuous function if
‖f (x1)− f (x2)‖ ≤ L‖x1 − x2‖ (1)
for all x1, x2 ∈ dom(f )Definition 2. A convex function f (x) is β-strong convex if there exists aconstant β > 0 and for any α ∈ [0, 1], it holds:
f (αx1 + (1− α)x2) ≤ αf (x1) + (1− α)f (x2)− 1
2α(1− α)β‖x1 − x2‖2
(2)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 4 / 29
Problem
Preliminaries
When f (x) is differentiable, the strong convexity is equivalent to
f (x1) ≥ f (x2)+ < ∇f (x2), x1 − x2 > +β
2‖x1 − x2‖2 (3)
Typically, we use the standard Euclidean norm to define Lipschitz andstrong convex functions.
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 5 / 29
Problem
Minimizing finite average of convex functions
Let ψ1, ..., ψn be a sequence of vector functions from Rd to R.
minP(ω),P(ω) =1
n
n∑i=1
ψi (ω) (4)
assumptions:
each ψi (ω) is convex and differentiable on dom(R)
each ψi (ω) is smooth with Lipschitz constant L
‖∇ψi (ω)−∇ψi (ω′)‖ ≤ L‖ω − ω′‖ (5)
P(ω) is strongly convex
P(ω) ≥ P(ω′) +γ
2‖ω − ω′‖2 +∇P(ω′)ᵀ(ω − ω′) (6)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 6 / 29
Problem
Gradient Descent
ω(t) = ω(t−1) − ηt∇P(ω(t−1)) = ω(t−1) − ηtn
n∑i=1
∇ψi (ω(t−1)) (7)
Stochastic Gradient DescentDraw it randomly from {1, ..., n}
ω(t) = ω(t−1) − ηt∇ψit (ω(t−1)) (8)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 7 / 29
Problem
SGD
A more general version of SGD is the following
ω(t) = ω(t−1) − ηtgt(ω(t−1), ξt) (9)
where ξt is a random variable that may depend on ω(t−1), the expectationE[gt(ω
(t−1), ξt)|ω(t−1)] = ∇P(ω(t−1))
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 8 / 29
Problem
Variance
For general convex optimization, stochastic gradient descent methods canobtain an O(1/
√T ) convergence rate in expectation.
Randomness introduces large variance if gt(ω(t−1), ξt) is very large, it will
slow down the convergence.
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 9 / 29
Stochastic Average Gradient (SAG)
Outline
1 Problem
2 Stochastic Average Gradient (SAG)
3 Accelerating SGD using Predictive Variance Reduction (SVRG)
4 Conclusion
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 10 / 29
Stochastic Average Gradient (SAG)
Stochastic Average Gradient
SAG method (Le Roux, Schmidt, Bach 2012)
ωt = ωt−1 −η
n
n∑i=1
g/t(i) (10)
where
g(i)t =
{∇ψi (ωt), if i = it
g(i)t−1, otherwise
(11)
It needs to store all gradient, not practical for some cases
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 11 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Outline
1 Problem
2 Stochastic Average Gradient (SAG)
3 Accelerating SGD using Predictive Variance Reduction (SVRG)
4 Conclusion
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 12 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
SVRG
Motivation
Reduce the variance
Stochastic gradient descent has slow convergence asymptotically dueto the inherent variance.
SAG needs to store all gradients
Contribution
No need to store the intermediate gradients
The same convergence rate as SAG can obtain
Under mild assumptions, even work on nonconvex cases
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 13 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Stochastic variance reduced gradient (SVRG)
SVRG (Johnson & Zhang, NIPS 2013)update form
ω(t) = ω(t−1) − ηt(∇ψit (ω(t−1))−∇ψit (ω̃) +∇P(ω̃)) (12)
update ω̃ periodically (every m SGD iterations)
~
)~(it
)~(P
~)~( itP
)( kit
~)~( itP k
)( kP
Figure: Intuition of variance reduction
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 14 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Procedure SVRG
input: update frequency m and learning rate ηinitialization: ω̃0
for s=1,2,... doω̃ = ω̃s−1µ̃ = ∇P(ω̃) = 1
n
∑ni=1∇ψi (ω̃)
ω0 = ω̃Randomly pick it ∈ {1, ..., n} and update weight, repeat m timesωt = ωt−1 − ηt(∇ψit (ωt−1)−∇ψit (ω̃) +∇P(ω̃))option I: set ω̃s = ωm
option II: set ω̃s = ωt for randomly chosen t ∈ {0, ...,m − 1}end for
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 15 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Convergence for SVRG
Theorem
Consider SVRG with option II. Assume that all ψi (ω) are convex andsmooth, P(ω) is strongly convex. Let ω∗ = argminωP(ω). Assume that mis sufficiently large so that
α =1
γη(1− 2Lη)m+
2Lη
1− 2Lη< 1
then we have geometric convergence in expectations for SVRG
EP(ω̃s) ≤ EP(ω̃∗) + αs [P(ω̃0)− P(ω∗)]
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 16 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
Given any i, consider
gi (ω) = ψi (ω)− ψi (ω∗)−∇ψi (ω∗)T (ω − ω∗) (13)
where gi (ω∗) = argminωgi (ω) and ∇gi (ω∗) = 0
0 = gi (ω∗) ≤ minη[gi (ω − η∇gi (ω))]
≤ minη[gi (ω)− η‖∇gi (ω)‖2 + 0.5Lη2‖∇gi (ω)‖2] (14)
Here it uses a well-known inequality for a function with 1/L-Lipschitzcontinuous gradient
f (x)− f (y)− 〈∇f (y), x − y〉 ≤ L
2‖x − y‖2
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 17 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
From (14), we can get η = 1/L, then
0 = gi (ω∗) ≤ gi (ω)− 1
2L‖∇gi (ω)‖2 (15)
It can be rewrite as
‖∇gi (ω)‖2 ≤ 2Lgi (ω) (16)
using the definition of gi (ω) and ∇gi (ω) = ∇ψi (ω)−∇ψi (ω∗), the (16)will be
‖∇ψi (ω)−∇ψi (ω∗)‖2 ≤ 2L[ψi (ω)− ψi (ω∗)−∇ψi (ω∗)T (ω − ω∗)] (17)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 18 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
By summing the inequality (17) over i = {1, ..., n}, the fact thatP(ω) = 1
n
∑ni=1∇ψi (ω) and ∇P(ω∗) = 0, we can get
n−1n∑
i=1
‖∇ψi (ω)−∇ψi (ω∗)‖2 ≤ 2L[P(ω)− P(ω∗)] (18)
Use µ̃ = ∇P(ω̃) and let vt = ∇ψit (ωt−1)−∇ψit (ω̃) + µ̃, vt is theapproximate gradient of SVRG.
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 19 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
With respect to it , expectation can be obtained as
E‖vt‖2 = E‖∇ψit (ωt−1)−∇ψit (ω̃) + µ̃‖2
≤ 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖[∇ψit (ω̃)−∇ψit (ω∗)]− µ̃‖2
= 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖[∇ψit (ω̃)−∇ψit (ω∗)]
− E[∇ψit (ω̃)−∇ψit (ω∗)]‖2
≤ 2E‖∇ψit (ωt−1)−∇ψit (ω∗)‖2 + 2E‖∇ψit (ω̃)−∇ψit (ω∗)‖2
≤ 4L[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)] (19)
The first inequality uses ‖a + b‖2 ≤ 2‖a‖2 + 2‖b‖2The second inequality uses E‖X − EX‖2 ≤ E‖X‖2The third one uses (18)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 20 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
The update form of SVRG is ωt = ωt−1 − ηvt , conditioned on ωt−1
E‖ωt − ω∗‖2 = E‖ωt−1 − ω∗ − ηvt‖2
= ‖ωt−1 − ω∗‖2 − 2η(ωt−1 − ω∗)ᵀEvt + η2E‖vt‖2
Here Evt = E[∇ψit (ωt−1)−∇ψit (ω̃) + µ̃] = ∇P(ωt−1)Using (19) then we can get
E‖ωt − ω∗‖2
≤‖ωt−1 − ω∗‖2 − 2η(ωt−1 − ω∗)ᵀ∇P(ωt−1)
+ 4Lη2[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)] (20)
By convexity of P(ω) that
−(ωt−1 − ω∗)ᵀ∇P(ωt−1) ≤ P(ω∗)− P(ωt−1) (21)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 21 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
E‖ωt − ω∗‖2
≤‖ωt−1 − ω∗‖2 − 2η[P(ω∗)− P(ωt−1)]
+ 4Lη2[P(ωt−1)− P(ω∗) + P(ω̃)− P(ω∗)]
=‖ωt−1 − ω∗‖2 − 2η(1− 2Lη)[P(ωt−1)− P(ω∗)] + 4Lη2[P(ω̃)− P(ω∗)]
(22)
In each fixed stage s, ω̃ = ω̃s−1 and ω̃s is selected after all updates havecompleted. By summing the inequality over t = 1, ...,m, takingexpectation with all the history
E‖ωm − ω∗‖2 + 2η(1− 2Lη)mE[P(ω̃s)− P(ω∗)]
≤E‖ω̃ − ω∗‖2 + 4Lmη2E[P(ω̃)− P(ω∗)]
≤2
γE[P(ω̃)− P(ω∗)] + 4Lmη2E[P(ω̃)− P(ω∗)] (23)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 22 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
From above inequality, we can have
2η(1− 2Lη)mE[P(ω̃s)− P(ω∗)]
≤ 2
γE[P(ω̃)− P(ω∗)] + 4Lmη2E[P(ω̃)− P(ω∗)] (24)
which can be also rewrite as
E[P(ω̃s)− P(ω̃∗)] ≤ αE[P(ω̃s−1)− P(ω∗)] (25)
where
α =1
γη(1− 2Lη)m+
2Lη
1− 2Lη(26)
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 23 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Proof
From (25), we can get the desired bound in the Theorem
E[P(ω̃s)− P(ω̃∗)] ≤ αsE[P(ω̃0)− P(ω∗)] (27)
The bound in Theorem 1 is comparable to Le Roux et al. [2012] andShalev-Shwartz and Zhang [2012].The convergence rate of SVRG is O(1/T ) which improves the standardSGD convergence rate of O(1/
√T )
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 24 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Experiments
Figure: (a) Training loss comparison with SGD with fixed learning rates. (b)Training loss residual P(ω)− P(ω∗) (c) Variance of weight update
It is hard to find a good η for SGD. Use a single relatively large value of η,SVRG smoothly goes down faster than SGD.
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 25 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Experiments
Figure: More convex-case results. Loss residual P(ω)− P(ω∗) (top) and testerror rates (down)
SVRG is competitive with SDCA and better than the best-tuned SGD.Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 26 / 29
Accelerating SGD using Predictive Variance Reduction (SVRG)
Figure: Neural net results (nonconvex)
For nonconvex problems, it is useful to start with an initial vector ω̃0 thatis close to a local minimum. Results show that SVRG reduces the varianceand smoothly converges faster than the best-tuned SGD.
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 27 / 29
Conclusion
Conclusion
For smooth and strongly convex functions, we prove SVRG enjoys thesame fast convergence rate as SAG
Unlike SAG, no requirement of the storage of gradients
Unlike SAG, it is more easily applicable to complex problems
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 28 / 29
Conclusion
Thank you!
Rie Johnson, Tong Zhang Presenter: Jiawen YaoStochastic Gradient Descent with Variance Reduction March 17, 2015 29 / 29