Projected Gradient Algorithm · 2020. 10. 23. · Remark: If fis not di erentiable, we can replace...
Transcript of Projected Gradient Algorithm · 2020. 10. 23. · Remark: If fis not di erentiable, we can replace...
Projected Gradient Algorithm
Andersen Ang
Mathematique et recherche operationnelleUMONS, Belgium
[email protected] Homepage: angms.science
First draft: August 2, 2017Last update : October 23, 2020
Overview
1 Constrained and unconstrained problem
2 Understanding the geometry of projection
3 PGD is a special case of proximal gradient
4 Theorem 1. An inequality of PGD with constant stepsize
5 Theorem 2. PGD converges at order O(
1√k
)on Lipschitz function
6 Summary
2 / 22
Constrained and unconstrained problemI For unconstrained minimization problem
minx∈Rn
f(x),
any x in Rn can be a solution.
I For constrained minimization problem with a given set Q ⊂ Rn
minx∈Q
f(x),
not any x can be a solution, the solution has to be inside the set Q.
I An example of constrained minimization problem:
minx∈Rn
‖Ax− b‖22 s.t. ‖x‖2 ≤ 1
can be expressed asmin‖x‖2≤1
‖Ax− b‖22.
3 / 22
Solving unconstrained problem by gradient descentI Gradient Descent (GD) is a standard (easy and simple) way to solve
unconstrained optimization problem.
I Starting from an initial point x0 ∈ Rn, GD iterates the followingequation until a stopping condition is met:
xk+1 = xk − αk∇f(xk),
where ∇f is the gradient of f , the parameter α ≥ 0 is the step size,and k is the iteration counter.
I Question: how about constrained problem? Is it possible to tune GDto fit constrained problem?Answer: yes, and the key is projection.
Remark: If f is not differentiable, we can replace gradient by subgradient, and we get theso-called subgradient method.
4 / 22
Solving constrained problem by projected gradient descent
I Projected Gradient Descent (PGD) is a standard (easy and simple)way to solve constrained optimization problem.
I Consider a constraint set Q ⊂ Rn, starting from a initial pointx0 ∈ Q, PGD iterates the following equation until a stoppingcondition is met:
xk+1 = PQ
(xk − αk∇f(xk)
).
I PQ( . ) is the projection operator, and itself is also an optimizationproblem:
PQ(x0) = arg minx∈Q
1
2‖x− x0‖22,
i.e. given a point x0, PQ try to find a point x ∈ Q which is “closest”to x0.
5 / 22
About the projection
I PQ( . ) is a function from Rn to Rn, and itself is an optimizationproblem:
PQ(x0) = arg minx∈Q
1
2‖x− x0‖22.
I PGD is an “economic” algorithm if the problem is easy to solve.This is not true for general Q and there are lots of constraint setsthat are very difficult to project onto.
I If Q is a convex set, the optimization problem has a unique solution.
I If Q is nonconvex, the solution to PQ(x0) may not be unique: it givesmore than one solution.
6 / 22
Comparing PGD to GDI GD
1. Pick an initial point x0 ∈ Rn
2. Loop until stopping condition is met:2.1 Descent direction: pick the descent direction as −∇f(xk)2.2 Stepsize: pick a step size αk
2.3 Update: xk+1 = xk − αk∇f(xk)
I PGD1. Pick an initial point x0 ∈ Q2. Loop until stopping condition is met:
2.1 Descent direction: pick the descent direction as −∇f(xk)2.2 Stepsize: pick a step size αk
2.3 Update: yk+1 = xk − αk∇f(xk)2.4 Projection: xk+1 = argmin
x∈Q
12‖x− yk+1‖22
I PGD has one more step: the projection.
I The idea of PGD is simple: if the point xk − αk∇f(xk) after thegradient update is leaving the set Q, project it back.
7 / 22
Understanding the geometry of projection ... (1/5)
Consider a convex set Q and a point x0 ∈ Q.
Qx0
I As x0 ∈ Q, the closest point to x0 in Q will be x0 itself.
I The distance between a point to itself is zero.
I Mathematically: 12‖x− x0‖22 = 0 gives x = x0.
8 / 22
Understanding the geometry of projection ... (2/5)
Now consider a convex set Q and a point x0 /∈ Q: outside Q.
Qx0
9 / 22
Understanding the geometry of projection ... (3/5)I The circles are L2 norm ball centered at x0 with different radius.
I Points on these circles are equidistant to x0 (with different L2
distance on different circles).
I Note that some points on the blue circle are inside Q.
Qx0
10 / 22
Understanding the geometry of projection ... (4/5)I The point inside Q which is closest to x0 is the point where the L2
norm ball “touches” Q.I In this example, the blue point y is the solution to
PQ(x0) = argminx∈Q
1
2‖x− x0‖22.
Qx0
y
In fact, it can be proved that, such point is always located on theboundary of Q for x0 /∈ Q. That is, mathematically,argmin
x∈Q
12‖x− x0‖22 ∈ bdQ if x0 /∈ Q.
11 / 22
Understanding the geometry of projection ... (5/5)
Note that the projection is orthogonal: the blue point y is always on astraight line that is tangent to the norm ball and Q.
Qx0
y
12 / 22
PGD is a special case of proximal gradient
I The indicator function, denoted as i(x), of a set Q is defined asfollows: if x ∈ Q, then i(x) = 0; if x /∈ Q, then i(x) =∞.
I With the indicator function, constrained problem has two equivalentexpressions
minx∈Q
f(x) ≡ minxf(x) + i(x).
I Proximal gradient is a method to solve the optimization problem of asum of differentiable and a non-differentiable function:
minxf(x) + g(x),
where g is a non-differentiable function.
I PGD is in fact the special case of proximal gradient where g(x) is theindicator function of the constrain set. See here for more aboutproximal gradient .
13 / 22
On PGD convergence rateI Theorem 1. If f is convex, PGD with constant stepsize α satisfies
f
(1
K + 1
K∑k=0
xk
)− f∗ ≤ ‖x0 − x∗‖22
2α(K + 1)+
α
2(K + 1)
K∑k=0
‖∇f(xk)‖22,
where f∗ = f(x∗) is the optimal cost value, x∗ is the (global)minimizer, α is the constant stepsize, K is the total of number ofiteration performed.
I Interpretation of this theorem: the term 1K+1
∑Kk=0 xk is the
“average” of the sequence xk after K iteration, hence we can denoteit as x and f(x) as f . Then the theorem reads:
f − f∗ ≤ ‖x0 − x∗‖222α(K + 1)
+ something positive.
Hence the convergence rate is like O( 1K ).
I For the second term on the right hand side, as long as∑Kk=0 ‖∇f(xk)‖22 is not diverging to infinity, or the growth of it is
slower than K, then the termα
2(K + 1)
∑Kk=0 ‖∇f(xk)‖22 converges.
14 / 22
Proof of theorem 1 ... (1/5)I f is convex so f(x)− f(z) ≤ 〈∇f(x),x− z〉.
I Put x = xk, z = x∗ and f(x∗) = f∗:
f(xk)− f∗ ≤ 〈∇f(xk),xk − x∗〉.
I PGD update yk+1 = xk −αk∇f(xk) gives ∇f(xk) =xk − yk+1
αkand
f(xk)− f∗ ≤1
αk〈xk − yk+1,xk − x∗〉.
I As we use constant stepsize:
f(xk)− f∗ ≤1
α〈xk − yk+1,xk − x∗〉.
15 / 22
Proof of theorem 1 ... (2/5)I A trick
(a− b)(a− c) = a2 − ac− ab+ bc
=2a2 − 2ac− 2ab+ 2bc
2
=a2 − 2ac+ a2 − 2ab+ 2bc+c2 − c2 + b2 − b2
2
=(a− c)2 + (a− b)2 − (b− c)2
2
I Hence
f(xk)− f∗ ≤ 1α〈xk − yk+1,xk − x∗〉
= 12α
(‖xk − x∗‖22 + ‖xk − yk+1‖22 − ‖yk+1 − x∗‖22
)∗= 1
2α
(‖xk − x∗‖22 − ‖yk+1 − x∗‖22
)+ α
2 ‖∇f(xk)‖22
where * is due to PGD update xk − yk+1 = α∇f(xk)
16 / 22
Proof of theorem 1 ... (3/5)
Note that ‖yk+1 − x∗‖22 ≥ ‖xk+1 − x∗‖22.
Qy
ΠQ(y) = xk+1
x∗
Hence −‖yk+1 − x∗‖22 ≤ −‖xk+1 − x∗‖22 and
f(xk)− f∗ ≤ 12α
(‖xk − x∗‖22−‖yk+1 − x∗‖22
)+ α
2 ‖∇f(xk)‖22≤ 1
2α
(‖xk − x∗‖22−‖xk+1 − x∗‖22
)+ α
2 ‖∇f(xk)‖22
It forms a telescoping series !
17 / 22
Proof of theorem 1 ... (4/5)
k = 0 f(x0)− f∗ ≤‖x0 − x∗‖22 − ‖x1 − x∗‖22
2α+α
2‖∇f(x0)‖22
k = 1 f(x1)− f∗ ≤‖x1 − x∗‖22 − ‖x2 − x∗‖22
2α+α
2‖∇f(x1)‖22
...
k = K f(xk)− f∗ ≤‖xk − x∗‖22 − ‖xK+1 − x∗‖22
2α+α
2‖∇f(xk)‖22
Sums all
K∑k=0
(f(xk)− f∗
)≤ ‖x0 − x∗‖22 − ‖xk+1 − x∗‖22
2α+α
2
K∑k=0
‖∇f(xk)‖22.
18 / 22
Proof of theorem 1 ... (5/5)As 0 ≤ 1
2α‖xk+1 − x∗‖22,
K∑k=0
(f(xk)− f∗
)≤ ‖x0 − x∗‖22
2α+α
2
K∑k=0
‖∇f(xk)‖22.
Expand the summation on the left and divide the whole equation by K + 1
1
K + 1
K∑k=0
f(xk)− f∗ ≤‖x0 − x∗‖222α(K + 1)
+α
2(K + 1)
K∑k=0
‖∇f(xk)‖22.
Consider the left hand side, as f is convex, by Jensen’s inequality
f
(1
K + 1
K∑k=0
xk
)≤ 1
K + 1
K∑k=0
f(xk).
Therefore
f
(1
K + 1
K∑k=0
xk
)− f∗ ≤ ‖x0 − x∗‖22
2α(K + 1)+
α
2(K + 1)
K∑k=0
‖∇f(xk)‖22.
19 / 22
PGD converges at order O(
1√k
)on Lipschitz function
Theorem 2. If f is Lipschitz, for the point xK =
{1
K + 1
K∑k=0
xk
}and
constant stepsize α =‖x0 − x∗‖L√K + 1
we have
f(xK)− f∗ ≤ L‖x0 − x∗‖√K + 1
Proof. Put xK , α into theorem 1 directly, note that ‖∇f‖ ≤ L.
Remarks
I f is Lipschitz then ∇f is bounded: ‖∇f‖ ≤ L, where L is theLipschitz constant.
I On the stepsize α, note that it is K (total number of step) not k(current iteration number).
I The stepsize requires to know x∗, so this theorem is practicallyuseless as knowing x∗ already solves the problem.
20 / 22
DiscussionIn the convergence analysis of GD:
1. f is convex and β-smooth (gradient is β-Lipschitz)
2. Convergence rate O(1
k
).
In the convergence analysis of PGD:
1. f is convex and L-Lipschitz (gradient is bounded above)
2. Convergence rate O( 1√
k
).
3. The convergence rate works on xK
If f is convex and β-smooth, the convergence of PGD will be the same asthat of GD.
I Theoretical convergence rate of PGD on convex and β-smooth f will
also be O(1
k
).
I However practically it depends on the complexity of the projection.Some Q are difficult to project onto.
21 / 22
Last page - summary
I PGD = GD + projection
I PGD with constant stepsize α:
f
(1
K + 1
K∑k=0
xk
)− f∗ ≤ ‖x0 − x∗‖22
2α(K + 1)+
α
2(K + 1)
K∑k=0
‖∇f(xk)‖22
I If f is Lipschitz (bounded gradient), for the point
xK ={
1K+1
∑Kk=0 xk
}and constant step size α = ‖x0−x∗‖
L√K+1
then
f(xK)− f∗ ≤ L‖x0 − x∗‖√K + 1
.
End of document
22 / 22