Projected Gradient Algorithm · 2020. 10. 23. · Remark: If fis not di erentiable, we can replace...

Projected Gradient Algorithm

Andersen Ang

Mathematique et recherche operationnelleUMONS, Belgium

[email protected] Homepage: angms.science

First draft: August 2, 2017Last update : October 23, 2020

Overview

1 Constrained and unconstrained problem

2 Understanding the geometry of projection

3 PGD is a special case of proximal gradient

4 Theorem 1. An inequality of PGD with constant stepsize

5 Theorem 2. PGD converges at order O(

1√k

)on Lipschitz function

6 Summary

2 / 22

Constrained and unconstrained problemI For unconstrained minimization problem

minx∈Rn

f(x),

any x in Rn can be a solution.

I For constrained minimization problem with a given set Q ⊂ Rn

minx∈Q

f(x),

not any x can be a solution, the solution has to be inside the set Q.

I An example of constrained minimization problem:

minx∈Rn

‖Ax− b‖22 s.t. ‖x‖2 ≤ 1

can be expressed asmin‖x‖2≤1

‖Ax− b‖22.

3 / 22

Solving unconstrained problem by gradient descentI Gradient Descent (GD) is a standard (easy and simple) way to solve

unconstrained optimization problem.

I Starting from an initial point x0 ∈ Rn, GD iterates the followingequation until a stopping condition is met:

xk+1 = xk − αk∇f(xk),

where ∇f is the gradient of f , the parameter α ≥ 0 is the step size,and k is the iteration counter.

I Question: how about constrained problem? Is it possible to tune GDto fit constrained problem?Answer: yes, and the key is projection.

Remark: If f is not differentiable, we can replace gradient by subgradient, and we get theso-called subgradient method.

4 / 22

Solving constrained problem by projected gradient descent

I Projected Gradient Descent (PGD) is a standard (easy and simple)way to solve constrained optimization problem.

I Consider a constraint set Q ⊂ Rn, starting from a initial pointx0 ∈ Q, PGD iterates the following equation until a stoppingcondition is met:

xk+1 = PQ

(xk − αk∇f(xk)

).

I PQ( . ) is the projection operator, and itself is also an optimizationproblem:

PQ(x0) = arg minx∈Q

1

2‖x− x0‖22,

i.e. given a point x0, PQ try to find a point x ∈ Q which is “closest”to x0.

5 / 22

About the projection

I PQ( . ) is a function from Rn to Rn, and itself is an optimizationproblem:

PQ(x0) = arg minx∈Q

1

2‖x− x0‖22.

I PGD is an “economic” algorithm if the problem is easy to solve.This is not true for general Q and there are lots of constraint setsthat are very difficult to project onto.

I If Q is a convex set, the optimization problem has a unique solution.

I If Q is nonconvex, the solution to PQ(x0) may not be unique: it givesmore than one solution.

6 / 22

Comparing PGD to GDI GD

1. Pick an initial point x0 ∈ Rn

2. Loop until stopping condition is met:2.1 Descent direction: pick the descent direction as −∇f(xk)2.2 Stepsize: pick a step size αk

2.3 Update: xk+1 = xk − αk∇f(xk)

I PGD1. Pick an initial point x0 ∈ Q2. Loop until stopping condition is met:

2.1 Descent direction: pick the descent direction as −∇f(xk)2.2 Stepsize: pick a step size αk

2.3 Update: yk+1 = xk − αk∇f(xk)2.4 Projection: xk+1 = argmin

x∈Q

12‖x− yk+1‖22

I PGD has one more step: the projection.

I The idea of PGD is simple: if the point xk − αk∇f(xk) after thegradient update is leaving the set Q, project it back.

7 / 22

Understanding the geometry of projection ... (1/5)

Consider a convex set Q and a point x0 ∈ Q.

Qx0

I As x0 ∈ Q, the closest point to x0 in Q will be x0 itself.

I The distance between a point to itself is zero.

I Mathematically: 12‖x− x0‖22 = 0 gives x = x0.

8 / 22


Now consider a convex set Q and a point x0 /∈ Q: outside Q.

Qx0

9 / 22

Understanding the geometry of projection ... (3/5)I The circles are L2 norm ball centered at x0 with different radius.

I Points on these circles are equidistant to x0 (with different L2

distance on different circles).

I Note that some points on the blue circle are inside Q.

Qx0

10 / 22

Understanding the geometry of projection ... (4/5)I The point inside Q which is closest to x0 is the point where the L2

norm ball “touches” Q.I In this example, the blue point y is the solution to

PQ(x0) = argminx∈Q

1

2‖x− x0‖22.

Qx0

y

In fact, it can be proved that, such point is always located on theboundary of Q for x0 /∈ Q. That is, mathematically,argmin

x∈Q

12‖x− x0‖22 ∈ bdQ if x0 /∈ Q.

11 / 22


Note that the projection is orthogonal: the blue point y is always on astraight line that is tangent to the norm ball and Q.

Qx0

y

12 / 22

PGD is a special case of proximal gradient

I The indicator function, denoted as i(x), of a set Q is defined asfollows: if x ∈ Q, then i(x) = 0; if x /∈ Q, then i(x) =∞.

I With the indicator function, constrained problem has two equivalentexpressions

minx∈Q

f(x) ≡ minxf(x) + i(x).

I Proximal gradient is a method to solve the optimization problem of asum of differentiable and a non-differentiable function:

minxf(x) + g(x),

where g is a non-differentiable function.

I PGD is in fact the special case of proximal gradient where g(x) is theindicator function of the constrain set. See here for more aboutproximal gradient .

13 / 22

https://angms.science/doc/CVX/CVX_ProximalGD.pdf

On PGD convergence rateI Theorem 1. If f is convex, PGD with constant stepsize α satisfies

f

(1

K + 1

K∑k=0

xk

)− f∗ ≤ ‖x0 − x∗‖22

2α(K + 1)+

α

2(K + 1)

K∑k=0

‖∇f(xk)‖22,

where f∗ = f(x∗) is the optimal cost value, x∗ is the (global)minimizer, α is the constant stepsize, K is the total of number ofiteration performed.

I Interpretation of this theorem: the term 1K+1

∑Kk=0 xk is the

“average” of the sequence xk after K iteration, hence we can denoteit as x and f(x) as f . Then the theorem reads:

f − f∗ ≤ ‖x0 − x∗‖222α(K + 1)

+ something positive.

Hence the convergence rate is like O( 1K ).

I For the second term on the right hand side, as long as∑Kk=0 ‖∇f(xk)‖22 is not diverging to infinity, or the growth of it is

slower than K, then the termα

2(K + 1)

∑Kk=0 ‖∇f(xk)‖22 converges.

14 / 22

Proof of theorem 1 ... (1/5)I f is convex so f(x)− f(z) ≤ 〈∇f(x),x− z〉.

I Put x = xk, z = x∗ and f(x∗) = f∗:

f(xk)− f∗ ≤ 〈∇f(xk),xk − x∗〉.

I PGD update yk+1 = xk −αk∇f(xk) gives ∇f(xk) =xk − yk+1

αkand

f(xk)− f∗ ≤1

αk〈xk − yk+1,xk − x∗〉.

I As we use constant stepsize:

f(xk)− f∗ ≤1

α〈xk − yk+1,xk − x∗〉.

15 / 22

Proof of theorem 1 ... (2/5)I A trick

(a− b)(a− c) = a2 − ac− ab+ bc

=2a2 − 2ac− 2ab+ 2bc

2

=a2 − 2ac+ a2 − 2ab+ 2bc+c2 − c2 + b2 − b2

2

=(a− c)2 + (a− b)2 − (b− c)2

2

I Hence

f(xk)− f∗ ≤ 1α〈xk − yk+1,xk − x∗〉

= 12α

(‖xk − x∗‖22 + ‖xk − yk+1‖22 − ‖yk+1 − x∗‖22

)∗= 1

2α

(‖xk − x∗‖22 − ‖yk+1 − x∗‖22

)+ α

2 ‖∇f(xk)‖22

where * is due to PGD update xk − yk+1 = α∇f(xk)

16 / 22

Proof of theorem 1 ... (3/5)

Note that ‖yk+1 − x∗‖22 ≥ ‖xk+1 − x∗‖22.

Qy

ΠQ(y) = xk+1

x∗

Hence −‖yk+1 − x∗‖22 ≤ −‖xk+1 − x∗‖22 and

f(xk)− f∗ ≤ 12α

(‖xk − x∗‖22−‖yk+1 − x∗‖22

)+ α

2 ‖∇f(xk)‖22≤ 1

2α

(‖xk − x∗‖22−‖xk+1 − x∗‖22

)+ α

2 ‖∇f(xk)‖22

It forms a telescoping series !

17 / 22

Proof of theorem 1 ... (4/5)

k = 0 f(x0)− f∗ ≤‖x0 − x∗‖22 − ‖x1 − x∗‖22

2α+α

2‖∇f(x0)‖22

k = 1 f(x1)− f∗ ≤‖x1 − x∗‖22 − ‖x2 − x∗‖22

2α+α

2‖∇f(x1)‖22

...

k = K f(xk)− f∗ ≤‖xk − x∗‖22 − ‖xK+1 − x∗‖22

2α+α

2‖∇f(xk)‖22

Sums all

K∑k=0

(f(xk)− f∗

)≤ ‖x0 − x∗‖22 − ‖xk+1 − x∗‖22

2α+α

2

K∑k=0

‖∇f(xk)‖22.

18 / 22

Proof of theorem 1 ... (5/5)As 0 ≤ 1

2α‖xk+1 − x∗‖22,

K∑k=0

(f(xk)− f∗

)≤ ‖x0 − x∗‖22

2α+α

2

K∑k=0

‖∇f(xk)‖22.

Expand the summation on the left and divide the whole equation by K + 1

1

K + 1

K∑k=0

f(xk)− f∗ ≤‖x0 − x∗‖222α(K + 1)

+α

2(K + 1)

K∑k=0

‖∇f(xk)‖22.

Consider the left hand side, as f is convex, by Jensen’s inequality

f

(1

K + 1

K∑k=0

xk

)≤ 1

K + 1

K∑k=0

f(xk).

Therefore

f

(1

K + 1

K∑k=0

xk

)− f∗ ≤ ‖x0 − x∗‖22

2α(K + 1)+

α

2(K + 1)

K∑k=0

‖∇f(xk)‖22.

19 / 22

PGD converges at order O(

1√k

)on Lipschitz function

Theorem 2. If f is Lipschitz, for the point xK =

{1

K + 1

K∑k=0

xk

}and

constant stepsize α =‖x0 − x∗‖L√K + 1

we have

f(xK)− f∗ ≤ L‖x0 − x∗‖√K + 1

Proof. Put xK , α into theorem 1 directly, note that ‖∇f‖ ≤ L.

Remarks

I f is Lipschitz then ∇f is bounded: ‖∇f‖ ≤ L, where L is theLipschitz constant.

I On the stepsize α, note that it is K (total number of step) not k(current iteration number).

I The stepsize requires to know x∗, so this theorem is practicallyuseless as knowing x∗ already solves the problem.

20 / 22

DiscussionIn the convergence analysis of GD:

1. f is convex and β-smooth (gradient is β-Lipschitz)

2. Convergence rate O(1

k

).

In the convergence analysis of PGD:

1. f is convex and L-Lipschitz (gradient is bounded above)

2. Convergence rate O( 1√

k

).

3. The convergence rate works on xK

If f is convex and β-smooth, the convergence of PGD will be the same asthat of GD.

I Theoretical convergence rate of PGD on convex and β-smooth f will

also be O(1

k

).

I However practically it depends on the complexity of the projection.Some Q are difficult to project onto.

21 / 22

Last page - summary

I PGD = GD + projection

I PGD with constant stepsize α:

f

(1

K + 1

K∑k=0

xk

)− f∗ ≤ ‖x0 − x∗‖22

2α(K + 1)+

α

2(K + 1)

K∑k=0

‖∇f(xk)‖22

I If f is Lipschitz (bounded gradient), for the point

xK ={

1K+1

∑Kk=0 xk

}and constant step size α = ‖x0−x∗‖

L√K+1

then

f(xK)− f∗ ≤ L‖x0 − x∗‖√K + 1

.

End of document

22 / 22

Projected Gradient Algorithm · 2020. 10. 23. · Remark: If fis not di erentiable, we can replace...

Documents

Transcript of Projected Gradient Algorithm · 2020. 10. 23. · Remark: If fis not di erentiable, we can replace...