Chordal Graphs and Semidefinite Optimization Vandenberghe Andersen
On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0...
Transcript of On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0...
![Page 1: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/1.jpg)
On Dropping Convexity for Faster Optimization
Sujay SanghaviUT Austin
![Page 2: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/2.jpg)
SrinadhBhojanapalli
UT Austinà TTI Chicago
AnastasiosKyrillidis
UT Austin
Dohyung Park
UT Austin
![Page 3: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/3.jpg)
Motivation
U
V 0
users
Sample problem: matrix completion
eO(nr)
eO(n2)
eO(nr)
data size
output size
Convex optimization - sizeAltMin Sensing Completion Proof Summary References
A Comparision
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fraction of observations
Ratio
of su
ccess
AltMin
Nuclear norm approach
Nuclear norm approach : a leading theoretical approach.
Empirically, AltMin hassimilar sample complexity andbetter computational complexity.
Praneeth Netrapalli Provable Matrix Completion using Alternating Minimization
… and empirically often statistically worse …
Similar stories in phase retrieval, matrix regression, …
![Page 4: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/4.jpg)
Step 1: Semidefinite Optimization
minX
f(X)
s.t. X ⌫ 0
convex, nice ..
Natural method: projected gradient descent
X+ P+ (X � ⌘rf(X) )
“First-order oracle access to f ”
Projection onto psd coneComputationally intensive
Step size
![Page 5: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/5.jpg)
First order oracle access
Access to the function is only as follows:
Oracle access is a standard abstraction in the study of methods in convex optimization
Typical result: if f satisfies <properties> then convergence rate of <method that uses first order oracle> is <…>
![Page 6: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/6.jpg)
Classic Result 1: Smoothness
for all
Then, for (projected) gradient descent with step size
![Page 7: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/7.jpg)
Classic Result 2: Strong Convexity
Suppose f is strongly convex, i.e. hessian satisfies
Then for gradient descent with step size
The error in every step reduces by factor
So: “best” choice of step size gives reduction by factor
Condition number of f
“linear convergence”
![Page 8: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/8.jpg)
Effect of Condition Number
87.9201
87.9201
87.9201
87.9201
175.7305
175.7305
175.7305
175.7305
175.7305
263.5408
263.5408
263.5408263.5408
263.5408
263.5408
351.3512
351.3512
351.3512
351.3512
439.1616
439.1616
439.1616
439.1616
526.972
526.972
526.972
526.972
614.7824
614.7824
614.7824
702.5928
702.5928
702.5928
790.4032
790.4032
790.4032
878.2136
878.2136
966.024
966.024
1053.8344
1141.6448
1229.4552
1317.2655
x1
-10 -5 0 5 10
x2
-10
-8
-6
-4
-2
0
2
4
6
8
10
81.9674
81.9674
81.9674
81.9674
163.8632
163.8632
163.8632
163.8632
245.7589
245.7589
245.7589
245.7589
327.6546
327.6546
327.6546
327.6546
409.5504
409.5504
409.5504
409.5504
491.4461
491.4461
491.4461
573.3418
573.3418
573.3418
655.2375
655.2375
737.1333
737.1333
819.029
819.029
900.9247
900.9247
982.8204
982.8204
1064.7162
1146.6119
x1
-10 -5 0 5 10
x2
-10
-8
-6
-4
-2
0
2
4
6
8
10
( 1� 1/ )Error decreases by in every iteration (with best step size)
Low“well conditioned”
High “badly conditioned”
![Page 9: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/9.jpg)
Dropping Convexity
X ⌫ 0 , 9 U s.t.X = UU 0
minU
f(UU 0)
This problem is “equivalent” to original problem because
non-convex, but “only” due to UU’ parameterization
n x n matrix
[Bruer & Monteiro] with linear f, and constraints, eventual convergence to correct answer - no indication of how fast
![Page 10: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/10.jpg)
Factored Gradient Descent
= Gradient descent on
By chain rule, so
(Factored) Gradient descent:
U+ U � ⌘rf(UU 0)U
Again, first-order oracle access to f
No projection step …
![Page 11: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/11.jpg)
Non-convexity: Issue 1
2
2
3
3
4
4
5
5
5 5
6
6
6
6
6
6
7
7
7
7
7
7
78
8
8
8
88
9
9
U-2 -1 0 1 2
U
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
210
U
-1-2-2-1
0U
1
100
101
102
103
104
105
2
f(U
U>)
For any rotation matrix , i.e. a matrix such that , we have that
Idea: new definition of distance:
“Only the contour level matters”
![Page 12: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/12.jpg)
Non-convexity: Issue 2
has spurious stationary points – even for strongly convex original
- saddle points, local minima, local maxima …
e.g. is always a stationary point :
More generally, can have but
Does it have bad local minima when U is n x n ?- we don’t know …
![Page 13: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/13.jpg)
-1.8763
-1.8763
-1.0257
-1.0257
-0.17517
-0.175170.67537
0.67537
0.67537
0.67537
1.5259
1.5259
1.5259
1.5259
1.52591.5259
1.52592.3765
2.3765
2.3765
2.3765
2.3765
2.3765
3.227
3.227
U-2 -1 0 1 2
U
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-4.4279
-3.5773
-3.5773
-2.7268
-2.7268
-2.7268
-2.7268
-1.8763
-1.8763
-1.8763
-1.8763
-1.8763
-1.8763
-1.0257
-1.0257
-1.0257
-1.0257
-1.0257
U-1.2 -1.1 -1 -0.9 -0.8
U
-1.2
-1.15
-1.1
-1.05
-1
-0.95
-0.9
-0.85
-0.8
-0.75
Non-convexity: Issue 2
Idea 1: look for local convergence i.e.
once Note: still not “locally convex”in U space
Idea 2: find a way to initialize (using first-order oracle)
![Page 14: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/14.jpg)
Step size
Idea: let us find
Hessian of g with respect to U
Special case (only for intuition): separable function
Such that
… after some algebra … Depends on X and the gradient of f
And then set
![Page 15: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/15.jpg)
Step size
initial point
Effect: in this example
Comparison with step size
(Sa et al., 2014, Zheng and Lafferty 2015,
Tu et al., 2015)
![Page 16: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/16.jpg)
Summary so far …
minX
f(X) s.t. X ⌫ 0
minU
f(UU 0)
Given
Convert to
U+ U � ⌘rf(UU 0)UDo factored gradient descent
Idea: use step size
![Page 17: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/17.jpg)
Pushing further …
Artificially restrict the size of U to be n x r
U+ U � ⌘rf(UU 0)U
Reason 1: Computational
Smaller r = less variables, faster in every iteration
Reason 2: Statistical
Prevent over-fitting (in cases where f is a data-dependent loss function)
![Page 18: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/18.jpg)
Issue 0: What does it converge to ?
And consider the matrix of its top r eigen-components
We will show convergence of
In the following: let be such that
![Page 19: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/19.jpg)
Restricted Strong Convexity
(Regular) strong convexity: for all
Restricted strong convexity (RSC): above holds only for low-rank X,Y[Negahban et. al.]
Weaker assumption on f – common in high-dimensional machine learning
E.g. Matrix regression:
![Page 20: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/20.jpg)
Main Result
Theorem: With step size choice as above, and (m,M) RSC,
Provided:
Next iterate
Linear Convergence onceclose enough.
Provided r appropriate chosen
In practice: increase r in stages
![Page 21: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/21.jpg)
Initialization for Strongly Convex f
We propose:
1. Find negative gradient at 0
2. Keep at most r most positive eigen-components (i.e. values and their Corresponding vectors). Remove all negative eigen-components.
Requires one SVD.
![Page 22: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/22.jpg)
Initialization
2
Increasing
Theorem:
Specializations of this have already been used in matrix completion, phaseretrieval etc.
![Page 23: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/23.jpg)
A strange phenomenon
Different convergence rates for
and
Number of iterations0 200 400 600 800 1000
kb X!
X?k F
=kX
?k F
10-6
10-5
10-4
10-3
10-2
10-1
100
<1
<3= 100
<1
<3= 10
<1
<3= 5
Shift the function, get a differentconvergence behavior (!)
![Page 24: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/24.jpg)
Smooth Convex Functions
Theorem: Local 1/k convergence rate:
where
![Page 25: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/25.jpg)
Summary so far …
minX
f(X)
s.t. X ⌫ 0
minU
f(UU 0)
U+ U � ⌘rf(UU 0)UFactored gradient descent
With step size
Restricted strongly convex f : 1. Local linear convergence to
2. Initialization
Smooth f: Local 1/k convergence
![Page 26: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/26.jpg)
General (unconstrained) FGD
![Page 27: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/27.jpg)
General (unconstrained) FGD
FGD:
Now, bigger uncertainty sets:
![Page 28: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/28.jpg)
General (unconstrained) FGD
Immediate corollary: 1/k convergence for smooth f
But, cannot use this trick for strongly convex …
![Page 29: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/29.jpg)
Strongly Convex
Smooth, strongly convex,global min at 0
( Borrowing from [Tu et.al., 2016] )
Theorem: Local linear convergence to neighborhood of
where
![Page 30: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/30.jpg)
Open Problems
1. Constraints
2. Acceleration
![Page 31: On#Dropping#Convexityfor#Faster# Optimization#Semidefinite#Optimization min X f (X) s.t. X 0 convex,#nice#.. Naturalmethod:projectedgradientdescent X + P + ( X rf (X )) “FirstDorder#oracle#](https://reader033.fdocuments.net/reader033/viewer/2022051722/5a9ffd497f8b9a89178d66db/html5/thumbnails/31.jpg)
Summary
This work: factored gradient descent under first-order oracle model- new step size rule - local convergence rates for smooth, and for restricted strongly
convex functions- new initialization scheme
Implication: Correctness + convergence rates for
phase recoverymatrix regressionmatrix sensing… and almost, for matrix completion …
Under similar statistical settings already in analysis of convex optimization and Alt-Min
Claim: convex optimization a bad idea for statistical inferenceproblems involving low-rank matrix estimation …
All of these already use (special cases of) our initialization …