Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

Solving Linear Systems of Equations via RandomKaczmarz/Stochastic Gradient Descent

Stefan Steinerberger

August 2020

Outline

1. The Kaczmarz method

2. Random Kaczmarz (or: Stochastic Gradient Descent)

3. What happens to the singular vectors?

4. Getting stuck between a rock and a hard place

5. Changing the likelihoods

6. The energy cascade

7. What’s next?

Throughout this talk, we will try to solve Ax = b where A ∈ Rn×n.

Let us assume that A is invertible and use ai ∈ Rn to denote thei−th row, so we can also write

a1a2. . .an

or∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Most of the results are more general and apply to overdeterminedsystems that have a solution. I don’t quite know what happenswhen there are no solutions: do things converge properly to a leastsquares solution? (Question: how many arguments survive?)

Throughout this talk, we will try to solve Ax = b where A ∈ Rn×n.

Let us assume that A is invertible and use ai ∈ Rn to denote thei−th row, so we can also write

a1a2. . .an

or∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Most of the results are more general and apply to overdeterminedsystems that have a solution. I don’t quite know what happenswhen there are no solutions: do things converge properly to a leastsquares solution? (Question: how many arguments survive?)

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish Mathematician

PhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Polish MathematicianPhD in 1924 for Functional Equations

1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge

1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)

His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)

The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

The method is remarkably simple: we want

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Geometrically, we want to find the intersection of hyperplanes.

The Kaczmarz method

The method is remarkably simple: we want

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Geometrically, we want to find the intersection of hyperplanes.

The Kaczmarz method

xkxk+1

Project iteratively on the hyperplanes given by

〈ai , x〉 = bi .

Pythageorean Theorem implies that the distance to the solutionalways decreases (unless you are already on that hyperplane).

The Kaczmarz method

This is very simple in terms of equations. If we project onto thehyperplane given by the i−th equation, we have

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

I This is cheap: it’s an inner product! We do not even have toload the full matrix into memory.

I This is thus useful for large matrices.

The Kaczmarz method

This is very simple in terms of equations. If we project onto thehyperplane given by the i−th equation, we have

I This is cheap: it’s an inner product! We do not even have toload the full matrix into memory.

I This is thus useful for large matrices.

Standard Kaczmarz. We cycle through the indices i and set

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Theorem (Strohmer & Vershynin, 2007)

If we pick the i−th equation with likelihood proportional to ‖ai‖2,then

E ‖xk − x‖22 ≤(

1− 1

‖A‖2F · ‖A−1‖22

‖x0 − x‖22.

I ‖A‖F is the Frobenius norm, i.e.

‖A‖2F =n∑

i ,j=1

a2ij .

I ‖A−1‖2 is the inverse of the smallest singular value, i.e.

‖A−1‖2 = infx 6=0

‖A−1x‖‖x‖

= infx 6=0

‖x‖‖Ax‖

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

Sketch of the Proof

‖xk+1 − x‖22 ≤

∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2)‖xk − x‖22.

E∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2 =m∑i=1

‖aj‖22‖A‖2F

⟨xk − x

‖xk − x‖,

aj‖aj‖2

‖A‖2F

m∑i=1

⟨xk − x

‖xk − x‖, aj

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Sketch of the Proof

‖xk+1 − x‖22 ≤

∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2)‖xk − x‖22.

E∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2 =m∑i=1

‖aj‖22‖A‖2F

⟨xk − x

‖xk − x‖,

aj‖aj‖2

‖A‖2F

m∑i=1

⟨xk − x

‖xk − x‖, aj

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Here’s what I really wanted to know: what does xk − x do?Looking at the picture, it should be sort of jumping around.

xkxk+1

But in numerical experiments, I didn’t see that.

Here’s what I really wanted to know: what does xk − x do?Looking at the picture, it should be sort of jumping around.

xkxk+1

But in numerical experiments, I didn’t see that.

xkxk+1

Numerically, the (random) sequence of vectors

xk − x

‖xk − x‖

tends to mainly a linear combination of singular vectors with smallsingular values. Of course it can jump around but it doesn’t seemto do it very much.

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

σ2`‖A‖2F

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

E 〈xk − x , v`〉 =

σ2`‖A‖2F

〈x0 − x , v`〉 .

σn =1

‖A‖2F‖A−1‖22

E 〈xk − x , v`〉 =

σ2`‖A‖2F

〈x0 − x , v`〉 .

σn =1

‖A‖2F‖A−1‖22

E 〈xk − x , v`〉 =

σ2`‖A‖2F

〈x0 − x , v`〉 .

This actually suggests that the method can be used to find thesmallest singular vector: solve the problem Ax = 0.

5000 10000 15000 20000

‖Axk‖‖xk‖

Figure: A sample evolution of ‖Axk‖/‖xk‖.

4. Stuck between a rock and a hard place

You get trapped in the narrow regions and it’s hard to escape.This seems strange because, after all, it is a random process andyou might end up on any hyperplane.

Theorem (Slowing down in Bad Areas, (S. 2020))

If xk 6= x and P(xk+1 = x) = 0, then

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Theorem (Slowing down in Bad Areas, (S. 2020))

If xk 6= x and P(xk+1 = x) = 0, then

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .I The left-hand side corresponds to how much you change your

angle, how much you change the direction from which you areapproaching.

I The right-hand side checks how your current angle is relatedto singular vector.

Once you approach from small singular vectors, you slow down andare unlikely to change directions. That’s bad! How to fix it?

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .Proof. Littlewood’s Lemma: Identities are always trivial.

E⟨xk ,

‖xk+1‖

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk ,

xk −〈ai ,xk〉‖ai‖2

‖xk −〈ai ,xk〉‖ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk , xk −

〈ai ,xk〉‖ai‖2

ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk , xk −

〈ai ,xk〉‖ai‖2

ai‖2=

m∑i=1

‖ai‖2

‖A‖2F

ai‖4

ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

∥∥∥∥∥xk − 〈ai , xk〉‖ai‖ai

‖ai‖

∥∥∥∥∥2

=m∑i=1

‖ai‖2

‖A‖2F

(1−〈ai , xk〉2

‖ai‖2

‖A‖2F

m∑i=1

(‖ai‖

2 − 〈ai , xk〉2)

= 1−1

‖A‖2F

m∑i=1

〈ai , xk〉2 = 1−

‖Axk‖2

‖A‖2F

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive. Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

Wait a second... those inner products are expensive.

Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

Wait a second... those inner products are expensive. Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

Proposed fix: choose the i−th equation with likelihoodproportional to

P(we choose equation i) =|〈ai , xk〉 − b|p

‖Axk − b‖p`p.

I for p = 0, every equation is picked with equal likelihood

I for p large, the large deviations are more likely to be picked

I in practice, no difference between p = 20 and p = 10100

I the method ‘converges’ to maximum residual as p →∞.

Proposed fix: choose the i−th equation with likelihoodproportional to

P(we choose equation i) =|〈ai , xk〉 − b|p

‖Axk − b‖p`p.

I for p = 0, every equation is picked with equal likelihood

I for p large, the large deviations are more likely to be picked

I in practice, no difference between p = 20 and p = 10100

I the method ‘converges’ to maximum residual as p →∞.

2000 4000 6000 8000 10000

‖xk − x‖`2

Figure: ‖xk − x‖`2 for the Randomized Kaczmarz method (blue), forp = 1 (orange), p = 2 (green) and p = 20 (red).

Theorem (Weighting is better (S. 2020))

Let 0 < p <∞, let A be normalized to having the norm of eachrow be ‖ai‖ = 1. Then

E ‖xk − x‖22 ≤

(1− inf

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2

‖x0 − x‖22.

This is at least the rate of Randomized Kaczmarz since

infx 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2≥ 1

‖A‖2F · ‖A−1‖2

with equality if and only if the singular vector vn corresponding tothe smallest singular value of A has the property that Avn is aconstant vector.

Theorem (Weighting is better (S. 2020))

Let 0 < p <∞, let A be normalized to having the norm of eachrow be ‖ai‖ = 1. Then

E ‖xk − x‖22 ≤

(1− inf

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2

‖x0 − x‖22.

This is at least the rate of Randomized Kaczmarz since

infx 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2≥ 1

‖A‖2F · ‖A−1‖2

with equality if and only if the singular vector vn corresponding tothe smallest singular value of A has the property that Avn is aconstant vector.

6. The Energy Cascade

Back to Randomized Kaczmarz (Open Question: how much of thisis true for the weighted case?)

2000 4000 6000 8000 10000

‖Axk − b‖`2

Figure: The size of ‖Axk − b‖`2 for k = 1, . . . , 10000. We observe rapidinitial decay which then slows down.

2000 4000 6000 8000 10000

The underlying story is somehow clear: the part of xk − x thatbelongs to larger singular vectors decays faster. But the Theoremabove does not actually rigorously prove this: it’s a statementabout expectation (no concentration!).

No concentration indeed!

200 400 600 800 1000

-0.010

-0.005

0.010⟨

xk−x‖xk−x‖ , v1

Figure: The evolution of the normalized residual against the leadingsingular vector v1: fluctuations around the mean.

Theorem (Energy Cascade (S, 2020))

Abbreviating

α = max1≤i≤n

‖Aai‖2

‖ai‖2,

we have

E ‖Axk+1−b‖22 ≤(

‖A‖2F

)‖Axk−b‖22−

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22.

I The constant α is usually quite small, maybe α ∼ 2.

I The most important part is that there is an AT term in thedecay quantity: this makes contributions from large singularvectors even bigger and leads to even more decay.

Theorem (Energy Cascade (S, 2020))

Abbreviating

α = max1≤i≤n

‖Aai‖2

‖ai‖2,

we have

E ‖Axk+1−b‖22 ≤(

‖A‖2F

)‖Axk−b‖22−

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22.

I The constant α is usually quite small, maybe α ∼ 2.

I The most important part is that there is an AT term in thedecay quantity: this makes contributions from large singularvectors even bigger and leads to even more decay.

500 1000 1500 2000 2500 3000

‖Axk − b‖`2 500 1000 1500 2000 2500 3000

‖A‖2F

‖Axk − b‖22 −2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22

Figure: The Theorem in Action: numbers work out.

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

7. What’s next?

Summary

7. What’s next?

Summary

So now we have a lot better understanding: how can we improvethings?

The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

So now we have a lot better understanding: how can we improvethings? The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

So now we have a lot better understanding: how can we improvethings? The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

References

1. Randomized Kaczmarz converges along small singular vectors,arXiv:2006.16978

2. A Weighted Randomized Kaczmarz Method for Solving LinearSystems, arXiv:2007.02910

3. Stochastic Gradient Descent applied to Least Squares ...,arXiv:2007.13288

Thank you!

Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

Documents

Transcript of Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

Weak Convergence and Optimal Scaling of Random Walk ...lchrist...maximize the efficiency of the algorithm. The main result is a weak convergence result as the dimension of a sequence

SINGULAR VALUE DECOMPOSITION OF LARGE RANDOM MATRICES ...math.bme.hu/~marib/publica/bfksvd.pdf · SINGULAR VALUE DECOMPOSITION OF LARGE RANDOM MATRICES ... C1 (Convergence in ...

Convergence of a Random Particle Method to Solutions of …...MATHEMATICS OF COMPUTATION VOLUME 52, NUMBER 186 APRIL 1989, PAGES 615-645 Convergence of a Random Particle Method to

Weak Convergence (IA) Sequences of Random Vectors · Weak Convergence (IA)-Sequences of Random Vectors Statistics and Probability African Society (SPAS) Books Series. Calgary, ...

On weak convergence of random ﬁelds · 2010. 6. 7. · When considering weak convergence of stochastic processes deﬁned on the real line and, particularly, on the unit interval

Drosophila IKK-related kinase Ik2 and Katanin p60-like 1 regulate ...

Limit theorems in Probability Theory, Random Matrix Theory and …kowalski/fim-08/nikeghbali.pdf · 1. Outline Weak convergence appears in probability theory (convergence in law or

[3] Convergence of Random Variables

On complete convergence in Marcinkiewicz-Zygmund type SLLN ... fileOn complete convergence in Marcinkiewicz-Zygmund type SLLN for random variables 1 Anna Kuczmaszewska Department of

Convergence of Series Absolute convergence, Conditional convergence, Examples. Absolute convergence, Conditional convergence, Examples.

Gradient Descent with Random Initialization: Fast Global ...yc5/publications/random_init_PR.pdf · Gradient Descent with Random Initialization: Fast Global Convergence for Nonconvex

CONVERGENCE RATES FOR WEIGHTED SUMS OF · CONVERGENCE RATES FOR WEIGHTED SUMS OF ASSOCIATED RANDOM VARIABLES AND A MARCINKIEWICZ-ZYGMUND LAW TONGUC˘ C˘AGIN AND PAULO EDUARDO OLIVEIRA

The Drosophila IKK-related kinase (Ik2) and Spindle-F proteins are ...

Convergence Evaluation of a Random Step-Size NLMS Adaptive ...€¦ · Convergence Evaluation of a Random Step-Size NLMS Adaptive Algorithm in System Identification and Channel Equalization

The complete convergence theorem holds for contact ... · The complete convergence theorem holds for contact processes in a random environment on Zd ×Z+ Qiang Yaoa,∗, Xinxing Chenb

The rank of diluted random graphs - arXivRandom graphs, adjacency matrix, random matrices, local weak convergence, Karp and Sipser algorithm. This is an electronic reprint of the original

Motivation Convergence with Probability 1 Convergence in ...• A sequence X1,X2,X3,... of r.v.s is said to converge to a random variable X with probability 1 (w.p.1, also called almost

Convergence and (S, T)-stability almost surely for random ... · applications Godwin Amechi Okeke1 ... random partial differential equations and many classes of random operators.

Spectral convergence of the connection Laplacian from ...amits/publications/iaw016.pdf · Spectral convergence of the connection Laplacian from random samples Amit Singer Department

LOCAL CONVERGENCE OF AN ALGORITHM FOR SUBSPACE · Convergence at an expected linear rate is demonstrated under certain assumptions. The case in which the full random vector is revealed