Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

57
Solving Linear Systems of Equations via Random Kaczmarz/Stochastic Gradient Descent Stefan Steinerberger August 2020

Transcript of Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

Page 1: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Solving Linear Systems of Equations via RandomKaczmarz/Stochastic Gradient Descent

Stefan Steinerberger

August 2020

Page 2: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Outline

1. The Kaczmarz method

2. Random Kaczmarz (or: Stochastic Gradient Descent)

3. What happens to the singular vectors?

4. Getting stuck between a rock and a hard place

5. Changing the likelihoods

6. The energy cascade

7. What’s next?

Page 3: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Throughout this talk, we will try to solve Ax = b where A ∈ Rn×n.

Let us assume that A is invertible and use ai ∈ Rn to denote thei−th row, so we can also write

a1a2. . .an

x = b

or∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Most of the results are more general and apply to overdeterminedsystems that have a solution. I don’t quite know what happenswhen there are no solutions: do things converge properly to a leastsquares solution? (Question: how many arguments survive?)

Page 4: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Throughout this talk, we will try to solve Ax = b where A ∈ Rn×n.

Let us assume that A is invertible and use ai ∈ Rn to denote thei−th row, so we can also write

a1a2. . .an

x = b

or∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Most of the results are more general and apply to overdeterminedsystems that have a solution. I don’t quite know what happenswhen there are no solutions: do things converge properly to a leastsquares solution? (Question: how many arguments survive?)

Page 5: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish Mathematician

PhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

Page 6: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations

1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

Page 7: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge

1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

Page 8: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)

His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

Page 9: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)

The circumstances of his death in WW2(either 1939 or 1940) are unclear.

Page 10: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

Page 11: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

The method is remarkably simple: we want

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Geometrically, we want to find the intersection of hyperplanes.

x

Page 12: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

The method is remarkably simple: we want

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Geometrically, we want to find the intersection of hyperplanes.

x

Page 13: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

x

xkxk+1

xk+2

Project iteratively on the hyperplanes given by

〈ai , x〉 = bi .

Pythageorean Theorem implies that the distance to the solutionalways decreases (unless you are already on that hyperplane).

Page 14: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

This is very simple in terms of equations. If we project onto thehyperplane given by the i−th equation, we have

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I This is cheap: it’s an inner product! We do not even have toload the full matrix into memory.

I This is thus useful for large matrices.

Page 15: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

The Kaczmarz method

This is very simple in terms of equations. If we project onto thehyperplane given by the i−th equation, we have

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I This is cheap: it’s an inner product! We do not even have toload the full matrix into memory.

I This is thus useful for large matrices.

Page 16: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Standard Kaczmarz. We cycle through the indices i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Page 17: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Standard Kaczmarz. We cycle through the indices i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Page 18: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Standard Kaczmarz. We cycle through the indices i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Page 19: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Strohmer & Vershynin, 2007)

If we pick the i−th equation with likelihood proportional to ‖ai‖2,then

E ‖xk − x‖22 ≤(

1− 1

‖A‖2F · ‖A−1‖22

)k

‖x0 − x‖22.

I ‖A‖F is the Frobenius norm, i.e.

‖A‖2F =n∑

i ,j=1

a2ij .

I ‖A−1‖2 is the inverse of the smallest singular value, i.e.

‖A−1‖2 = infx 6=0

‖A−1x‖‖x‖

= infx 6=0

‖x‖‖Ax‖

Page 20: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

x

xk

Page 21: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

x

xk

Page 22: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

‖xk+1 − x‖22 ≤

(1−

∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2)‖xk − x‖22.

E∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2 =m∑i=1

‖aj‖22‖A‖2F

⟨xk − x

‖xk − x‖,

aj‖aj‖2

⟩2

=1

‖A‖2F

m∑i=1

⟨xk − x

‖xk − x‖, aj

⟩2

=1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Page 23: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

‖xk+1 − x‖22 ≤

(1−

∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2)‖xk − x‖22.

E∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2 =m∑i=1

‖aj‖22‖A‖2F

⟨xk − x

‖xk − x‖,

aj‖aj‖2

⟩2

=1

‖A‖2F

m∑i=1

⟨xk − x

‖xk − x‖, aj

⟩2

=1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Page 24: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

3. What happens to the singular vectors?

Here’s what I really wanted to know: what does xk − x do?Looking at the picture, it should be sort of jumping around.

x

xkxk+1

xk+2

But in numerical experiments, I didn’t see that.

Page 25: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

3. What happens to the singular vectors?

Here’s what I really wanted to know: what does xk − x do?Looking at the picture, it should be sort of jumping around.

x

xkxk+1

xk+2

But in numerical experiments, I didn’t see that.

Page 26: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

x

xkxk+1

xk+2

Numerically, the (random) sequence of vectors

xk − x

‖xk − x‖

tends to mainly a linear combination of singular vectors with smallsingular values. Of course it can jump around but it doesn’t seemto do it very much.

Page 27: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

Page 28: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

Page 29: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

Page 30: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

This actually suggests that the method can be used to find thesmallest singular vector: solve the problem Ax = 0.

5000 10000 15000 20000

0.2

0.4

0.6

0.8

1.0

‖Axk‖‖xk‖

Figure: A sample evolution of ‖Axk‖/‖xk‖.

Page 31: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

4. Stuck between a rock and a hard place

You get trapped in the narrow regions and it’s hard to escape.This seems strange because, after all, it is a random process andyou might end up on any hyperplane.

Page 32: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

4. Stuck between a rock and a hard place

You get trapped in the narrow regions and it’s hard to escape.This seems strange because, after all, it is a random process andyou might end up on any hyperplane.

Page 33: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

4. Stuck between a rock and a hard place

x

Theorem (Slowing down in Bad Areas, (S. 2020))

If xk 6= x and P(xk+1 = x) = 0, then

E⟨

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

⟩2

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Page 34: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Slowing down in Bad Areas, (S. 2020))

If xk 6= x and P(xk+1 = x) = 0, then

E⟨

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

⟩2

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .I The left-hand side corresponds to how much you change your

angle, how much you change the direction from which you areapproaching.

I The right-hand side checks how your current angle is relatedto singular vector.

Once you approach from small singular vectors, you slow down andare unlikely to change directions. That’s bad! How to fix it?

Page 35: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

E⟨

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

⟩2

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .Proof. Littlewood’s Lemma: Identities are always trivial.

E⟨xk ,

xk+1

‖xk+1‖

⟩2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk ,

xk −〈ai ,xk〉‖ai‖2

ai

‖xk −〈ai ,xk〉‖ai‖2

ai‖

⟩2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk , xk −

〈ai ,xk〉‖ai‖2

ai

⟩2

‖xk −〈ai ,xk〉‖ai‖2

ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk , xk −

〈ai ,xk〉‖ai‖2

ai

⟩2

‖xk −〈ai ,xk〉‖ai‖2

ai‖2=

m∑i=1

‖ai‖2

‖A‖2F

‖xk −〈ai ,xk〉‖ai‖2

ai‖4

‖xk −〈ai ,xk〉‖ai‖2

ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

∥∥∥∥∥xk − 〈ai , xk〉‖ai‖ai

‖ai‖

∥∥∥∥∥2

=m∑i=1

‖ai‖2

‖A‖2F

(1−〈ai , xk〉2

‖ai‖2

)

=1

‖A‖2F

m∑i=1

(‖ai‖

2 − 〈ai , xk〉2)

= 1−1

‖A‖2F

m∑i=1

〈ai , xk〉2 = 1−

‖Axk‖2

‖A‖2F

.

Page 36: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

5. Changing the likelihoods

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive. Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

Page 37: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

5. Changing the likelihoods

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive.

Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

Page 38: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

5. Changing the likelihoods

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive. Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

Page 39: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Proposed fix: choose the i−th equation with likelihoodproportional to

P(we choose equation i) =|〈ai , xk〉 − b|p

‖Axk − b‖p`p.

I for p = 0, every equation is picked with equal likelihood

I for p large, the large deviations are more likely to be picked

I in practice, no difference between p = 20 and p = 10100

I the method ‘converges’ to maximum residual as p →∞.

Page 40: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Proposed fix: choose the i−th equation with likelihoodproportional to

P(we choose equation i) =|〈ai , xk〉 − b|p

‖Axk − b‖p`p.

I for p = 0, every equation is picked with equal likelihood

I for p large, the large deviations are more likely to be picked

I in practice, no difference between p = 20 and p = 10100

I the method ‘converges’ to maximum residual as p →∞.

Page 41: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

2000 4000 6000 8000 10000

5

10

15

20

25

30

‖xk − x‖`2

Figure: ‖xk − x‖`2 for the Randomized Kaczmarz method (blue), forp = 1 (orange), p = 2 (green) and p = 20 (red).

Page 42: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Weighting is better (S. 2020))

Let 0 < p <∞, let A be normalized to having the norm of eachrow be ‖ai‖ = 1. Then

E ‖xk − x‖22 ≤

(1− inf

x 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2

)k

‖x0 − x‖22.

This is at least the rate of Randomized Kaczmarz since

infx 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2≥ 1

‖A‖2F · ‖A−1‖2

with equality if and only if the singular vector vn corresponding tothe smallest singular value of A has the property that Avn is aconstant vector.

Page 43: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Weighting is better (S. 2020))

Let 0 < p <∞, let A be normalized to having the norm of eachrow be ‖ai‖ = 1. Then

E ‖xk − x‖22 ≤

(1− inf

x 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2

)k

‖x0 − x‖22.

This is at least the rate of Randomized Kaczmarz since

infx 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2≥ 1

‖A‖2F · ‖A−1‖2

with equality if and only if the singular vector vn corresponding tothe smallest singular value of A has the property that Avn is aconstant vector.

Page 44: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

6. The Energy Cascade

Back to Randomized Kaczmarz (Open Question: how much of thisis true for the weighted case?)

Page 45: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

2000 4000 6000 8000 10000

2

4

6

8

10

‖Axk − b‖`2

Figure: The size of ‖Axk − b‖`2 for k = 1, . . . , 10000. We observe rapidinitial decay which then slows down.

Page 46: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

2000 4000 6000 8000 10000

2

4

6

8

10

The underlying story is somehow clear: the part of xk − x thatbelongs to larger singular vectors decays faster. But the Theoremabove does not actually rigorously prove this: it’s a statementabout expectation (no concentration!).

Page 47: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

No concentration indeed!

200 400 600 800 1000

-0.010

-0.005

0.005

0.010⟨

xk−x‖xk−x‖ , v1

Figure: The evolution of the normalized residual against the leadingsingular vector v1: fluctuations around the mean.

Page 48: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Energy Cascade (S, 2020))

Abbreviating

α = max1≤i≤n

‖Aai‖2

‖ai‖2,

we have

E ‖Axk+1−b‖22 ≤(

1 +α

‖A‖2F

)‖Axk−b‖22−

2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22.

I The constant α is usually quite small, maybe α ∼ 2.

I The most important part is that there is an AT term in thedecay quantity: this makes contributions from large singularvectors even bigger and leads to even more decay.

Page 49: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

Theorem (Energy Cascade (S, 2020))

Abbreviating

α = max1≤i≤n

‖Aai‖2

‖ai‖2,

we have

E ‖Axk+1−b‖22 ≤(

1 +α

‖A‖2F

)‖Axk−b‖22−

2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22.

I The constant α is usually quite small, maybe α ∼ 2.

I The most important part is that there is an AT term in thedecay quantity: this makes contributions from large singularvectors even bigger and leads to even more decay.

Page 50: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

500 1000 1500 2000 2500 3000

200

400

600

800

1000

‖Axk − b‖`2 500 1000 1500 2000 2500 3000

-2.5

-2.0

-1.5

-1.0

-0.5

α

‖A‖2F

‖Axk − b‖22 −2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22

Figure: The Theorem in Action: numbers work out.

Page 51: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

Page 52: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

Page 53: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

Page 54: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

So now we have a lot better understanding: how can we improvethings?

The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

ai .

Why?

Page 55: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

So now we have a lot better understanding: how can we improvethings? The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

ai .

Why?

Page 56: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

So now we have a lot better understanding: how can we improvethings? The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

ai .

Why?

Page 57: Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence of this method is geometrically obvious) { but the convergence rate is not. Random

References

1. Randomized Kaczmarz converges along small singular vectors,arXiv:2006.16978

2. A Weighted Randomized Kaczmarz Method for Solving LinearSystems, arXiv:2007.02910

3. Stochastic Gradient Descent applied to Least Squares ...,arXiv:2007.13288

Thank you!