Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

Post on 15-Jul-2021

1 views 0 download

Transcript of Solving Linear Systems of Equations via Random … · 2020. 8. 27. · ka ik2 a i: (The convergence...

Solving Linear Systems of Equations via RandomKaczmarz/Stochastic Gradient Descent

Stefan Steinerberger

August 2020

Outline

1. The Kaczmarz method

2. Random Kaczmarz (or: Stochastic Gradient Descent)

3. What happens to the singular vectors?

4. Getting stuck between a rock and a hard place

5. Changing the likelihoods

6. The energy cascade

7. What’s next?

Throughout this talk, we will try to solve Ax = b where A ∈ Rn×n.

Let us assume that A is invertible and use ai ∈ Rn to denote thei−th row, so we can also write

a1a2. . .an

x = b

or∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Most of the results are more general and apply to overdeterminedsystems that have a solution. I don’t quite know what happenswhen there are no solutions: do things converge properly to a leastsquares solution? (Question: how many arguments survive?)

Throughout this talk, we will try to solve Ax = b where A ∈ Rn×n.

Let us assume that A is invertible and use ai ∈ Rn to denote thei−th row, so we can also write

a1a2. . .an

x = b

or∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Most of the results are more general and apply to overdeterminedsystems that have a solution. I don’t quite know what happenswhen there are no solutions: do things converge properly to a leastsquares solution? (Question: how many arguments survive?)

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish Mathematician

PhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations

1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge

1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)

His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)

The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

Stefan Kaczmarz(1895 - 1939/1940)

Polish MathematicianPhD in 1924 for Functional Equations1930s: visit Hardy and Paley in Cambridge1937: Approximate Solutions of LinearEquations (3 pages)His colleagues described him as ”tall andskinny”, ”calm and quiet”, and a ”modestman with rather moderate scientific ambi-tions”. (bit strange, taken verbatim fromMacTutor Math Biographies)The circumstances of his death in WW2(either 1939 or 1940) are unclear.

The Kaczmarz method

The method is remarkably simple: we want

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Geometrically, we want to find the intersection of hyperplanes.

x

The Kaczmarz method

The method is remarkably simple: we want

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi .

Geometrically, we want to find the intersection of hyperplanes.

x

The Kaczmarz method

x

xkxk+1

xk+2

Project iteratively on the hyperplanes given by

〈ai , x〉 = bi .

Pythageorean Theorem implies that the distance to the solutionalways decreases (unless you are already on that hyperplane).

The Kaczmarz method

This is very simple in terms of equations. If we project onto thehyperplane given by the i−th equation, we have

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I This is cheap: it’s an inner product! We do not even have toload the full matrix into memory.

I This is thus useful for large matrices.

The Kaczmarz method

This is very simple in terms of equations. If we project onto thehyperplane given by the i−th equation, we have

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I This is cheap: it’s an inner product! We do not even have toload the full matrix into memory.

I This is thus useful for large matrices.

Standard Kaczmarz. We cycle through the indices i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Standard Kaczmarz. We cycle through the indices i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Standard Kaczmarz. We cycle through the indices i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

(The convergence of this method is geometrically obvious) – butthe convergence rate is not.

Random Kaczmarz. We pick a random equation i and set

xk+1 = xk −bi − 〈ai , xk〉‖ai‖2

ai .

I somehow behaves a little better

I used since the 1980s in Tomography

I stochastic gradient descent for ‖Ax − b‖2 → min

Theorem (Strohmer & Vershynin, 2007)

If we pick the i−th equation with likelihood proportional to ‖ai‖2,then

E ‖xk − x‖22 ≤(

1− 1

‖A‖2F · ‖A−1‖22

)k

‖x0 − x‖22.

I ‖A‖F is the Frobenius norm, i.e.

‖A‖2F =n∑

i ,j=1

a2ij .

I ‖A−1‖2 is the inverse of the smallest singular value, i.e.

‖A−1‖2 = infx 6=0

‖A−1x‖‖x‖

= infx 6=0

‖x‖‖Ax‖

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

x

xk

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

x

xk

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

‖xk+1 − x‖22 ≤

(1−

∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2)‖xk − x‖22.

E∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2 =m∑i=1

‖aj‖22‖A‖2F

⟨xk − x

‖xk − x‖,

aj‖aj‖2

⟩2

=1

‖A‖2F

m∑i=1

⟨xk − x

‖xk − x‖, aj

⟩2

=1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Sketch of the Proof

Strohmer & Vershynin’s argument is very short and elegant (one ofthe reasons it has inspired a lot of subsequent work).

‖xk+1 − x‖22 ≤

(1−

∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2)‖xk − x‖22.

E∣∣∣∣⟨ xk − x

‖xk − x‖,Z

⟩∣∣∣∣2 =m∑i=1

‖aj‖22‖A‖2F

⟨xk − x

‖xk − x‖,

aj‖aj‖2

⟩2

=1

‖A‖2F

m∑i=1

⟨xk − x

‖xk − x‖, aj

⟩2

=1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

3. What happens to the singular vectors?

Here’s what I really wanted to know: what does xk − x do?Looking at the picture, it should be sort of jumping around.

x

xkxk+1

xk+2

But in numerical experiments, I didn’t see that.

3. What happens to the singular vectors?

Here’s what I really wanted to know: what does xk − x do?Looking at the picture, it should be sort of jumping around.

x

xkxk+1

xk+2

But in numerical experiments, I didn’t see that.

x

xkxk+1

xk+2

Numerically, the (random) sequence of vectors

xk − x

‖xk − x‖

tends to mainly a linear combination of singular vectors with smallsingular values. Of course it can jump around but it doesn’t seemto do it very much.

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

I The slowest rate of decay is given by the smallest singularvalue σn. Since

σn =1

‖A−1‖2the rate is 1− 1

‖A‖2F‖A−1‖22

which is the (optimal) rate of Strohmer & Vershynin.

I Open Problem: Only Expectation, what about variance...?

Theorem (Small Singular Values Dominate, (S. 2020))

Let v` be a (right) singular vector of A associated to the singularvalue σ`. Then

E 〈xk − x , v`〉 =

(1−

σ2`‖A‖2F

)k

〈x0 − x , v`〉 .

This actually suggests that the method can be used to find thesmallest singular vector: solve the problem Ax = 0.

5000 10000 15000 20000

0.2

0.4

0.6

0.8

1.0

‖Axk‖‖xk‖

Figure: A sample evolution of ‖Axk‖/‖xk‖.

4. Stuck between a rock and a hard place

You get trapped in the narrow regions and it’s hard to escape.This seems strange because, after all, it is a random process andyou might end up on any hyperplane.

4. Stuck between a rock and a hard place

You get trapped in the narrow regions and it’s hard to escape.This seems strange because, after all, it is a random process andyou might end up on any hyperplane.

4. Stuck between a rock and a hard place

x

Theorem (Slowing down in Bad Areas, (S. 2020))

If xk 6= x and P(xk+1 = x) = 0, then

E⟨

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

⟩2

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .

Theorem (Slowing down in Bad Areas, (S. 2020))

If xk 6= x and P(xk+1 = x) = 0, then

E⟨

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

⟩2

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .I The left-hand side corresponds to how much you change your

angle, how much you change the direction from which you areapproaching.

I The right-hand side checks how your current angle is relatedto singular vector.

Once you approach from small singular vectors, you slow down andare unlikely to change directions. That’s bad! How to fix it?

E⟨

xk − x

‖xk − x‖,

xk+1 − x

‖xk+1 − x‖

⟩2

= 1− 1

‖A‖2F

∥∥∥∥A xk − x

‖xk − x‖

∥∥∥∥2 .Proof. Littlewood’s Lemma: Identities are always trivial.

E⟨xk ,

xk+1

‖xk+1‖

⟩2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk ,

xk −〈ai ,xk〉‖ai‖2

ai

‖xk −〈ai ,xk〉‖ai‖2

ai‖

⟩2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk , xk −

〈ai ,xk〉‖ai‖2

ai

⟩2

‖xk −〈ai ,xk〉‖ai‖2

ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

⟨xk , xk −

〈ai ,xk〉‖ai‖2

ai

⟩2

‖xk −〈ai ,xk〉‖ai‖2

ai‖2=

m∑i=1

‖ai‖2

‖A‖2F

‖xk −〈ai ,xk〉‖ai‖2

ai‖4

‖xk −〈ai ,xk〉‖ai‖2

ai‖2

=m∑i=1

‖ai‖2

‖A‖2F

∥∥∥∥∥xk − 〈ai , xk〉‖ai‖ai

‖ai‖

∥∥∥∥∥2

=m∑i=1

‖ai‖2

‖A‖2F

(1−〈ai , xk〉2

‖ai‖2

)

=1

‖A‖2F

m∑i=1

(‖ai‖

2 − 〈ai , xk〉2)

= 1−1

‖A‖2F

m∑i=1

〈ai , xk〉2 = 1−

‖Axk‖2

‖A‖2F

.

5. Changing the likelihoods

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive. Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

5. Changing the likelihoods

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive.

Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

5. Changing the likelihoods

New idea: maybe we shouldn’t pick the likelihoods randomly. Wewant

∀ 1 ≤ i ≤ n : 〈ai , x〉 = bi

so maybe we should pick equations where |〈ai , x〉 − bi | is large?This is known as the maximum residual method. It is known since(at least) the 1990s that this is faster (Feichtinger, Cenker, Mayer,Steier and Strohmer, 1992), (Griebel and Oswald, 2012), ...

Wait a second... those inner products are expensive. Ifyou can precompute an N×N matrix, then it’s cheap. Aniteration step gets basically twice as expensive.

Proposed fix: choose the i−th equation with likelihoodproportional to

P(we choose equation i) =|〈ai , xk〉 − b|p

‖Axk − b‖p`p.

I for p = 0, every equation is picked with equal likelihood

I for p large, the large deviations are more likely to be picked

I in practice, no difference between p = 20 and p = 10100

I the method ‘converges’ to maximum residual as p →∞.

Proposed fix: choose the i−th equation with likelihoodproportional to

P(we choose equation i) =|〈ai , xk〉 − b|p

‖Axk − b‖p`p.

I for p = 0, every equation is picked with equal likelihood

I for p large, the large deviations are more likely to be picked

I in practice, no difference between p = 20 and p = 10100

I the method ‘converges’ to maximum residual as p →∞.

2000 4000 6000 8000 10000

5

10

15

20

25

30

‖xk − x‖`2

Figure: ‖xk − x‖`2 for the Randomized Kaczmarz method (blue), forp = 1 (orange), p = 2 (green) and p = 20 (red).

Theorem (Weighting is better (S. 2020))

Let 0 < p <∞, let A be normalized to having the norm of eachrow be ‖ai‖ = 1. Then

E ‖xk − x‖22 ≤

(1− inf

x 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2

)k

‖x0 − x‖22.

This is at least the rate of Randomized Kaczmarz since

infx 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2≥ 1

‖A‖2F · ‖A−1‖2

with equality if and only if the singular vector vn corresponding tothe smallest singular value of A has the property that Avn is aconstant vector.

Theorem (Weighting is better (S. 2020))

Let 0 < p <∞, let A be normalized to having the norm of eachrow be ‖ai‖ = 1. Then

E ‖xk − x‖22 ≤

(1− inf

x 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2

)k

‖x0 − x‖22.

This is at least the rate of Randomized Kaczmarz since

infx 6=0

‖Ax‖p+2`p+2

‖Ax‖p`p‖x‖2`2≥ 1

‖A‖2F · ‖A−1‖2

with equality if and only if the singular vector vn corresponding tothe smallest singular value of A has the property that Avn is aconstant vector.

6. The Energy Cascade

Back to Randomized Kaczmarz (Open Question: how much of thisis true for the weighted case?)

2000 4000 6000 8000 10000

2

4

6

8

10

‖Axk − b‖`2

Figure: The size of ‖Axk − b‖`2 for k = 1, . . . , 10000. We observe rapidinitial decay which then slows down.

2000 4000 6000 8000 10000

2

4

6

8

10

The underlying story is somehow clear: the part of xk − x thatbelongs to larger singular vectors decays faster. But the Theoremabove does not actually rigorously prove this: it’s a statementabout expectation (no concentration!).

No concentration indeed!

200 400 600 800 1000

-0.010

-0.005

0.005

0.010⟨

xk−x‖xk−x‖ , v1

Figure: The evolution of the normalized residual against the leadingsingular vector v1: fluctuations around the mean.

Theorem (Energy Cascade (S, 2020))

Abbreviating

α = max1≤i≤n

‖Aai‖2

‖ai‖2,

we have

E ‖Axk+1−b‖22 ≤(

1 +α

‖A‖2F

)‖Axk−b‖22−

2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22.

I The constant α is usually quite small, maybe α ∼ 2.

I The most important part is that there is an AT term in thedecay quantity: this makes contributions from large singularvectors even bigger and leads to even more decay.

Theorem (Energy Cascade (S, 2020))

Abbreviating

α = max1≤i≤n

‖Aai‖2

‖ai‖2,

we have

E ‖Axk+1−b‖22 ≤(

1 +α

‖A‖2F

)‖Axk−b‖22−

2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22.

I The constant α is usually quite small, maybe α ∼ 2.

I The most important part is that there is an AT term in thedecay quantity: this makes contributions from large singularvectors even bigger and leads to even more decay.

500 1000 1500 2000 2500 3000

200

400

600

800

1000

‖Axk − b‖`2 500 1000 1500 2000 2500 3000

-2.5

-2.0

-1.5

-1.0

-0.5

α

‖A‖2F

‖Axk − b‖22 −2

‖A‖2F

∥∥∥AT (Axk − b)∥∥∥22

Figure: The Theorem in Action: numbers work out.

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

7. What’s next?

Summary

I Random Kaczmarz converges preferably along small singularvectors (we had several theorems about this).

This is good for ‖A(xk − x)‖ and bad for ‖xk − x‖ (thecascade phenomenon).

I We also proved some good results for the weighted variantshowing that its better.

So now we have a lot better understanding: how can we improvethings?

The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

ai .

Why?

So now we have a lot better understanding: how can we improvethings? The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

ai .

Why?

So now we have a lot better understanding: how can we improvethings? The main problem is somehow: how do you avoid the badsubspaces? Maybe detect them and try to remove them?

Strohmer & Vershynin remark in their paper that in manyinstances, it is better not to project onto the hyperplane but go alittle bit further.

xk+1 = xk − 1.1bi − 〈ai , xk〉‖ai‖2

ai .

Why?

References

1. Randomized Kaczmarz converges along small singular vectors,arXiv:2006.16978

2. A Weighted Randomized Kaczmarz Method for Solving LinearSystems, arXiv:2007.02910

3. Stochastic Gradient Descent applied to Least Squares ...,arXiv:2007.13288

Thank you!