Greedy and Randomized Projection Methodsjhaddock/Slides/CAMOct19.pdfGlimpse of HUGE Body of...

Transcript of Greedy and Randomized Projection Methodsjhaddock/Slides/CAMOct19.pdfGlimpse of HUGE Body of...

Greedy and Randomized Projection Methods

Jamie Haddock

UCLA CAM Colloquium,

October 30, 2019

Computational and Applied Mathematics

UCLA

joint with Jesús A. De Loera, Deanna Needell, and Anna Ma

https://arxiv.org/abs/1802.03126 (BIT Numerical Mathematics 2019)

https://arxiv.org/abs/1605.01418 (SISC 2017)

1
BIG Data

2
BIG Data

2
Motivation: Average Consensus

0By Everaldo Coelho (YellowIcon); - All Crystal icons were posted by the author as LGPL on kde-look;, LGPL,

https://commons.wikimedia.org/w/index.php?curid=7288951. 3


https://commons.wikimedia.org/w/index.php?curid=7288951.

3

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0

Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:

. clock synchronization

. PageRank

. opinion formation

. blockchain technology

4

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0


2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:


. PageRank

. opinion formation


4

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0


2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:


. PageRank

. opinion formation


4

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0


2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:


. PageRank

. opinion formation


4

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0


2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:


. PageRank

. opinion formation


4

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0


2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:


. PageRank

. opinion formation


4

AC solution:

argminx1

2‖x− c‖2 s.t.

1 0 −1 0 . . . 00 1 −1 0 . . . 1...

......

.... . .

...

0 0 1 0 . . . −1

x = 0


2. Define x(i)k = x

(j)k =

x(i)k−1+x

(j)k−1

2 .

3. Repeat.

Applications:


. PageRank

. opinion formation


4
General Setup

We are interested in solving highly overdetermined systems of equations

(or inequalities), Ax = b (Ax ≤ b), where A ∈ Rm×n, b ∈ Rm andm� n. Rows are denoted aTi .

5
Iterative Projection Methods

If {x ∈ Rn : Ax = b} is nonempty, these methods construct anapproximation to a solution:

1. Randomized Kaczmarz Method

2. Motzkin’s Method

3. Sampling Kaczmarz-Motzkin Methods (SKM)

Applications:

1. Tomography (Algebraic Reconstruction Technique)

2. Linear programming

3. Average consensus (greedy gossip with eavesdropping)

6





Applications:




6





Applications:




6
Kaczmarz Method

x0

Given x0 ∈ Rn:1. Choose ik ∈ [m] with probability

‖aik ‖2

‖A‖2F.

2. Define xk := xk−1 +bik−a

Tikxk−1

||aik ||2aik .

3. Repeat.

7
Kaczmarz Method

x0

x1


‖aik ‖2

‖A‖2F.


Tikxk−1

||aik ||2aik .

3. Repeat.

7
Kaczmarz Method

x0

x1

x2


‖aik ‖2

‖A‖2F.


Tikxk−1

||aik ||2aik .

3. Repeat.

7
Kaczmarz Method

x0

x1

x2

x3


‖aik ‖2

‖A‖2F.


Tikxk−1

||aik ||2aik .

3. Repeat.

7
Motzkin’s Method

x0

Given x0 ∈ Rn:1. Choose ik ∈ [m] as

ik := argmaxi∈[m]

|aTi xk−1 − bi |.


Tikxk−1

||aik ||2aik .

3. Repeat. 8
Motzkin’s Method

x0

x1


ik := argmaxi∈[m]



Tikxk−1

||aik ||2aik .

3. Repeat. 8
Motzkin’s Method

x0

x1x2


ik := argmaxi∈[m]



Tikxk−1

||aik ||2aik .

3. Repeat. 8
Our Hybrid Method (SKM)

x0

Given x0 ∈ Rn:1. Choose τk ⊂ [m] to be a

sample of size β constraints

chosen uniformly at random

among the rows of A.

2. From the β rows, choose

ik := argmaxi∈τk


3. Define

xk := xk−1 +bik−a

Tikxk−1

||aik ||2aik .

4. Repeat.9

x0

x1






ik := argmaxi∈τk


3. Define


Tikxk−1

||aik ||2aik .

4. Repeat.9

x0

x1

x2






ik := argmaxi∈τk


3. Define


Tikxk−1

||aik ||2aik .

4. Repeat.9
Glimpse of HUGE Body of Literature

RK: [Strohmer-Vershynin ’09], [Needell-Srebro-Ward ’16]

Greedy: [Censor ’81], [Nutini et al ’16], [Bai-Wu ’18], [Du-Gao ’19]

Accel.: [Hanke-Niethammer ’90], [Liu-Wright ’16], [Morshed-Islam ’19]

Block: [Popa et al ’12], [Needell-Tropp ’14], [Needell-Zhao-Zouzias ’15],

Sketching: [Gower-Richtarik ’15], [Needell-Rebrova ’19]

Phase retrieval: [Tan-Vershynin ’17], [Jeong-Güntürk ’17]

LP: [Motzkin-Schoenberg ’54], [Agmon ’54], [Goffin ’80], [Chubanov ’12]

10
Experimental Convergence

. β: sample size

. A is 50000× 100 Gaussian matrix, consistent system

. ‘faster’ convergence for larger sample size

11

. β: sample size



11

. β: sample size



11
Convergence Rates

Below are the convergence rates for the methods on a system, Ax = b,

which is consistent with unique solution x, whose rows have been

normalized to have unit norm.

. RK (Strohmer - Vershynin ’09):

E||xk − x||22≤(

1− σ2min(A)

m

)k||x0 − x||22

. MM (Agmon ’54):

‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22

. SKM (DeLoera - H. - Needell ’17):

E‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22

Why are these all the same?

12
Convergence Rates





E||xk − x||22≤(

1− σ2min(A)

m

)k||x0 − x||22

. MM (Agmon ’54):

‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22


E‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22


12
Convergence Rates





E||xk − x||22≤(

1− σ2min(A)

m

)k||x0 − x||22

. MM (Agmon ’54):

‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22


E‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22


12
Convergence Rates





E||xk − x||22≤(

1− σ2min(A)

m

)k||x0 − x||22

. MM (Agmon ’54):

‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22


E‖xk − x‖22≤(

1− σ2min(A)

m

)k‖x0 − x‖22

Why are these all the same?12
A Pathological Example

Because.

x0

13
Structure of the Residual

Several works have used sparsity of the residual to improve the

convergence rate of greedy methods.

[De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19]

However, not much sparsity can be expected in most cases. Instead, we’d

like to use dynamic range of the residual to guarantee faster convergence.

γk :=‖Axk − Ax‖2

‖Axk − Ax‖2∞

14
Structure of the Residual

Several works have used sparsity of the residual to improve the

convergence rate of greedy methods.

[De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19]

However, not much sparsity can be expected in most cases. Instead, we’d

like to use dynamic range of the residual to guarantee faster convergence.

γk :=‖Axk − Ax‖2

‖Axk − Ax‖2∞

14
An Accelerated Convergence Rate

Theorem (H. - Needell ’19)

Let x denote the solution of the consistent, normalized system Ax = b.

Motzkin’s method exhibits the (possibly highly accelerated) convergence

rate:

‖xT − x‖2≤T−1∏k=0

(1− σ

2min(A)

4γk

)· ‖x0 − x‖2

Here γk bounds the dynamic range of the kth residual, γk :=‖Axk−Ax‖2‖Axk−Ax‖2∞

.

. improvement over previous result when 4γk < m

15
Netlib LP Systems

200 400 600 800

Iterations

0

2

4

6

8

10

||x

k-x

||2

1013

Motzkin

RK

(1 - min

2/4

k)||x

k-1-x||

2

(1-min

2/m)

k||x

0-x||

2

200 400 600 800 1000 1200

Iterations

0

0.5

1

1.5

2

2.5

||x

k-x

||2

1012

Motzkin

RK

(1 - min

2/4

k)||x

k-1-x||

2

(1-min

2/m)

k||x

0-x||

2

200 400 600 800 1000 1200 1400 1600

Iterations

0

0.5

1

1.5

2

2.5

||x

k-x

||2

1012

Motzkin

RK

(1 - min

2/4

k)||x

k-1-x||

2

(1-min

2/m)

k||x

0-x||

2

100 200 300 400 500 600

Iterations

0

0.5

1

1.5

2

2.5

3

3.5

||x

k-x

||2

104

Motzkin

RK

(1 - min

2/4

k)||x

k-1-x||

2

(1-min

2/m)

k||x

0-x||

2

16
Extending to SKM

Corollary (H. - Ma 2019+)

Let A be normalized so ‖ai‖2= 1 for all rows i = 1, ...,m. If the systemAx = b is consistent with the unique solution x∗ then the SKM method

converges at least linearly in expectation and the rate depends on the

dynamic range of the random sample of rows of A, τj . Precisely, in the

j + 1st iteration of SKM, we have

Eτj‖xj+1 − x∗‖22≤(

1− βσ2min(A)

γjm

)‖xj − x∗‖22

where γj =

∑τj∈([m]β )

‖Aτj xj−bτj ‖22∑

τj∈([m]β )‖Aτj xj−bτj ‖2∞

.

17
Extending to SKM


. bound uses dynamic range of sample of β rows

18
What can we say about γj?

1 ≤ γj ≤ β

↗ ↖Axk − b = ei Axk − b = c1“Best” case “Worst” case

Best Case Worst Case Previous Best Previous Worst

MM 1− σ2min(A)1− σ

2min(A)m

1− σ2min(A)

4

1− σ2min(A)mSKM 1−

βσ2min(A)m 1− σ

2min(A)mRK 1− σ

2min(A)m

Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

Nervous? γk ≥ βmσ2min(A) when A is row-normalized

19

1 ≤ γj ≤ β↗ ↖

Axk − b = ei Axk − b = c1

“Best” case “Worst” case



2min(A)m

1− σ2min(A)

4



2min(A)mRK 1− σ

2min(A)m



19

1 ≤ γj ≤ β↗ ↖

Axk − b = ei Axk − b = c1“Best” case “Worst” case



2min(A)m

1− σ2min(A)

4



2min(A)mRK 1− σ

2min(A)m



19

1 ≤ γj ≤ β↗ ↖




2min(A)m

1− σ2min(A)

4



2min(A)mRK 1− σ

2min(A)m



19

1 ≤ γj ≤ β↗ ↖




2min(A)m

1− σ2min(A)

4



2min(A)mRK 1− σ

2min(A)m


Nervous?

γk ≥ βmσ2min(A) when A is row-normalized

19

1 ≤ γj ≤ β↗ ↖




2min(A)m

1− σ2min(A)

4



2min(A)mRK 1− σ

2min(A)m



19
γk : Gaussian systems

γk .nβ

log β

20
γk : Gaussian systems

γk .nβ

log β

20
Gaussian Convergence


21
γk : AC systems

γk ≤β∑

eij∈E(x(i)k − x

(j)k )

2

|E|(x (βi )k − x(βj )k )

2

βi , βj : nodes connected by the βth smallest residual

22
Generalizing the Convergence Result

. can immediately generalize to varying β (SKM with βk)

. to generalize to non-normalized A, we need

• a sampling distribution that depends upon ‖ai‖2

• a sampling distribution that depends upon xk

Generalized SKM sampling distribution: px :(

[m]βk

)→ [0, 1)

• t(τ, x) = argmaxt∈τ (atx− bt)2

greedy subresidual choice

• px(τk) =‖at(τk ,x)‖

2∑τ∈([m]βk)

‖at(τ,x)‖2

proportional to norm of selected row

23






[m]βk

)→ [0, 1)




2∑τ∈([m]βk)

‖at(τ,x)‖2


23






[m]βk

)→ [0, 1)




2∑τ∈([m]βk)

‖at(τ,x)‖2


23






[m]βk

)→ [0, 1)




2∑τ∈([m]βk)

‖at(τ,x)‖2


23






[m]βk

)→ [0, 1)




2∑τ∈([m]βk)

‖at(τ,x)‖2


23






[m]βk

)→ [0, 1)




2∑τ∈([m]βk)

‖at(τ,x)‖2


23






[m]βk

)→ [0, 1)




2∑τ∈([m]βk)

‖at(τ,x)‖2


23
Generalized SKM

Given x0 ∈ Rn:

1. Choose τk ∈(

[m]βk

)according to pxk−1 .

2. Choose ik := t(τk , xk−1).


Tikxk−1

||aik ||2aik .

4. Repeat.

. If βk = 1, this is the distribution in [Strohmer, Vershynin ’09].

. If ‖ai‖2= 1, this is the uniform distribution over(

[m]βk

).

24
Generalized SKM

Given x0 ∈ Rn:

1. Choose τk ∈(

[m]βk




Tikxk−1

||aik ||2aik .

4. Repeat.



[m]βk

).

24
Generalized SKM

Given x0 ∈ Rn:

1. Choose τk ∈(

[m]βk




Tikxk−1

||aik ||2aik .

4. Repeat.



[m]βk

).

24
Generalized Result

Theorem (H. - Ma 2019+)

Let x∗ denote the unique solution to the system of equations Ax = b.

Then generalized SKM converges at least linearly in expectation and the

bound on the rate depends on the dynamic range, γk of the random

sample of βk rows of A, τk . Precisely, in the kth iteration of generalized

SKM, we have

Eτk‖xk − x∗‖2≤(

1−βk(mβk

)σ2min(A)

γkm∑τ∈([m]βk)

‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2.

. If all rows of A have the same norm, then


1− βkσ2min(A)

γk‖A‖2F

)‖xk−1 − x∗‖2.

25
Generalized Result

Theorem (H. - Ma 2019+)

Let x∗ denote the unique solution to the system of equations Ax = b.

Then generalized SKM converges at least linearly in expectation and the

bound on the rate depends on the dynamic range, γk of the random

sample of βk rows of A, τk . Precisely, in the kth iteration of generalized

SKM, we have


1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2.

. If all rows of A have the same norm, then


1− βkσ2min(A)

γk‖A‖2F

)‖xk−1 − x∗‖2.

25
Sketch of Proof

‖xk − x∗‖2 = ‖xk−1 − x∗‖2−‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

1

Eτk‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

=∑

τ∈([m]βk)

‖at(τ,xk−1)‖2∑π∈([m]βk)

‖at(π,xk−1)‖2‖Aτxk−1 − bτ‖2∞‖at(τ,xk−1)‖2

=1∑

π∈([m]βk)‖at(π,xk−1)‖2

∑τ∈([m]βk)

‖Aτxk−1 − bτ‖2∞

=

(mβk

)βk

γkm∑π∈([m]βk)

‖at(π,xk−1)‖2‖Axk−1 − b‖2

1Originally created by en:User:Michael Hardy, then scaled, with colour and labels being added by en:User:Wapcaplet, transformed in svg

format by fr:Utilisateur:Steff, changed colors and font by de:Leo2004. (https://commons.wikimedia.org/wiki/File:Pythagorean.svg)

26
Sketch of Proof


1


=∑

τ∈([m]βk)

‖at(τ,xk−1)‖2∑π∈([m]βk)


=1∑


∑τ∈([m]βk)


=

(mβk

)βk


‖at(π,xk−1)‖2‖Axk−1 − b‖2



26
Sketch of Proof


1


=∑

τ∈([m]βk)

‖at(τ,xk−1)‖2∑π∈([m]βk)


=1∑


∑τ∈([m]βk)


=

(mβk

)βk


‖at(π,xk−1)‖2‖Axk−1 − b‖2



26
Sketch of Proof


1


=∑

τ∈([m]βk)

‖at(τ,xk−1)‖2∑π∈([m]βk)


=1∑


∑τ∈([m]βk)


=

(mβk

)βk


‖at(π,xk−1)‖2‖Axk−1 − b‖2



26
Now can we determine the optimal β?

Roughly, if we know the value of γj , we can (just) do it.

27


27


27
Conclusions and Future Work

. presented a convergence rate for SKM which generalizes (and

improves) results for RK and Motzkin


1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

. proved a result which “easily” yields results which illustrate SKM

improvement for specific types of measurement matrices

. specialized our result to Gaussian matrices and AC systems

. identify useful bounds on γk for other useful systems

. identify optimal β of systems for which γk is known

28




1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2






28




1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2






28




1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2






28




1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2






28




1−βk(mβk

)σ2min(A)


‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2






28
Thanks for listening!

Questions?

[1] J. A. De Loera, J. Haddock, and D. Needell. A sampling Kaczmarz-Motzkin

algorithm for linear feasibility. SIAM Journal on Scientific Computing,

39(5):S66–S87, 2017.

[2] J. Haddock and D. Needell. On Motzkins method for inconsistent linear systems.

BIT Numerical Mathematics, 59(2):387–401, 2019.

[3] Nicolas Loizou and Peter Richtárik. Revisiting randomized gossip algorithms:

General framework, convergence rates and novel block and accelerated

protocols. arXiv preprint arXiv:1905.08645, 2019.

[4] T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear

inequalities. Canadian J. Math., 6:393–404, 1954.

[5] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with

exponential convergence. J. Fourier Anal. Appl., 15:262–278, 2009.

29
Block Kaczmarz

30

Greedy and Randomized Projection Methodsjhaddock/Slides/CAMOct19.pdfGlimpse of HUGE Body of...

Documents

Transcript of Greedy and Randomized Projection Methodsjhaddock/Slides/CAMOct19.pdfGlimpse of HUGE Body of...