Greedy and Randomized Projection Methodsjhaddock/Slides/CAMOct19.pdfGlimpse of HUGE Body of...

80
Greedy and Randomized Projection Methods Jamie Haddock UCLA CAM Colloquium, October 30, 2019 Computational and Applied Mathematics UCLA joint with Jes´ us A. De Loera, Deanna Needell, and Anna Ma https://arxiv.org/abs/1802.03126 (BIT Numerical Mathematics 2019) https://arxiv.org/abs/1605.01418 (SISC 2017) 1

Transcript of Greedy and Randomized Projection Methodsjhaddock/Slides/CAMOct19.pdfGlimpse of HUGE Body of...

  • Greedy and Randomized Projection Methods

    Jamie Haddock

    UCLA CAM Colloquium,

    October 30, 2019

    Computational and Applied Mathematics

    UCLA

    joint with Jesús A. De Loera, Deanna Needell, and Anna Ma

    https://arxiv.org/abs/1802.03126 (BIT Numerical Mathematics 2019)

    https://arxiv.org/abs/1605.01418 (SISC 2017)

    1

  • BIG Data

    2

  • BIG Data

    2

  • Motivation: Average Consensus

    0By Everaldo Coelho (YellowIcon); - All Crystal icons were posted by the author as LGPL on kde-look;, LGPL,

    https://commons.wikimedia.org/w/index.php?curid=7288951. 3

  • Motivation: Average Consensus

    0By Everaldo Coelho (YellowIcon); - All Crystal icons were posted by the author as LGPL on kde-look;, LGPL,

    https://commons.wikimedia.org/w/index.php?curid=7288951. 3

  • Motivation: Average Consensus

    0By Everaldo Coelho (YellowIcon); - All Crystal icons were posted by the author as LGPL on kde-look;, LGPL,

    https://commons.wikimedia.org/w/index.php?curid=7288951. 3

  • Motivation: Average Consensus

    0By Everaldo Coelho (YellowIcon); - All Crystal icons were posted by the author as LGPL on kde-look;, LGPL,

    https://commons.wikimedia.org/w/index.php?curid=7288951.

    3

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • Motivation: Average Consensus

    AC solution:

    argminx1

    2‖x− c‖2 s.t.

    1 0 −1 0 . . . 00 1 −1 0 . . . 1...

    ......

    .... . .

    ...

    0 0 1 0 . . . −1

    x = 0

    Gossip Method: Begin with x0 = c ∈ R|V|:1. Choose eij ∈ E .

    2. Define x(i)k = x

    (j)k =

    x(i)k−1+x

    (j)k−1

    2 .

    3. Repeat.

    Applications:

    . clock synchronization

    . PageRank

    . opinion formation

    . blockchain technology

    4

  • General Setup

    We are interested in solving highly overdetermined systems of equations

    (or inequalities), Ax = b (Ax ≤ b), where A ∈ Rm×n, b ∈ Rm andm� n. Rows are denoted aTi .

    5

  • Iterative Projection Methods

    If {x ∈ Rn : Ax = b} is nonempty, these methods construct anapproximation to a solution:

    1. Randomized Kaczmarz Method

    2. Motzkin’s Method

    3. Sampling Kaczmarz-Motzkin Methods (SKM)

    Applications:

    1. Tomography (Algebraic Reconstruction Technique)

    2. Linear programming

    3. Average consensus (greedy gossip with eavesdropping)

    6

  • Iterative Projection Methods

    If {x ∈ Rn : Ax = b} is nonempty, these methods construct anapproximation to a solution:

    1. Randomized Kaczmarz Method

    2. Motzkin’s Method

    3. Sampling Kaczmarz-Motzkin Methods (SKM)

    Applications:

    1. Tomography (Algebraic Reconstruction Technique)

    2. Linear programming

    3. Average consensus (greedy gossip with eavesdropping)

    6

  • Iterative Projection Methods

    If {x ∈ Rn : Ax = b} is nonempty, these methods construct anapproximation to a solution:

    1. Randomized Kaczmarz Method

    2. Motzkin’s Method

    3. Sampling Kaczmarz-Motzkin Methods (SKM)

    Applications:

    1. Tomography (Algebraic Reconstruction Technique)

    2. Linear programming

    3. Average consensus (greedy gossip with eavesdropping)

    6

  • Kaczmarz Method

    x0

    Given x0 ∈ Rn:1. Choose ik ∈ [m] with probability

    ‖aik ‖2

    ‖A‖2F.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat.

    7

  • Kaczmarz Method

    x0

    x1

    Given x0 ∈ Rn:1. Choose ik ∈ [m] with probability

    ‖aik ‖2

    ‖A‖2F.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat.

    7

  • Kaczmarz Method

    x0

    x1

    x2

    Given x0 ∈ Rn:1. Choose ik ∈ [m] with probability

    ‖aik ‖2

    ‖A‖2F.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat.

    7

  • Kaczmarz Method

    x0

    x1

    x2

    x3

    Given x0 ∈ Rn:1. Choose ik ∈ [m] with probability

    ‖aik ‖2

    ‖A‖2F.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat.

    7

  • Motzkin’s Method

    x0

    Given x0 ∈ Rn:1. Choose ik ∈ [m] as

    ik := argmaxi∈[m]

    |aTi xk−1 − bi |.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat. 8

  • Motzkin’s Method

    x0

    x1

    Given x0 ∈ Rn:1. Choose ik ∈ [m] as

    ik := argmaxi∈[m]

    |aTi xk−1 − bi |.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat. 8

  • Motzkin’s Method

    x0

    x1x2

    Given x0 ∈ Rn:1. Choose ik ∈ [m] as

    ik := argmaxi∈[m]

    |aTi xk−1 − bi |.

    2. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    3. Repeat. 8

  • Our Hybrid Method (SKM)

    x0

    Given x0 ∈ Rn:1. Choose τk ⊂ [m] to be a

    sample of size β constraints

    chosen uniformly at random

    among the rows of A.

    2. From the β rows, choose

    ik := argmaxi∈τk

    |aTi xk−1 − bi |.

    3. Define

    xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    4. Repeat.9

  • Our Hybrid Method (SKM)

    x0

    x1

    Given x0 ∈ Rn:1. Choose τk ⊂ [m] to be a

    sample of size β constraints

    chosen uniformly at random

    among the rows of A.

    2. From the β rows, choose

    ik := argmaxi∈τk

    |aTi xk−1 − bi |.

    3. Define

    xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    4. Repeat.9

  • Our Hybrid Method (SKM)

    x0

    x1

    x2

    Given x0 ∈ Rn:1. Choose τk ⊂ [m] to be a

    sample of size β constraints

    chosen uniformly at random

    among the rows of A.

    2. From the β rows, choose

    ik := argmaxi∈τk

    |aTi xk−1 − bi |.

    3. Define

    xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    4. Repeat.9

  • Glimpse of HUGE Body of Literature

    RK: [Strohmer-Vershynin ’09], [Needell-Srebro-Ward ’16]

    Greedy: [Censor ’81], [Nutini et al ’16], [Bai-Wu ’18], [Du-Gao ’19]

    Accel.: [Hanke-Niethammer ’90], [Liu-Wright ’16], [Morshed-Islam ’19]

    Block: [Popa et al ’12], [Needell-Tropp ’14], [Needell-Zhao-Zouzias ’15],

    Sketching: [Gower-Richtarik ’15], [Needell-Rebrova ’19]

    Phase retrieval: [Tan-Vershynin ’17], [Jeong-Güntürk ’17]

    LP: [Motzkin-Schoenberg ’54], [Agmon ’54], [Goffin ’80], [Chubanov ’12]

    10

  • Experimental Convergence

    . β: sample size

    . A is 50000× 100 Gaussian matrix, consistent system

    . ‘faster’ convergence for larger sample size

    11

  • Experimental Convergence

    . β: sample size

    . A is 50000× 100 Gaussian matrix, consistent system

    . ‘faster’ convergence for larger sample size

    11

  • Experimental Convergence

    . β: sample size

    . A is 50000× 100 Gaussian matrix, consistent system

    . ‘faster’ convergence for larger sample size

    11

  • Convergence Rates

    Below are the convergence rates for the methods on a system, Ax = b,

    which is consistent with unique solution x, whose rows have been

    normalized to have unit norm.

    . RK (Strohmer - Vershynin ’09):

    E||xk − x||22≤(

    1− σ2min(A)

    m

    )k||x0 − x||22

    . MM (Agmon ’54):

    ‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    . SKM (DeLoera - H. - Needell ’17):

    E‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    Why are these all the same?

    12

  • Convergence Rates

    Below are the convergence rates for the methods on a system, Ax = b,

    which is consistent with unique solution x, whose rows have been

    normalized to have unit norm.

    . RK (Strohmer - Vershynin ’09):

    E||xk − x||22≤(

    1− σ2min(A)

    m

    )k||x0 − x||22

    . MM (Agmon ’54):

    ‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    . SKM (DeLoera - H. - Needell ’17):

    E‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    Why are these all the same?

    12

  • Convergence Rates

    Below are the convergence rates for the methods on a system, Ax = b,

    which is consistent with unique solution x, whose rows have been

    normalized to have unit norm.

    . RK (Strohmer - Vershynin ’09):

    E||xk − x||22≤(

    1− σ2min(A)

    m

    )k||x0 − x||22

    . MM (Agmon ’54):

    ‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    . SKM (DeLoera - H. - Needell ’17):

    E‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    Why are these all the same?

    12

  • Convergence Rates

    Below are the convergence rates for the methods on a system, Ax = b,

    which is consistent with unique solution x, whose rows have been

    normalized to have unit norm.

    . RK (Strohmer - Vershynin ’09):

    E||xk − x||22≤(

    1− σ2min(A)

    m

    )k||x0 − x||22

    . MM (Agmon ’54):

    ‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    . SKM (DeLoera - H. - Needell ’17):

    E‖xk − x‖22≤(

    1− σ2min(A)

    m

    )k‖x0 − x‖22

    Why are these all the same?12

  • A Pathological Example

    Because.

    x0

    13

  • Structure of the Residual

    Several works have used sparsity of the residual to improve the

    convergence rate of greedy methods.

    [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19]

    However, not much sparsity can be expected in most cases. Instead, we’d

    like to use dynamic range of the residual to guarantee faster convergence.

    γk :=‖Axk − Ax‖2

    ‖Axk − Ax‖2∞

    14

  • Structure of the Residual

    Several works have used sparsity of the residual to improve the

    convergence rate of greedy methods.

    [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19]

    However, not much sparsity can be expected in most cases. Instead, we’d

    like to use dynamic range of the residual to guarantee faster convergence.

    γk :=‖Axk − Ax‖2

    ‖Axk − Ax‖2∞

    14

  • An Accelerated Convergence Rate

    Theorem (H. - Needell ’19)

    Let x denote the solution of the consistent, normalized system Ax = b.

    Motzkin’s method exhibits the (possibly highly accelerated) convergence

    rate:

    ‖xT − x‖2≤T−1∏k=0

    (1− σ

    2min(A)

    4γk

    )· ‖x0 − x‖2

    Here γk bounds the dynamic range of the kth residual, γk :=‖Axk−Ax‖2‖Axk−Ax‖2∞

    .

    . improvement over previous result when 4γk < m

    15

  • Netlib LP Systems

    200 400 600 800

    Iterations

    0

    2

    4

    6

    8

    10

    ||x

    k-x

    ||2

    1013

    Motzkin

    RK

    (1 - min

    2/4

    k)||x

    k-1-x||

    2

    (1-min

    2/m)

    k||x

    0-x||

    2

    200 400 600 800 1000 1200

    Iterations

    0

    0.5

    1

    1.5

    2

    2.5

    ||x

    k-x

    ||2

    1012

    Motzkin

    RK

    (1 - min

    2/4

    k)||x

    k-1-x||

    2

    (1-min

    2/m)

    k||x

    0-x||

    2

    200 400 600 800 1000 1200 1400 1600

    Iterations

    0

    0.5

    1

    1.5

    2

    2.5

    ||x

    k-x

    ||2

    1012

    Motzkin

    RK

    (1 - min

    2/4

    k)||x

    k-1-x||

    2

    (1-min

    2/m)

    k||x

    0-x||

    2

    100 200 300 400 500 600

    Iterations

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    ||x

    k-x

    ||2

    104

    Motzkin

    RK

    (1 - min

    2/4

    k)||x

    k-1-x||

    2

    (1-min

    2/m)

    k||x

    0-x||

    2

    16

  • Extending to SKM

    Corollary (H. - Ma 2019+)

    Let A be normalized so ‖ai‖2= 1 for all rows i = 1, ...,m. If the systemAx = b is consistent with the unique solution x∗ then the SKM method

    converges at least linearly in expectation and the rate depends on the

    dynamic range of the random sample of rows of A, τj . Precisely, in the

    j + 1st iteration of SKM, we have

    Eτj‖xj+1 − x∗‖22≤(

    1− βσ2min(A)

    γjm

    )‖xj − x∗‖22

    where γj =

    ∑τj∈([m]β )

    ‖Aτj xj−bτj ‖22∑

    τj∈([m]β )‖Aτj xj−bτj ‖2∞

    .

    17

  • Extending to SKM

    . A is 50000× 100 Gaussian matrix, consistent system

    . bound uses dynamic range of sample of β rows

    18

  • What can we say about γj?

    1 ≤ γj ≤ β

    ↗ ↖Axk − b = ei Axk − b = c1“Best” case “Worst” case

    Best Case Worst Case Previous Best Previous Worst

    MM 1− σ2min(A)1− σ

    2min(A)m

    1− σ2min(A)

    4

    1− σ2min(A)mSKM 1−

    βσ2min(A)m 1− σ

    2min(A)mRK 1− σ

    2min(A)m

    Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

    Nervous? γk ≥ βmσ2min(A) when A is row-normalized

    19

  • What can we say about γj?

    1 ≤ γj ≤ β↗ ↖

    Axk − b = ei Axk − b = c1

    “Best” case “Worst” case

    Best Case Worst Case Previous Best Previous Worst

    MM 1− σ2min(A)1− σ

    2min(A)m

    1− σ2min(A)

    4

    1− σ2min(A)mSKM 1−

    βσ2min(A)m 1− σ

    2min(A)mRK 1− σ

    2min(A)m

    Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

    Nervous? γk ≥ βmσ2min(A) when A is row-normalized

    19

  • What can we say about γj?

    1 ≤ γj ≤ β↗ ↖

    Axk − b = ei Axk − b = c1“Best” case “Worst” case

    Best Case Worst Case Previous Best Previous Worst

    MM 1− σ2min(A)1− σ

    2min(A)m

    1− σ2min(A)

    4

    1− σ2min(A)mSKM 1−

    βσ2min(A)m 1− σ

    2min(A)mRK 1− σ

    2min(A)m

    Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

    Nervous? γk ≥ βmσ2min(A) when A is row-normalized

    19

  • What can we say about γj?

    1 ≤ γj ≤ β↗ ↖

    Axk − b = ei Axk − b = c1“Best” case “Worst” case

    Best Case Worst Case Previous Best Previous Worst

    MM 1− σ2min(A)1− σ

    2min(A)m

    1− σ2min(A)

    4

    1− σ2min(A)mSKM 1−

    βσ2min(A)m 1− σ

    2min(A)mRK 1− σ

    2min(A)m

    Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

    Nervous? γk ≥ βmσ2min(A) when A is row-normalized

    19

  • What can we say about γj?

    1 ≤ γj ≤ β↗ ↖

    Axk − b = ei Axk − b = c1“Best” case “Worst” case

    Best Case Worst Case Previous Best Previous Worst

    MM 1− σ2min(A)1− σ

    2min(A)m

    1− σ2min(A)

    4

    1− σ2min(A)mSKM 1−

    βσ2min(A)m 1− σ

    2min(A)mRK 1− σ

    2min(A)m

    Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

    Nervous?

    γk ≥ βmσ2min(A) when A is row-normalized

    19

  • What can we say about γj?

    1 ≤ γj ≤ β↗ ↖

    Axk − b = ei Axk − b = c1“Best” case “Worst” case

    Best Case Worst Case Previous Best Previous Worst

    MM 1− σ2min(A)1− σ

    2min(A)m

    1− σ2min(A)

    4

    1− σ2min(A)mSKM 1−

    βσ2min(A)m 1− σ

    2min(A)mRK 1− σ

    2min(A)m

    Table 1: Contraction terms α such that Eτk ‖ek‖2≤ α‖ek−1‖2.

    Nervous? γk ≥ βmσ2min(A) when A is row-normalized

    19

  • γk : Gaussian systems

    γk .nβ

    log β

    20

  • γk : Gaussian systems

    γk .nβ

    log β

    20

  • Gaussian Convergence

    . A is 50000× 100 Gaussian matrix, consistent system

    21

  • γk : AC systems

    γk ≤β∑

    eij∈E(x(i)k − x

    (j)k )

    2

    |E|(x (βi )k − x(βj )k )

    2

    βi , βj : nodes connected by the βth smallest residual

    22

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalizing the Convergence Result

    . can immediately generalize to varying β (SKM with βk)

    . to generalize to non-normalized A, we need

    • a sampling distribution that depends upon ‖ai‖2

    • a sampling distribution that depends upon xk

    Generalized SKM sampling distribution: px :(

    [m]βk

    )→ [0, 1)

    • t(τ, x) = argmaxt∈τ (atx− bt)2

    greedy subresidual choice

    • px(τk) =‖at(τk ,x)‖

    2∑τ∈([m]βk)

    ‖at(τ,x)‖2

    proportional to norm of selected row

    23

  • Generalized SKM

    Given x0 ∈ Rn:

    1. Choose τk ∈(

    [m]βk

    )according to pxk−1 .

    2. Choose ik := t(τk , xk−1).

    3. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    4. Repeat.

    . If βk = 1, this is the distribution in [Strohmer, Vershynin ’09].

    . If ‖ai‖2= 1, this is the uniform distribution over(

    [m]βk

    ).

    24

  • Generalized SKM

    Given x0 ∈ Rn:

    1. Choose τk ∈(

    [m]βk

    )according to pxk−1 .

    2. Choose ik := t(τk , xk−1).

    3. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    4. Repeat.

    . If βk = 1, this is the distribution in [Strohmer, Vershynin ’09].

    . If ‖ai‖2= 1, this is the uniform distribution over(

    [m]βk

    ).

    24

  • Generalized SKM

    Given x0 ∈ Rn:

    1. Choose τk ∈(

    [m]βk

    )according to pxk−1 .

    2. Choose ik := t(τk , xk−1).

    3. Define xk := xk−1 +bik−a

    Tikxk−1

    ||aik ||2aik .

    4. Repeat.

    . If βk = 1, this is the distribution in [Strohmer, Vershynin ’09].

    . If ‖ai‖2= 1, this is the uniform distribution over(

    [m]βk

    ).

    24

  • Generalized Result

    Theorem (H. - Ma 2019+)

    Let x∗ denote the unique solution to the system of equations Ax = b.

    Then generalized SKM converges at least linearly in expectation and the

    bound on the rate depends on the dynamic range, γk of the random

    sample of βk rows of A, τk . Precisely, in the kth iteration of generalized

    SKM, we have

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2.

    . If all rows of A have the same norm, then

    Eτk‖xk − x∗‖2≤(

    1− βkσ2min(A)

    γk‖A‖2F

    )‖xk−1 − x∗‖2.

    25

  • Generalized Result

    Theorem (H. - Ma 2019+)

    Let x∗ denote the unique solution to the system of equations Ax = b.

    Then generalized SKM converges at least linearly in expectation and the

    bound on the rate depends on the dynamic range, γk of the random

    sample of βk rows of A, τk . Precisely, in the kth iteration of generalized

    SKM, we have

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2.

    . If all rows of A have the same norm, then

    Eτk‖xk − x∗‖2≤(

    1− βkσ2min(A)

    γk‖A‖2F

    )‖xk−1 − x∗‖2.

    25

  • Sketch of Proof

    ‖xk − x∗‖2 = ‖xk−1 − x∗‖2−‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    1

    Eτk‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    =∑

    τ∈([m]βk)

    ‖at(τ,xk−1)‖2∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Aτxk−1 − bτ‖2∞‖at(τ,xk−1)‖2

    =1∑

    π∈([m]βk)‖at(π,xk−1)‖2

    ∑τ∈([m]βk)

    ‖Aτxk−1 − bτ‖2∞

    =

    (mβk

    )βk

    γkm∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Axk−1 − b‖2

    1Originally created by en:User:Michael Hardy, then scaled, with colour and labels being added by en:User:Wapcaplet, transformed in svg

    format by fr:Utilisateur:Steff, changed colors and font by de:Leo2004. (https://commons.wikimedia.org/wiki/File:Pythagorean.svg)

    26

  • Sketch of Proof

    ‖xk − x∗‖2 = ‖xk−1 − x∗‖2−‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    1

    Eτk‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    =∑

    τ∈([m]βk)

    ‖at(τ,xk−1)‖2∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Aτxk−1 − bτ‖2∞‖at(τ,xk−1)‖2

    =1∑

    π∈([m]βk)‖at(π,xk−1)‖2

    ∑τ∈([m]βk)

    ‖Aτxk−1 − bτ‖2∞

    =

    (mβk

    )βk

    γkm∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Axk−1 − b‖2

    1Originally created by en:User:Michael Hardy, then scaled, with colour and labels being added by en:User:Wapcaplet, transformed in svg

    format by fr:Utilisateur:Steff, changed colors and font by de:Leo2004. (https://commons.wikimedia.org/wiki/File:Pythagorean.svg)

    26

  • Sketch of Proof

    ‖xk − x∗‖2 = ‖xk−1 − x∗‖2−‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    1

    Eτk‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    =∑

    τ∈([m]βk)

    ‖at(τ,xk−1)‖2∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Aτxk−1 − bτ‖2∞‖at(τ,xk−1)‖2

    =1∑

    π∈([m]βk)‖at(π,xk−1)‖2

    ∑τ∈([m]βk)

    ‖Aτxk−1 − bτ‖2∞

    =

    (mβk

    )βk

    γkm∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Axk−1 − b‖2

    1Originally created by en:User:Michael Hardy, then scaled, with colour and labels being added by en:User:Wapcaplet, transformed in svg

    format by fr:Utilisateur:Steff, changed colors and font by de:Leo2004. (https://commons.wikimedia.org/wiki/File:Pythagorean.svg)

    26

  • Sketch of Proof

    ‖xk − x∗‖2 = ‖xk−1 − x∗‖2−‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    1

    Eτk‖Aτkxk−1 − bτk‖2∞‖at(τk ,xk−1)‖2

    =∑

    τ∈([m]βk)

    ‖at(τ,xk−1)‖2∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Aτxk−1 − bτ‖2∞‖at(τ,xk−1)‖2

    =1∑

    π∈([m]βk)‖at(π,xk−1)‖2

    ∑τ∈([m]βk)

    ‖Aτxk−1 − bτ‖2∞

    =

    (mβk

    )βk

    γkm∑π∈([m]βk)

    ‖at(π,xk−1)‖2‖Axk−1 − b‖2

    1Originally created by en:User:Michael Hardy, then scaled, with colour and labels being added by en:User:Wapcaplet, transformed in svg

    format by fr:Utilisateur:Steff, changed colors and font by de:Leo2004. (https://commons.wikimedia.org/wiki/File:Pythagorean.svg)

    26

  • Now can we determine the optimal β?

    Roughly, if we know the value of γj , we can (just) do it.

    27

  • Now can we determine the optimal β?

    Roughly, if we know the value of γj , we can (just) do it.

    27

  • Now can we determine the optimal β?

    Roughly, if we know the value of γj , we can (just) do it.

    27

  • Conclusions and Future Work

    . presented a convergence rate for SKM which generalizes (and

    improves) results for RK and Motzkin

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

    . proved a result which “easily” yields results which illustrate SKM

    improvement for specific types of measurement matrices

    . specialized our result to Gaussian matrices and AC systems

    . identify useful bounds on γk for other useful systems

    . identify optimal β of systems for which γk is known

    28

  • Conclusions and Future Work

    . presented a convergence rate for SKM which generalizes (and

    improves) results for RK and Motzkin

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

    . proved a result which “easily” yields results which illustrate SKM

    improvement for specific types of measurement matrices

    . specialized our result to Gaussian matrices and AC systems

    . identify useful bounds on γk for other useful systems

    . identify optimal β of systems for which γk is known

    28

  • Conclusions and Future Work

    . presented a convergence rate for SKM which generalizes (and

    improves) results for RK and Motzkin

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

    . proved a result which “easily” yields results which illustrate SKM

    improvement for specific types of measurement matrices

    . specialized our result to Gaussian matrices and AC systems

    . identify useful bounds on γk for other useful systems

    . identify optimal β of systems for which γk is known

    28

  • Conclusions and Future Work

    . presented a convergence rate for SKM which generalizes (and

    improves) results for RK and Motzkin

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

    . proved a result which “easily” yields results which illustrate SKM

    improvement for specific types of measurement matrices

    . specialized our result to Gaussian matrices and AC systems

    . identify useful bounds on γk for other useful systems

    . identify optimal β of systems for which γk is known

    28

  • Conclusions and Future Work

    . presented a convergence rate for SKM which generalizes (and

    improves) results for RK and Motzkin

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

    . proved a result which “easily” yields results which illustrate SKM

    improvement for specific types of measurement matrices

    . specialized our result to Gaussian matrices and AC systems

    . identify useful bounds on γk for other useful systems

    . identify optimal β of systems for which γk is known

    28

  • Conclusions and Future Work

    . presented a convergence rate for SKM which generalizes (and

    improves) results for RK and Motzkin

    Eτk‖xk − x∗‖2≤(

    1−βk(mβk

    )σ2min(A)

    γkm∑τ∈([m]βk)

    ‖at(τ,xk−1)‖2)‖xk−1 − x∗‖2

    . proved a result which “easily” yields results which illustrate SKM

    improvement for specific types of measurement matrices

    . specialized our result to Gaussian matrices and AC systems

    . identify useful bounds on γk for other useful systems

    . identify optimal β of systems for which γk is known

    28

  • Thanks for listening!

    Questions?

    [1] J. A. De Loera, J. Haddock, and D. Needell. A sampling Kaczmarz-Motzkin

    algorithm for linear feasibility. SIAM Journal on Scientific Computing,

    39(5):S66–S87, 2017.

    [2] J. Haddock and D. Needell. On Motzkins method for inconsistent linear systems.

    BIT Numerical Mathematics, 59(2):387–401, 2019.

    [3] Nicolas Loizou and Peter Richtárik. Revisiting randomized gossip algorithms:

    General framework, convergence rates and novel block and accelerated

    protocols. arXiv preprint arXiv:1905.08645, 2019.

    [4] T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear

    inequalities. Canadian J. Math., 6:393–404, 1954.

    [5] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with

    exponential convergence. J. Fourier Anal. Appl., 15:262–278, 2009.

    29

  • Block Kaczmarz

    30