Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra...

221
Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl´ ement Pernet Universit´ e Grenoble Alpes, Inria, LIP-AriC July 6, 2015 C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 1 / 73

Transcript of Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra...

Page 1: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact Linear Algebra Algorithmic: Theory and PracticeISSAC’15 Tutorial

Clement Pernet

Universite Grenoble Alpes, Inria, LIP-AriC

July 6, 2015

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 1 / 73

Page 2: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact linear algebra

Matrices can be

Dense: store all coefficients

Sparse: store the non-zero coefficients only

Black-box: no access to the storage, only apply to a vector

Coefficient domains:

Word size: I integers with a priori boundsI Z/pZ for p of ≈ 32 bits

Multi-precision: Z/pZ for p of ≈ 100, 200, 1000, 2000, . . . bits

Arbitrary precision: Z,QPolynomials: K[X] for K any of the above

Several implemenations for the same domain: better fits FFT, LinAlg, etc

Need to structure the design.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 2 / 73

Page 3: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact linear algebra

Matrices can be

Dense: store all coefficients

Sparse: store the non-zero coefficients only

Black-box: no access to the storage, only apply to a vector

Coefficient domains:

Word size: I integers with a priori boundsI Z/pZ for p of ≈ 32 bits

Multi-precision: Z/pZ for p of ≈ 100, 200, 1000, 2000, . . . bits

Arbitrary precision: Z,QPolynomials: K[X] for K any of the above

Several implemenations for the same domain: better fits FFT, LinAlg, etc

Need to structure the design.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 2 / 73

Page 4: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact linear algebra

Matrices can be

Dense: store all coefficients

Sparse: store the non-zero coefficients only

Black-box: no access to the storage, only apply to a vector

Coefficient domains:

Word size: I integers with a priori boundsI Z/pZ for p of ≈ 32 bits

Multi-precision: Z/pZ for p of ≈ 100, 200, 1000, 2000, . . . bits

Arbitrary precision: Z,QPolynomials: K[X] for K any of the above

Several implemenations for the same domain: better fits FFT, LinAlg, etc

Need to structure the design.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 2 / 73

Page 5: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact linear algebra

Matrices can be

Dense: store all coefficients

Sparse: store the non-zero coefficients only

Black-box: no access to the storage, only apply to a vector

Coefficient domains:

Word size: I integers with a priori boundsI Z/pZ for p of ≈ 32 bits

Multi-precision: Z/pZ for p of ≈ 100, 200, 1000, 2000, . . . bits

Arbitrary precision: Z,QPolynomials: K[X] for K any of the above

Several implemenations for the same domain: better fits FFT, LinAlg, etc

Need to structure the design.C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 2 / 73

Page 6: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact linear algebra

Motivations

Comp. Number Theory: CharPoly, LinSys, Echelon, over Z,Q,Z/pZ, Dense

Graph Theory: MatMul, CharPoly, Det, over Z, Sparse

Discrete log.: LinSys, over Z/pZ, p ≈ 120 bits, Sparse

Integer Factorization: NullSpace, over Z/2Z, Sparse

Algebraic Attacks: Echelon, LinSys, over Z/pZ, p ≈ 20 bits, Sparse & Dense

List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured

Need for high performance.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 3 / 73

Page 7: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Exact linear algebra

Motivations

Comp. Number Theory: CharPoly, LinSys, Echelon, over Z,Q,Z/pZ, Dense

Graph Theory: MatMul, CharPoly, Det, over Z, Sparse

Discrete log.: LinSys, over Z/pZ, p ≈ 120 bits, Sparse

Integer Factorization: NullSpace, over Z/2Z, Sparse

Algebraic Attacks: Echelon, LinSys, over Z/pZ, p ≈ 20 bits, Sparse & Dense

List decoding of RS codes: Lattice reduction, over GF(q)[X], Structured

Need for high performance.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 3 / 73

Page 8: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Content

The scope of this presentation:

I not an exhaustive overview on linear algebra algorithmic andcomplexity improvements

I a few guidelines, for the use and design of exact linear algebra inpractice

I bridging the theoretical algorihmic development and practicalefficiency concerns

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 4 / 73

Page 9: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Outline

1 Choosing the underlying arithmeticUsing boolean arithmeticUsing machine word arithmeticLarger field sizes

2 Reductions and building blocksIn dense linear algebraIn blackbox linear algebra

3 Size dimension trade-offsHermite normal formFrobenius normal form

4 Parallel exact linear algebraIngredients for the parallelizationParallel dense linear algebra mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 5 / 73

Page 10: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic

Outline

1 Choosing the underlying arithmeticUsing boolean arithmeticUsing machine word arithmeticLarger field sizes

2 Reductions and building blocksIn dense linear algebraIn blackbox linear algebra

3 Size dimension trade-offsHermite normal formFrobenius normal form

4 Parallel exact linear algebraIngredients for the parallelizationParallel dense linear algebra mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 6 / 73

Page 11: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic

Achieving high practical efficiency

Most of linear algebra operations boil down to (a lot of)

y ← y ± a * b

I dot-product

I matrix-matrix multiplication

I rank 1 update in Gaussian elimination

I Schur complements, . . .

Efficiency relies on

I fast arithmetic

I fast memory accesses

Here: focus on dense linear algebra

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 7 / 73

Page 12: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit

bit-packing

Z3,5,7 2-3 bits

bit-slicing, bit-packing

Zp 4-26 bits

CPU arithmetic

Z,Q > 32 bits

multiprec. ints, big ints,CRT, lifting

Zp > 32 bits

multiprec. ints, big ints, CRTGF(pk) ≡ Zp[X]/(Q) Polynomial, Kronecker, Zech log, ...

Available CPU arithmetic

I boolean

I integer (fixed size)

I floating point

I .. and their vectorization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 8 / 73

Page 13: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit

bit-packing

Z3,5,7 2-3 bits

bit-slicing, bit-packing

Zp 4-26 bits

CPU arithmetic

Z,Q > 32 bits

multiprec. ints, big ints,CRT, lifting

Zp > 32 bits

multiprec. ints, big ints, CRTGF(pk) ≡ Zp[X]/(Q) Polynomial, Kronecker, Zech log, ...

Available CPU arithmetic

I boolean

I integer (fixed size)

I floating point

I .. and their vectorization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 8 / 73

Page 14: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit bit-packingZ3,5,7 2-3 bits bit-slicing, bit-packingZp 4-26 bits CPU arithmeticZ,Q > 32 bits multiprec. ints, big ints,CRT, liftingZp > 32 bits multiprec. ints, big ints, CRT

GF(pk) ≡ Zp[X]/(Q) Polynomial, Kronecker, Zech log, ...

Available CPU arithmetic

I boolean

I integer (fixed size)

I floating point

I .. and their vectorization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 8 / 73

Page 15: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit bit-packingZ3,5,7 2-3 bits bit-slicing, bit-packingZp 4-26 bits CPU arithmeticZ,Q > 32 bits multiprec. ints, big ints,CRT, liftingZp > 32 bits multiprec. ints, big ints, CRTGF(pk) ≡ Zp[X]/(Q) Polynomial, Kronecker, Zech log, ...

Available CPU arithmetic

I boolean

I integer (fixed size)

I floating point

I .. and their vectorization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 8 / 73

Page 16: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z2: bit-packing

uint64 t ≡ (Z2)64 ^ : bit-wise XOR, (+ mod 2)

& : bit-wise AND, (* mod 2)

dot-product (a,b)

uint64_t t = 0;

for (int k=0; k < N/64; ++k)

t ^= a[k] & b[k];

c = parity(t)

parity(x)

if (size(x) == 1)

return x;

else return parity (High(x) ^ Low(x))

Can be parallelized on 64 instances.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 9 / 73

Page 17: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Tabulation:

I avoid computing parities

I balance computation vs communication

I (slight) complexity improvement possible

The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70]

1 compute all 2k linearcombinations of k rows of B.

Gray code: each new linecosts 1 vector add (vs k/2)

2 multiply chunks of length kof A by table look-up

1 10

1 111 00

1 100 101 10

0 100 110 011 011 111 101 00

0 00

B1B2B3

0

B3B2+B3

B2B1+B2

B1+B2+B3B1+B3

B1

I For k = log n O(n3/ log n).

I In pratice: choice of k s.t. the table fits in L2 cache.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 10 / 73

Page 18: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Tabulation:

I avoid computing parities

I balance computation vs communication

I (slight) complexity improvement possible

The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70]

1 compute all 2k linearcombinations of k rows of B.Gray code: each new linecosts 1 vector add (vs k/2)

2 multiply chunks of length kof A by table look-up

1 10

1 111 00

1 100 101 10

0 100 110 011 011 111 101 00

0 00

B1B2B3

0

B3B2+B3

B2B1+B2

B1+B2+B3B1+B3

B1

I For k = log n O(n3/ log n).

I In pratice: choice of k s.t. the table fits in L2 cache.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 10 / 73

Page 19: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Tabulation:

I avoid computing parities

I balance computation vs communication

I (slight) complexity improvement possible

The Four Russian method [Arlazarov, Dinic, Kronrod, Faradzev 70]

1 compute all 2k linearcombinations of k rows of B.Gray code: each new linecosts 1 vector add (vs k/2)

2 multiply chunks of length kof A by table look-up

1 10

1 111 00

1 100 101 10

0 100 110 011 011 111 101 00

0 00

B1B2B3

0

B3B2+B3

B2B1+B2

B1+B2+B3B1+B3

B1

I For k = log n O(n3/ log n).

I In pratice: choice of k s.t. the table fits in L2 cache.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 10 / 73

Page 20: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z2

The M4RI library [Albrecht Bard Hart 10]

I bit-packing

I Method of the Four Russians

I SIMD vectorization of boolean instructions (128 bits registers)

I Cache optimization

I Strassen’s O(n2.81) algorithm

n 7000 14 000 28 000

SIMD bool arithmetic 2.109s 15.383s 111.82SIMD + 4 Russians 0.256s 2.829s 29.28sSIMD + 4 Russians + Strassen 0.257s 2.001s 15.73

Table : Matrix product n× n by n× n, on an i5 SandyBridge 2.6Ghz.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 11 / 73

Page 21: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z3,Z5 [Boothby & Bradshaw 09]

Z3 = {0, 1,−1} = {00, 01, 10}

add/sub in 7 bool ops= {00, 10, 11} add/sub in 6 bool ops

Bit-slicing

(−1, 0, 1, 0, 1,−1,−1, 0) ∈ Z83 → (11, 00, 10, 00, 10, 11, 00)

Stored as 2 words(1,0,1,0,1,1,0)(1,0,0,0,0,1,0)

~y ← ~y + x~b for x ∈ Z3, ~y,~b ∈ Z643 in 6 boolean word ops.

Recipe for Z5

I Use redundant representations on 3 bits + bit-slicing

I integer add + bool operations

I Pseudo-reduction mod 5 (4→ 3 bits) in 8 bool ops found bycomputer assisted search.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 12 / 73

Page 22: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z3,Z5 [Boothby & Bradshaw 09]

Z3 = {0, 1,−1} = {00, 01, 10} add/sub in 7 bool ops

= {00, 10, 11} add/sub in 6 bool ops

Bit-slicing

(−1, 0, 1, 0, 1,−1,−1, 0) ∈ Z83 → (11, 00, 10, 00, 10, 11, 00)

Stored as 2 words(1,0,1,0,1,1,0)(1,0,0,0,0,1,0)

~y ← ~y + x~b for x ∈ Z3, ~y,~b ∈ Z643 in 6 boolean word ops.

Recipe for Z5

I Use redundant representations on 3 bits + bit-slicing

I integer add + bool operations

I Pseudo-reduction mod 5 (4→ 3 bits) in 8 bool ops found bycomputer assisted search.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 12 / 73

Page 23: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z3,Z5 [Boothby & Bradshaw 09]

Z3 = {0, 1,−1} = {00, 01, 10} add/sub in 7 bool ops= {00, 10, 11} add/sub in 6 bool ops

Bit-slicing

(−1, 0, 1, 0, 1,−1,−1, 0) ∈ Z83 → (11, 00, 10, 00, 10, 11, 00)

Stored as 2 words(1,0,1,0,1,1,0)(1,0,0,0,0,1,0)

~y ← ~y + x~b for x ∈ Z3, ~y,~b ∈ Z643 in 6 boolean word ops.

Recipe for Z5

I Use redundant representations on 3 bits + bit-slicing

I integer add + bool operations

I Pseudo-reduction mod 5 (4→ 3 bits) in 8 bool ops found bycomputer assisted search.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 12 / 73

Page 24: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z3,Z5 [Boothby & Bradshaw 09]

Z3 = {0, 1,−1} = {00, 01, 10} add/sub in 7 bool ops= {00, 10, 11} add/sub in 6 bool ops

Bit-slicing

(−1, 0, 1, 0, 1,−1,−1, 0) ∈ Z83 → (11, 00, 10, 00, 10, 11, 00)

Stored as 2 words(1,0,1,0,1,1,0)(1,0,0,0,0,1,0)

~y ← ~y + x~b for x ∈ Z3, ~y,~b ∈ Z643 in 6 boolean word ops.

Recipe for Z5

I Use redundant representations on 3 bits + bit-slicing

I integer add + bool operations

I Pseudo-reduction mod 5 (4→ 3 bits) in 8 bool ops found bycomputer assisted search.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 12 / 73

Page 25: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z3,Z5 [Boothby & Bradshaw 09]

Z3 = {0, 1,−1} = {00, 01, 10} add/sub in 7 bool ops= {00, 10, 11} add/sub in 6 bool ops

Bit-slicing

(−1, 0, 1, 0, 1,−1,−1, 0) ∈ Z83 → (11, 00, 10, 00, 10, 11, 00)

Stored as 2 words(1,0,1,0,1,1,0)(1,0,0,0,0,1,0)

~y ← ~y + x~b for x ∈ Z3, ~y,~b ∈ Z643 in 6 boolean word ops.

Recipe for Z5

I Use redundant representations on 3 bits + bit-slicing

I integer add + bool operations

I Pseudo-reduction mod 5 (4→ 3 bits) in 8 bool ops found bycomputer assisted search.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 12 / 73

Page 26: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using boolean arithmetic

Dense linear algebra over Z3,Z5 [Boothby & Bradshaw 09]

Z3 = {0, 1,−1} = {00, 01, 10} add/sub in 7 bool ops= {00, 10, 11} add/sub in 6 bool ops

Bit-slicing

(−1, 0, 1, 0, 1,−1,−1, 0) ∈ Z83 → (11, 00, 10, 00, 10, 11, 00)

Stored as 2 words(1,0,1,0,1,1,0)(1,0,0,0,0,1,0)

~y ← ~y + x~b for x ∈ Z3, ~y,~b ∈ Z643 in 6 boolean word ops.

Recipe for Z5

I Use redundant representations on 3 bits + bit-slicing

I integer add + bool operations

I Pseudo-reduction mod 5 (4→ 3 bits) in 8 bool ops found bycomputer assisted search.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 12 / 73

Page 27: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic

2 Reduce modulo p only when necessary

When to reduce ?

Bound the values of all intermediate computations.

I A priori:

Representation of Zp {0 . . . p− 1} {− p−12

. . . p−12}

Scalar product, Classic MatMul n(p− 1)2 n(p−1

2

)2

Strassen-Winograd MatMul (` rec. levels) ( 1+3`

2)2b n

2` c (p− 1)2 9`b n2` c(p−1

2

)2I Maintain locally a bounding interval on all matrices computed

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 13 / 73

Page 28: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic

2 Reduce modulo p only when necessary

When to reduce ?

Bound the values of all intermediate computations.

I A priori:

Representation of Zp {0 . . . p− 1} {− p−12

. . . p−12}

Scalar product, Classic MatMul n(p− 1)2 n(p−1

2

)2

Strassen-Winograd MatMul (` rec. levels) ( 1+3`

2)2b n

2` c (p− 1)2 9`b n2` c(p−1

2

)2I Maintain locally a bounding interval on all matrices computed

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 13 / 73

Page 29: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic

2 Reduce modulo p only when necessary

When to reduce ?

Bound the values of all intermediate computations.

I A priori:

Representation of Zp {0 . . . p− 1} {− p−12

. . . p−12}

Scalar product, Classic MatMul n(p− 1)2 n(p−1

2

)2Strassen-Winograd MatMul (` rec. levels) ( 1+3`

2)2b n

2` c (p− 1)2 9`b n2` c(p−1

2

)2

I Maintain locally a bounding interval on all matrices computed

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 13 / 73

Page 30: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic

2 Reduce modulo p only when necessary

When to reduce ?

Bound the values of all intermediate computations.

I A priori:

Representation of Zp {0 . . . p− 1} {− p−12

. . . p−12}

Scalar product, Classic MatMul n(p− 1)2 n(p−1

2

)2Strassen-Winograd MatMul (` rec. levels) ( 1+3`

2)2b n

2` c (p− 1)2 9`b n2` c(p−1

2

)2I Maintain locally a bounding interval on all matrices computed

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 13 / 73

Page 31: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units

+ vectorization

y += a * b: correct if |ab+ y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax

imulq -56(%rbp), %rax

addq %rax, %rcx

movq -80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0

vpaddq %xmm2,%xmm0,%xmm0

vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

+ vectorization

y += a * b: correct if |ab+ y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0

mulsd -56(%rbp), %xmm0

addsd %xmm0, %xmm1

movq %xmm1, %rax

vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0

vmulpd %ymm1, %ymm0, %ymm0

vaddpd (%rsi,%rax),%ymm0, %ymm0

vmovapd %ymm0, (%rsi,%rax)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 14 / 73

Page 32: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units

+ vectorization

y += a * b: correct if |ab+ y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax

imulq -56(%rbp), %rax

addq %rax, %rcx

movq -80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0

vpaddq %xmm2,%xmm0,%xmm0

vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

+ vectorization

y += a * b: correct if |ab+ y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0

mulsd -56(%rbp), %xmm0

addsd %xmm0, %xmm1

movq %xmm1, %rax

vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0

vmulpd %ymm1, %ymm0, %ymm0

vaddpd (%rsi,%rax),%ymm0, %ymm0

vmovapd %ymm0, (%rsi,%rax)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 14 / 73

Page 33: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorizationy += a * b: correct if |ab+ y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax

imulq -56(%rbp), %rax

addq %rax, %rcx

movq -80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0

vpaddq %xmm2,%xmm0,%xmm0

vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

+ vectorization

y += a * b: correct if |ab+ y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0

mulsd -56(%rbp), %xmm0

addsd %xmm0, %xmm1

movq %xmm1, %rax

vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0

vmulpd %ymm1, %ymm0, %ymm0

vaddpd (%rsi,%rax),%ymm0, %ymm0

vmovapd %ymm0, (%rsi,%rax)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 14 / 73

Page 34: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorizationy += a * b: correct if |ab+ y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax

imulq -56(%rbp), %rax

addq %rax, %rcx

movq -80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0

vpaddq %xmm2,%xmm0,%xmm0

vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

+ vectorization

y += a * b: correct if |ab+ y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0

mulsd -56(%rbp), %xmm0

addsd %xmm0, %xmm1

movq %xmm1, %rax

vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0

vmulpd %ymm1, %ymm0, %ymm0

vaddpd (%rsi,%rax),%ymm0, %ymm0

vmovapd %ymm0, (%rsi,%rax)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 14 / 73

Page 35: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorizationy += a * b: correct if |ab+ y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax

imulq -56(%rbp), %rax

addq %rax, %rcx

movq -80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0

vpaddq %xmm2,%xmm0,%xmm0

vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

+ vectorization

y += a * b: correct if |ab+ y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0

mulsd -56(%rbp), %xmm0

addsd %xmm0, %xmm1

movq %xmm1, %rax

vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0

vmulpd %ymm1, %ymm0, %ymm0

vaddpd (%rsi,%rax),%ymm0, %ymm0

vmovapd %ymm0, (%rsi,%rax)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 14 / 73

Page 36: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorizationy += a * b: correct if |ab+ y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax

imulq -56(%rbp), %rax

addq %rax, %rcx

movq -80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0

vpaddq %xmm2,%xmm0,%xmm0

vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units + vectorizationy += a * b: correct if |ab+ y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0

mulsd -56(%rbp), %xmm0

addsd %xmm0, %xmm1

movq %xmm1, %rax

vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0

vmulpd %ymm1, %ymm0, %ymm0

vaddpd (%rsi,%rax),%ymm0, %ymm0

vmovapd %ymm0, (%rsi,%rax)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 14 / 73

Page 37: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Exploiting in-core parallelism

Ingredients

SIMD: Single Instruction Multiple Data:1 arith. unit acting on a vector of data

MMX 64 bitsSSE 128bitsAVX 256 bitsAVX-512 512 bits

4 x 64 = 256 bits

+

y[3]

x[3]x[2]

y[2]y[1]

x[1]x[0]

y[0]

x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3]

Pipeline: amortize the latency of an operation when used repeatedlythroughput of 1 op/ Cycle for all

arithmetic ops considered here

Execution Unit parallelism: multiple arith. units acting simulatneously ondistinct registers

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 15 / 73

Page 38: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Exploiting in-core parallelism

Ingredients

SIMD: Single Instruction Multiple Data:1 arith. unit acting on a vector of data

MMX 64 bitsSSE 128bitsAVX 256 bitsAVX-512 512 bits

4 x 64 = 256 bits

+

y[3]

x[3]x[2]

y[2]y[1]

x[1]x[0]

y[0]

x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3]

Pipeline: amortize the latency of an operation when used repeatedlythroughput of 1 op/ Cycle for all

arithmetic ops considered here

Execution Unit parallelism: multiple arith. units acting simulatneously ondistinct registers

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 15 / 73

Page 39: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Exploiting in-core parallelism

Ingredients

SIMD: Single Instruction Multiple Data:1 arith. unit acting on a vector of data

MMX 64 bitsSSE 128bitsAVX 256 bitsAVX-512 512 bits

4 x 64 = 256 bits

+

y[3]

x[3]x[2]

y[2]y[1]

x[1]x[0]

y[0]

x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3]

Pipeline: amortize the latency of an operation when used repeatedlythroughput of 1 op/ Cycle for all

arithmetic ops considered here

Execution Unit parallelism: multiple arith. units acting simulatneously ondistinct registers

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 15 / 73

Page 40: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

Intel Sandybridge micro-architecture

FMULPort 0

FADDPort 1

Port 5

4 x 64 = 256 bits Scheduler

2 x 64 = 128 bits

Int ADD

Int MUL

Int ADD

Performs at every clock cycle:

I 1 Floating Pt. Mul × 4

I 1 Floating Pt. Add × 4

Or:

I 1 Integer Mul × 2

I 2 Integer Add × 2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 16 / 73

Page 41: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

Intel Haswell micro-architecture

FMA

Int MUL

Port 0

FMA

Int ADD

Port 1

Int ADDPort 5

4 x 64 = 256 bits Scheduler

Performs at every clock cycle:

I 1 Floating Pt. Mul & Add × 4

I 1 Floating Pt. Add & Add × 4

Or:

I 1 Integer Mul × 4

I 2 Integer Add × 4

FMA: Fused Multiplying & Accumulate, c += a * b

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 17 / 73

Page 42: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

AMD Bulldozer micro-architecture

Port 0

Port 1

Port 3

Scheduler

2 x 64 = 128 bits

Int MUL

Int ADD

FMA

FMA

Int ADDPort 2

2 x 64 = 128 bits

Performs at every clock cycle:

I 2 Floating Pt. Mul & Add × 2

Or:

I 1 Integer Mul × 2

I 2 Integer Add × 2

FMA: Fused Multiplying & Accumulate, c += a * b

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 18 / 73

Page 43: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

Intel Nehalem micro-architecture

Port 0

Port 1

Port 5

Scheduler

2 x 64 = 128 bits

Int MUL

Int ADD

FADD

FMUL

2 x 64 = 128 bits

Int ADD

Performs at every clock cycle:

I 1 Floating Pt. Mul × 2

I 1 Floating Pt. Add × 2

Or:

I 1 Integer Mul × 2

I 2 Integer Add × 2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 19 / 73

Page 44: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28

23.3

AVX2 FP 256 2 8 3.5 56

49.2

Intel Sandybridge INT

128 2 1 2 3.3 13.2 12.1

AVX1 FP

256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT

128 2 1 2 2.1 8.4 6.44

FMA4 FP

128 2 4 2.1 16.8 13.1

Intel Nehalem INT

128 2 1 2 2.66 10.6 4.47

SSE4 FP

128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 45: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT

128 2 1 2 3.3 13.2 12.1

AVX1 FP

256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT

128 2 1 2 2.1 8.4 6.44

FMA4 FP

128 2 4 2.1 16.8 13.1

Intel Nehalem INT

128 2 1 2 2.66 10.6 4.47

SSE4 FP

128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 46: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2

12.1

AVX1 FP 256 1 1 4 3.3 26.4

24.6

AMD Bulldozer INT

128 2 1 2 2.1 8.4 6.44

FMA4 FP

128 2 4 2.1 16.8 13.1

Intel Nehalem INT

128 2 1 2 2.66 10.6 4.47

SSE4 FP

128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 47: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT

128 2 1 2 2.1 8.4 6.44

FMA4 FP

128 2 4 2.1 16.8 13.1

Intel Nehalem INT

128 2 1 2 2.66 10.6 4.47

SSE4 FP

128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 48: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT 128 2 1 2 2.1 8.4

6.44

FMA4 FP 128 2 4 2.1 16.8

13.1

Intel Nehalem INT

128 2 1 2 2.66 10.6 4.47

SSE4 FP

128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 49: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44FMA4 FP 128 2 4 2.1 16.8 13.1

Intel Nehalem INT

128 2 1 2 2.66 10.6 4.47

SSE4 FP

128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 50: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44FMA4 FP 128 2 4 2.1 16.8 13.1

Intel Nehalem INT 128 2 1 2 2.66 10.6

4.47

SSE4 FP 128 1 1 2 2.66 10.6

9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 51: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44FMA4 FP 128 2 4 2.1 16.8 13.1

Intel Nehalem INT 128 2 1 2 2.66 10.6 4.47SSE4 FP 128 1 1 2 2.66 10.6 9.6

AMD K10 INT

64 2 1 1 2.4 4.8

SSE4a FP

128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 52: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44FMA4 FP 128 2 4 2.1 16.8 13.1

Intel Nehalem INT 128 2 1 2 2.66 10.6 4.47SSE4 FP 128 1 1 2 2.66 10.6 9.6

AMD K10 INT 64 2 1 1 2.4 4.8SSE4a FP 128 1 1 2 2.4 9.6

8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 53: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Reg

iste

rsi

ze

#A

dd

ers

#M

ult

iplie

rs

#F

MA

#d

axp

y/

Cyc

le

CP

UF

req

.(G

hz)

Sp

eed

of

Lig

ht

(Gfo

ps)

Sp

eed

inp

ract

ice

(Gfo

ps)

Intel Haswell INT 256 2 1 4 3.5 28 23.3AVX2 FP 256 2 8 3.5 56 49.2

Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1AVX1 FP 256 1 1 4 3.3 26.4 24.6

AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44FMA4 FP 128 2 4 2.1 16.8 13.1

Intel Nehalem INT 128 2 1 2 2.66 10.6 4.47SSE4 FP 128 1 1 2 2.66 10.6 9.6

AMD K10 INT 64 2 1 1 2.4 4.8SSE4a FP 128 1 1 2 2.4 9.6 8.93

Speed of light: CPU freq × ( # daxpy / Cycle) ×2

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 20 / 73

Page 54: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers: ressources

Micro-architecture bible: Agner Fog’s software optimization resources[www.agner.org/optimize]

Experiments:

dgemm (double): OpenBLAS [http://www.openblas.net/]

igemm (int64 t): Eigen [http://eigen.tuxfamily.org/] &FFLAS-FFPACK [linalg.org/projects/fflas-ffpack]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 21 / 73

Page 55: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Integer Packing

32 bits: half the precision twice the speed

double

float float float float float float float float

double double double

Gfops double float int64 t int32 t

Intel SandyBridge 24.7 49.1 12.1 24.7Intel Haswell 49.2 77.6 23.3 27.4AMD Bulldozer 13.0 20.7 6.63 11.8

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 22 / 73

Page 56: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35

Sp

ee

d (

Gfo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod p

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Bit s

peed (

Gbops)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod p

SandyBridge [email protected]. n = 2000.

Take home message

I Floating pt. arith. delivers the highest speed (except in corner cases)

I 32 bits twice as fast as 64 bits

I best bit computation throughput for double precision floating points.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 23 / 73

Page 57: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35

Sp

ee

d (

Gfo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod p

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Bit s

pe

ed

(G

bo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod p

SandyBridge [email protected]. n = 2000.

Take home message

I Floating pt. arith. delivers the highest speed (except in corner cases)

I 32 bits twice as fast as 64 bits

I best bit computation throughput for double precision floating points.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 23 / 73

Page 58: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Larger field sizes

Larger finite fields: log2 p ≥ 32

As before:1 Use adequate integer arithmetic2 reduce modulo p only when necessary

Which integer arithmetic?

1 big integers (GMP)

2 fixed size multiprecision (Givaro-RecInt)

3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed

log2 p 50 100 150

GMP 58.1s 94.6s 140sRecInt 5.7s 28.6s 837sRNS 0.785s 1.42s 1.88s

n = 1000, on an Intel SandyBridge.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 24 / 73

Page 59: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Larger field sizes

Larger finite fields: log2 p ≥ 32

As before:1 Use adequate integer arithmetic2 reduce modulo p only when necessary

Which integer arithmetic?

1 big integers (GMP)

2 fixed size multiprecision (Givaro-RecInt)

3 Residue Number Systems (Chinese Remainder theorem) using moduli delivering optimum bitspeed

log2 p 50 100 150

GMP 58.1s 94.6s 140sRecInt 5.7s 28.6s 837sRNS 0.785s 1.42s 1.88s

n = 1000, on an Intel SandyBridge.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 24 / 73

Page 60: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Larger field sizes

In practice

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35

Sp

ee

d (

Gfo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Bit s

pe

ed

(G

bo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

0

100

200

300

400

500

600

0 20 40 60 80 100 120 140

Bit s

peed (

Gbops)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 25 / 73

Page 61: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Larger field sizes

In practice

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35

Sp

ee

d (

Gfo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Bit s

pe

ed

(G

bo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

0

100

200

300

400

500

600

0 20 40 60 80 100 120 140

Bit s

pe

ed

(G

bo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 25 / 73

Page 62: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Choosing the underlying arithmetic Larger field sizes

In practice

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35

Sp

ee

d (

Gfo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Bit s

pe

ed

(G

bo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

0

100

200

300

400

500

600

0 20 40 60 80 100 120 140

Bit s

pe

ed

(G

bo

ps)

Modulo p (bitsize)

fgemm C = A x B n = 2000

float mod pdouble mod p

int64 mod pRNS mod p

RecInt mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 25 / 73

Page 63: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks

Outline

1 Choosing the underlying arithmeticUsing boolean arithmeticUsing machine word arithmeticLarger field sizes

2 Reductions and building blocksIn dense linear algebraIn blackbox linear algebra

3 Size dimension trade-offsHermite normal formFrobenius normal form

4 Parallel exact linear algebraIngredients for the parallelizationParallel dense linear algebra mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 26 / 73

Page 64: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks

Reductions to building blocks

Huge number of algorithmic variants for a given computation in O(n3).Need to structure the design of set of routines :

I Focus tuning effort on a single routineI Some operations deliver better efficiency:

B in practice: memory access patternB in theory: better algorithms

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 27 / 73

Page 65: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Memory access pattern

I The memory wall: communication speedimproves slower than arithmetic

I Deep memory hierarchy

Need to overlap communications bycomputation

Design of BLAS 3 [Dongarra & Al. 87]

I Group all ops in Matrix products gemm:Work O(n3) >> Data O(n2)

MatMul has become a building block in practice

2

4

6

8

10

12

14

16

18

20

22

1980 1985 1990 1995 2000 2005 2010 2015

Speed

Time

CPU arithmetic throughputMemory speed

CPU

L1

RAM

L2

L3

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 28 / 73

Page 66: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Memory access pattern

I The memory wall: communication speedimproves slower than arithmetic

I Deep memory hierarchy

Need to overlap communications bycomputation

Design of BLAS 3 [Dongarra & Al. 87]

I Group all ops in Matrix products gemm:Work O(n3) >> Data O(n2)

MatMul has become a building block in practice

2

4

6

8

10

12

14

16

18

20

22

1980 1985 1990 1995 2000 2005 2010 2015

Speed

Time

CPU arithmetic throughputMemory speed

CPU

L1

RAM

L2

L3

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 28 / 73

Page 67: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Memory access pattern

I The memory wall: communication speedimproves slower than arithmetic

I Deep memory hierarchy

Need to overlap communications bycomputation

Design of BLAS 3 [Dongarra & Al. 87]

I Group all ops in Matrix products gemm:Work O(n3) >> Data O(n2)

MatMul has become a building block in practice

2

4

6

8

10

12

14

16

18

20

22

1980 1985 1990 1995 2000 2005 2010 2015

Speed

Time

CPU arithmetic throughputMemory speed

CPU

L1

RAM

L2

L3

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 28 / 73

Page 68: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevskıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807)

...

[Schonhage 81] O(n2.52)

...

[Coppersmith, Winograd 90] O(n2.375)

...

[Le Gall 14] O(n2.3728639)

Other operations

[Strassen 69]: Inverse in O(nω)

[Schonhage 72]: QR in O(nω)

[Bunch, Hopcroft 74]: LU in O(nω)

[Ibarra & al. 82]: Rank in O(nω)

[Keller-Gehrig 85]: CharPoly inO(nω log n)

MatMul has become a building block in theory theoretical reductions

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 29 / 73

Page 69: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevskıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807)

...

[Schonhage 81] O(n2.52)

...

[Coppersmith, Winograd 90] O(n2.375)

...

[Le Gall 14] O(n2.3728639)

Other operations

[Strassen 69]: Inverse in O(nω)

[Schonhage 72]: QR in O(nω)

[Bunch, Hopcroft 74]: LU in O(nω)

[Ibarra & al. 82]: Rank in O(nω)

[Keller-Gehrig 85]: CharPoly inO(nω log n)

MatMul has become a building block in theory theoretical reductions

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 29 / 73

Page 70: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevskıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807)

...

[Schonhage 81] O(n2.52)

...

[Coppersmith, Winograd 90] O(n2.375)

...

[Le Gall 14] O(n2.3728639)

Other operations

[Strassen 69]: Inverse in O(nω)

[Schonhage 72]: QR in O(nω)

[Bunch, Hopcroft 74]: LU in O(nω)

[Ibarra & al. 82]: Rank in O(nω)

[Keller-Gehrig 85]: CharPoly inO(nω log n)

MatMul has become a building block in theory theoretical reductions

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 29 / 73

Page 71: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevskıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807)

...

[Schonhage 81] O(n2.52)

...

[Coppersmith, Winograd 90] O(n2.375)

...

[Le Gall 14] O(n2.3728639)

Other operations

[Strassen 69]: Inverse in O(nω)

[Schonhage 72]: QR in O(nω)

[Bunch, Hopcroft 74]: LU in O(nω)

[Ibarra & al. 82]: Rank in O(nω)

[Keller-Gehrig 85]: CharPoly inO(nω log n)

MatMul has become a building block in theory theoretical reductions

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 29 / 73

Page 72: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions: theory

and practice

HNF(Z)

Det(Z)

LinSys(Z)

MM(Z)

SNF(Z)

Det(Zp)

LU(Zp) CharPoly(Zp)

MinPoly(Zp)

TRSM(Zp)

MM(Zp)

LinSys(Zp)

Common mistrust

Fast linear algebra is

7 never faster

7 numerically unstable

Lucky coincidence

3 same building block in theoryand in practice

reduction trees are still relevant

Road map towards efficiency inpractice

1 Tune the MatMul building block.

2 Tune the reductions.

3 New reductions.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 30 / 73

Page 73: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions: theory

and practice

HNF(Z)

Det(Z)

LinSys(Z)

MM(Z)

SNF(Z)

Det(Zp)

LU(Zp) CharPoly(Zp)

MinPoly(Zp)

TRSM(Zp)

MM(Zp)

LinSys(Zp)

Common mistrust

Fast linear algebra is

7 never faster

7 numerically unstable

Lucky coincidence

3 same building block in theoryand in practice

reduction trees are still relevant

Road map towards efficiency inpractice

1 Tune the MatMul building block.

2 Tune the reductions.

3 New reductions.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 30 / 73

Page 74: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions: theory and practice

HNF(Z)

Det(Z)

LinSys(Z)

MM(Z)

SNF(Z)

Det(Zp)

LU(Zp) CharPoly(Zp)

MinPoly(Zp)

TRSM(Zp)

MM(Zp)

LinSys(Zp)

Common mistrust

Fast linear algebra is

7 never faster

7 numerically unstable

Lucky coincidence

3 same building block in theoryand in practice

reduction trees are still relevant

Road map towards efficiency inpractice

1 Tune the MatMul building block.

2 Tune the reductions.

3 New reductions.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 30 / 73

Page 75: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions: theory and practice

HNF(Z)

Det(Z)

LinSys(Z)

MM(Z)

SNF(Z)

Det(Zp)

LU(Zp) CharPoly(Zp)

MinPoly(Zp)

TRSM(Zp)

MM(Zp)

LinSys(Zp)

Common mistrust

Fast linear algebra is

7 never faster

7 numerically unstable

Lucky coincidence

3 same building block in theoryand in practice

reduction trees are still relevant

Road map towards efficiency inpractice

1 Tune the MatMul building block.

2 Tune the reductions.

3 New reductions.C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 30 / 73

Page 76: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

I Compute over Z and delay modular reductions

k(p−12

)2< 2mantissa

I Fastest integer arithmetic: double

I Cache optimizations numerical BLAS

I Strassen-Winograd 6n2.807 + . . .

with memory efficient schedules [Boyer, Dumas, P. and Zhou 09]Tradeoffs:

Extra memory

Overwriting input Leading constant

Fully in-place in

7.2n2.807 + . . .

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 31 / 73

Page 77: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

I Compute over Z and delay modular reductions

k(p−12

)2< 253

I Fastest integer arithmetic: double

I Cache optimizations numerical BLAS

I Strassen-Winograd 6n2.807 + . . .

with memory efficient schedules [Boyer, Dumas, P. and Zhou 09]Tradeoffs:

Extra memory

Overwriting input Leading constant

Fully in-place in

7.2n2.807 + . . .

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 31 / 73

Page 78: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

I Compute over Z and delay modular reductions

9`⌊k2`

⌋ (p−12

)2< 253

I Fastest integer arithmetic: double

I Cache optimizations numerical BLAS

I Strassen-Winograd 6n2.807 + . . .

with memory efficient schedules [Boyer, Dumas, P. and Zhou 09]Tradeoffs:

Extra memory

Overwriting input Leading constant

Fully in-place in

7.2n2.807 + . . .

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 31 / 73

Page 79: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

I Compute over Z and delay modular reductions

9`⌊k2`

⌋ (p−12

)2< 253

I Fastest integer arithmetic: double

I Cache optimizations numerical BLAS

I Strassen-Winograd 6n2.807 + . . .

with memory efficient schedules [Boyer, Dumas, P. and Zhou 09]Tradeoffs:

Extra memory

Overwriting input Leading constant

Fully in-place in

7.2n2.807 + . . .

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 31 / 73

Page 80: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sequential Matrix Multiplication

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

2n

3/t

ime

/10

9 (

Gfo

ps e

qu

iv.)

matrix dimension

i5−3320M at 2.6Ghz with AVX 1

OpenBLAS sgemm

p = 83, 1 mod / 10000 mul.p = 821, 1 mod / 100 mul.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 32 / 73

Page 81: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sequential Matrix Multiplication

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

2n

3/t

ime

/10

9 (

Gfo

ps e

qu

iv.)

matrix dimension

i5−3320M at 2.6Ghz with AVX 1

FFLAS fgemm over Z/83ZOpenBLAS sgemm

p = 83, 1 mod / 10000 mul.

p = 821, 1 mod / 100 mul.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 32 / 73

Page 82: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sequential Matrix Multiplication

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

2n

3/t

ime

/10

9 (

Gfo

ps e

qu

iv.)

matrix dimension

i5−3320M at 2.6Ghz with AVX 1

FFLAS fgemm over Z/83ZFFLAS fgemm over Z/821Z

OpenBLAS sgemm

p = 83, 1 mod / 10000 mul.p = 821, 1 mod / 100 mul.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 32 / 73

Page 83: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Sequential Matrix Multiplication

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

2n

3/t

ime

/10

9 (

Gfo

ps e

qu

iv.)

matrix dimension

i5−3320M at 2.6Ghz with AVX 1

FFLAS fgemm over Z/83ZFFLAS fgemm over Z/821Z

OpenBLAS sgemmFFLAS fgemm over Z/1898131Z

FFLAS fgemm over Z/18981307ZOpenBLAS dgemm

p = 83, 1 mod / 10000 mul.p = 821, 1 mod / 100 mul.

p = 1898131, 1 mod / 10000 mul.p = 18981307, 1 mod / 100 mul.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 32 / 73

Page 84: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions in dense linear algebra

LU decomposition

I Block recursive algorithm reduces to MatMul O(nω)

n 1000 5000 10000 15000 20000

LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s 113.66fflas-ffpack 0.058s 2.46s 16.08s 47.47s 105.96s

Intel Haswell E3-1270 3.0Ghz using OpenBLAS-0.2.9

×7.63

×6.59

Characteristic Polynomial

I A new reduction to matrix multiplication in O(nω).

n 1000 2000 5000 10000

magma-v2.19-9 1.38s 24.28s 332.7s 2497sfflas-ffpack 0.532s 2.936s 32.71s 219.2s

Intel Ivy-Bridge i5-3320 2.6Ghz using OpenBLAS-0.2.9

×7.5

×6.7

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 33 / 73

Page 85: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions in dense linear algebra

LU decomposition

I Block recursive algorithm reduces to MatMul O(nω)

n 1000 5000 10000 15000 20000

LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s 113.66fflas-ffpack 0.058s 2.46s 16.08s 47.47s 105.96s

Intel Haswell E3-1270 3.0Ghz using OpenBLAS-0.2.9

×7.63

×6.59

Characteristic Polynomial

I A new reduction to matrix multiplication in O(nω).

n 1000 2000 5000 10000

magma-v2.19-9 1.38s 24.28s 332.7s 2497sfflas-ffpack 0.532s 2.936s 32.71s 219.2s

Intel Ivy-Bridge i5-3320 2.6Ghz using OpenBLAS-0.2.9

×7.5

×6.7

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 33 / 73

Page 86: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Reductions in dense linear algebra

LU decomposition

I Block recursive algorithm reduces to MatMul O(nω)

n 1000 5000 10000 15000 20000

LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s 113.66fflas-ffpack 0.058s 2.46s 16.08s 47.47s 105.96s

Intel Haswell E3-1270 3.0Ghz using OpenBLAS-0.2.9

×7.63

×6.59

Characteristic Polynomial

I A new reduction to matrix multiplication in O(nω).

n 1000 2000 5000 10000

magma-v2.19-9 1.38s 24.28s 332.7s 2497sfflas-ffpack 0.532s 2.936s 32.71s 219.2s

Intel Ivy-Bridge i5-3320 2.6Ghz using OpenBLAS-0.2.9

×7.5

×6.7

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 33 / 73

Page 87: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

The case of Gaussian elimination

Which reduction to MatMul ?

Slab iterative Slab recursiveLAPACK FFLAS-FFPACK

Tile iterative Tile recursivePLASMA FFLAS-FFPACK

I Sub-cubic complexity: recursive algorithmsI Data locality

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 34 / 73

Page 88: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

The case of Gaussian elimination

Which reduction to MatMul ?

Slab recursiveFFLAS-FFPACK

Tile recursiveFFLAS-FFPACK

I Sub-cubic complexity: recursive algorithms

I Data locality

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 34 / 73

Page 89: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

The case of Gaussian elimination

Which reduction to MatMul ?

Tile recursiveFFLAS-FFPACK

I Sub-cubic complexity: recursive algorithmsI Data locality

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 34 / 73

Page 90: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 91: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

trsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 92: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,Utrsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 93: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,Utrsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 94: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 95: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

trsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 96: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 97: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

trsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 98: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 99: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

trsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 100: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 101: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 102: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

trsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 103: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 104: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

trsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 105: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,Utrsm: B ← BU−1,B ← L−1Bgemm: C ← C −A×B

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 106: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Block algorithms

Tiled Iterative Slab Recursive Tiled Recursive

getrf: A→ L,U

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 35 / 73

Page 107: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Counting Modular Reductionsk≥

1 Tiled Iter. Right looking 13k

n3 +(1− 1

k

)n2 +

(16k − 5

2+ 3

k

)n

Tiled Iter. Left looking(2− 1

2k

)n2 +

(− 5

2k − 1 + 2

k

)n + 2k2 − 2k + 1

Tiled Iter. Crout(52− 1

k

)n2 +

(−2k − 5

2+ 3

k

)n + k2

k=

1 Iter. Right looking 13n3 − 1

3n

Iter. Left Looking 32n2 − 3

2n + 1

Iter. Crout 32n2 − 7

2n + 3

Tiled Recursive 2n2 − n log2 n− n

Slab Recursive (1 + 14

log2 n)n2 − 12n log2 n− n

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 36 / 73

Page 108: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Counting Modular Reductionsk≥

1 Tiled Iter. Right looking 13k

n3 +(1− 1

k

)n2 +

(16k − 5

2+ 3

k

)n

Tiled Iter. Left looking(2− 1

2k

)n2 +

(− 5

2k − 1 + 2

k

)n + 2k2 − 2k + 1

Tiled Iter. Crout(52− 1

k

)n2 +

(−2k − 5

2+ 3

k

)n + k2

k=

1 Iter. Right looking 13n3 − 1

3n

Iter. Left Looking 32n2 − 3

2n + 1

Iter. Crout 32n2 − 7

2n + 3

Tiled Recursive 2n2 − n log2 n− n

Slab Recursive (1 + 14

log2 n)n2 − 12n log2 n− n

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 36 / 73

Page 109: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Counting Modular Reductionsk≥

1 Tiled Iter. Right looking 13k

n3 +(1− 1

k

)n2 +

(16k − 5

2+ 3

k

)n

Tiled Iter. Left looking(2− 1

2k

)n2 +

(− 5

2k − 1 + 2

k

)n + 2k2 − 2k + 1

Tiled Iter. Crout(52− 1

k

)n2 +

(−2k − 5

2+ 3

k

)n + k2

k=

1 Iter. Right looking 13n3 − 1

3n

Iter. Left Looking 32n2 − 3

2n + 1

Iter. Crout 32n2 − 7

2n + 3

Tiled Recursive 2n2 − n log2 n− n

Slab Recursive (1 + 14

log2 n)n2 − 12n log2 n− n

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 36 / 73

Page 110: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Impact in practice

0

2

4

6

8

10

12

3000 5000 7000

(2/3

)n3/t

ime/1

09

matrix dimension

sequential LU decomposition variants on one core

Right(k=212)Left(k=212)

Crout(k=212)Tiled-RecSlab-Rec

I As anticipated : Right-looking < Crout < Left-looking

I Recursive algorithms stand out with large matrices (Strassen’smultiplication) despite their worse mod. reduction complexity.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 37 / 73

Page 111: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Impact in practice

0

2

4

6

8

10

12

3000 5000 7000

(2/3

)n3/t

ime/1

09

matrix dimension

sequential LU decomposition variants on one core

Right(k=n/3)Left(k=n/3)

Crout(k=n/3)Tiled-RecSlab-Rec

I As anticipated : Right-looking < Crout < Left-lookingI Recursive algorithms stand out with large matrices (Strassen’s

multiplication) despite their worse mod. reduction complexity.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 37 / 73

Page 112: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Dealing with rank deficiencies and computing rank profiles

Rank profiles: first linearly independent columns

I Major invariant of a matrix (echelon form)

I Grobner basis computations (Macaulay matrix)

I Krylov methods

Gaussian elimination revealing echelon forms:

[Ibarra, Moran and Hui 82]S=A L P

[Keller-Gehrig 85]=A RX

[Jeannerod, P. and Storjohann 13]A E= P L

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 38 / 73

Page 113: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing rank profiles

Lessons learned (or what we thought was necessary):

I treat rows in order

I exhaust all columns before considering the next row

I slab block splitting required (recursive or iterative) similar to partial pivoting

Tiled recursive PLUQ [Dumas P. Sultan 13,15]

1 Generalized to handle rank deficiencyB 4 recursive calls necessaryB in-place computation

2 Pivoting strategies exist to recover rank profile and echelon forms

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 39 / 73

Page 114: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing rank profiles

Lessons learned (or what we thought was necessary):

I treat rows in order

I exhaust all columns before considering the next row

I slab block splitting required (recursive or iterative) similar to partial pivoting

Tiled recursive PLUQ [Dumas P. Sultan 13,15]

1 Generalized to handle rank deficiencyB 4 recursive calls necessaryB in-place computation

2 Pivoting strategies exist to recover rank profile and echelon forms

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 39 / 73

Page 115: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

2× 2 block splitting

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 116: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

Recursive call

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 117: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

TRSM: B ← BU−1

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 118: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

TRSM: B ← L−1B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 119: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

MatMul: C ← C −A×B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 120: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

MatMul: C ← C −A×B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 121: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

MatMul: C ← C −A×B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 122: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

2 independent recursive calls

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 123: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

TRSM: B ← BU−1

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 124: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

TRSM: B ← L−1B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 125: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

MatMul: C ← C −A×B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 126: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

MatMul: C ← C −A×B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 127: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

MatMul: C ← C −A×B

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 128: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

Recursive call

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 129: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

Puzzle game (block cyclic rotations)

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 130: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

A tiled recursive algorithm

[Dumas, P. and Sultan 13]

I O(mnrω−2) (degenerating to 2/3n3)I computing col. and row rank profiles of all leading sub-matricesI fewer modular reductions than slab algorithmsI rank deficiency introduces parallelism

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 40 / 73

Page 131: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

A = PLUQ = P

[L 0M Im−r

] [Ir

0

] [U V

In−r

]Q

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 132: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1}ColRP = {1}

A = PLUQ = P

[L 0M Im−r

] [Ir

0

] [U V

In−r

]Q

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 133: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1,2}ColRP = {1,3}

A = PLUQ = P

[L 0M Im−r

] [Ir

0

] [U V

In−r

]Q

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 134: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1,4}ColRP = {1,2}

A = PLUQ = P

[L 0M Im−r

] [Ir

0

] [U V

In−r

]Q

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 135: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1,4}ColRP = {1,2}

A = PLUQ = P

[L 0M Im−r

] [Ir

0

] [U V

In−r

]Q

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 136: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1,4}ColRP = {1,2}

A = PLUQ = P

[L 0M Im−r

]PTP

[Ir

0

]QQT

[U V

In−r

]Q

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 137: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1,4}ColRP = {1,2}

A = PLUQ = P

[L 0M Im−r

]PT

︸ ︷︷ ︸L

P

[Ir

0

]Q︸ ︷︷ ︸

ΠP,Q

QT

[U V

In−r

]Q︸ ︷︷ ︸

U

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 138: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In dense linear algebra

Computing all rank profiles at once

Dumas, P. and Sultan ISSAC’15 (Thursday 9 @ 3PM )

Definition (Rank Profile matrix)

The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leadingsub-matrix of RA and of A have the same rank.

Theorem

I RowRP and ColRP read directly on R(A)I Same holds for any (i, j)-leading submatrix.

1 2 3 42 4 5 81 2 3 43 5 9 12

1 0 0 00 0 1 00 0 0 00 1 0 0

A R

RowRP = {1,4}ColRP = {1,2}

A = PLUQ = P

[L 0M Im−r

]PT

︸ ︷︷ ︸L

P

[Ir

0

]Q︸ ︷︷ ︸

ΠP,Q

QT

[U V

In−r

]Q︸ ︷︷ ︸

U

With appropriate pivoting: ΠP,Q = R(A)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 41 / 73

Page 139: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In blackbox linear algebra

Reductions in black box linear algebra

v ∈ Kn

A ∈ Kn×n

Av ∈ Kn

Matrix-Vector Product: building block, costs E(n)

Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n2)

Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners

O (nE(n) + n2)

Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank, . . .

CharPoly(Zp)

Rank(Zp)

MinPoly(Zp)

MatVecProd(Zp)

Det(Zp) LinSys(Zp)

MinPoly(Z)

CharPoly(Z)

[Dumas, P. Saunders 09]

[Chen & al. 02]

[Wiedemann 86]

[Kaltofen Saunders 91]

[Wiedemann 86]

[Chen & al. 02]

[Dumas, P. and Saunders 09]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 42 / 73

Page 140: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In blackbox linear algebra

Reductions in black box linear algebra

v ∈ Kn

A ∈ Kn×n

Av ∈ Kn

Matrix-Vector Product: building block, costs E(n)

Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n2)

Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (nE(n) + n2)

Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank, . . .

CharPoly(Zp)

Rank(Zp)

MinPoly(Zp)

MatVecProd(Zp)

Det(Zp) LinSys(Zp)

MinPoly(Z)

CharPoly(Z)

[Dumas, P. Saunders 09]

[Chen & al. 02]

[Wiedemann 86]

[Kaltofen Saunders 91]

[Wiedemann 86]

[Chen & al. 02]

[Dumas, P. and Saunders 09]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 42 / 73

Page 141: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In blackbox linear algebra

Reductions in black box linear algebra

v ∈ Kn

A ∈ Kn×n

Av ∈ Kn

Matrix-Vector Product: building block, costs E(n)

Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n2)

Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (nE(n) + n2)

Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank, . . .

CharPoly(Zp)

Rank(Zp)

MinPoly(Zp)

MatVecProd(Zp)

Det(Zp) LinSys(Zp)

MinPoly(Z)

CharPoly(Z)

[Dumas, P. Saunders 09]

[Chen & al. 02]

[Wiedemann 86]

[Kaltofen Saunders 91]

[Wiedemann 86]

[Chen & al. 02]

[Dumas, P. and Saunders 09]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 42 / 73

Page 142: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Reductions and building blocks In blackbox linear algebra

Reductions in black box linear algebra

v ∈ Kn

A ∈ Kn×n

Av ∈ Kn

Matrix-Vector Product: building block, costs E(n)

Minimal polynomial: [Wiedemann 86] iterative Krylov/Lanczos methods O(nE(n) + n2)

Rank, Det, Solve: [ Chen& Al. 02] reduces to MinPoly + preconditioners O (nE(n) + n2)

Characteristic Poly.: [Dumas P. Saunders 09] reduces to MinPoly, Rank, . . .

CharPoly(Zp)

Rank(Zp)

MinPoly(Zp)

MatVecProd(Zp)

Det(Zp) LinSys(Zp)

MinPoly(Z)

CharPoly(Z)

[Dumas, P. Saunders 09]

[Chen & al. 02]

[Wiedemann 86]

[Kaltofen Saunders 91]

[Wiedemann 86]

[Chen & al. 02]

[Dumas, P. and Saunders 09]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 42 / 73

Page 143: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Outline

1 Choosing the underlying arithmeticUsing boolean arithmeticUsing machine word arithmeticLarger field sizes

2 Reductions and building blocksIn dense linear algebraIn blackbox linear algebra

3 Size dimension trade-offsHermite normal formFrobenius normal form

4 Parallel exact linear algebraIngredients for the parallelizationParallel dense linear algebra mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 43 / 73

Page 144: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Size Dimension trade-offs

Computing with coefficients of varying size: Z,Q,K[X], . . .

Multimodular methods

over K[X]: evaluation-interpolation

over Z,Q: Chinese Remainder Theorem

Cost = Algebraic Cost× Size(Output)

3 avoids coefficient blow-up

7 uniform (worst case) cost for all arithmetic ops

Example

Hadamard’s bound: |det(A)| ≤ (‖A‖∞√n)n.

LinSysZ(n) = O(nω × n(log n+ log ‖A‖∞))

= O (nω+1 log ‖A‖∞)]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 44 / 73

Page 145: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Size Dimension trade-offs

Computing with coefficients of varying size: Z,Q,K[X], . . .

Multimodular methods

over K[X]: evaluation-interpolation

over Z,Q: Chinese Remainder Theorem

Cost = Algebraic Cost× Size(Output)

3 avoids coefficient blow-up

7 uniform (worst case) cost for all arithmetic ops

Example

Hadamard’s bound: |det(A)| ≤ (‖A‖∞√n)n.

LinSysZ(n) = O(nω × n(log n+ log ‖A‖∞))

= O (nω+1 log ‖A‖∞)]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 44 / 73

Page 146: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Size Dimension trade-offs

Computing with coefficients of varying size: Z,Q,K[X], . . .

Multimodular methods

over K[X]: evaluation-interpolation

over Z,Q: Chinese Remainder Theorem

Cost = Algebraic Cost× Size(Output)

3 avoids coefficient blow-up

7 uniform (worst case) cost for all arithmetic ops

Example

Hadamard’s bound: |det(A)| ≤ (‖A‖∞√n)n.

LinSysZ(n) = O(nω × n(log n+ log ‖A‖∞)) = O (nω+1 log ‖A‖∞)]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 44 / 73

Page 147: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Size Dimension trade-offs

Computing with coefficients of varying size: Z,Q,K[X], . . .

Lifting techniques

p-adic lifting: [Moenck & Carter 79, Dixon 82]

I One computation over ZpI Iterative lifting of the solution to Z,Q

High order lifting : [Storjohann 02,03]

I Fewer iteration stepsI larger dimension in the lifting

Example

LinSysZ(n) = O(n3 log ‖A‖1+ε∞ )

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 45 / 73

Page 148: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Size Dimension trade-offs

Computing with coefficients of varying size: Z,Q,K[X], . . .

Lifting techniques

p-adic lifting: [Moenck & Carter 79, Dixon 82]

I One computation over ZpI Iterative lifting of the solution to Z,Q

High order lifting : [Storjohann 02,03]

I Fewer iteration stepsI larger dimension in the lifting

Example

LinSysZ(n) = O (nω log ‖A‖∞)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 45 / 73

Page 149: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Improving time Complexities

Matrix multiplication: door to fast linear algebra

I over Z: O(nωM(log ‖A‖)) = O (nω log ‖A‖)

Equivalence over Z or K[X]: Hermite normal form

I [Kannan & Bachem 79]: ∈ PI [Chou & Collins 82]: O (n6 log ‖A‖)I [Domich & Al. 87], [Illiopoulos 89]: O (n4 log ‖A‖)I [Micciancio & Warinschi01]: O (n5 log ‖A‖2),

O (n3 log ‖A‖) heur.I [Storjohann & Labahn 96]: O (nω+1 log ‖A‖)I [Gupta & Storjohann 11]: O (n3 log ‖A‖)

Similarity over a field: Frobenius normal form

I [Giesbrecht 93]: O (nω) probabilisticI [Storjohann 00]: O (nω) deterministicI [P. & Storjohann 07]: O(nω) probabilistic

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 46 / 73

Page 150: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Improving time Complexities

Matrix multiplication: door to fast linear algebra

I over Z: O(nωM(log ‖A‖)) = O (nω log ‖A‖)

Equivalence over Z or K[X]: Hermite normal form

I [Kannan & Bachem 79]: ∈ PI [Chou & Collins 82]: O (n6 log ‖A‖)I [Domich & Al. 87], [Illiopoulos 89]: O (n4 log ‖A‖)I [Micciancio & Warinschi01]: O (n5 log ‖A‖2),

O (n3 log ‖A‖) heur.I [Storjohann & Labahn 96]: O (nω+1 log ‖A‖)I [Gupta & Storjohann 11]: O (n3 log ‖A‖)

Similarity over a field: Frobenius normal form

I [Giesbrecht 93]: O (nω) probabilisticI [Storjohann 00]: O (nω) deterministicI [P. & Storjohann 07]: O(nω) probabilistic

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 46 / 73

Page 151: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Improving time Complexities

Matrix multiplication: door to fast linear algebra

I over Z: O(nωM(log ‖A‖)) = O (nω log ‖A‖)

Equivalence over Z or K[X]: Hermite normal form

I [Kannan & Bachem 79]: ∈ PI [Chou & Collins 82]: O (n6 log ‖A‖)I [Domich & Al. 87], [Illiopoulos 89]: O (n4 log ‖A‖)I [Micciancio & Warinschi01]: O (n5 log ‖A‖2),

O (n3 log ‖A‖) heur.I [Storjohann & Labahn 96]: O (nω+1 log ‖A‖)I [Gupta & Storjohann 11]: O (n3 log ‖A‖)

Similarity over a field: Frobenius normal form

I [Giesbrecht 93]: O (nω) probabilisticI [Storjohann 00]: O (nω) deterministicI [P. & Storjohann 07]: O(nω) probabilistic

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 46 / 73

Page 152: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Improving time Complexities

Matrix multiplication: door to fast linear algebra

I over Z: O(nωM(log ‖A‖)) = O (nω log ‖A‖)

Equivalence over Z or K[X]: Hermite normal form

I [Kannan & Bachem 79]: ∈ PI [Chou & Collins 82]: O (n6 log ‖A‖)I [Domich & Al. 87], [Illiopoulos 89]: O (n4 log ‖A‖)I [Micciancio & Warinschi01]: O (n5 log ‖A‖2),

O (n3 log ‖A‖) heur.I [Storjohann & Labahn 96]: O (nω+1 log ‖A‖)I [Gupta & Storjohann 11]: O (n3 log ‖A‖)

Similarity over a field: Frobenius normal form

I [Giesbrecht 93]: O (nω) probabilisticI [Storjohann 00]: O (nω) deterministicI [P. & Storjohann 07]: O(nω) probabilistic

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 46 / 73

Page 153: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs

Building blocks and reductions

In brief

Reductions to a building block

Matrix Mult: block rec.∑logn

i=1 n(n2i

)ω−1= O(nω) (Gauss, REF)

Matrix Mult: Iterative∑n

k=1 k(nk

)ω= O(nω) (Frobenius)

Linear Sys: over Z (Hermite Normal Form)

Size/dimension compromises

I Hermite normal form : rank 1 updates reducing the determinant

I Frobenius normal form : degree k, dimension n/k for k = 1 . . . n

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 47 / 73

Page 154: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Hermite normal form: naive algorithm

Reduced Echelon form over a ring:

p1 ∗ x1,2 ∗ ∗ x1,3 ∗

p2 ∗ ∗ x2,3 ∗p3 ∗

with

0 ≤ x∗,j < pj .

for i = 1 . . . n do(g, ti, . . . , tn) = xgcd(Ai,i, Ai+1,i, . . . , An,i)Li ←

∑nj=i+1 tjLj

for j = i+ 1 . . . n doLj ← Lj − Aj,i

g Li . eliminateend forfor j = 1 . . . i− 1 do

Lj ← Lj − bAj,i

g cLi . reduceend for

end forC. Pernet Exact Linear Algebra Algorithmic July 6, 2015 48 / 73

Page 155: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Computing modulo the determinant [Domich & Al. 87]

Property

I For A non-singular: maxi∑

j Hij ≤ detH = detA

I Every computation can be done modulo d = detA:

U ′[AdIn In

]=

[H

In

]

Example

A =

−5 8 −3 −9 5 5−2 8 −2 −2 8 57 −5 −8 4 3 −41 −1 6 0 8 −3

, H =

1 0 3 237 −299 900 1 1 103 −130 400 0 4 352 −450 1350 0 0 486 −627 188

detA = 1944

O(n3)×M(n(log n+ log ‖A‖)) = O (n5 log ‖A‖2)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 49 / 73

Page 156: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Computing modulo the determinant [Domich & Al. 87]

Property

I For A non-singular: maxi∑

j Hij ≤ detH = detA

I Every computation can be done modulo d = detA:

U ′[AdIn In

]=

[H

In

]

Example

A =

−5 8 −3 −9 5 5−2 8 −2 −2 8 57 −5 −8 4 3 −41 −1 6 0 8 −3

, H =

1 0 3 237 −299 900 1 1 103 −130 400 0 4 352 −450 1350 0 0 486 −627 188

detA = 1944

O(n3)×M(n(log n+ log ‖A‖)) = O (n5 log ‖A‖2)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 49 / 73

Page 157: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Computing modulo the determinant

I Pessimistic estimate on the arithmetic size

I d large but most coefficients of H are small

I On average : only the last few columns are large

Compute H ′ close to H but with small determinant

[Micciancio & Warinschi 01]

A =

B bcT an−1,n

dT an,n

d1 = det

([BcT

]), d2 = det

([BdT

])g = gcd(d1, d2) = sd1 + td2 Then

det

([B

scT + tdT

])= g

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 50 / 73

Page 158: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Computing modulo the determinant

I Pessimistic estimate on the arithmetic size

I d large but most coefficients of H are small

I On average : only the last few columns are large

Compute H ′ close to H but with small determinant

[Micciancio & Warinschi 01]

A =

B bcT an−1,n

dT an,n

d1 = det

([BcT

]), d2 = det

([BdT

])g = gcd(d1, d2) = sd1 + td2 Then

det

([B

scT + tdT

])= g

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 50 / 73

Page 159: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Micciancio & Warinschi algorithm

Compute d1 = det

([BcT

]), d2 = det

([BdT

]). Double Det

(g, s, t) = xgcd(d1, d2)

Compute H1 the HNF of

[B

scT + tdT

]mod g . Modular HNF

Recover H2 the HNF of

[B b

scT + tdT san−1,n + tan,n

]. AddCol

Recover H3 the HNF of

B bcT an−1,ndT an,n

. AddRow

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 51 / 73

Page 160: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Micciancio & Warinschi algorithm

Compute d1 = det

([BcT

]), d2 = det

([BdT

]). Double Det

(g, s, t) = xgcd(d1, d2)

Compute H1 the HNF of

[B

scT + tdT

]mod g . Modular HNF

Recover H2 the HNF of

[B b

scT + tdT san−1,n + tan,n

]. AddCol

Recover H3 the HNF of

B bcT an−1,ndT an,n

. AddRow

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 51 / 73

Page 161: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Double Determinant

First approach: LU mod p1, . . . , pk + CRT

I Only one elimination for the n− 2 first rows

I 2 updates for the last rows (triangular back substitution)

I k large such that∏ki=1 pi > nn log ‖A‖n/2

Second approach: [Abbott Bronstein Mulders 99]

I Solve Ax = b.

I δ = lcm(q1, . . . , qn) s.t. xi = pi/qi

Then δ is a large divisor of D = detA.

I Compute D/δ by LU mod p1, . . . , pk + CRT

I k small , such that∏ki=1 pi > nn log ‖A‖n/2/δ

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 52 / 73

Page 162: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Double Determinant

First approach: LU mod p1, . . . , pk + CRT

I Only one elimination for the n− 2 first rows

I 2 updates for the last rows (triangular back substitution)

I k large such that∏ki=1 pi > nn log ‖A‖n/2

Second approach: [Abbott Bronstein Mulders 99]

I Solve Ax = b.

I δ = lcm(q1, . . . , qn) s.t. xi = pi/qi

Then δ is a large divisor of D = detA.

I Compute D/δ by LU mod p1, . . . , pk + CRT

I k small , such that∏ki=1 pi > nn log ‖A‖n/2/δ

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 52 / 73

Page 163: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

Double Determinant : improved

Property

Let x = [x1, . . . , xn] be the solution of[A c

]x = d. Then

y = [− x1xn, . . . ,−xn−1

xn, 1xn

] is the solution of[A d

]y = c.

I 1 system solve

I 1 LU for each pi

d1, d2 computed at about the cost of 1 determinant

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 53 / 73

Page 164: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

AddCol

Problem

Find a vector e such that[H1 e

]= U

[B b

scT + tdT san−1,n + tan,n

]

e = U

[b

san−1,n + tan,n

]= H1

[B

scT + tdT

]−1 [b

san−1,n + tan,n

] Solve a system.

I n− 1 first rows are small

I last row is large

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 54 / 73

Page 165: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Hermite normal form

AddCol

Idea:

replace the last row by a random small one wT .[BwT

]y =

[b

an−1,n−1

]Let {k} be a basis of the kernel of B. Then

x = y + αk.

where

α =an−1,n−1 − (scT + tdT ) · y

(scT + tdT ) · k limits the expensive arithmetic to a few dot products

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 55 / 73

Page 166: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

Definition

Unique F = U−1AU = Diag(Cf0 , . . . , Cfk) with fk|fk−1| . . . |f0.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 167: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

[P. & Storjohann 07]

k-shifted form:

1

0

1

0

1

1

0

1

1

k k ≤ k

I From k to k + 1-shifted in O(k(nk )ω)

I Compute iteratively from a 1-shifted form

I Invariant factors appear by increasing degree

I Until the Hessenberg polycyclic form

nωn∑k=1

(1

k

)ω−1≤ ζ(ω − 1)nω = O(nω)

I Generalized to the Frobenius form as well

I Transformation matrix in O(nω log log n)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 168: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

[P. & Storjohann 07]

k + 1-shifted form:

1

0

1

1

1

0

1

1

0

k + 1 k + 1 dl

I From k to k + 1-shifted in O(k(nk )ω)

I Compute iteratively from a 1-shifted form

I Invariant factors appear by increasing degree

I Until the Hessenberg polycyclic form

nωn∑k=1

(1

k

)ω−1≤ ζ(ω − 1)nω = O(nω)

I Generalized to the Frobenius form as well

I Transformation matrix in O(nω log log n)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 169: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

[P. & Storjohann 07]

k + 1-shifted form:

1

0

1

1

1

0

1

1

0

k + 1 k + 1 dlI From k to k + 1-shifted in O(k(nk )

ω)

I Compute iteratively from a 1-shifted form

I Invariant factors appear by increasing degree

I Until the Hessenberg polycyclic form

nωn∑k=1

(1

k

)ω−1≤ ζ(ω − 1)nω = O(nω)

I Generalized to the Frobenius form as well

I Transformation matrix in O(nω log log n)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 170: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

[P. & Storjohann 07]

Hessenberg polycyclic:

0

1

1

0

1

1

1

0

1

d2d1 dlI From k to k + 1-shifted in O(k(nk )

ω)

I Compute iteratively from a 1-shifted form

I Invariant factors appear by increasing degree

I Until the Hessenberg polycyclic form

nωn∑k=1

(1

k

)ω−1≤ ζ(ω − 1)nω = O(nω)

I Generalized to the Frobenius form as well

I Transformation matrix in O(nω log log n)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 171: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

[P. & Storjohann 07]

Hessenberg polycyclic:

0

1

1

0

1

1

1

0

1

d2d1 dlI From k to k + 1-shifted in O(k(nk )

ω)

I Compute iteratively from a 1-shifted form

I Invariant factors appear by increasing degree

I Until the Hessenberg polycyclic form

nωn∑k=1

(1

k

)ω−1≤ ζ(ω − 1)nω = O(nω)

I Generalized to the Frobenius form as well

I Transformation matrix in O(nω log log n)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 172: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

Computing the Frobenius normal form

[P. & Storjohann 07]

Hessenberg polycyclic:

0

1

1

0

1

1

1

0

1

d2d1 dlI From k to k + 1-shifted in O(k(nk )

ω)

I Compute iteratively from a 1-shifted form

I Invariant factors appear by increasing degree

I Until the Hessenberg polycyclic form

nωn∑k=1

(1

k

)ω−1≤ ζ(ω − 1)nω = O(nω)

I Generalized to the Frobenius form as well

I Transformation matrix in O(nω log log n)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 56 / 73

Page 173: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

A new type size dimension trade-off

xIn −A

det(xIn −A)

dimension = 1degree = n

degree = 1

dimension = n

Keller-Gehrig 2

1

1

0

1

1

0 dimension = n

2i

degree = 2i

New algorithm

1

0

1

0

1

0

1

0

1

1

0

1

1

0

0

1

dimension = nk

degree = k

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 57 / 73

Page 174: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

A new type size dimension trade-off

xIn −A

det(xIn −A)

dimension = 1degree = n

degree = 1

dimension = n

Keller-Gehrig 2

1

1

0

1

1

0 dimension = n

2i

degree = 2i

New algorithm

1

0

1

0

1

0

1

0

1

1

0

1

1

0

0

1

dimension = nk

degree = k

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 57 / 73

Page 175: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Size dimension trade-offs Frobenius normal form

A new type size dimension trade-off

xIn −A

det(xIn −A)

dimension = 1degree = n

degree = 1

dimension = n

Keller-Gehrig 2

1

1

0

1

1

0 dimension = n

2i

degree = 2i

New algorithm

1

0

1

0

1

0

1

0

1

1

0

1

1

0

0

1

dimension = nk

degree = k

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 57 / 73

Page 176: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra

Outline

1 Choosing the underlying arithmeticUsing boolean arithmeticUsing machine word arithmeticLarger field sizes

2 Reductions and building blocksIn dense linear algebraIn blackbox linear algebra

3 Size dimension trade-offsHermite normal formFrobenius normal form

4 Parallel exact linear algebraIngredients for the parallelizationParallel dense linear algebra mod p

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 58 / 73

Page 177: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Parallelization

Parallel numerical linear algebra

I cost invariant wrt. splittingB O(n3)

fine grain block iterative algorithms

I regular task load

I Numerical stability constraints

Exact linear algebra specificities

I cost affected by the splittingB O(nω) for ω < 3B modular reductions

coarse grain recursive algorithms

I rank deficiencies unbalanced task loads

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 59 / 73

Page 178: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Parallelization

Parallel numerical linear algebra

I cost invariant wrt. splittingB O(n3)

fine grain block iterative algorithms

I regular task load

I Numerical stability constraints

Exact linear algebra specificities

I cost affected by the splittingB O(nω) for ω < 3B modular reductions

coarse grain recursive algorithms

I rank deficiencies unbalanced task loads

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 59 / 73

Page 179: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Ingredients for the parallelization

Criteria

I good performances

I portability across architectures

I abstraction for simplicity

Challenging key point: scheduling as a plugin

Program: only describes where the parallelism lies

Runtime: scheduling & mapping, depending on the context ofexecution

3 main models:

1 Parallel loop [data parallelism]

2 Fork-Join (independent tasks) [task parallelism]

3 Dependent tasks with data flow dependencies [task parallelism]

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 60 / 73

Page 180: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Data Parallelism

OMP

for (int step = 0; step < 2; ++step){

#pragma omp parallel for

for (int i = 0; i < count; ++i)

A[i] = (B[i+1] + B[i-1] + 2.0*B[i])*0.25;

}

Limitation: very un-efficient with recursive parallel regions

I Limited to iterative algorithms

I No composition of routines

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 61 / 73

Page 181: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Task parallelism with fork-Join

I Task based program: spawn + sync

I Especially suited for recursive programs

OMP (since v3)

void fibonacci(long* result , long n) {

if (n < 2)

*result = n;

else {

long x,y;

#pragma omp task

fibonacci( &x, n-1 );

fibonacci( &y, n-2 );

#pragma omp taskwait

*result = x + y;

}

}

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 62 / 73

Page 182: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Tasks with dataflow dependencies

I Task based model avoiding synchronizationsI Infer synchronizations from the read/write specifications

B A task is ready for execution when all its inputs variables are readyB A variable is ready when it has been written

I Recently supported: Athapascan [96], Kaapi [06], StarSs [07], StarPU[08], Quark [10], OMP since v4 [14]...

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 63 / 73

Page 183: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Illustration: Cholesky factorization

v o i d C h o l e s k y ( double∗ A, i n t N, s i z e t NB ) {

f o r ( s i z e t k=0; k < N; k += NB){

c l a p a c k d p o t r f ( CblasRowMajor , CblasLower , NB, &A [ k∗N+k ] , N ) ;

f o r ( s i z e t m=k+ NB; m < N; m += NB){

c b l a s d t r s m ( CblasRowMajor , C b l a s L e f t , CblasLower , CblasNoTrans , C b l a s U n i t ,NB, NB, 1 . , &A [ k∗N+k ] , N, &A [m∗N+k ] , N ) ;

}

f o r ( s i z e t m=k+ NB; m < N; m += NB){

c b l a s d s y r k ( CblasRowMajor , CblasLower , CblasNoTrans ,NB, NB, −1.0 , &A [m∗N+k ] , N, 1 . 0 , &A [m∗N+m] , N ) ;

f o r ( s i z e t n=k+NB; n < m; n += NB){

cblas dgemm ( CblasRowMajor , CblasNoTrans , CblasTrans ,NB, NB, NB, −1.0 , &A [m∗N+k ] , N, &A [ n∗N+k ] , N, 1 . 0 , &A [m∗N+n ] , N ) ;

}}

}}

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 64 / 73

Page 184: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Illustration: Cholesky factorization

v o i d C h o l e s k y ( double∗ A, i n t N, s i z e t NB ) {#pragma omp p a r a l l e l#pragma omp s i n g l e nowa i t

f o r ( s i z e t k=0; k < N; k += NB){

c l a p a c k d p o t r f ( CblasRowMajor , CblasLower , NB, &A [ k∗N+k ] , N ) ;

f o r ( s i z e t m=k+ NB; m < N; m += NB){

#pragma omp t a s k f i r s t p r i v a t e ( k , m) s h a r e d (A)c b l a s d t r s m ( CblasRowMajor , C b l a s L e f t , CblasLower , CblasNoTrans , C b l a s U n i t ,

NB, NB, 1 . , &A [ k∗N+k ] , N, &A [m∗N+k ] , N ) ;}

#pragma omp t a s k w a i t // B a r r i e r : no c o n c u r r e n c y with next t a s k sf o r ( s i z e t m=k+ NB; m < N; m += NB){

#pragma omp t a s k f i r s t p r i v a t e ( k , m) s h a r e d (A)c b l a s d s y r k ( CblasRowMajor , CblasLower , CblasNoTrans ,

NB, NB, −1.0 , &A [m∗N+k ] , N, 1 . 0 , &A [m∗N+m] , N ) ;

f o r ( s i z e t n=k+NB; n < m; n += NB){

#pragma omp t a s k f i r s t p r i v a t e ( k , m) s h a r e d (A)cblas dgemm ( CblasRowMajor , CblasNoTrans , CblasTrans ,

NB, NB, NB, −1.0 , &A [m∗N+k ] , N, &A [ n∗N+k ] , N, 1 . 0 , &A [m∗N+n ] , N ) ;}

}#pragma omp t a s k w a i t // B a r r i e r : no c o n c u r r e n c y with t a s k s at i t e r a t i o n k+1}

}

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 65 / 73

Page 185: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 186: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 187: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 188: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 189: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 190: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 191: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

SYNC.

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 192: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 66 / 73

Page 193: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

Illustration: Cholesky factorization

v o i d C h o l e s k y ( double∗ A, i n t N, s i z e t NB ){#pragma k a a p i p a r a l l e l

f o r ( s i z e t k=0; k < N; k += NB){

#pragma k a a p i t a s k r e a d w r i t e (&A [ k∗N+k ]{ l d=N; [NB ] [ NB]} )c l a p a c k d p o t r f ( CblasRowMajor , CblasLower , NB, &A [ k∗N+k ] , N ) ;

f o r ( s i z e t m=k+ NB; m < N; m += NB){

#pragma k a a p i t a s k r e a d (&A [ k∗N+k ]{ l d=N ; [ NB ] [ NB]} ) r e a d w r i t e (&A [m∗N+k ]{ l d=N; [NB ] [ NB]} )c b l a s d t r s m ( CblasRowMajor , C b l a s L e f t , CblasLower , CblasNoTrans , C b l a s U n i t ,

NB, NB, 1 . , &A [ k∗N+k ] , N, &A [m∗N+k ] , N ) ;}

f o r ( s i z e t m=k+ NB; m < N; m += NB){

#pragma k a a p i t a s k r e a d (&A [m∗N+k ]{ l d=N ; [ NB ] [ NB]} ) r e a d w r i t e (&A [m∗N+m]{ l d=N; [NB ] [ NB]} )c b l a s d s y r k ( CblasRowMajor , CblasLower , CblasNoTrans ,

NB, NB, −1.0 , &A [m∗N+k ] , N, 1 . 0 , &A [m∗N+m] , N ) ;

f o r ( s i z e t n=k+NB; n < m; n += NB){

#pragma k a a p i t a s k r e a d (&A [m∗N+k ]{ l d=N; [NB ] [ NB]} , &A [ n∗N+k ]{ l d=N; [NB ] [ NB]})\r e a d w r i t e (&A [m∗N+n ]{ l d=N; [NB ] [ NB]} )

cblas dgemm ( CblasRowMajor , CblasNoTrans , CblasTrans ,NB, NB, NB, −1.0 , &A [m∗N+k ] , N, &A [ n∗N+k ] , N, 1 . 0 , &A [m∗N+n ] , N ) ;

}}

}// I m p l i c i t b a r r i e r o n l y at the end o f Kaapi p a r a l l e l r e g i o n

}C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 67 / 73

Page 194: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 195: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 196: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 197: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 198: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 199: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 200: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Ingredients for the parallelization

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 68 / 73

Page 201: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Parallel matrix multiplication

[Dumas, Gautier, P. & Sultan 14]

A1

A2

B1 B2

C11 C12

C21 C22

1st recursion cutting

2nd recursion cutting

0

100

200

300

400

500

600

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

pfgemm over Z/131071Z on a Xeon E5-4620 2.2Ghz 32 cores

MKL dgemmPLASMA-QUARK dgemm

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 69 / 73

Page 202: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Parallel matrix multiplication

[Dumas, Gautier, P. & Sultan 14]

A1

A2

B1 B2

C11 C12

C21 C22

1st recursion cutting

2nd recursion cutting

0

100

200

300

400

500

600

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

pfgemm over Z/131071Z on a Xeon E5-4620 2.2Ghz 32 cores

pfgemm doubleMKL dgemm

PLASMA-QUARK dgemm

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 69 / 73

Page 203: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Parallel matrix multiplication

[Dumas, Gautier, P. & Sultan 14]

A1

A2

B1 B2

C11 C12

C21 C22

1st recursion cutting

2nd recursion cutting

0

100

200

300

400

500

600

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

pfgemm over Z/131071Z on a Xeon E5-4620 2.2Ghz 32 cores

pfgemm doublepfgemm mod 131071

MKL dgemmPLASMA-QUARK dgemm

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 69 / 73

Page 204: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Gaussian elimination

Slab iterative Slab recursiveLAPACK FFLAS-FFPACK

Tile iterative Tile recursivePLASMA FFLAS-FFPACK

I Prefer recursive algorithmsI Better data locality

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 70 / 73

Page 205: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Gaussian elimination

Slab recursiveFFLAS-FFPACK

Tile recursiveFFLAS-FFPACK

I Prefer recursive algorithms

I Better data locality

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 70 / 73

Page 206: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Gaussian elimination

Tile recursiveFFLAS-FFPACK

I Prefer recursive algorithmsI Better data locality

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 70 / 73

Page 207: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Full rank Gaussian elimination

[Dumas, Gautier, P. and Sultan 14] Comparing numerical efficiency (nomodulo)

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

parallel PLUQ over double on full rank matrices on 32 cores

PLASMA-Quark dgetrf tiled storage (k=212)PLASMA-Quark dgetrf (k=212)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 71 / 73

Page 208: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Full rank Gaussian elimination

[Dumas, Gautier, P. and Sultan 14] Comparing numerical efficiency (nomodulo)

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

parallel PLUQ over double on full rank matrices on 32 cores

Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)

PLASMA-Quark dgetrf (k=212)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 71 / 73

Page 209: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Full rank Gaussian elimination

[Dumas, Gautier, P. and Sultan 14] Comparing numerical efficiency (nomodulo)

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

parallel PLUQ over double on full rank matrices on 32 cores

FFLAS-FFPACK<double> explicit synchFFLAS-FFPACK<double> dataflow synch

Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)

PLASMA-Quark dgetrf (k=212)

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 71 / 73

Page 210: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Full rank Gaussian elimination

[Dumas, Gautier, P. and Sultan 14] Over the finite field Z/131071Z

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

Parallel PLUQ over Z/131071Z with full rank matrices on 32 cores

Tiled Rec explicit syncTiled Rec dataflow sync

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 71 / 73

Page 211: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Full rank Gaussian elimination

[Dumas, Gautier, P. and Sultan 14] Over the finite field Z/131071Z

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

Parallel PLUQ over Z/131071Z with full rank matrices on 32 cores

Tiled Rec explicit syncTiled Rec dataflow syncTiled Iter dataflow sync

Tiled Iter explicit sync

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 71 / 73

Page 212: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Conclusion

Design framework for high performance exact linear algebra

Asymptotic reduction > algorithm tuning > building block implementation

I So far, floating point arithmetic delivers best speed

I Medium size arithmetic: RNS harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

I Favor tiled recursive algorithms architecture oblivious vs aware algorithms [Gustavson 07]

I New pivoting strategies revealing all rank profile informations tournament pivoting? [Demmel, Grigori and Xiang 11]

I Seek size-dimension trade-offs, even heuristic ones,

I Recursive tasks and coarse grain parallelization Light weight task workstealing management required Need for an improved recursive dataflow scheduling

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 72 / 73

Page 213: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Conclusion

Design framework for high performance exact linear algebra

Asymptotic reduction > algorithm tuning > building block implementation

I So far, floating point arithmetic delivers best speed

I Medium size arithmetic: RNS harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

I Favor tiled recursive algorithms architecture oblivious vs aware algorithms [Gustavson 07]

I New pivoting strategies revealing all rank profile informations tournament pivoting? [Demmel, Grigori and Xiang 11]

I Seek size-dimension trade-offs, even heuristic ones,

I Recursive tasks and coarse grain parallelization Light weight task workstealing management required Need for an improved recursive dataflow scheduling

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 72 / 73

Page 214: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Conclusion

Design framework for high performance exact linear algebra

Asymptotic reduction > algorithm tuning > building block implementation

I So far, floating point arithmetic delivers best speed

I Medium size arithmetic: RNS harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

I Favor tiled recursive algorithms architecture oblivious vs aware algorithms [Gustavson 07]

I New pivoting strategies revealing all rank profile informations tournament pivoting? [Demmel, Grigori and Xiang 11]

I Seek size-dimension trade-offs, even heuristic ones,

I Recursive tasks and coarse grain parallelization Light weight task workstealing management required Need for an improved recursive dataflow scheduling

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 72 / 73

Page 215: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Conclusion

Design framework for high performance exact linear algebra

Asymptotic reduction > algorithm tuning > building block implementation

I So far, floating point arithmetic delivers best speed

I Medium size arithmetic: RNS harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

I Favor tiled recursive algorithms architecture oblivious vs aware algorithms [Gustavson 07]

I New pivoting strategies revealing all rank profile informations tournament pivoting? [Demmel, Grigori and Xiang 11]

I Seek size-dimension trade-offs, even heuristic ones,

I Recursive tasks and coarse grain parallelization Light weight task workstealing management required Need for an improved recursive dataflow scheduling

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 72 / 73

Page 216: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Conclusion

Design framework for high performance exact linear algebra

Asymptotic reduction > algorithm tuning > building block implementation

I So far, floating point arithmetic delivers best speed

I Medium size arithmetic: RNS harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

I Favor tiled recursive algorithms architecture oblivious vs aware algorithms [Gustavson 07]

I New pivoting strategies revealing all rank profile informations tournament pivoting? [Demmel, Grigori and Xiang 11]

I Seek size-dimension trade-offs, even heuristic ones,

I Recursive tasks and coarse grain parallelization Light weight task workstealing management required Need for an improved recursive dataflow scheduling

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 72 / 73

Page 217: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Conclusion

Design framework for high performance exact linear algebra

Asymptotic reduction > algorithm tuning > building block implementation

I So far, floating point arithmetic delivers best speed

I Medium size arithmetic: RNS harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

I Favor tiled recursive algorithms architecture oblivious vs aware algorithms [Gustavson 07]

I New pivoting strategies revealing all rank profile informations tournament pivoting? [Demmel, Grigori and Xiang 11]

I Seek size-dimension trade-offs, even heuristic ones,

I Recursive tasks and coarse grain parallelization Light weight task workstealing management required Need for an improved recursive dataflow scheduling

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 72 / 73

Page 218: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Perspectives

Large scale distributed exact linear algebraI reducing communications [Demmel, Grigori and Xiang 11]

I sparse and hybrid [Faugere and Lachartre 10]

Structured linear algebraI A lot of action recently [Jeannerod Schost 08], [Chowdhury & Al. 15]

I Combined with recent advances in linear algebra over K[X]

I Applications to list decoding

Symbolic-numeric computationI High precision floating point linear algebra via exact rational arithmetic and RNS

Thank you

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 73 / 73

Page 219: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Perspectives

Large scale distributed exact linear algebraI reducing communications [Demmel, Grigori and Xiang 11]

I sparse and hybrid [Faugere and Lachartre 10]

Structured linear algebraI A lot of action recently [Jeannerod Schost 08], [Chowdhury & Al. 15]

I Combined with recent advances in linear algebra over K[X]

I Applications to list decoding

Symbolic-numeric computationI High precision floating point linear algebra via exact rational arithmetic and RNS

Thank you

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 73 / 73

Page 220: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Perspectives

Large scale distributed exact linear algebraI reducing communications [Demmel, Grigori and Xiang 11]

I sparse and hybrid [Faugere and Lachartre 10]

Structured linear algebraI A lot of action recently [Jeannerod Schost 08], [Chowdhury & Al. 15]

I Combined with recent advances in linear algebra over K[X]

I Applications to list decoding

Symbolic-numeric computationI High precision floating point linear algebra via exact rational arithmetic and RNS

Thank you

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 73 / 73

Page 221: Exact Linear Algebra Algorithmic: Theory and Practice - ISSAC'15 … · Exact Linear Algebra Algorithmic: Theory and Practice ISSAC’15 Tutorial Cl ement Pernet Universit e Grenoble

Parallel exact linear algebra Parallel dense linear algebra mod p

Perspectives

Large scale distributed exact linear algebraI reducing communications [Demmel, Grigori and Xiang 11]

I sparse and hybrid [Faugere and Lachartre 10]

Structured linear algebraI A lot of action recently [Jeannerod Schost 08], [Chowdhury & Al. 15]

I Combined with recent advances in linear algebra over K[X]

I Applications to list decoding

Symbolic-numeric computationI High precision floating point linear algebra via exact rational arithmetic and RNS

Thank you

C. Pernet Exact Linear Algebra Algorithmic July 6, 2015 73 / 73