Adaptive Parallel Exact dense LU...

45
Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Universit ´ e de Grenoble 1/35

Transcript of Adaptive Parallel Exact dense LU...

Page 1: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Adaptive Parallel Exact dense LU factorization

Ziad SULTAN

15 mai 2013

JNCF2013, Universite de Grenoble1/35

Page 2: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

2/35

Page 3: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

3/35

Page 4: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Gaussian elimination in Dense Computer algebra

DenseI benchmarking of supercomputers (www.top500.org)I basis of linear algebra

SparseLarge sparse matrix problems→ smaller dense problems

(still large !)I Sparse Iterative :

Induce dense elimination on blocs of iterated vectors(Krylov, Lanczos, smith normal form)

I Sparse Direct :Switch to dense after fill-in [FGB]

4/35

Page 5: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

5/35

Page 6: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Gaussian elimination in numerical computation

pivoting strategiesI search for best pivot→ good numerical stability

I good data localityI Reduce the fill-in→ reduce additional memory needs→ reduce induced computation costs

6/35

Page 7: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Exact gaussian elimination applications

ExactI Rank→ Algebraic topology (smith normal form)

I Rank Profile→ Grobner basis computation [FGB]→ Computational number theory [Stein]

I Characteristic Polynomial→ Graph Theory [G. Royle]

I Coding theory→ Semi-fields

7/35

Page 8: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Rank profile

Row/Column rank profileDefinition

lexico-graphically smallest sequenceof r row/column indices s.t. the correspondingrows/columns of A are linearly independant.

Generic rank profileIf its first r leading principal minors are non zero

example :the sequence {1,...,r} is the row rank profile

of a generic rank profile matrix

8/35

Page 9: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

9/35

Page 10: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Optimized building block in Dense linear algebra

Matrix multiplicationAlgorithmic complexity : recursive→ Strassen O(n2.8), ...,O(nω)Optimized hardware implementation : iterative→ pipeline, SSE, AVX, ...Implementation : block versions cascading

I cache optimizationI reduce dependencies on the bus speed

I faster computation for blocks loaded in cache

10/35

Page 11: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Gaussian elimination concerns

Same concerns as M.M. = block versionsImplementation optimization→ benefits from matrix multiplicationReduce dependencies on bus speed(cache optimization)

Possible best versions adapted for parallel computingTiled iterative implementationblock recursive implementation

11/35

Page 12: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Exact gaussian elimination adapted for Parallelcomputing

block versions trade-offCommon point less memory accessesif block size fits the cache

→ N3/B memory accesses.(N dimension of the matrix, B the block size)

Trade-offblock recursive :

→ More adaptativetiled iterative :

→ less synchronizations→ Historically, It’s more difficult to parallelizerecursive implementation with existing modelof Parallel computing (OpenMP, ...)

12/35

Page 13: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

State of the art

State of the artSequential Exact :→ FFLAS-FFPACK, M4RI, within FGBParallel numeric :→ ScaLAPACK, Plasma-QuarkParallel exact : ? ?→ this work

13/35

Page 14: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

14/35

Page 15: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

LU factorization of generic rank profile matrices

A L= U P

LU Decomposition applications

Solving System : A.x = b;L.(U.x) = b;L.(y) = bRank : Rank(A) is the number of rows of UInvert of A : A−1 = U−1.L−1

Determinant : det(A) = ±det(U)row or column Rank Profile : given by positions of row orcolumn permutations

15/35

Page 16: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Tiled iterative LU decomposition

LU decomposition on first block A11 = L1.U1

updates : A′21 = A21.U−11 ; A′31 = A31.U−1

1

A′12 = L−11 .A12 ; A′13 = L−1

1 .A13 ; A′22 = A22 − A′21.A′12 ...

16/35

Page 17: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Tiled iterative LU Decomposition

A11 A12 A13

A22A21

A31 A32 A33

A23

Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ

FTRSM (update blocks on same column and same row)

Aik = Aik.U−1kk

Aki = L−1kk .Aki

FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj

applyP (applying permutation matrix)

Aik = Aik.P−1kk

17/35

Page 18: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Tiled iterative LU Decomposition

A'12 A'13

A'22A'21

A'31 A'32 A'33

A'23

U1L1

Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ

FTRSM (update blocks on same column and same row)

Aik = Aik.U−1kk

Aki = L−1kk .Aki

FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj

applyP (applying permutation matrix)

Aik = Aik.P−1kk

17/35

Page 19: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Tiled iterative LU Decomposition

A'12 A'13

A'21

A'31 A''32A''33

A''23

U1L1

U2L2

Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ

FTRSM (update blocks on same column and same row)

Aik = Aik.U−1kk

Aki = L−1kk .Aki

FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj

applyP (applying permutation matrix)

Aik = Aik.P−1kk

17/35

Page 20: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Tiled iterative LU Decomposition

A'12 A'13

A'21

A'31 A''32

A''23

U1L1

U2L2

U3L3

Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ

FTRSM (update blocks on same column and same row)

Aik = Aik.U−1kk

Aki = L−1kk .Aki

FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj

applyP (applying permutation matrix)

Aik = Aik.P−1kk

17/35

Page 21: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

OpenMP parallel loop Synchronizations

Synchronizationwaiting for all tasks ...

waiting for all tasks ... Synchronization

Time

LU(A11)

ApplyPFTRSM(A12)

LU(A22)

LU(A33)

ApplyPFTRSM(A21)

ApplyPFTRSM(A13)

ApplyPFTRSM(A31)

FGEMM(A32)

FGEMM(A22)

FGEMM(A23)

FGEMM(A33)

ApplyPFTRSM(A32)

ApplyPFTRSM(A23)

FGEMM(A33)

18/35

Page 22: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

for(k=0 ; k<nblocks ; k++){R = FFPACK : :LUdivine(...) ;#pragma omp parallel for shared(A, P){#pragma omp for nowaitfor(i=k+1 ; i<nblocks ; i++)

FFLAS : :ftrsm(...) ;}#pragma omp parallel for shared(A, P)for(i=k+1 ; i<nblocks ; i++){

FFPACK : :applyP(...) ;FFLAS : :ftrsm(...) ;}#pragma omp parallel for shared(A, P, T)for(i=k+1 ; i<nblocks ; i++){

#pragma omp parallel for shared(A )for(j=k+1 ; j<nblocks ; j++){FFLAS : :fgemm(...) ;}}

}19/35

Page 23: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

KAAPI dataflow scheduling for Tiled LUP

Time

LU(A11)

ApplyPFTRSM(A31)

FGEMM(A32)

FGEMM(A22)

FGEMM(A23)

FGEMM(A33)

LU(A22)

ApplyPFTRSM(A13)

ApplyPFTRSM(A12)

ApplyPFTRSM(A21)

ApplyPFTRSM(A32)

ApplyPFTRSM(A23) FGEMM

(A33) LU(A33)

20/35

Page 24: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

for(int k=0 ; k<nblocks ; k++){#pragma kaapi task readwrite(&A) write(&P, &Q)R = FFPACK : :LUdivine(...) ;for(int i=k+1 ; i<nblocks ; i++){#pragma kaapi task readwrite(&A) read(&A)FFLAS : :ftrsm(...) ;}for(int i=k+1 ; i<nblocks ; i++){#pragma kaapi task readwrite(&A) read(&P)FFPACK : :applyP(...) ;#pragma kaapi task readwrite(&A) read(&A)FFLAS : :ftrsm(...) ;}for(int i=k+1 ; i<nblocks ; i++){

for(int j=k+1 ; j<nblocks ; j++){#pragma kaapi task readwrite(&A) read(&A)FFPACK : :fgemm(...) ; } }

}

21/35

Page 25: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

KAAPI vs OpenMP

HPAC : Intel SandyBridge E5-4620 2.2Ghz, 32 cores, L3cache(16384 KB). (Z/1009Z)

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35

tim

ing

s (

se

co

nd

s)

number of cores

Overcost Parallel vs sequential for matrix dimension 10000*10000

LUdivine (sequential)OpenMP LU BS=512

KAAPI LU BS=212KAAPI LU BS=424

22/35

Page 26: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

KAAPI version speed-up

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

sp

ee

d-u

p

number of cores

speed-up kaapi and OpenMP for matrix dimension 10000*10000

KAAPI LU BS=212KAAPI LU BS=424

OpenMP LU BS=512 Ideal

23/35

Page 27: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Parallelization overcost on LU algorithm

0

10

20

30

40

50

60

70

2K 4K 6K 8K 10K 12K 14K 16K 18K 20K-30 %

-20 %

-10 %

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Tim

ing

s (

se

co

nd

s)

ga

in f

acto

r

matrix dimension

Gain factor KAAPI vs OMP on dense full rank matrices (32 cores)

OpenMPkaapi

1-KAAPI/OMP

24/35

Page 28: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

25/35

Page 29: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

CUP decomposition (Rank deficient matrices)

=AC

UP

26/35

Page 30: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

block CUP decomposition

block CUPLess parallelism

I some independent tasks removedI big sequential costly task

27/35

Page 31: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Parallelization of block CUP with OpenMP

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

Sp

ee

d-u

p

Number of cores

CUP (n=10000, R=5000 blocksize=212) over Z/1009

OpenMP CUP speedupIdeal

28/35

Page 32: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Parallelization of block CUP with KAAPI Dynamicscheduling

dependenciesI The graph of task dependency is calculated during runtimeI Dependency between tasks is done according to the

referent of each task.I In this implementation, the referent is the pointer of the

block i.e. it’s the pointer on the upper-left side of eachblock.

X X X

29/35

Page 33: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Parallelization of block CUP with KAAPI Staticscheduling

The graph of task dependancy is precalculated beforeexecution. (faster)

X X X

X is a task parameter, set as CW. CW mode for staticscheduling is not defined yet in actual KAAPI version.

30/35

Page 34: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

31/35

Page 35: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 36: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

0

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 37: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

0

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 38: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

00

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 39: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

00

0

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 40: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

0

0

0

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 41: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

0

0

0

0

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 42: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

0

00

0

[Dumas, Pernet, Sultan, ISSAC 2013]32/35

Page 43: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

33/35

Page 44: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Conclusion

I Exact ComputationI Parallelization in exact

I Trade-off : (Tiled, block) <=> (adaptative, less sync.)I Specificity in Exact/Numeric : rank, rank profile→ New issues and trade-off /Numeric & Parallel Numeric

I dataflow synchro. LUP :→better adaptatibity→ more parallelism

I PLUQI Dynamic scheduling CUP :→ dynamic block size, parallelism ?

I new algorithm to parallelize : recursive, tile ?

34/35

Page 45: Adaptive Parallel Exact dense LU factorizationjncf2013.imag.fr/exposes/sultan/JNCF13_slides_sultan.pdf · Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013,

Thank you for your attention !

35/35