Adaptive Parallel Exact dense LU...

Adaptive Parallel Exact dense LU factorization

Ziad SULTAN

15 mai 2013

JNCF2013, Universite de Grenoble1/35

Outline1 Introduction2 Exact gaussian elimination

Gaussian elimination in numerical computationExact gaussian eliminationRank profile

3 Dense linear algebraOptimized building block

4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup

5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI

6 perspective7 conclusion

2/35







3/35

Gaussian elimination in Dense Computer algebra

DenseI benchmarking of supercomputers (www.top500.org)I basis of linear algebra

SparseLarge sparse matrix problems→ smaller dense problems

(still large !)I Sparse Iterative :

Induce dense elimination on blocs of iterated vectors(Krylov, Lanczos, smith normal form)

I Sparse Direct :Switch to dense after fill-in [FGB]

4/35







5/35

Gaussian elimination in numerical computation

pivoting strategiesI search for best pivot→ good numerical stability

I good data localityI Reduce the fill-in→ reduce additional memory needs→ reduce induced computation costs

6/35

Exact gaussian elimination applications

ExactI Rank→ Algebraic topology (smith normal form)

I Rank Profile→ Grobner basis computation [FGB]→ Computational number theory [Stein]

I Characteristic Polynomial→ Graph Theory [G. Royle]

I Coding theory→ Semi-fields

7/35

Rank profile

Row/Column rank profileDefinition

lexico-graphically smallest sequenceof r row/column indices s.t. the correspondingrows/columns of A are linearly independant.

Generic rank profileIf its first r leading principal minors are non zero

example :the sequence {1,...,r} is the row rank profile

of a generic rank profile matrix

8/35







9/35

Optimized building block in Dense linear algebra

Matrix multiplicationAlgorithmic complexity : recursive→ Strassen O(n2.8), ...,O(nω)Optimized hardware implementation : iterative→ pipeline, SSE, AVX, ...Implementation : block versions cascading

I cache optimizationI reduce dependencies on the bus speed

I faster computation for blocks loaded in cache

10/35

Gaussian elimination concerns

Same concerns as M.M. = block versionsImplementation optimization→ benefits from matrix multiplicationReduce dependencies on bus speed(cache optimization)

Possible best versions adapted for parallel computingTiled iterative implementationblock recursive implementation

11/35

Exact gaussian elimination adapted for Parallelcomputing

block versions trade-offCommon point less memory accessesif block size fits the cache

→ N3/B memory accesses.(N dimension of the matrix, B the block size)

Trade-offblock recursive :

→ More adaptativetiled iterative :

→ less synchronizations→ Historically, It’s more difficult to parallelizerecursive implementation with existing modelof Parallel computing (OpenMP, ...)

12/35

State of the art

State of the artSequential Exact :→ FFLAS-FFPACK, M4RI, within FGBParallel numeric :→ ScaLAPACK, Plasma-QuarkParallel exact : ? ?→ this work

13/35







14/35

LU factorization of generic rank profile matrices

A L= U P

LU Decomposition applications

Solving System : A.x = b;L.(U.x) = b;L.(y) = bRank : Rank(A) is the number of rows of UInvert of A : A−1 = U−1.L−1

Determinant : det(A) = ±det(U)row or column Rank Profile : given by positions of row orcolumn permutations

15/35

Tiled iterative LU decomposition

LU decomposition on first block A11 = L1.U1

updates : A′21 = A21.U−11 ; A′31 = A31.U−1

1

A′12 = L−11 .A12 ; A′13 = L−1

1 .A13 ; A′22 = A22 − A′21.A′12 ...

16/35

Tiled iterative LU Decomposition

A11 A12 A13

A22A21

A31 A32 A33

A23

Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ

FTRSM (update blocks on same column and same row)

Aik = Aik.U−1kk

Aki = L−1kk .Aki

FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj

applyP (applying permutation matrix)

Aik = Aik.P−1kk

17/35


A'12 A'13

A'22A'21

A'31 A'32 A'33

A'23

U1L1



Aik = Aik.U−1kk

Aki = L−1kk .Aki



Aik = Aik.P−1kk

17/35


A'12 A'13

A'21

A'31 A''32A''33

A''23

U1L1

U2L2



Aik = Aik.U−1kk

Aki = L−1kk .Aki



Aik = Aik.P−1kk

17/35


A'12 A'13

A'21

A'31 A''32

A''23

U1L1

U2L2

U3L3



Aik = Aik.U−1kk

Aki = L−1kk .Aki



Aik = Aik.P−1kk

17/35

OpenMP parallel loop Synchronizations

Synchronizationwaiting for all tasks ...

waiting for all tasks ... Synchronization

Time

LU(A11)

ApplyPFTRSM(A12)

LU(A22)

LU(A33)

ApplyPFTRSM(A21)

ApplyPFTRSM(A13)

ApplyPFTRSM(A31)

FGEMM(A32)

FGEMM(A22)

FGEMM(A23)

FGEMM(A33)

ApplyPFTRSM(A32)

ApplyPFTRSM(A23)

FGEMM(A33)

18/35

for(k=0 ; k<nblocks ; k++){R = FFPACK : :LUdivine(...) ;#pragma omp parallel for shared(A, P){#pragma omp for nowaitfor(i=k+1 ; i<nblocks ; i++)

FFLAS : :ftrsm(...) ;}#pragma omp parallel for shared(A, P)for(i=k+1 ; i<nblocks ; i++){

FFPACK : :applyP(...) ;FFLAS : :ftrsm(...) ;}#pragma omp parallel for shared(A, P, T)for(i=k+1 ; i<nblocks ; i++){

#pragma omp parallel for shared(A )for(j=k+1 ; j<nblocks ; j++){FFLAS : :fgemm(...) ;}}

}19/35

KAAPI dataflow scheduling for Tiled LUP

Time

LU(A11)

ApplyPFTRSM(A31)

FGEMM(A32)

FGEMM(A22)

FGEMM(A23)

FGEMM(A33)

LU(A22)

ApplyPFTRSM(A13)

ApplyPFTRSM(A12)

ApplyPFTRSM(A21)

ApplyPFTRSM(A32)

ApplyPFTRSM(A23) FGEMM

(A33) LU(A33)

20/35

for(int k=0 ; k<nblocks ; k++){#pragma kaapi task readwrite(&A) write(&P, &Q)R = FFPACK : :LUdivine(...) ;for(int i=k+1 ; i<nblocks ; i++){#pragma kaapi task readwrite(&A) read(&A)FFLAS : :ftrsm(...) ;}for(int i=k+1 ; i<nblocks ; i++){#pragma kaapi task readwrite(&A) read(&P)FFPACK : :applyP(...) ;#pragma kaapi task readwrite(&A) read(&A)FFLAS : :ftrsm(...) ;}for(int i=k+1 ; i<nblocks ; i++){

for(int j=k+1 ; j<nblocks ; j++){#pragma kaapi task readwrite(&A) read(&A)FFPACK : :fgemm(...) ; } }

}

21/35

KAAPI vs OpenMP

HPAC : Intel SandyBridge E5-4620 2.2Ghz, 32 cores, L3cache(16384 KB). (Z/1009Z)

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35

tim

ing

s (

se

co

nd

s)

number of cores

Overcost Parallel vs sequential for matrix dimension 10000*10000

LUdivine (sequential)OpenMP LU BS=512

KAAPI LU BS=212KAAPI LU BS=424

22/35

KAAPI version speed-up

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

sp

ee

d-u

p

number of cores

speed-up kaapi and OpenMP for matrix dimension 10000*10000

KAAPI LU BS=212KAAPI LU BS=424

OpenMP LU BS=512 Ideal

23/35

Parallelization overcost on LU algorithm

0

10

20

30

40

50

60

70

2K 4K 6K 8K 10K 12K 14K 16K 18K 20K-30 %

-20 %

-10 %

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Tim

ing

s (

se

co

nd

s)

ga

in f

acto

r

matrix dimension

Gain factor KAAPI vs OMP on dense full rank matrices (32 cores)

OpenMPkaapi

1-KAAPI/OMP

24/35







25/35

CUP decomposition (Rank deficient matrices)

=AC

UP

26/35

block CUP decomposition

block CUPLess parallelism

I some independent tasks removedI big sequential costly task

27/35

Parallelization of block CUP with OpenMP

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

Sp

ee

d-u

p

Number of cores

CUP (n=10000, R=5000 blocksize=212) over Z/1009

OpenMP CUP speedupIdeal

28/35

Parallelization of block CUP with KAAPI Dynamicscheduling

dependenciesI The graph of task dependency is calculated during runtimeI Dependency between tasks is done according to the

referent of each task.I In this implementation, the referent is the pointer of the

block i.e. it’s the pointer on the upper-left side of eachblock.

X X X

29/35

Parallelization of block CUP with KAAPI Staticscheduling

The graph of task dependancy is precalculated beforeexecution. (faster)

X X X

X is a task parameter, set as CW. CW mode for staticscheduling is not defined yet in actual KAAPI version.

30/35







31/35

Algorithme PLUQ Quad-recursif

I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row

rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices

[Dumas, Pernet, Sultan, ISSAC 2013]32/35




0





00





00

0





0

0

0





0

0

0

0





0

00

0








33/35

Conclusion

I Exact ComputationI Parallelization in exact

I Trade-off : (Tiled, block) <=> (adaptative, less sync.)I Specificity in Exact/Numeric : rank, rank profile→ New issues and trade-off /Numeric & Parallel Numeric

I dataflow synchro. LUP :→better adaptatibity→ more parallelism

I PLUQI Dynamic scheduling CUP :→ dynamic block size, parallelism ?

I new algorithm to parallelize : recursive, tile ?

34/35

Thank you for your attention !

35/35

Adaptive Parallel Exact dense LU...

Documents

Transcript of Adaptive Parallel Exact dense LU...