Prince Sultan University | HomePrince Sultan University | Home
Adaptive Parallel Exact dense LU...
Transcript of Adaptive Parallel Exact dense LU...
Adaptive Parallel Exact dense LU factorization
Ziad SULTAN
15 mai 2013
JNCF2013, Universite de Grenoble1/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
2/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
3/35
Gaussian elimination in Dense Computer algebra
DenseI benchmarking of supercomputers (www.top500.org)I basis of linear algebra
SparseLarge sparse matrix problems→ smaller dense problems
(still large !)I Sparse Iterative :
Induce dense elimination on blocs of iterated vectors(Krylov, Lanczos, smith normal form)
I Sparse Direct :Switch to dense after fill-in [FGB]
4/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
5/35
Gaussian elimination in numerical computation
pivoting strategiesI search for best pivot→ good numerical stability
I good data localityI Reduce the fill-in→ reduce additional memory needs→ reduce induced computation costs
6/35
Exact gaussian elimination applications
ExactI Rank→ Algebraic topology (smith normal form)
I Rank Profile→ Grobner basis computation [FGB]→ Computational number theory [Stein]
I Characteristic Polynomial→ Graph Theory [G. Royle]
I Coding theory→ Semi-fields
7/35
Rank profile
Row/Column rank profileDefinition
lexico-graphically smallest sequenceof r row/column indices s.t. the correspondingrows/columns of A are linearly independant.
Generic rank profileIf its first r leading principal minors are non zero
example :the sequence {1,...,r} is the row rank profile
of a generic rank profile matrix
8/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
9/35
Optimized building block in Dense linear algebra
Matrix multiplicationAlgorithmic complexity : recursive→ Strassen O(n2.8), ...,O(nω)Optimized hardware implementation : iterative→ pipeline, SSE, AVX, ...Implementation : block versions cascading
I cache optimizationI reduce dependencies on the bus speed
I faster computation for blocks loaded in cache
10/35
Gaussian elimination concerns
Same concerns as M.M. = block versionsImplementation optimization→ benefits from matrix multiplicationReduce dependencies on bus speed(cache optimization)
Possible best versions adapted for parallel computingTiled iterative implementationblock recursive implementation
11/35
Exact gaussian elimination adapted for Parallelcomputing
block versions trade-offCommon point less memory accessesif block size fits the cache
→ N3/B memory accesses.(N dimension of the matrix, B the block size)
Trade-offblock recursive :
→ More adaptativetiled iterative :
→ less synchronizations→ Historically, It’s more difficult to parallelizerecursive implementation with existing modelof Parallel computing (OpenMP, ...)
12/35
State of the art
State of the artSequential Exact :→ FFLAS-FFPACK, M4RI, within FGBParallel numeric :→ ScaLAPACK, Plasma-QuarkParallel exact : ? ?→ this work
13/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
14/35
LU factorization of generic rank profile matrices
A L= U P
LU Decomposition applications
Solving System : A.x = b;L.(U.x) = b;L.(y) = bRank : Rank(A) is the number of rows of UInvert of A : A−1 = U−1.L−1
Determinant : det(A) = ±det(U)row or column Rank Profile : given by positions of row orcolumn permutations
15/35
Tiled iterative LU decomposition
LU decomposition on first block A11 = L1.U1
updates : A′21 = A21.U−11 ; A′31 = A31.U−1
1
A′12 = L−11 .A12 ; A′13 = L−1
1 .A13 ; A′22 = A22 − A′21.A′12 ...
16/35
Tiled iterative LU Decomposition
A11 A12 A13
A22A21
A31 A32 A33
A23
Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ
FTRSM (update blocks on same column and same row)
Aik = Aik.U−1kk
Aki = L−1kk .Aki
FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj
applyP (applying permutation matrix)
Aik = Aik.P−1kk
17/35
Tiled iterative LU Decomposition
A'12 A'13
A'22A'21
A'31 A'32 A'33
A'23
U1L1
Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ
FTRSM (update blocks on same column and same row)
Aik = Aik.U−1kk
Aki = L−1kk .Aki
FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj
applyP (applying permutation matrix)
Aik = Aik.P−1kk
17/35
Tiled iterative LU Decomposition
A'12 A'13
A'21
A'31 A''32A''33
A''23
U1L1
U2L2
Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ
FTRSM (update blocks on same column and same row)
Aik = Aik.U−1kk
Aki = L−1kk .Aki
FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj
applyP (applying permutation matrix)
Aik = Aik.P−1kk
17/35
Tiled iterative LU Decomposition
A'12 A'13
A'21
A'31 A''32
A''23
U1L1
U2L2
U3L3
Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ
FTRSM (update blocks on same column and same row)
Aik = Aik.U−1kk
Aki = L−1kk .Aki
FGEMM (MM, update remaining blocks)Aij = Aij − Aik.Akj
applyP (applying permutation matrix)
Aik = Aik.P−1kk
17/35
OpenMP parallel loop Synchronizations
Synchronizationwaiting for all tasks ...
waiting for all tasks ... Synchronization
Time
LU(A11)
ApplyPFTRSM(A12)
LU(A22)
LU(A33)
ApplyPFTRSM(A21)
ApplyPFTRSM(A13)
ApplyPFTRSM(A31)
FGEMM(A32)
FGEMM(A22)
FGEMM(A23)
FGEMM(A33)
ApplyPFTRSM(A32)
ApplyPFTRSM(A23)
FGEMM(A33)
18/35
for(k=0 ; k<nblocks ; k++){R = FFPACK : :LUdivine(...) ;#pragma omp parallel for shared(A, P){#pragma omp for nowaitfor(i=k+1 ; i<nblocks ; i++)
FFLAS : :ftrsm(...) ;}#pragma omp parallel for shared(A, P)for(i=k+1 ; i<nblocks ; i++){
FFPACK : :applyP(...) ;FFLAS : :ftrsm(...) ;}#pragma omp parallel for shared(A, P, T)for(i=k+1 ; i<nblocks ; i++){
#pragma omp parallel for shared(A )for(j=k+1 ; j<nblocks ; j++){FFLAS : :fgemm(...) ;}}
}19/35
KAAPI dataflow scheduling for Tiled LUP
Time
LU(A11)
ApplyPFTRSM(A31)
FGEMM(A32)
FGEMM(A22)
FGEMM(A23)
FGEMM(A33)
LU(A22)
ApplyPFTRSM(A13)
ApplyPFTRSM(A12)
ApplyPFTRSM(A21)
ApplyPFTRSM(A32)
ApplyPFTRSM(A23) FGEMM
(A33) LU(A33)
20/35
for(int k=0 ; k<nblocks ; k++){#pragma kaapi task readwrite(&A) write(&P, &Q)R = FFPACK : :LUdivine(...) ;for(int i=k+1 ; i<nblocks ; i++){#pragma kaapi task readwrite(&A) read(&A)FFLAS : :ftrsm(...) ;}for(int i=k+1 ; i<nblocks ; i++){#pragma kaapi task readwrite(&A) read(&P)FFPACK : :applyP(...) ;#pragma kaapi task readwrite(&A) read(&A)FFLAS : :ftrsm(...) ;}for(int i=k+1 ; i<nblocks ; i++){
for(int j=k+1 ; j<nblocks ; j++){#pragma kaapi task readwrite(&A) read(&A)FFPACK : :fgemm(...) ; } }
}
21/35
KAAPI vs OpenMP
HPAC : Intel SandyBridge E5-4620 2.2Ghz, 32 cores, L3cache(16384 KB). (Z/1009Z)
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35
tim
ing
s (
se
co
nd
s)
number of cores
Overcost Parallel vs sequential for matrix dimension 10000*10000
LUdivine (sequential)OpenMP LU BS=512
KAAPI LU BS=212KAAPI LU BS=424
22/35
KAAPI version speed-up
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
sp
ee
d-u
p
number of cores
speed-up kaapi and OpenMP for matrix dimension 10000*10000
KAAPI LU BS=212KAAPI LU BS=424
OpenMP LU BS=512 Ideal
23/35
Parallelization overcost on LU algorithm
0
10
20
30
40
50
60
70
2K 4K 6K 8K 10K 12K 14K 16K 18K 20K-30 %
-20 %
-10 %
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
90 %
100 %
Tim
ing
s (
se
co
nd
s)
ga
in f
acto
r
matrix dimension
Gain factor KAAPI vs OMP on dense full rank matrices (32 cores)
OpenMPkaapi
1-KAAPI/OMP
24/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
25/35
CUP decomposition (Rank deficient matrices)
=AC
UP
26/35
block CUP decomposition
block CUPLess parallelism
I some independent tasks removedI big sequential costly task
27/35
Parallelization of block CUP with OpenMP
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
Sp
ee
d-u
p
Number of cores
CUP (n=10000, R=5000 blocksize=212) over Z/1009
OpenMP CUP speedupIdeal
28/35
Parallelization of block CUP with KAAPI Dynamicscheduling
dependenciesI The graph of task dependency is calculated during runtimeI Dependency between tasks is done according to the
referent of each task.I In this implementation, the referent is the pointer of the
block i.e. it’s the pointer on the upper-left side of eachblock.
X X X
29/35
Parallelization of block CUP with KAAPI Staticscheduling
The graph of task dependancy is precalculated beforeexecution. (faster)
X X X
X is a task parameter, set as CW. CW mode for staticscheduling is not defined yet in actual KAAPI version.
30/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
31/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
0
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
0
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
00
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
00
0
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
0
0
0
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
0
0
0
0
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Algorithme PLUQ Quad-recursif
I Recursive cutting according to row and columnsI We can permut the blocks in a way that we can obtain row
rank profile and column rank profile at the same time.I rank profile of all leading sub-matrices
0
00
0
[Dumas, Pernet, Sultan, ISSAC 2013]32/35
Outline1 Introduction2 Exact gaussian elimination
Gaussian elimination in numerical computationExact gaussian eliminationRank profile
3 Dense linear algebraOptimized building block
4 Block generic full rank matricesTiled LU factorizationParallelization of block LUspeedup
5 Any rank profileTiled CUP decompositionParallelization with OpenMPParallelization of Block CUP with KAAPI
6 perspective7 conclusion
33/35
Conclusion
I Exact ComputationI Parallelization in exact
I Trade-off : (Tiled, block) <=> (adaptative, less sync.)I Specificity in Exact/Numeric : rank, rank profile→ New issues and trade-off /Numeric & Parallel Numeric
I dataflow synchro. LUP :→better adaptatibity→ more parallelism
I PLUQI Dynamic scheduling CUP :→ dynamic block size, parallelism ?
I new algorithm to parallelize : recursive, tile ?
34/35
Thank you for your attention !
35/35