SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures
-
Upload
brittania-sellers -
Category
Documents
-
view
25 -
download
0
description
Transcript of SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures
June 9-11, 2007 SPAA 2007 1
SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures
Ernie Chan
The University of Texas at Austin
June 9-11, 2007 SPAA 2007 2
Motivation
Motivating Example Cholesky Factorization A → L L
T
June 9-11, 2007 SPAA 2007 3
Motivation
Better
PeakPerformance
96 Gflops
June 9-11, 2007 SPAA 2007 4
Outline
Performance FLAME SuperMatrix Conclusion
June 9-11, 2007 SPAA 2007 5
Performance
Target Architecture 16 CPU Itanium2
NUMA 8 dual-processor nodes
OpenMP Intel Compiler 9.0
BLAS GotoBLAS 1.06 Intel MKL 8.1
June 9-11, 2007 SPAA 2007 6
Performance
Implementations Multithreaded BLAS (Sequential algorithm)
LAPACK dpotrf FLAME var3
Serial BLAS (Parallel algorithm) Data-flow
June 9-11, 2007 SPAA 2007 7
Performance
Implementations Column-major order storage Varying block sizes
{ 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size
June 9-11, 2007 SPAA 2007 8
Performance
June 9-11, 2007 SPAA 2007 9
Outline
Performance FLAME SuperMatrix Conclusion
June 9-11, 2007 SPAA 2007 10
FLAME
Formal Linear Algebra Method Environment High-level abstraction away from indices
“Views” into matrices Seamless transition from algorithms to
code
June 9-11, 2007 SPAA 2007 11
FLAME
Cholesky Factorization
for ( j = 0; j < n; j++ ) {
A[j,j] = sqrt( A[j,j] );
for ( i = j+1; i < n; i++ )
A[i,j] = A[i,j] / A[j,j];
for ( k = j+1; k < n; k++ )
for ( i = k; i < n; i++ )
A[i,k] = A[i,k] – A[i,j] * A[k,j];
}
June 9-11, 2007 SPAA 2007 12
FLAME
LAPACK dpotrf Different variant (right-looking)
DO J = 1, N, NB JB = MIN( NB, N-J+1 )
CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO )
CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’,
$ ‘Non-unit’, N-J-JB+1, JB, ONE,
$ A( J, J ), LDA, A( J+JB, J ), LDA )
CALL DSYRK( ‘Lower’, ‘No transpose’,
$ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA,
$ ONE, A( J+JB, J+JB ), LDA )
ENDDO
June 9-11, 2007 SPAA 2007 13
FLAME
Partitioning Matrices
June 9-11, 2007 SPAA 2007 14
FLAME
June 9-11, 2007 SPAA 2007 15
FLAME
June 9-11, 2007 SPAA 2007 16
FLAME
FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}
June 9-11, 2007 SPAA 2007 17
FLAME
June 9-11, 2007 SPAA 2007 18
Outline
Performance FLAME SuperMatrix
Data-flow 2D data affinity Contiguous storage
Conclusion
June 9-11, 2007 SPAA 2007 19
SuperMatrix
Cholesky Factorization Iteration 1
Chol
Trsm Syrk
GemmTrsm Syrk
June 9-11, 2007 SPAA 2007 20
SuperMatrix
Cholesky Factorization Iteration 2
Syrk
Chol
Trsm
June 9-11, 2007 SPAA 2007 21
SuperMatrix
Cholesky Factorization Iteration 3
Chol
June 9-11, 2007 SPAA 2007 22
SuperMatrix
Analyzer Delay execution and place tasks on
queue Tasks are function pointers annotated with
input/output information Compute dependence information
(flow, anti, output) between all tasks Create DAG of tasks
June 9-11, 2007 SPAA 2007 23
SuperMatrix
Analyzer
Chol
Trsm
Trsm
Syrk
Syrk
Chol
Gemm
…
Chol
Trsm Trsm
GemmSyrk Syrk
Chol
… …
Task Queue DAG of tasks
June 9-11, 2007 SPAA 2007 24
SuperMatrix
FLASH Matrix of matrices
June 9-11, 2007 SPAA 2007 25
SuperMatrix
FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );
while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}FLASH_Queue_exec( );
June 9-11, 2007 SPAA 2007 26
SuperMatrix
Dispatcher Use DAG to execute tasks out-of-order in
parallel Akin to Tomasulo’s algorithm and
instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix
June 9-11, 2007 SPAA 2007 27
SuperMatrix
Dispatcher 4 threads 5 x 5 matrix
of blocks 35 tasks 14 stages
Chol
Trsm Trsm Trsm Trsm
Syrk SyrkGemm Gemm
Gemm GemmSyrk
Trsm
SyrkGemm
Gemm
Chol
Trsm Trsm
Syrk Syrk
Syrk
Gemm Gemm
Gemm Chol
Trsm Trsm
Trsm
Chol
Chol
Syrk
Syrk SyrkGemm
June 9-11, 2007 SPAA 2007 28
SuperMatrix
June 9-11, 2007 SPAA 2007 29
SuperMatrixChol
Trsm Trsm Trsm Trsm
Syrk Gemm Gemm
Gemm GemmSyrk
Trsm
SyrkGemm
Gemm
Chol
Trsm Trsm
Syrk
Syrk
Gemm Gemm
Gemm
Trsm Trsm
Trsm
Chol
Chol
Syrk
Syrk SyrkGemm
Syrk
Chol
Syrk
Dispatcher Tasks write
to block [2,2] No data affinity
June 9-11, 2007 SPAA 2007 30
SuperMatrix
Blocks of Matrices
Tasks
Threads
Processors
Owner Computes
Rule
Data Affinity
CPU Affinity
Binding Threads to Processors
AssigningTasks toThreads
Denote Tasksby the BlocksOverwritten
June 9-11, 2007 SPAA 2007 31
SuperMatrix
Data Affinity 2D block cyclic decomposition (ScaLAPACK)
4 x 4 matrix of blocks assigned to 2 x 2 mesh
June 9-11, 2007 SPAA 2007 32
SuperMatrix
June 9-11, 2007 SPAA 2007 33
SuperMatrix
Contiguous Storage One level of blocking
User inherentlydoes not need to know about the underlying storage of data
June 9-11, 2007 SPAA 2007 34
SuperMatrix
June 9-11, 2007 SPAA 2007 35
SuperMatrix
GotoBLAS vs. MKL All previous graphs link with GotoBLAS MKL better tuned for small matrices on
Itanium2
June 9-11, 2007 SPAA 2007 36
SuperMatrix
June 9-11, 2007 SPAA 2007 37
SuperMatrix
June 9-11, 2007 SPAA 2007 38
SuperMatrix
June 9-11, 2007 SPAA 2007 39
SuperMatrix
June 9-11, 2007 SPAA 2007 40
SuperMatrix
Results LAPACK chose a bad variant Data affinity and contiguous storage have
clear advantage Multithreaded GotoBLAS tuned for large
matrices MKL better tuned for small matrices
June 9-11, 2007 SPAA 2007 41
SuperMatrix
June 9-11, 2007 SPAA 2007 44
Outline
Performance FLAME SuperMatrix Conclusion
June 9-11, 2007 SPAA 2007 45
Conclusion
Key Points View blocks of matrices as units of
computation instead of scalars Apply instruction-level parallelism to
blocks Abstractions away from low-level details
of scheduling
June 9-11, 2007 SPAA 2007 46
Authors
Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn
Universidad Jaume I The University of Texas at Austin
June 9-11, 2007 SPAA 2007 47
Acknowledgements
We thank the other members of the FLAME team for their support Field Van Zee
Funding NSF grant CCF-0540926
June 9-11, 2007 SPAA 2007 48
References
[1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989.
[2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001.
[3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004.
[4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001.
[5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007.
[6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.
June 9-11, 2007 SPAA 2007 49
Conclusion
More Information
http://www.cs.utexas.edu/users/flame
Questions?