SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 1

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

Ernie Chan

The University of Texas at Austin

June 9-11, 2007 SPAA 2007 2

Motivation

Motivating Example Cholesky Factorization A → L L

T

June 9-11, 2007 SPAA 2007 3

Motivation

Better

PeakPerformance

96 Gflops

June 9-11, 2007 SPAA 2007 4

Outline

Performance FLAME SuperMatrix Conclusion

June 9-11, 2007 SPAA 2007 5

Performance

Target Architecture 16 CPU Itanium2

NUMA 8 dual-processor nodes

OpenMP Intel Compiler 9.0

BLAS GotoBLAS 1.06 Intel MKL 8.1

June 9-11, 2007 SPAA 2007 6

Performance

Implementations Multithreaded BLAS (Sequential algorithm)

LAPACK dpotrf FLAME var3

Serial BLAS (Parallel algorithm) Data-flow

June 9-11, 2007 SPAA 2007 7

Performance

Implementations Column-major order storage Varying block sizes

{ 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size

June 9-11, 2007 SPAA 2007 8

Performance

June 9-11, 2007 SPAA 2007 9

Outline


June 9-11, 2007 SPAA 2007 10

FLAME

Formal Linear Algebra Method Environment High-level abstraction away from indices

“Views” into matrices Seamless transition from algorithms to

code

June 9-11, 2007 SPAA 2007 11

FLAME

Cholesky Factorization

for ( j = 0; j < n; j++ ) {

A[j,j] = sqrt( A[j,j] );

for ( i = j+1; i < n; i++ )

A[i,j] = A[i,j] / A[j,j];

for ( k = j+1; k < n; k++ )

for ( i = k; i < n; i++ )

A[i,k] = A[i,k] – A[i,j] * A[k,j];

}

June 9-11, 2007 SPAA 2007 12

FLAME

LAPACK dpotrf Different variant (right-looking)

DO J = 1, N, NB JB = MIN( NB, N-J+1 )

CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO )

CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’,

$ ‘Non-unit’, N-J-JB+1, JB, ONE,

$ A( J, J ), LDA, A( J+JB, J ), LDA )

CALL DSYRK( ‘Lower’, ‘No transpose’,

$ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA,

$ ONE, A( J+JB, J+JB ), LDA )

ENDDO

June 9-11, 2007 SPAA 2007 13

FLAME

Partitioning Matrices

June 9-11, 2007 SPAA 2007 14

FLAME

June 9-11, 2007 SPAA 2007 15

FLAME

June 9-11, 2007 SPAA 2007 16

FLAME

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}

June 9-11, 2007 SPAA 2007 17

FLAME

June 9-11, 2007 SPAA 2007 18

Outline

Performance FLAME SuperMatrix

Data-flow 2D data affinity Contiguous storage

Conclusion

June 9-11, 2007 SPAA 2007 19

SuperMatrix

Cholesky Factorization Iteration 1

Chol

Trsm Syrk

GemmTrsm Syrk

June 9-11, 2007 SPAA 2007 20

SuperMatrix


Syrk

Chol

Trsm

June 9-11, 2007 SPAA 2007 21

SuperMatrix


Chol

June 9-11, 2007 SPAA 2007 22

SuperMatrix

Analyzer Delay execution and place tasks on

queue Tasks are function pointers annotated with

input/output information Compute dependence information

(flow, anti, output) between all tasks Create DAG of tasks

June 9-11, 2007 SPAA 2007 23

SuperMatrix

Analyzer

Chol

Trsm

Trsm

Syrk

Syrk

Chol

Gemm

…

Chol

Trsm Trsm

GemmSyrk Syrk

Chol

… …

Task Queue DAG of tasks

June 9-11, 2007 SPAA 2007 24

SuperMatrix

FLASH Matrix of matrices

June 9-11, 2007 SPAA 2007 25

SuperMatrix

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}FLASH_Queue_exec( );

June 9-11, 2007 SPAA 2007 26

SuperMatrix

Dispatcher Use DAG to execute tasks out-of-order in

parallel Akin to Tomasulo’s algorithm and

instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix

June 9-11, 2007 SPAA 2007 27

SuperMatrix

Dispatcher 4 threads 5 x 5 matrix

of blocks 35 tasks 14 stages

Chol

Trsm Trsm Trsm Trsm

Syrk SyrkGemm Gemm

Gemm GemmSyrk

Trsm

SyrkGemm

Gemm

Chol

Trsm Trsm

Syrk Syrk

Syrk

Gemm Gemm

Gemm Chol

Trsm Trsm

Trsm

Chol

Chol

Syrk

Syrk SyrkGemm

June 9-11, 2007 SPAA 2007 28

SuperMatrix

June 9-11, 2007 SPAA 2007 29

SuperMatrixChol

Trsm Trsm Trsm Trsm

Syrk Gemm Gemm

Gemm GemmSyrk

Trsm

SyrkGemm

Gemm

Chol

Trsm Trsm

Syrk

Syrk

Gemm Gemm

Gemm

Trsm Trsm

Trsm

Chol

Chol

Syrk

Syrk SyrkGemm

Syrk

Chol

Syrk

Dispatcher Tasks write

to block [2,2] No data affinity

June 9-11, 2007 SPAA 2007 30

SuperMatrix

Blocks of Matrices

Tasks

Threads

Processors

Owner Computes

Rule

Data Affinity

CPU Affinity

Binding Threads to Processors

AssigningTasks toThreads

Denote Tasksby the BlocksOverwritten

June 9-11, 2007 SPAA 2007 31

SuperMatrix

Data Affinity 2D block cyclic decomposition (ScaLAPACK)

4 x 4 matrix of blocks assigned to 2 x 2 mesh

June 9-11, 2007 SPAA 2007 32

SuperMatrix

June 9-11, 2007 SPAA 2007 33

SuperMatrix

Contiguous Storage One level of blocking

User inherentlydoes not need to know about the underlying storage of data

June 9-11, 2007 SPAA 2007 34

SuperMatrix

June 9-11, 2007 SPAA 2007 35

SuperMatrix

GotoBLAS vs. MKL All previous graphs link with GotoBLAS MKL better tuned for small matrices on

Itanium2

June 9-11, 2007 SPAA 2007 36

SuperMatrix

June 9-11, 2007 SPAA 2007 37

SuperMatrix

June 9-11, 2007 SPAA 2007 38

SuperMatrix

June 9-11, 2007 SPAA 2007 39

SuperMatrix

June 9-11, 2007 SPAA 2007 40

SuperMatrix

Results LAPACK chose a bad variant Data affinity and contiguous storage have

clear advantage Multithreaded GotoBLAS tuned for large

matrices MKL better tuned for small matrices

June 9-11, 2007 SPAA 2007 41

SuperMatrix

June 9-11, 2007 SPAA 2007 44

Outline


June 9-11, 2007 SPAA 2007 45

Conclusion

Key Points View blocks of matrices as units of

computation instead of scalars Apply instruction-level parallelism to

blocks Abstractions away from low-level details

of scheduling

June 9-11, 2007 SPAA 2007 46

Authors

Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn

Universidad Jaume I The University of Texas at Austin

June 9-11, 2007 SPAA 2007 47

Acknowledgements

We thank the other members of the FLAME team for their support Field Van Zee

Funding NSF grant CCF-0540926

June 9-11, 2007 SPAA 2007 48

References

[1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989.

[2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001.

[3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004.

[4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001.

[5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007.

[6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.

June 9-11, 2007 SPAA 2007 49

Conclusion

More Information

http://www.cs.utexas.edu/users/flame

Questions?

[email protected]

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

Documents

Transcript of SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures