SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

47
June 9-11, 2007 SPAA 2007 1 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas at Austin

description

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures. Ernie Chan The University of Texas at Austin. Motivation. Motivating Example Cholesky Factorization A → L L. T. Motivation. Peak Performance 96 Gflops. Better. Outline. Performance FLAME - PowerPoint PPT Presentation

Transcript of SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

Page 1: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 1

SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures

Ernie Chan

The University of Texas at Austin

Page 2: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 2

Motivation

Motivating Example Cholesky Factorization A → L L

T

Page 3: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 3

Motivation

Better

PeakPerformance

96 Gflops

Page 4: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 4

Outline

Performance FLAME SuperMatrix Conclusion

Page 5: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 5

Performance

Target Architecture 16 CPU Itanium2

NUMA 8 dual-processor nodes

OpenMP Intel Compiler 9.0

BLAS GotoBLAS 1.06 Intel MKL 8.1

Page 6: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 6

Performance

Implementations Multithreaded BLAS (Sequential algorithm)

LAPACK dpotrf FLAME var3

Serial BLAS (Parallel algorithm) Data-flow

Page 7: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 7

Performance

Implementations Column-major order storage Varying block sizes

{ 64, 96, 128, 160, 192, 224, 256 } Select best performance for each problem size

Page 8: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 8

Performance

Page 9: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 9

Outline

Performance FLAME SuperMatrix Conclusion

Page 10: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 10

FLAME

Formal Linear Algebra Method Environment High-level abstraction away from indices

“Views” into matrices Seamless transition from algorithms to

code

Page 11: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 11

FLAME

Cholesky Factorization

for ( j = 0; j < n; j++ ) {

A[j,j] = sqrt( A[j,j] );

for ( i = j+1; i < n; i++ )

A[i,j] = A[i,j] / A[j,j];

for ( k = j+1; k < n; k++ )

for ( i = k; i < n; i++ )

A[i,k] = A[i,k] – A[i,j] * A[k,j];

}

Page 12: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 12

FLAME

LAPACK dpotrf Different variant (right-looking)

DO J = 1, N, NB JB = MIN( NB, N-J+1 )

CALL DPOTF2( ‘Lower’, JB, A( J, J ), LDA, INFO )

CALL DTRSM( ‘Right’, ‘Lower’, ‘Transpose’,

$ ‘Non-unit’, N-J-JB+1, JB, ONE,

$ A( J, J ), LDA, A( J+JB, J ), LDA )

CALL DSYRK( ‘Lower’, ‘No transpose’,

$ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA,

$ ONE, A( J+JB, J+JB ), LDA )

ENDDO

Page 13: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 13

FLAME

Partitioning Matrices

Page 14: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 14

FLAME

Page 15: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 15

FLAME

Page 16: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 16

FLAME

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*---------------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}

Page 17: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 17

FLAME

Page 18: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 18

Outline

Performance FLAME SuperMatrix

Data-flow 2D data affinity Contiguous storage

Conclusion

Page 19: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 19

SuperMatrix

Cholesky Factorization Iteration 1

Chol

Trsm Syrk

GemmTrsm Syrk

Page 20: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 20

SuperMatrix

Cholesky Factorization Iteration 2

Syrk

Chol

Trsm

Page 21: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 21

SuperMatrix

Cholesky Factorization Iteration 3

Chol

Page 22: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 22

SuperMatrix

Analyzer Delay execution and place tasks on

queue Tasks are function pointers annotated with

input/output information Compute dependence information

(flow, anti, output) between all tasks Create DAG of tasks

Page 23: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 23

SuperMatrix

Analyzer

Chol

Trsm

Trsm

Syrk

Syrk

Chol

Gemm

Chol

Trsm Trsm

GemmSyrk Syrk

Chol

… …

Task Queue DAG of tasks

Page 24: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 24

SuperMatrix

FLASH Matrix of matrices

Page 25: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 25

SuperMatrix

FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL );

while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); /*---------------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*---------------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL );}FLASH_Queue_exec( );

Page 26: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 26

SuperMatrix

Dispatcher Use DAG to execute tasks out-of-order in

parallel Akin to Tomasulo’s algorithm and

instruction-level parallelism on blocks of computation SuperScalar vs. SuperMatrix

Page 27: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 27

SuperMatrix

Dispatcher 4 threads 5 x 5 matrix

of blocks 35 tasks 14 stages

Chol

Trsm Trsm Trsm Trsm

Syrk SyrkGemm Gemm

Gemm GemmSyrk

Trsm

SyrkGemm

Gemm

Chol

Trsm Trsm

Syrk Syrk

Syrk

Gemm Gemm

Gemm Chol

Trsm Trsm

Trsm

Chol

Chol

Syrk

Syrk SyrkGemm

Page 28: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 28

SuperMatrix

Page 29: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 29

SuperMatrixChol

Trsm Trsm Trsm Trsm

Syrk Gemm Gemm

Gemm GemmSyrk

Trsm

SyrkGemm

Gemm

Chol

Trsm Trsm

Syrk

Syrk

Gemm Gemm

Gemm

Trsm Trsm

Trsm

Chol

Chol

Syrk

Syrk SyrkGemm

Syrk

Chol

Syrk

Dispatcher Tasks write

to block [2,2] No data affinity

Page 30: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 30

SuperMatrix

Blocks of Matrices

Tasks

Threads

Processors

Owner Computes

Rule

Data Affinity

CPU Affinity

Binding Threads to Processors

AssigningTasks toThreads

Denote Tasksby the BlocksOverwritten

Page 31: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 31

SuperMatrix

Data Affinity 2D block cyclic decomposition (ScaLAPACK)

4 x 4 matrix of blocks assigned to 2 x 2 mesh

Page 32: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 32

SuperMatrix

Page 33: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 33

SuperMatrix

Contiguous Storage One level of blocking

User inherentlydoes not need to know about the underlying storage of data

Page 34: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 34

SuperMatrix

Page 35: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 35

SuperMatrix

GotoBLAS vs. MKL All previous graphs link with GotoBLAS MKL better tuned for small matrices on

Itanium2

Page 36: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 36

SuperMatrix

Page 37: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 37

SuperMatrix

Page 38: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 38

SuperMatrix

Page 39: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 39

SuperMatrix

Page 40: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 40

SuperMatrix

Results LAPACK chose a bad variant Data affinity and contiguous storage have

clear advantage Multithreaded GotoBLAS tuned for large

matrices MKL better tuned for small matrices

Page 41: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 41

SuperMatrix

Page 42: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 44

Outline

Performance FLAME SuperMatrix Conclusion

Page 43: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 45

Conclusion

Key Points View blocks of matrices as units of

computation instead of scalars Apply instruction-level parallelism to

blocks Abstractions away from low-level details

of scheduling

Page 44: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 46

Authors

Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn

Universidad Jaume I The University of Texas at Austin

Page 45: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 47

Acknowledgements

We thank the other members of the FLAME team for their support Field Van Zee

Funding NSF grant CCF-0540926

Page 46: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 48

References

[1] R. C. Agarwal and F. G. Gustavson. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing ’89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pages 225-233, New York, NY, USA, 1989.

[2] B. S. Andersen, J. Waśniewski, and F. G. Gustavson. A recursive formulation for Cholesky factorization of a matrix in packed storage. ACM Trans. Math. Soft., 27(2):214-244, 2001.

[3] E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kagstrom. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3-45, 2004.

[4] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Soft., 27(4):422-455, 2001.

[5] F. G. Gustavson, L. Karlsson, and B. Kagstrom. Three algorithms on distributed memory using packed storage. Computational Science – Para 2006. B. Kagstrom, E. Elmroth, eds., accepted for Lecture Notes in Computer Science. Springer-Verlag, 2007.

[6] R. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. IBM J. of Research and Development, 11(1), 1967.

Page 47: SuperMatrix Out-of-Order Scheduling of  Matrix Operations for SMP and Multi-Core Architectures

June 9-11, 2007 SPAA 2007 49

Conclusion

More Information

http://www.cs.utexas.edu/users/flame

Questions?

[email protected]