Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012 · Performance for all...
Transcript of Dynamic Selection of Auto-tuned Kernels to the Numerical ......Feb 15, 2012 · Performance for all...
The DOE ACTS CollectionSIAM Parallel Processing for Scientific
Computing, Savannah, Georgia Feb 15, 2012
Tony DrummondComputational Research DivisionLawrence Berkeley National Laboratory
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
The DOE ACTS Collection Project
Goal: The Advanced CompuTational Software Collection (ACTS) makes reliable and efficient software tools more widely used, and more effective in solving the nation’s engineering and scientific problems.
Tony Drummond and Osni MarquesComputational Research Division
Lawrence Berkeley National Laboratory
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
What is the Role of ACTS in the HPC Software Stack?
APPLICATIONS
GENERAL PURPOSE TOOLS
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
APPLICATIONS
ACTS Plays a Critical Role in the HPC Software Stack
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
GENERAL PURPOSE TOOLS
Accelerate Application Code Development
PETSc
SLEPc
SuperLU
AztecOO
ScaLAPACK
TAO
Hypre
ATLAS
PyACTS
Global Arrays
SUNDIALS
Overture
TAU
By maintaining a solid collection with some of the best numerical kernels and support tools for code development, run-time and library optimization
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
The DOE ACTS CollectionCategory Tool Functionalities
Numerical AztecOO Scalable linear and non-linear solvers using iterative schemes.
Hypre A family of scalable preconditioners.
PETSc Scalable linear and non-linear solvers and additional support for PDE related work.
OPT++ Object-oriented nonlinear optimization solvers.
SUNDIALS Solvers for the solution of systems of ordinary differential equations, nonlinear algebraic equations, and differential-algebraic equations.
ScaLAPACK High performance parallel dense linear algebra.
SLEPc Scalable algorithms for the solution of large sparse eigenvalue problems.
SuperLU Scalable direct solution of large, sparse, nonsymmetric linear systems of equations.
TAO Large-scale optimization software.
Code DevelopmentGlobal Arrays Supports the development of parallel programs.
Overture Supports the development of computational fluid dynamics codes in complex geometries.
Run Time SupportTAU Portable and scalable performance analyzes and tracing tools for C, C++, Fortran and Java
programs.
Library DevelopmentATLAS Automatic generation of optimized numerical dense algebra for scalar processors.
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Numerical Functionality in the ACTS Collection
€
minx | | b− Ax | |2
€
minx | | x | |2
€
minx | | x | |2
€
minx | | b− Ax | |2
€
Az = λz
€
A = UΣVT
A = UΣVH
€
Az = λBzABz = λzBAz = λz
Ax = b or AX = B
Hx = b’
Commonalities among ACTS Tools:• General purpose user interfaces• Parallel and Scalable implementations of
numerical algorithms• Modular design (kernel reusability)• Parallelism exploited at the MPI_TASK level
(newer versions under development to support other levels of concurrency)
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems
TOO
L D
EV
ELO
PE
RS
Hand-Tuned Codes
Compiler optAuto-tuning
New ProgrammingEnvironments
CS Communityefforts
APPLICATIONDEVELOPERS
APPLICATIONS
Challenge is avoid code rewrite for performance
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems
• Development of new numerical functionalities and implementation
• Integration of auto-tuned kernels (BLAS, LAPACK, TORCH, etc. .)
• Adoption of new programming models and paradigms
• Tool and Functionality Selection• Choosing functional parameters• Compile/Link: Integrating Optimized Kernels • Runtime: Dynamic Kernel Selection• Verification: Robustness and ScalabilityA
PP
LIC
ATIO
ND
EV
ELO
PE
RS
TOO
L D
EV
ELO
PE
RS
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems
Q. Can Performance Scalability be passed from libraries/tools to applications and across platforms and configurations?
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
LU Solve (ScaLAPACK)
!"
#"
$!"
$#"
%!"
%#"
&!"
'()"*"+%,%-" '(")"."+%,*-" '(")"$/"+*,*-" '(")"%*+*,/-"
% im
prov
emen
t
Number of MPI_TASKS (np)
$$!!!"
$&!!!"
$#!!!"
CRAY - XTE6
Execution time improvement vs. performance scalability
!"!!#$!!%
&"!!#$!!%
'"!!#$!!%
("!!#$!!%
)"!!#$!!%
*"!!#$!*%
*"&!#$!*%
*"'!#$!*%
'% )% *(% &'%
!"#$"%
&'("
)*+,&-"#$".#-&/"
0-#,1&+"/23&456777856777"
*+!!!%
Compiler Optimized vs Highly Tuned
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
!!"""#
!$"""#%""""#
"#
&""#
!""#
%""#
'""#
()*#'#+!,!-# ()#*#.#
+!,'-# ()#*#&$#+','-# ()#*#!'
+',$-#
!!"""#
!$"""#
%""""#
Portable Performance is no longer straight forward
NPThreads/
MPI_TASK1(24)
4 6
8 3
16 1
24 1
NPThreads/
MPI_TASK2(48)
4 12
8 6
16 2
24 2
Doubling Problem Size and Adding 1 node
!"!!#$!!%
&"!!#$!!%
'"!!#$!'%
'"&!#$!'%
("!!#$!'%
("&!#$!'%
)"!!#$!'%
*% +% ',% (*%
Total MPI Tasks
!"#$"%&'(")&*"+#,&"-./01"2345"
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
!!"""#
!$"""#%&"""#
"#'""#(""#!""#%""#&""#)""#*""#
+,-#%#.(/(0# +,#-#1#
.(/%0# +,#-#')#.%/%0# +,#-#(%
.%/)0#
!!"""#
!$"""#
%&"""#
Portable Performance is no longer straight forward
NPThreads/
MPI_TASK1(24)
4 6
8 3
16 1
24 1
Trice the Problem Size and 3 nodes
NPThreads/
MPI_TASK3(72)
4 18
8 9
16 4
24 3
!"!!#$!!%
&"!!#$!!%
'"!!#$!'%
'"&!#$!'%
("!!#$!'%
("&!#$!'%
)"!!#$!'%
*% +% ',% (*%
Total MPI Tasks
!"#$"%&'(")&*"+#,&"-./01"2345"
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Providing Sustainable and Scalable Performance for ACTS Tools in Multicore Systems
Q. Can Performance Scalability be passed from libraries/tools to applications and across
platforms and configurations?
A. Yes, maybe with a lot of automatic work• Fully operational through various parameters and
levels of automation• Application developers are very reluctant to
change their code then preserve in as much as possible current structure of APIs
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
ACTS Parametric Research and Collaborations
Use of ACTS parameters to ensure application scalability (pACTS)
Pre-Installation
Library Installation
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
Run Time
Compile + linkApplication
job submit options
APPLICATIONS
GENERAL PURPOSE TOOLS
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Auto-tune algorithmic parameters(smart-tuning)
Sustainable & scalable Performance for all
applications
Run-time selection oftuned executables (#cores/node)
With pACTS
Auto-tuning produces multiple tuned libraries using steering parameters (#cores/node)
}ACTS Parametric Research and Integration
Some applications won’t scale
Without pACTS
Auto-tuning produces a single tuned library n=max-cores/node
Hand-tuning algorithmic parameters can be cumbersome
PLATFORM SUPPORT TOOLS AND UTILITIES
GENERAL PURPOSE TOOLS
APPLICATIONS
HARDWARE
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Multi-Level Tuning to Attain Scalable Performance
•Optimized dense BLAS kernels• Algorithmic optimization•Minimize computational costs (storage + ops)• Sustain numerical stability and reliability• Specialized problem solving techniques
• Software Implementations:• Specialized Data Structures•Maximize Load balancing•Minimize Latencies, Idle time, etc. .
Auto-tuning
Auto-tuning
GENERAL PURPOSE TOOLS
APPLICATIONS
Smart-tuning
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Tuning at Library Installation Level
APPLICATIONS
GENERAL PURPOSE TOOLS
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
• Compiler level optimizations• Specialized communication libraries and
other custom-made support libraries• Auto-tuners
Performance Tuning Parameters (PT-pACTS)and Software Dependencies (SD-pACTS):• arithmetic and arithmetic precision• automatic threading• compiler• communication libraries and paradigms• software requirements
Software Resources:
AC
TS P
AR
AM
ETER
S
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Parametrize Optimized Installation of Libraries+Apps
APPLICATIONS
GENERAL PURPOSE TOOLS
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
PT-pACTS, SD-pACTS• NUMA Aware, Thread, cache, TLB and local
store blocking, padding, register and format selection
FP-pACTS• Output of performance monitoring• Optimized library and kernels labeling
• Auto-tuners• Performance Monitors• Functional Performance Parameter Derivation
(FP-pACTS)
Software Resources:
AC
TS P
AR
AM
ETER
S
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Runtime Dynamic Selection of Kernels/Libraries
APPLICATIONS
GENERAL PURPOSE TOOLS
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
• Smart-tuning tools• Choice of ACTS tool(s) and functionality• Choice of calling parameters (RT-pACTS)
PT-pACTS, SD-pACTS, FP-pACTS and RT-pACTS•algorithmic (functional calls)•application numerical requirements •problem size•resource utilization
• Runtime scripts (e.g., job submission scripts)• Runtime Parameters impacting application
performance (RT-pACTS)• Functional Performance Parameter Derivation
(FP-pACTS)
Software Resources:
AC
TS
PAR
AM
ETER
S
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
• PDGETRF and PDGETRS implementations• BLAS DGEMM, DTRSM implementations
• Number of cores/nodePT-pACTSSD-pACTS
Simple Example of ScaLAPACK LU
ScaLAPACK
BLAS
LAPACK BLACS
MPI/PVM/...
PBLASGlobal
Local
platform specific
Library Installation Time
OUTPUT: Tuned kernels
PBLAS_A01V1PBLAS_DEFAULT
PBLAS_A01V2PBLAS_A01V3
PBLAS_A01Vn:
BLACS_A01V1BLACS_DEFAULT
BLACS_A01V2BLACS_A01V3
BLACS_A01Vn:
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Simple Example of ScaLAPACK LU
• PDGETRF and PDGETRS implementations• BLAS DGEMM, DTRSM implementations
• MPI_TASKS/node• Blocking factor
FP-pACTSSD-pACTS
Application Code Link Time
APPLICATIONS
GENERAL PURPOSE TOOLS
PLATFORM SUPPORT TOOLS AND UTILITIES
HARDWARE
OUTPUT: Tuned kernels
PBLAS_A01V1PBLAS_DEFAULT
PBLAS_A01V2PBLAS_A01V3
PBLAS_A01Vn:
BLACS_A01V1BLACS_DEFAULT
BLACS_A01V2BLACS_A01V3
BLACS_A01Vn:
kernelSelector
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Total MPI Tasks
% o
f Pea
k
0
12.5
25.0
37.5
50.0
1 2 6 16 24
Best in Node Performance
Using RT-pACTS Without RT-pACTS
4
Example of ScaLAPACK LU
RT-pACTS• Number of cores and Number of nodes• Problem size• Matrix Blocking• Process Grid
FP-pACTS• Matrix 2D Blocking • Process Grid
SD-pACTS• PDGETRF and PDGETRS implementations• BLAS Implementation ACML
16x16
!"!!#$!!%
&"!!#$!!%
'"!!#$!'%
'"&!#$!'%
("!!#$!'%
("&!#$!'%
)"!!#$!'%
*% +% ',% (*%
Total MPI Tasks
!"#$"%&'(")&*"+#,&"-./01"2345"
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Example of ScaLAPACK LU
Total MPI Tasks
% o
f Pea
k
0
12.5
25.0
37.5
50.0
1 2 6 16 24
Best in Node Performance
Using RT-pACTS Without RT-pACTS
SD-pACTS• PDGETRF and PDGETRS implementations• BLAS Implementation
4
RT-pACTS• Number of cores and Number of nodes• Problem size• Matrix 2D Blocking• Process Grid
FP-pACTS• Matrix 2D Blocking • Process Grid
optimized kernels 2-cores
4 MPI_TASKS/node45000
NB=8
!"!!#$!!%
&"!!#$!!%
'"!!#$!'%
'"&!#$!'%
("!!#$!'%
("&!#$!'%
)"!!#$!'%
*% +% ',% (*%
Total MPI Tasks
!"#$"%&'(")&*"+#,&"-./01"2345"
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Example of ScaLAPACK LU
Total MPI Tasks
% o
f Pea
k
0
12.5
25.0
37.5
50.0
1 2 6 16 24
Best in Node Performance
Using RT-pACTS Without RT-pACTS
SD-pACTS• PDGETRF and PDGETRS implementations• BLAS Implementation
4
RT-pACTS• Number of cores and Number of nodes• Problem size• Matrix 2D Blocking• Process Grid
FP-pACTS• Matrix 2D Blocking • Process Grid
optimized kernels 16-cores
16 MPI_TASKS/node45000
NB=8
!"!!#$!!%
&"!!#$!!%
'"!!#$!'%
'"&!#$!'%
("!!#$!'%
("&!#$!'%
)"!!#$!'%
*% +% ',% (*%
Total MPI Tasks
!"#$"%&'(")&*"+#,&"-./01"2345"
Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection SIAM PP12, Savannah, GA - February 15, 2012
Concluding Remarks
• HPC centers vs. Installation in your laptop• On going-work in parametric research• TORCH Kernels • Parameter derivation and selection
S. Petiton and C. Calvin
• Current tests used older version of OSKI, hand-tuned kernels and acml (blas)• Enlarge the set of auto-tuners
• Incorporate new ACTS tool developments