Thinking Outside of the Tera-Scale Box
description
Transcript of Thinking Outside of the Tera-Scale Box
![Page 1: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/1.jpg)
Thinking Outside of the Tera-Scale Box
Piotr Luszczek
![Page 2: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/2.jpg)
Brief History of Tera-flop: 1997
1997
ASCI Red
![Page 3: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/3.jpg)
Brief History of Tera-flop: 2007
1997
2007
ASCI Red
Intel Polaris
![Page 4: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/4.jpg)
Brief History of Tera-flop: GPGPU
2008
1997
2007
ASCI Red
Intel Polaris
GPGPU
![Page 5: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/5.jpg)
Tera-flop: a Dictionary Perspective• Belongs to
supercomputer expert• Consumes 500 kW• Costs over $1M• Requires distributed
memory programming
• Memory: 1 GiB or more
• Belongs togaming geek
• Consumes under 500 W• Costs under $200• Requires only a CUDA
kernel
• Memory: 1 GiB or more
![Page 6: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/6.jpg)
Double-Dip Recession?
20072005
Time
Productivity
Hybrid MulticoreMulticore
Cilk, Intel TBB, Intel Ct, Apple
GCD, MS Concrt, …
HMPP, PGI Accelerator, PathScale?
![Page 7: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/7.jpg)
Locality Space in HPC ChallengeHPLMat-Mat-Mul
Parallel-TransposeSTREAM
FFTRandomAccess
![Page 8: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/8.jpg)
Locality Space in HPC Challenge: BoundsHPLMat-Mat-Mul
Parallel-TransposeSTREAM
FFTRandomAccessLatency-bound
Bandwidth-bound
Compute-bound
AORSA2D
PCIe
![Page 9: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/9.jpg)
CUDA vs. OpenCL: Similar Software Stacks
![Page 10: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/10.jpg)
Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j)
CUBLAS Volkov et al. MAGMA Nakasato
![Page 11: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/11.jpg)
Fermi C2050: from CUDA to OpenCL
![Page 12: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/12.jpg)
Fermi C2050: Simple OpenCL Porting
![Page 13: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/13.jpg)
ATI 5870: Simple OpenCL Porting
![Page 14: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/14.jpg)
Making Case for Autotuning: Use of Texture
![Page 15: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/15.jpg)
Making Case for Autotuning: Blocking Factors
Provided by NVIDIA
![Page 16: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/16.jpg)
Mat-Mul Portability Summary
Achieved PerformanceSingle precision
Achieved PerformanceDouble precision
ATI OpenCL 52% 57%
ATI IL 74% 87%
NVIDIA CUDA 63% 58%
NVIDIA OpenCL 57% 49%
Multicore (vendor) 79% 79%
Multicore (autotuned) 46% 40%
![Page 17: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/17.jpg)
Now, let’s think outside of the box
![Page 18: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/18.jpg)
Recipe for Performance Before Multicore
Wider superscalar instr. issue
Increase clock frequency
Increase number of transistor
Load/Store
Instr. scheduling
![Page 19: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/19.jpg)
Recipe for Performance After Multicore
More cores
Parallel software
![Page 20: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/20.jpg)
Recipe for Future Performance
More cores
Better sequential software
DAG Runtime
![Page 21: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/21.jpg)
From Assembly to DAG: PLASMA• Typical loop in assembly
– load R0, #1000– divf F1, F2, F3– load R1, #1000– divf F2, F4, F5– dec R1– jump …
• LU factorization– for i = 1:N– GETR2(A[i, i])– for j = i+1:N– TRSM(A[i, i], A[i, j])– for k = i+1:N– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling
GETRF2
TRSM
GETRF2
GEMM GEMM
TRSM GEMM GEMMDAG scheduling
Sequential instruction
stream
Sequential task
stream
![Page 22: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/22.jpg)
Scalability of LU Factorization on Multicore
PLASMA
MKL 11.0
LAPACK
6 cores = 1 socket
24 cores = 4 sockets
48 cores = 8 sockets
AMD Istanbul 2.8 GHz
![Page 23: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/23.jpg)
Combining Multicore and GPU for QR Fact.
AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480
![Page 24: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/24.jpg)
Performance on Distributed Memory
QR LU
Cholesky
ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-coreInterconnect: Seastar, 3D torus
Cray LibSci
DPLASMA
Peak
![Page 25: Thinking Outside of the Tera-Scale Box](https://reader035.fdocuments.net/reader035/viewer/2022062310/5681560c550346895dc3d014/html5/thumbnails/25.jpg)
People and Projects• Jack Dongarra• Mathieu Faverge• Jakub Kurzak• Hatem Ltaief• Stan Tomov• Asim YarKhan• Peng Du• Rick Weber
• MAGMA• PLASMA• DPLASMA and DAGuE
– George Bosilca– Aurelien Bouteiller– Anthony Danalis– Thomas Herault– Pierre Lemarinier