Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of...

17
Parallel Implementaons for Solving Tridiagonal Systems (Sparse Days’18) Pedro Valero-Lara Ph.D. Researcher 27/Sep/2018

Transcript of Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of...

Page 1: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Parallel Implementations for Solving Tridiagonal

Systems

(Sparse Days’18)

Pedro Valero-LaraPh.D. Researcher

27/Sep/2018

Page 2: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Tridiagonal Systems:

• Accelerate the computation of batches of tridiagonal problems ➔ Simulation of the Human Brain

• Accelerate the computation of one large tridiagonal problems

Motivation & IntroductionMotivation:

• Ax=b, where A is a tridiagonal matrix

Thomas Algorithm:• The optimal algorithm

➔ 8n operations in 2n-1 steps➔ Forward & Backward

• SequentialThomas (l,d,u,rhs,n)// Backward Sweepfor i = n − 1 → 0 do

factor = u[i] / d[i];d[i] -= factor × l[i];rhs[i] -= factor × rhs[i];

end forrhs[0] /= d[0];// Forward Sweepfor i = 1 → n-1 do

rhs[i] -= l[i] × rhs[i];rhs[i] /= d[i];

end for 1

Page 3: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Accelerate the computation of batches of tridiagonal problemsImplementation of cuThomasBatch

Page 4: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

● Computing about 1.5x10¹⁴ (150 Billions) neurons!!

Simulation of the Human BrainHuman Brain Project (HBP)

void hines solver(double *a, double *b, double *d, double *rhs, int *p, int cell size)

{

int i; double factor;

// backward sweep

for(int i=cell size-1; i>0; −−i) {

factor = a[i] / d[i];

d[p[i]] -= factor * b[i];

rhs[p[i]] -= factor * rhs[i];

}

rhs[0] /= d[0]

// forward sweep

for(i=1; i<cell size; ++i) {

rhs[i] -= b[i] * rhs[p[i]];

rhs[i] /= d[i];

}

}

● Ax=b➔ A is sparse (3 vectors) and symmetric➔ Similar to Tridiagonal System (Thomas algorithm)

● 8×N operations (N = size of the neuron)● Vector p → branches

➔ Jumps in the memory access pattern➔ p stores the “shape” of the neuron

2

Page 5: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

• 1 CUDA Block per Tridiagonal System➔ gtsvStridedBatch (cuSparse)➔ Parallel Methods

• CR, PCR, …➔ Saturate the GPU with a “low” number

of Tridiagonal Systems• 1 CUDA thread per Tridiagonal System

➔ cuThomasBatch ➔ Thomas Method➔ Modification on data layout

• Once per simulation➔ Saturate the GPU with a “high” number

of Tridiagonal Systems

Implementation of cuThomasBatchParallel Tridiagonal Solve on GPU:

3

Page 6: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

• MinoTauro (one K80 node)➔ 1 logic K40➔ 2496 CUDA cores➔ 12 GB GDDR5➔ 240GB/s

Implementation of cuThomasBatchPlatform test:

Test case:• cuThomasBatch vs gtsvStridedBatch (cuSparse)

➔ For different systems size• Small (from 64 to 512)• Big (from 1024 to 8192)

➔ For different (Batch count) number of Tridiagonal Systems• 256, 2,560, 25,600, 256,000• 20, 200, 2,000, 20,000

4

Page 7: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Implementation of cuThomasBatchcuThomasBatch vs gtsvStridedBatch (cuSparse)

Conclusions:• cuThomasBatch is in need of a “high” Batch count

➔ From 2,560 for “small” systems➔ From 200 for “big” systems

5

Page 8: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Implementation of cuThomasBatchGPU vs Multicore

Conclusions:• cuThomasBatch continues scaling even when computing a high Batch count• gtsvStridedBatch (cuSparse) saturates the GPU with a “low” Batch count

➔ From 2,560 for “small” systems➔ From 200 for “big” systems

6

Page 9: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Implementation of cuThomasBatchMemory Occupancy

Conclusions:• gtsvStridedBatch (cuSparse) is in need of much more memory

➔ Temporal buffers to store data of the different levels• cuThomasBatch is not in need of temporal buffers

➔ 1 CUDA thread per system• cuThomasBatch is more accurate than gtsvStridedBatch (cuSparse)

➔ Thomas method is sequential (error is not extended)

7

Error

Page 10: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Implementation of cuThomasBatchNVIDA Visual Profiler

• Occupancy (92,6%)➔ No divergence➔ Coalesce memory accesses

• MemBandwidth (140 GB/s)➔ 240GB/s

• 25% ECC (60 GB/s)• 180 GB/s Real

➔ 140GB/s (~80%)

cuThomasBatch is part of the NVIDIA cuSparse library● gtsvInterleavedBatch

8

Page 11: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Accelerate the computation of one large tridiagonal systemImplementation of dsss_dgtsv@LASs

Page 12: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

• Parallel Methods➔ CR and PCR➔ High number of operations➔ Synchronization between steps

• Our approach➔ Combine PCR and Thomas➔ PCR

• Every step (s), PCR generates 2^s systems of size n/2^s ➔ Thomas

• Optimum method in terms of operations (8n)➔ What is the best switch point between PCR and Thomas?

Implementation of dsss_dgtsv@LASsParallel Tridiagonal Solve on Multicore:

9

Page 13: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

• We compute (in terms of # operations) the best SP following on the next equation (theoretical prediction):

((12 × n)/#cores) × SP + (8 × (n/2 S P ))

• Test platform• One node of MareNostrum IV Supercomputer• 2x 24 core Intel Xeon Platinium 8160

• Experimental Results• 4 different variants of the PCR+Thomas• The best variants are in agreement with

the theoretical study

Implementation of dsss_dgtsv@LASsBest Switch Point (SP)

10

Page 14: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

• CR, PCR and PCR+Thomas➔ Extrae+Paraver (traces)

• Error• PCR+Thomas < Full PCR

• Speedup w.r.t MKL • MKL is sequential and pivoting

Implementation of dsss_dgtsv@LASsPerformance Analysis

11

Page 15: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Conclusions• Implementation of cuThomasBatch

➔ Up to ~2,5x faster than the ref. routine in NVIDIA cuSparse (gtsvStridedBatch)➔ Less memory space required➔ More numerical accuracy➔ Included into NVIDIA cuSparse with the name of gtsvInterleavedBatch➔ Accessible in BSC repo: https://pm.bsc.es/gitlab/run-math/cuThomasBatch➔ Accessible in cuSparse: https://docs.nvidia.com/cuda/cusparse/

• Implementation of dsss_dgtsv@LASs➔ It is possible (and easy) to compute the best “Swith Point”

➔ Auto-tuning code➔ Much faster and more accurate numerically than the other parallel variants

• PCR and CR.➔ Up to ~4x faster than the ref. routine in Intel MKL➔ LASs is not accessible yet, coming soon!

12

Page 16: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Acknowledgment• European Flagship project “Human Brain Project”

➔ Raül Sirvent• NVIDIA cuSparse team

➔ gtsvInterleavedBatch➔ Lung Sheng Chien, Harun Bayraktar, Alex Fit-Florea

• Announcements➔ PUMPS + AI Summer School

• Advanced CUDA + AI• Wen-Mei Hwu and David Kirk• http://pumps.bsc.es

➔ Open Positions:• https://www.bsc.es/join-us/job-opportunities/

13

Page 17: Parallel Implementations for Solving Tridiagonal Systems · 2018. 10. 15. · • One node of MareNostrum IV Supercomputer • 2x 24 core Intel Xeon Platinium 8160 • Experimental

Thank you!

27/Sep/2018

[email protected]