SpeedIT : GPU-based acceleration of sparse linear algebra

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

Attempt I: SpeedIT ¡ GPU-based library of iterative solvers for Sparse Linear

Algebra and CFD. ¡ Current version: 2.2. Version 1.0 in 2008. ¡ CMRS format for fast SpMV and other storage formats. ¡ BiCGSTAB and CG iterative solvers. ¡ Preconditioners: Jacobi, AMG. ¡ Runs on one or more GPU cards. ¡ OpenFOAM compatibile.

¡ Possible Applications: ¡  Quantum Chemistry, ¡  Seminconductor and Power Network Design ¡  CFD (OpenFOAM) ¡  etc.

SpeedIT and OpenFOAM

Plug

in t

o O

penF

OA

M

¡ Provides conversion between CSR and OpenFOAM LDU data formats.

¡ Easy substitution of CG/BCG with their GPU versions.

¡ OpenFOAM source code is unchanged due to a plugin concept. No recompilation of OpenFOAM is necessary.

¡ GPL-based generic Plugin to OpenFOAM.

¡ Any acceleration toolkit can be integrated now. 1.  Edit system/controlDict file and add libSpeedIT.so

2.  Edit system/fvSolution and replace CG with CG_accel 3.  Compile the Plugin to OpenFOAM (wmake) 4.  Complete library dependencies: $FOAM_USER_LIBBIN

should contain GPU-based libraries.

SpeedIT vs. CUDA 5.0 SpMV is a key operation in scientific software.

0.2

1

10

1 10 100 1000

CM

RS

spee

dup

µ

vectorscalar

HYB

Figure 1: (color online) The speedup of our CMRS implementation ofthe SpMV kernel against scalar, vector and HYB kernels as a functionof the mean number of nonzero elements per row (µ). Results forlp1-like matrices are omitted for clarity.

scalar and hybrid kernels give the shortest SpMV timesfor small µ, but their e!ciency decreases as µ is increased,with the scalar kernel being very ine!cient for large µ.This property of the scalar kernel is related to its inabilityto coalesce data transfers if µ is large. The vector kernelbehaves in just the opposite way: its e!ciency relative toother kernels is very good for large µ, but it decreases asµ drops below ! 100. This is related to the fact that thevector kernel processes each row with a warp of 32 threads,and hence is most e!cient for sparse matrices with µ farlarger than 32. Interestingly, for µ ! 120 the speedup ofthe CMRS over the vector kernel can be approximated as6µ!0.35, i.e. by a power law.

From Fig. 1 it can be immediately seen that the CMRSformat generally does not yield much improvement overthe CRS format for µ " 150, and tends to be system-atically slower than the HYB format for µ ! 20. Henceone expects that the advantages of the CMRS will be mostpronounced for moderate values of µ. This is confirmed byFig. 2, which depicts the speedup of our CMRS SpMV im-plementation against the best of all five alternative SpMVimplementations considered here, calculated individuallyfor each matrix. Using CMRS can improve the SpMV per-formance by up to 43%. However, at the moment there isno simple criterion that could be used to tell in advancewhether using it can be of any benefit. Nevertheless, sincethe CMRS format introduces practically no overhead overthe vector kernel, we suggest that for the Fermi GPU ar-chitecture one should use the CMRS format for matriceswith µ " 70 and test its e!ciency against the HYB formatfor 20 ! µ ! 70.

A quite large dispersion of the data in Fig. 1 suggeststhat µ is not the only parameter controlling the relativeperformance of various implementations. As a second suchparameter we propose the standard deviation (!) of the

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000

CM

RS

spee

dup

µ

Figure 2: (color online) The speedup of our CMRS implementationagainst the best of all five alternative SpMV implementations for agiven matrix.

0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50 60 70 80 90 100

spee

dup

σ

0

0.5

1

1.5

2

0 2 4

Figure 3: (color online) The speedup of CMRS against HYB im-plementation as a function of !. Only matrices such that ! < µ aretaken into account. Inset: enlargement of the plot for ! ! 5.

number of nonzero elements (ni) per row

! =

!

"rows!1i=0 (ni " µ)2

rows. (3)

Figure 3 depicts the !-dependence of the e!ciency of theCMRS-based implementation of the SpMV kernel relativeto that of HYB. As could be expected, the HYB kernel isthe most e!cient for small values of !. In particular, for! = 0 all rows have the same number of nonzero elementsand the HYB format reduces to the ELL format withoutany memory overhead. As ! grows, the memory overheadof ELL begins to reduce the e!ciency of HYB and for verylarge ! an increasing part of the matrix is processed usingthe COO format, which is computationally ine!cient. Ourresults suggest that as long as ! < µ, the HYB formatshould be used if ! ! 1.5 and the CMRS is more e!cientfor ! " 3. Matrices for which ! > µ are ‘atypical’, power-law [14] or lp1-like matrices for which the HYB format ispreferable.

10

Fig. Speedup of our CMRS implementation against CUDA 5.0 (HYB/CSR formats).

Fig. Speedup of our CMRS implementation against scalar, vector (CSR) and HYB kernels as a function of the mean number of nonzero elements per row (µ).

0.2

1

10

1 10 100 1000

CM

RS

spee

dup

µ

vectorscalar

HYB

Figure 1: (color online) The speedup of our CMRS implementation ofthe SpMV kernel against scalar, vector and HYB kernels as a functionof the mean number of nonzero elements per row (µ). Results forlp1-like matrices are omitted for clarity.

scalar and hybrid kernels give the shortest SpMV timesfor small µ, but their e!ciency decreases as µ is increased,with the scalar kernel being very ine!cient for large µ.This property of the scalar kernel is related to its inabilityto coalesce data transfers if µ is large. The vector kernelbehaves in just the opposite way: its e!ciency relative toother kernels is very good for large µ, but it decreases asµ drops below ! 100. This is related to the fact that thevector kernel processes each row with a warp of 32 threads,and hence is most e!cient for sparse matrices with µ farlarger than 32. Interestingly, for µ ! 120 the speedup ofthe CMRS over the vector kernel can be approximated as6µ!0.35, i.e. by a power law.

From Fig. 1 it can be immediately seen that the CMRSformat generally does not yield much improvement overthe CRS format for µ " 150, and tends to be system-atically slower than the HYB format for µ ! 20. Henceone expects that the advantages of the CMRS will be mostpronounced for moderate values of µ. This is confirmed byFig. 2, which depicts the speedup of our CMRS SpMV im-plementation against the best of all five alternative SpMVimplementations considered here, calculated individuallyfor each matrix. Using CMRS can improve the SpMV per-formance by up to 43%. However, at the moment there isno simple criterion that could be used to tell in advancewhether using it can be of any benefit. Nevertheless, sincethe CMRS format introduces practically no overhead overthe vector kernel, we suggest that for the Fermi GPU ar-chitecture one should use the CMRS format for matriceswith µ " 70 and test its e!ciency against the HYB formatfor 20 ! µ ! 70.

A quite large dispersion of the data in Fig. 1 suggeststhat µ is not the only parameter controlling the relativeperformance of various implementations. As a second suchparameter we propose the standard deviation (!) of the

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000

CM

RS

spee

dup

µ

Figure 2: (color online) The speedup of our CMRS implementationagainst the best of all five alternative SpMV implementations for agiven matrix.

0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50 60 70 80 90 100

spee

dup

σ

0

0.5

1

1.5

2

0 2 4

Figure 3: (color online) The speedup of CMRS against HYB im-plementation as a function of !. Only matrices such that ! < µ aretaken into account. Inset: enlargement of the plot for ! ! 5.

number of nonzero elements (ni) per row

! =

!

"rows!1i=0 (ni " µ)2

rows. (3)

Figure 3 depicts the !-dependence of the e!ciency of theCMRS-based implementation of the SpMV kernel relativeto that of HYB. As could be expected, the HYB kernel isthe most e!cient for small values of !. In particular, for! = 0 all rows have the same number of nonzero elementsand the HYB format reduces to the ELL format withoutany memory overhead. As ! grows, the memory overheadof ELL begins to reduce the e!ciency of HYB and for verylarge ! an increasing part of the matrix is processed usingthe COO format, which is computationally ine!cient. Ourresults suggest that as long as ! < µ, the HYB formatshould be used if ! ! 1.5 and the CMRS is more e!cientfor ! " 3. Matrices for which ! > µ are ‘atypical’, power-law [14] or lp1-like matrices for which the HYB format ispreferable.

10

OpenFOAM & single GPULow-cost hardware

Test cases: ¡ Cavity 3D, 512K cells, icoFoam. ¡ Human aorta, 500K cells, simpleFoam. ¡ CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz GPU: NVIDIA GTX 460, VRAM 1GB Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1,

¡ Pressure equation solved with CG preconditioned with DIC, GAMG (CPU) and CG preconditioned with AMG (GPU).

GPU cost: $200

OpenFOAM & single GPUValidation

Fig. Velocity and pressure profiles along x axis for aorta (top left) and pressure profile along x axis for cavity3D (top right). Solution for all preconditioners. ��Cross-section for velocity in X and Y direction for cavity3D run with OpenFOAM and SpeedIT (lines) and from literature for Re=100 (dotted line) and Re=400 (solid line) without turbulence model (icoFoam).

OpenFOAM & single GPUPerformance

0

400

800

1200

1600

2000

CG+DIC 1C CG+DIC 2C GAMG 1C SpeedIT GAMG 2C

Fig. Execution times of 50 time steps (up) and acceleration factor for simpleFoam (left) and pisoFoam (right). Performed at Intel E8400@3GHz and GTX460. 1C and 2C stand for one, two cores respectively.

3.77x

2.40x

1.27x 0.76

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

CG+DIC 1C CG+DIC 2C GAMG 1C GAMG 2C

0

500

1000

1500

2000

2500

CG+DIC 2C CG+DIC 1C SpeedIT GAMG 2C GAMG 1C

2.68x 2.88x

0.87 0.91

0

0.5

1

1.5

2

2.5

3

3.5

CG+DIC 1C CG+DIC 2C GAMG 2C GAMG 1C

OpenFoam & multi-GPU ¡ Motivation: large geometries (> 8M celss) do not fit to a single GPU card (with max. 6GB RAM).

¡ OpenFOAM performed the decomposition. MPI was used for communication between GPU cards.

¡ Tests were performed at IBM PLX CINECA with 2 six-cores Intel Westmere 2.40 GHz and 2 Tesla M2070 per node (supported by NVIDIA). One thread manages one GPU card.

¡ Tests focused on multi-GPU precondtioned CG with AMG to solve pressure equation in steady-state flows.

OpenFoam & multi-GPU

Fig. Time in sec. of first 10 time steps for N GPUs and CPU cores. Motorbike test, simpleFoam with diagonal (blue) and GAMG (green) preconditioners, SpeedIT is marked in red. Geometry is 32 millions cells. Tests were performed at PLX CINECA.

1353 1207 699 394 848 736 429 342

0

1000

2000

3000

4000

5000

6000

6 8 16 32

CPU, PCG with GAMG CPU, PCG with diagonal SpeedIT

CPU GPU

x1.15

x1.59 x1.63 x1.62

Scaling of OpenFOAM & multi-GPU ¡ Complex car geometry, external steady state flow, simpleFOAM, 32M cells, pressure: GAMG, velocity: smoothGS.

0

20000

40000

60000

80000

100000

120000

140000

8 16 32 64 88

1.93

3.63

7.11

9.23

2

4

8

11

0.00

2.00

4.00

6.00

8.00

10.00

12.00

8 16 32 64 88

SpeedIT multiGPU Ideal Scaling

Fig. Execution times of 2950 time steps (left) and scaling relative to 8 GPU (right).

OpenFOAM & GPU for industries

Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.

Authors: Saeed Iqbal and Kevin Tubbs

Full report can be obtained at Dell Tech Center .


Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.



Figure 1: OpenFOAM performance of 3D cavity case using 8 million cells on a single node.



Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.


SpeedIT 3.0

What’s new:

¡ New ILU based preconditioner.

¡ Support for Keppler NVIDIA cards.

¡ much higher bandwidth (300GB/s).

¡ 5x faster intra-GPU communication thanks to GPU Direct 2.0.

¡ Release in the begining of 2013.

Vratis Ltd.��Muchoborska 18, Wroclaw, Poland��Email: [email protected]��

More information: speed-it.vratis.com, vratis.com/blog��

¡ Saeed Iqbal and Kevin Tubbs (DELL) ¡ Stan Posey, Edmondo Orlotti (NVIDIA) ¡ CINECA, PRACE

Acknowledgments

SpeedIT : GPU-based acceleration of sparse linear algebra

Technology

Transcript of SpeedIT : GPU-based acceleration of sparse linear algebra