ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute...

33
Kate Clark, July 28th 2016 ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING FINE-GRAINED PARALLELIZATION

Transcript of ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute...

Page 1: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

Kate Clark, July 28th 2016

ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING FINE-GRAINED PARALLELIZATION

Page 2: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

CONTENTS

QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion

Page 3: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD,

Chroma, CPS, MILC, TIFR, etc. – Latest release 0.8.0 (8th February 2016)

• Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-source solvers – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale

3

Page 4: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:
Page 5: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

QUDA CONTRIBUTORS

§ Steve Gottlieb (Indiana University) § Dean Howarth (Rensselaer Polytechnic Institute) § Bálint Joó (Jlab) § Hyung-Jin Kim (BNL -> Samsung) § Claudio Rebbi (Boston University) § Guochun Shi (NCSA -> Google) § Mario Schröck (INFN) § Alexei Strelchenko (FNAL) § Alejandro Vaquero (INFN) § Mathias Wagner (NVIDIA) § Frank Winter (Jlab)

(multigrid collaborators in green)

§ Ron Babich (NVIDIA) § Michael Baldhauf (Regensburg) § Kip Barros (LANL) § Rich Brower (Boston University) § Nuno Cardoso (NCSA) § Kate Clark (NVIDIA) § Michael Cheng (Boston University) § Carleton DeTar (Utah University) § Justin Foley (Utah -> NIH) § Joel Giedt (Rensselaer Polytechnic Institute) § Arjun Gambhir (William and Mary)

5

Page 6: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

MULTIGRID FOR QCD

Page 7: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

WHY MULTIGRID?

-0.43 -0.42 -0.41 -0.4mass

100

1000

10000

1e+05

Dir

ac o

per

ator

appli

cati

ons

32396 CG

24364 CG

16364 CG

24364 Eig-CG

16364 Eig-CG

32396 MG-GCR

24364 MG-GCR

16364 MG-GCR

Babich et al 2010

0"

10"

20"

30"

40"

50"

60"

70"

QUDA"(32"XK"nodes)" Mul:Grid"(16"XE"nodes)""

Average'Ru

n'Time'for'1

'source''

(secon

ds)'

Wallclock'9me'for'Light'Quark'solves'in'Single'Precision''

0"

5"

10"

15"

20"

25"

30"

35"

QUDA"(16"XK"Nodes)" Mul:"Grid(16"XE"Nodes)"

Average'Time'for'1

'source'

(secon

ds)'

Wallclock'9me'for'Strange'Quark'solves'in'Single'Precision'

Chroma propagator benchmark Figure by Balint Joo

MG Chroma integration by Saul CohenMG Algorithm by James Osborn

7Osborn et al 2010

Optimality Speed

Stability

Page 8: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

8

ADAPTIVE GEOMETRIC MULTIGRID

Adaptively find candidate null-space vectors

Dynamically learn the null space and use this to define the prolongator

Algorithm is self learning

Setup

1. Set solver to be simple smoother

2. Apply current solver to random vector vi = P(D) ηi

3. If convergence good enough, solver setup complete

4. Construct prolongator using fixed coarsening (1 - P R) vk = 0

➡ Typically use 44 geometric blocks

➡ Preserve chirality when coarsening R = γ5 P† γ5 = P

5. Construct coarse operator (Dc = R D P)

6. Recurse on coarse problem

7. Set solver to be augmented V-cycle, goto 2

Falgout

Babich et al 2010

Page 9: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

MULTIGRID ON GPUS

Page 10: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

10

THE CHALLENGE OF MULTIGRID ON GPU

GPU requirements very different from CPU Each thread is slow, but O(10,000) threads per GPU

Fine grids run very efficiently High parallel throughput problem

Coarse grids are worst possible scenario More cores than degrees of freedom

Increasingly serial and latency bound

Little’s law (bytes = bandwidth * latency)

Amdahl’s law limiter

Multigrid exposes many of the problems expected at the Exascale

Page 11: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

INGREDIENTS FOR PARALLEL ADAPTIVE MULTIGRID

▪ Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition

▪ Smoothing (relaxation on a given grid) – Repurpose existing solvers

▪ Prolongation – interpolation from coarse grid to fine grid – one-to-many mapping

▪ Restriction – restriction from fine grid to coarse grid – many-to-one mapping

▪ Coarse Operator construction (setup) – Evaluate R A P locally – Batched (small) dense matrix multiplication

▪ Coarse grid solver – Need optimal coarse-grid operator

xx

x

x−

x−

U x

Ux

μ

μ

ν

x x

x

x−

x−

U x

Ux

μ

μ

ν

11

Page 12: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

COARSE GRID OPERATOR

▪ Coarse operator looks like a Dirac operator (many more colors) – Link matrices have dimension 2Nv x 2Nv (e.g., 48 x 48)

▪ Fine vs. Coarse grid parallelization – Fine grid operator has plenty of grid-level parallelism – E.g., 16x16x16x16

= 65536 lattice sites – Coarse grid operator has diminishing grid-level parallelism – first coarse grid 4x4x4x4= 256 lattice sites – second coarse grid 2x2x2x2 = 16 lattice sites

▪ Current GPUs have up to 3840 processing cores

▪ Need to consider finer-grained parallelization – Increase parallelism to use all GPU resources – Load balancing

dofs (geometry). We start by defining the fields

W±µksc,ls�c� = V †

ksc,kscP±µs,s�U(k+µ)c,lc��k+µ,lVls�c�,ls�c�

note that here we are defining di�erent links for forward and backwards,they are not simply the conjugate of each other (because of the di�erent spinprojection between the two). Also note that these e�ective link matrices havealso a spin index, this is because the vectors used to define the V rotationmatrices have spin dependence now. In this form we can now write down thecoarse Dirac operator as

Disc,js�c� = ���i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

ksc,ls�c��k+µ,l + W+µ†ksc,ls�c��k�µ,l

⇧ ��l/B,j�s�/Bs,s�

+M �isc,js�c� .

We now finish up by blocking the geometry and spin onto the coarse lattice,defining the e�ective link matrices Y ±µ that connect sites on the coarselattice:

Y ±µisc,js�c� =

��i,k/B�s,s/Bs

⇥W±µ

ksc,ls�c�

��l/B,j�s�/Bs,s�

⇥�i⇤µ,j (2)

Xisc,js�c� =��i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

isc,ks�c� + W+µ†isc,ks�c�

⇧ ��l/B,j�s�/Bs,s�

⇥�i,j,

where we note now that the matrix X is not Hermitian. Thus the coarseoperator is written

Disc,js�c� = �⇤

µ

⌅Y �µ

isc,js�c��i+µ,j + Y +µ†isc,js�c��i�µ,j

⇧+ (M �Xisc,js�c�) �isc,js�c� . (3)

For the explicit form of these matrices we refer the reader to Appendix A.After the first blocking, subsequent blockings require that Bs = 1, i.e., we

cannot block the spin dimension again since we cannot remove the chirality.Apart from this observation, the next coarse operator will have a similar formto the current one: it will be a nearest neighbour non-Hermitian operatorconnecting sites with ds = 2 spin dimension (in 2d and 4d anyway).

We note here in passing that because of the definition of the matrix field Vinclude explicit spin dependence, this destroys the tensor product structureof the spin and colour on the coarse operator, i.e., we have to define ane�ective link matrix that rotates in spin and colour space. If this were notthe case, i.e., if V were to be spin independent, then this structure would be

8

xx

x

x−

x−

U x

Ux

μ

μ

ν

12

Page 13: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

13

X[0]

X[1]

SOURCE OF PARALLELISM

�a00 a01 a02 a03

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 0

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 1

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 2

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 3

xx

x

x−

x−

U x

Ux

μ

μ

ν

3. Stencil direction 8-way parallelism

1. Grid parallelism Volume of threads

2. Link matrix-vector partitioning 2 Nvec-way parallelism (spin * color)

0

BB@

c0c1c2c3

1

CCA+ =

0

BB@

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

1

CCA

0

BB@

b0b1b2b3

1

CCAthread y

index

4. Dot-product partitioning 4-way parallelism

Page 14: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

14

COARSE GRID OPERATOR PERFORMANCETesla K20X (Titan), FP32, Nvec = 24

0"

20"

40"

60"

80"

100"

120"

140"

160"

10" 8" 6" 4" 2"

GFLO

PS'

La)ce'length'

baseline"

color2spin"

dimension"+"direc7on"

dot"product"

24,576-way parallel

16-way parallel

Page 15: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

COARSE GRID OPERATOR PERFORMANCE

▪ Autotuner finds optimum degree of parallelization

▪ Larger grids favor less fine grained

▪ Coarse grids favor most fine grained

▪ GPU is nearly always faster than CPU

▪ Expect in future that coarse grids will favor CPUs

▪ For now, use GPU exclusively

8-core Haswell 2.4 GHz (solid line) vs M6000 (dashed lined), FP32

0 20 40 60 80 1002N

0

50

100

150

200

250

300

GFL

OPS

2x2x2x24x2x2x24x2x2x44x2x4x44x4x4x4

Coarse Dslash performance (8-core Haswell 2.4 GHz vs M6000)Solid symbol CPU, open symbol / dashed line GPU

15

Page 16: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

RESULTS

Page 17: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

17

MULTIGRID VERSUS BICGSTAB

• Compare MG against the state-of-the-art traditional Krylov solver on GPU • BiCGstab in double/half precision with reliable updates • 12/8 reconstruct • Symmetric red-black preconditioning

• Adaptive Multigrid algorithm • GCR outer solver wraps 3-level MG preconditioner • GCR restarts done in double, everything else in single • 24 null-space vectors on fine grid • Minimum Residual smoother • Symmetric red-black preconditioning on each level • GCR coarse-grid solver

Mee

= 1ee

� 2A�1ee

Deo

A�1oo

Doe

Page 18: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

Tim

e to

sol

utio

n

0

2

4

6

8

10

12

Mass parameter

-0.42 -0.415 -0.410 -0.405 -0.40

BiCGstab (double-half) GCR-MG (double-single)

Iterations GFLOPs

mass BiCGstab GCR-MG BiCGstab GCR-MG

-0.400 251 15 980 376

-0.405 372 16 980 372

-0.410 510 17 980 353

-0.415 866 18 980 314

-0.420 3103 19 980 293

Wilson, V = 243x64, single workstation (3x M6000) MULTIGRID VERSUS BICGSTAB

18

Page 19: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

19

Wilson-clover, Strong scaling on Titan (K20X)

Tim

e to

Sol

utio

n

0

10

20

30

40

50

Number of Nodes

24 48

BiCGstab MG

6.6x 6.3x

MULTIGRID VERSUS BICGSTABTi

me

to S

olut

ion

0

10

20

30

40

50

Number of Nodes

20 32

BiCGstab MG

7.9x 6.1x

V = 403x256, mπ = 230 MeV V = 483x96, mπ = 197 MeV

Page 20: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

20

Wilson-clover, Strong scaling on Titan (K20X), V = 643x128, mπ = 197 MeVTi

me

to S

olut

ion

0

10

20

30

40

50

Number of Nodes

32 64 128 256 512

BiCGstab MG

5.5x 10.2x 8.9x 7.4x

MULTIGRID VERSUS BICGSTAB

Page 21: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

21

Wilson-clover, Strong scaling on Titan (K20X), V = 643x128, 12 linear solvesTi

me

0

10

20

30

40

50

Number of Nodes

64 128 256 512

level 1 level 2 level 3

MULTIGRID TIMING BREAKDOWN

Page 22: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

22

MULTIGRID FUTURE WORK

• Absolute Performance tuning, e.g., half precision on coarse grids

• Strong scaling improvements: • Combine with Schwarz preconditioner • Accelerate coarse grid solver: CA-GMRES instead of GCR, deflation • More flexible coarse grid distribution, e.g., redundant nodes

• Investigate off load of coarse grids to the CPU • Use CPU and GPU simultaneously using additive MG

• Full off load of setup phase to GPU - required for HMC

Page 23: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

23

MULTI-SRC SOLVERS

• Multi-src solvers increase locality through link-field reuse • see previous talk by M. Wagner

• Multi-grid operators even more so since link matrices are 48x48 • Coarse Dslash / Prolongator / Restrictor

• Coarsest grids also latency limited • Kernel level latency • Network latency

• Multi-src solvers are a solution • More parallelism • Bigger messages

GFL

OPS

0

200

400

600

800

Number of right hand sides1 2 4 8 16 32 64 128

2^4 4^4

Coarse dslash on M6000 GPU vs #rhs

> 3x speedup

Page 24: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

24

PASCAL

• Newly launched Pascal P100 GPU provides significant performance uplift through stacked HBM memory

• Wilson-clover dslash 3x speedup vs K40 • double ~500 GFLOPS • single ~1000 GFLOPS • half ~2000 GFLOPS

• Same kernel from 2008 runs unchanged • 5 generations ago

• Similar gains for MG building blocks

GFL

OPS

0

150

300

450

600

Lattice length

2 4 6 8 10

Kepler Maxwell Pascal

Wilson-clover dslash

Coarse dslash performance

P100

Page 25: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

25

DGX-1

• Pascal includes NVLink interconnect technology • 4x NVLink connections per GPU • 160 GB/s bi-directional bandwidth per GPU

• NVIDIA DGX-1 • 8x P100 GPUs • 23 NVLink topology • 4x EDR IB (via PCIe switches) • 2x Intel Xeon hosts

• 1 DGX-1 equivalent inter-GPU bandwidth to 64 nodes of Titan

PCIe

Page 26: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

26

COMING SOON

•>10x speedup from MG •Lower bound since much more optimization to do

•6x node-to-node speedup •DGX-1 (8x GP100) vs Pi0g (4x K40)

•3x speedup from multi-src solvers •Increased temporal locality from links and increased parallelism for MG

•Expect >100x speedup for analysis workloads versus previous GPU workflow

Page 27: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

27

CONCLUSIONS AND OUTLOOK

• Multigrid algorithms LQCD are running well on GPUs • In production by multiple groups • Wilson, Wilson-clover, twisted-mass, twisted-clover all supported • Staggered in progress (see next talk by E. Weinberg) • 5-d operators when I get a chance…

• Up to 10x speedup • Fine-grained parallelization was key • To be presented / published at SC16

• Lots more to come • Combine with multi-right-hand side solver and Pascal architecture

Page 28: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

28

GRID PARALLELISM

__device__  void  grid_idx(int  x[],  const  int  X[])  {      //  X[]  holds  the  local  lattice  dimension          int  idx  =  blockIdx.x*blockDim.x  +  threadIdx.x;      int  za  =  (idx  /  X[0]);      int  zb  =    (za  /  X[1]);      x[1]  =  za  -­‐  zb  *  X[1];      x[3]  =  (zb  /  X[2]);      x[2]  =  zb  -­‐  x[3]  *  X[2];      x[0]  =  idx  -­‐  za  *  X[0];      //  x[]  now  holds  the  thread  coordinates  }    

X[0]

X[1]

Thread x dimension maps to location on the grid

Page 29: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

29

MATRIX-VECTOR PARALLELISM

template<int  Nv>  __device__  void  color_spin_idx(int  &s,  int  &c)  {      int  yIdx  =  blockDim.y*blockIdx.y  +  threadIdx.y;        int  s  =  yIdx  /  Nv;      int  c  =  yIdx  %  Nv;      //  s  is  now  spin  index  for  this  thread      //  c  is  now  color  index  for  this  thread  }    

Each stencil application is a sum of matrix-vector products

Parallelize over output vector indices (parallelization over color and spin)

Thread y dimension maps to vector indices

Up to 2 x Nv more parallelism

0

BB@

c0c1c2c3

1

CCA+ =

0

BB@

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

1

CCA

0

BB@

b0b1b2b3

1

CCAthread y

index

Page 30: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

Write result to shared memory

Synchronize

dim=0/dir=0 threads combine and write out result

Introduces up to 8x more parallelism30

STENCIL DIRECTION PARALLELSIM

__device__  void  dim_dir_idx(int  &dim,  int  &dir)  {      int  zIdx  =  blockDim.z*blockIdx.z  +  threadIdx.z;        int  dir  =  zIdx  %  2;      int  dim  =  zIdx  /  2;      //  dir  is  now  the  fwd/back  direction  for  this  thread      //  dim  is  now  the  dim  for  this  thread  }    

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 0x

x

x

x−

x−

U x

Ux

μ

μ

ν warp 1

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 2

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 3

Partition computation over stencil direction and dimension onto different threads

xx

x

x−

x−

U x

Ux

μ

μ

ν

Page 31: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

31

DOT PRODUCT PARALLELIZATION I

const  int  warp_size  =  32;  //  warp  size  const  int  n_split  =  4;  //  four-­‐way  warp  split  const  int  grid_points  =  warp_size/n_split;  //  grid  points  per  warp  complex<real>  sum  =  0.0;  for  (int  i=0;  i<N;  i+=n_split)      sum  +=  a[i]  *  b[i];  

//  cascading  reduction  for  (int  offset  =  warp_size/2;  offset  >=  grid_points;  offset  /=  2)      sum  +=  __shfl_down(sum,  offset);  //  first  grid_points  threads  now  hold  desired  result

Partition dot product between threads in the same warp Use warp shuffle for final result Useful when not enough grid parallelism to fill a warp

�a00 a01 a02 a03

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

Page 32: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

32

DOT PRODUCT PARALLELIZATION II

const  int  n_ilp  =  2;  //  two-­‐way  ILP  complex<real>  sum[n_ilp]  =  {  };  for  (int  i=0;  i<N;  i+=n_ilp)      for  (int  j=0;  j<n_ilp;  j++)          sum[j]  +=  a[i+j]  *  b[i+j];  

complex<real>  total  =  static_cast<real>(0.0);  for  (int  j=0;  j<n_ilp;  j++)  total  +=  sum[j];  

Degree of ILP exposed

Multiple computationswith no dependencies

Compute final result

Partition dot product computation within a thread Hide dependent arithmetic latency within a thread More important for Kepler then Maxwell / Pascal

�a00 a01 a02 a03

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

Page 33: ACCELERATING LATTICE QCD MULTIGRID ON GPUS USING … · MULTIGRID FUTURE WORK • Absolute Performance tuning, e.g., half precision on coarse grids • Strong scaling improvements:

33

ERROR REDUCTION AND VARIANCEV = 403x256, mπ = 230 MeV