Optimizing Geometric Multigrid Method Computation using a ...
Transcript of Optimizing Geometric Multigrid Method Computation using a ...
![Page 1: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/1.jpg)
Optimizing Geometric Multigrid MethodComputation using a DSL Approach
Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat,Uday Bondhugula
Department of Computer Science and AutomationIndian Institute of ScienceBengaluru 560012, India
14th Nov 2017
![Page 2: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/2.jpg)
Outline
![Page 3: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/3.jpg)
Iterative Numerical Methods
Visualization of heat transfer in a pump casing;created by solving the heat equation.
Numerical approximation for partial differential equations
Approximation is improved every iteration
Applications in fields of engineering and physical sciences
![Page 4: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/4.jpg)
Multigrid Methods
Poisson’s equation
∇2u = f.
Finite difference discretization; approximatesecond derivative
Solve f = Au, where A is a sparse bandedmatrix (u is a linearization of the unknown onthe multi-dimensional grid)
What about A−1?
![Page 5: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/5.jpg)
Multigrid Methods
Poisson’s equation
∇2u = f.
Finite difference discretization; approximatesecond derivative
Solve f = Au, where A is a sparse bandedmatrix (u is a linearization of the unknown onthe multi-dimensional grid)
What about A−1?
![Page 6: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/6.jpg)
Multigrid Methods
Poisson’s equation
∇2u = f.
Finite difference discretization; approximatesecond derivative
Solve f = Au, where A is a sparse bandedmatrix (u is a linearization of the unknown onthe multi-dimensional grid)
What about A−1?
![Page 7: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/7.jpg)
Multigrid Methods
Involving discretizations at multiple scales
A fast-forward technique in the iterative solving process
Used as direct solvers as well as preconditioners
![Page 8: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/8.jpg)
Multigrid Methods
V F
MG-Cycle
niters < N || Verror < ρ
True
False
Unit of execution:multigrid cycle
Iterate until convergence /for fixed number
![Page 9: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/9.jpg)
Multigrid Cycles: V-Cycle
Smoother
Defect/Residual
Correction
Restrict/Reciprocate
Interpolate/Prolongation
![Page 10: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/10.jpg)
Multigrid Methods
Heterogeneouscomposition of basiccomputations
Typically expressed as arecursive algorithm
Optimizations limited toone multigrid cycle
Algorithm 1: V-cycleh
Input : vh, fh
1 Relax vh for n1 iterations// pre-smoothing
2 if coarsest level then3 Relax vh for n2 iterations // coarse
smoothing
4 rh ← fh −Ahvh // residual
5 r2h ← I2hh rh // restriction
6 e2h ← 0
7 e2h ← V-cycle2h(e2h, r2h)
8 eh ← Ih2he2h // interpolation
9 vh ← vh + eh // correction
10 Relax vh for n3 iterations // postsmoothing
11 return vh
![Page 11: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/11.jpg)
Multigrid Methods: Basic Steps
Smoothing
f(x, y) =+1∑
σx=−1
+1∑σy=−1
g(x+ σx, y + σy) · w(σx, σy)
![Page 12: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/12.jpg)
Multigrid Methods: Basic Steps
Restriction/Decimation
f(x, y) =+1∑
σx=−1
+1∑σy=−1
g(2x+ σx, 2y + σy) · w(σx, σy)
![Page 13: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/13.jpg)
Multigrid Methods: Basic Steps
Interpolation/Prolongation
f(x, y) =+1∑σx=0
+1∑σy=0
g((x+ σx)/2, (y + σy)/2) · w(σx, σy, x, y)
![Page 14: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/14.jpg)
Multigrid Methods: Basic Steps
Point-wise
f(x, y) = α · g(x, y)
![Page 15: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/15.jpg)
Multigrid Cycles: W-Cycle
Smoother
Defect/Residual
Correction
Restrict/Reciprocate
Interpolate/Prolongation
![Page 16: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/16.jpg)
Naive vs Optimized Implementation
par opt opt+
2.39
1.05
0.4
Execution
time(s)
Multigrid-2D (Jacobi)(24 cores)
Naive parallelization (OpenMP,vector pragmas (icpc): 5.4x
Optimized code generated by aDSL compiler (PolyMage): tiling+ fusion + scratchpads, etc.(2.27× over naive)
Enhancements in PolyMage:+ storage opt + pooled memory+ multidim parallelism, etc.(5.92× over naive)
![Page 17: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/17.jpg)
Naive vs Optimized Implementation
par opt opt+
2.39
1.05
0.4
Execution
time(s)
Multigrid-2D (Jacobi)(24 cores)
Naive parallelization (OpenMP,vector pragmas (icpc): 5.4x
Optimized code generated by aDSL compiler (PolyMage): tiling+ fusion + scratchpads, etc.(2.27× over naive)
Enhancements in PolyMage:+ storage opt + pooled memory+ multidim parallelism, etc.(5.92× over naive)
Problem: Speeding up complex pipelines likemultigrid cycles manually is hard and tedious
Goal: Figure out complex and right set ofoptimizations automatically
![Page 18: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/18.jpg)
Related Work
Semi-automatic ones– S. Williams, D. Kalamkar et al. [SC 2012]
Communication aggregation: for dist-memWavefront schedule at nodeFusion of Residual and RestrictManual optimization and implementations
– Ghysels and Vanroose [SIAM J. ScientificComputing 2015]Pluto tiling for smoothing stepsManual optimization and implementationCode modifications for array access
– P. Basu, M. Hall et al. [HiPC’13, IPDPS’15]Wavefront schedule, fusionSemi-automatic (script-driven)No tiling transformations
![Page 19: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/19.jpg)
Outline
![Page 20: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/20.jpg)
Using PolyMage DSL to Optimize MG Cycles
PolyMage [ASPLOS 2015]: DSL/Compiler for imageprocessing pipelines
DSL embedded in PythonAutomatic Optimizing Code Generator
– Uses domain-specific cost models to applycomplex combinations of tiling and fusion tooptimize for parallelism and locality
![Page 21: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/21.jpg)
PolyMage for Multigrid
Express grid computations as functions
Abstracts away computations from schedule and storagemapping
Added new constructs for time-iterated stencils(smoothing)
Use the polyhedral framework to transform, optimize,and parallelize
Group functions, tile each group, generate storagemappings
![Page 22: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/22.jpg)
V-cycle Algorithm in PolyMage
N = Parameter(Int , ’n’)
V = Grid(Double , "V", [N+2, N+2])
F = Grid(Double , "F", [N+2, N+2])
...
def rec_v_cycle(v, f, l):
# coarsest level
if l == 0:
smooth_p1[l] = smoother(v, f, l, n3)
return smooth_p1[l][n3]
# finer levels
else:
smooth_p1[l] = smoother(v, f, l, n1)
r_h[l] = defect(smooth_p1[l][n1], f, l)
r_2h[l] = restrict(r_h[l], l)
e_2h[l] = rec_v_cycle(None , r_2h[l], l-1)
e_h[l] = interpolate(e_2h[l], l)
v_c[l] = correct(smooth_p1[l][n1], e_h[l], l)
smooth_p2[l] = smoother(v_c[l], f, l, n2)
return smooth_p2[l][n2]
def restrict(v, l):
R = Restrict (([y, x], [extent[l], extent[l]]),
Double)
R.defn = [ Stencil(v, (y, x),
[[1, 2, 1],
[2, 4, 2],
[1, 2, 1]], 1.0/16 , factor =1//2) ]
return R
def smoother(v, f, l, n):
...
W = TStencil (([y, x], [extent[l], extent[l]]),
Double , T)
W.defn = [ v(y, x) - weight *
(Stencil(v, [y, x],
[[ 0, -1, 0],
[-1, 4, -1],
[ 0, -1, 0]], 1.0/h[l]**2) - f(y, x)) ]
...
return W
def interpolate(v, l):
...
expr = [{}, {}]
expr [0][0] = Stencil(v, (y, x), [1])
expr [0][1] = Stencil(v, (y, x), [1, 1]) \
* 0.5
expr [1][0] = Stencil(v, (y, x), [[1], [1]]) \
* 0.5
expr [1][1] = Stencil(v, (y, x), [[1, 1], [1, 1]])\
* 0.25
P = Interp (([y, x], [extent[l], extent[l]]),
Double)
P.defn = [ expr ]
return P
![Page 23: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/23.jpg)
Polyhedral Representation
x
f1
f2
fout
x = Variable ()
fin = Grid(Double , [18])
f1 = Function (([x], [Interval(0, 17)]), Double)
f1.defn = [ fin(x) + 1 ]
f2 = Function (([x], [Interval(1, 16)]), Double)
f2.defn = [ f1(x-1) + f1(x+1) ]
fout = Function (([x], [Interval(2, 15)]), Double)
fout.defn = [ f2(x-1) . f2(x+1) ]
![Page 24: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/24.jpg)
Polyhedral Representation
x
f1
f2
fout
Domains
x = Variable ()
fin = Grid(Double , [18])
f1 = Function (([x], [Interval(0, 17)]), Double)
f1.defn = [ fin(x) + 1 ]
f2 = Function (([x], [Interval(1, 16)]), Double)
f2.defn = [ f1(x-1) + f1(x+1) ]
fout = Function (([x], [Interval(2, 15)]), Double)
fout.defn = [ f2(x-1) . f2(x+1) ]
![Page 25: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/25.jpg)
Polyhedral Representation
x
f1
f2
fout
Dependence vectors
Function Dependence Vectors
fout(x) = f2(x− 1)· f2(x+ 1) (1, 1), (1,−1)
f2(x) = f1(x− 1) + f1(x+ 1) (1, 1), (1,−1)
f1(x) = fin(x)
![Page 26: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/26.jpg)
Polyhedral Representation
x
f1
f2
fout
Live-outs
Function Dependence Vectors
fout(x) = f2(x− 1)· f2(x+ 1) (1, 1), (1,−1)
f2(x) = f1(x− 1) + f1(x+ 1) (1, 1), (1,−1)
f1(x) = fin(x)
![Page 27: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/27.jpg)
Tiling Smoothing Steps: Overlapped Tiling
x
f1
f2
f3
f4
f5
fout
Good for parallelism, locality, local buffers
Redundant computation at boundaries
![Page 28: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/28.jpg)
Tiling Smoothing Steps: Diamond Tiling
x
f1
f2
f3
f4
f5
fout
Good for parallelism, locality, but not local buffers
No redundant computation
![Page 29: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/29.jpg)
Overlapped Tiling for Multiscale
x
φl φr
f
f↓1
f↓2
f↑
fout
h
oτ
![Page 30: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/30.jpg)
Grouping Stages for Overlapped Tiling
Grouping criteria– Exponential number of valid groupings– Redundant computation vs reuse (locality)
Overlap, tile sizes, parameter estimates
Grouping heuristic– Greedy iterative algorithm– Only fuse stages that can be overlap
tiled up to a maximum group size limit
![Page 31: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/31.jpg)
Grouping
F
V S S S S d
↓ rSSSSd
↓ r S S S S d
↓ r S S S S
↑ i c S S S S
↑ icSSSS
↑ i c S S S S
Scratchpad Liveout Input Output Group
S Smoother d Defect ↓ r Restrict ↑ i Interpolate c Correction
![Page 32: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/32.jpg)
Storage Optimizations
Storage for Functions– Reuse of buffers across groups and witihn groups– Precise liveness information is available
![Page 33: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/33.jpg)
Intra-Group Storage Reuse
Not just about byte count– Accommodate larger tile sizes– Maximum performance gains for a configuration
Approach– Simple greedy algorithm to colour the nodes– Requires schedule information
![Page 34: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/34.jpg)
Inter-Group Storage Reuse
Larger gains in memory used– Significant reduction in memory consumption– Less page activity– Higher chances of fitting in LLC
Approach– Classification based on parametric bounds– Alloc/free at group granularity
Pooled allocation and early freeing: acrossmultiple calls and within pipeline call
![Page 35: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/35.jpg)
Storage Allocation for Grouping
F
V S S S S d
↓ rSSSSd
↓ r S S S S d
↓ r S S S S
↑ i c S S S S
↑ icSSSS
↑ i c S S S S
Scratchpad Liveout Input Output Group
S Smoother d Defect ↓ r Restrict ↑ i Interpolate c Correction
![Page 36: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/36.jpg)
PolyMage Compiler Flow
DSL SpecBuild stage graphStatic bounds checkInlining
Polyhedral representationInitial schedule
Alignment,Scaling,Grouping
Schedule transformationStorage optimizationCode generation
![Page 37: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/37.jpg)
PolyMage Compiler Flow
DSL specBuild stage graphStatic bounds checkInlining
Auto-mergeAlignmentScaling
Grouping
PolyhedralrepresentationInitial schedule
Scheduletransformation(fusion, tiling)
LibPluto
Intra-groupreuse opt
Storage optimization
Inter-groupreuse opt
Code generation
MultiparEarly freeingPooled memoryallocation
![Page 38: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/38.jpg)
Availability
Available as part of PolyMage repositoryhttps://bitbucket.org/udayb/polymage.git
Multigrid benchmarkssandbox/apps/python/multigrid/
![Page 39: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/39.jpg)
Outline
![Page 40: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/40.jpg)
Benchmarks
Multigrid applications solving Poisson’sequation
∇2u = f.
Configurations– V-cycle and W-cycle– 10-0-0 and 4-4-4– Four levels of discretization
NAS-MG from NAS-PB 3.2
![Page 41: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/41.jpg)
Benchmarks
Table: Problem size configurations: the same problem sizes were used forV-cycle and W-cycle and for 4-4-4 and 10-0-0
Benchmark Grid size, cycle #iters
Class B Class C
2D 81922 (10) 163842 (10)3D 2563 (25) 5123 (10)NAS-MG 2563 (20) 5123 (20)
![Page 42: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/42.jpg)
Experimental Setup
Table: Architecture details
2-socket Intel Xeon E5-2690 v3
Clock 2.60 GHzCores / socket 12Total cores 24Hyperthreading disabledL1 cache / core 64 KBL2 cache / core 512 KBL3 cache / socket 30,720 KBMemory 96 GB DDR4 ECC 2133 MHz
Compiler Intel C/C++ and Fortran compiler(icc/icpc and ifort) 16.0.0
Compiler flags -O3 -xhost -openmp -ipoLinux kernel 3.10.0 (64-bit) (Cent OS 7.1)
![Page 43: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/43.jpg)
Lines of Code
![Page 44: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/44.jpg)
Comparison with Hand Opt and Pluto
Ghysels and Vanroose [SIAM J. Sci.Comp. 2015]
– Manual storage reuse optimization– Multidimensional parallelism– Pluto
– Diamond tiling for concurrent startup– Tuned over 25 tile size configurations
![Page 45: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/45.jpg)
2D Benchmarks: V-cycle
B C0
2
4
6
8
10
4.18×
4.1
×
5.5
×
5.35×
2.69×
2.58×
1.19×
1.25×
3.17×
3.27×
Problem Class
Speedupover
PolyMG
naive(24cores) polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(a) 2D-V-10-0-0
B C0
2
4
6
8
3.15×
3.25×
4.22×
4.15×
1.24×
1.19×
1.41×
1.24×
2.21×
2.24×
Problem Class
polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(b) 2D-V-4-4-4
![Page 46: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/46.jpg)
2D Benchmarks: W-cycle
B C0
2
4
6
8
10
12
4.09×
4.55×5.34× 6.19×
2.71×
2.81×
1.2
×
1.38×
3.19×
3.58×
Problem Class
polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(c) 2D-W-10-0-0
B C0
1
2
3
4
5
6
7
2.61×
2.96×3.49×
3.66×
1.28×
1.29×
1.29×
1.3
×
2.33×
2.41×
Problem Class
polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(d) 2D-W-4-4-4
![Page 47: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/47.jpg)
3D Benchmarks: V-cycle
B C0
1
2
3
4
51.15×
1.77×2.21×
2.29×
2.18×
2.06×
1.29×
1.21×
2.86×
2.7
×
Problem Class
Speedupover
PolyMG
naive(24cores) polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(e) 3D-V-10-0-0
B C0
1
2
3
4
5
1.03×
1.83×
1.99× 2.42×
1.15×
1.25×
1.18×
1.2
×
2.04×
2.29×
Problem Class
polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(f) 3D-V-4-4-4
![Page 48: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/48.jpg)
3D Benchmarks: W-cycle
B C0
1
2
3
4
51.32×
1.6
×
2.31×
1.94×2.39×
2.13×
1.67×
1.28×
2.9
×
2.64×
Problem Class
polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(g) 3D-W-10-0-0
B C0
1
2
3
4
5
1.33×
1.38×
1.96×
2.1
×
1.54×
1.28×
2.01×
1.31×
2.52×
2.08×
Problem Class
polymg-opt
polymg-opt+
polymg-dtile-opt+
handopt
handopt+pluto
(h) 3D-W-4-4-4
![Page 49: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/49.jpg)
NAS-MG
B C0
0.5
1
1.5
2
1.12×
1.25×
1.24×
1.53×
1.37×
1.02×
Problem Class
Speedupover
PolyMG
naive(24cores) polymg-opt
polymg-opt+
reference
![Page 50: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/50.jpg)
Storage Optimization Gains
2D 3D0
1
2
3
4
1.43×
1.49×
1.98×
1.91×
2.07×
2.08×
Benchmark
SpeedupBreak
dow
n(24cores)
Intra Group
+Pool
+Inter Group
V-cycle for 10-0-0
![Page 51: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/51.jpg)
Summary of Improvements
1.31× Mean speedup of Opt+ over Opt
– 1.30× : 2D– 1.33× : 3D
3.21× Mean speedup of Opt+ over Naive
– 4.74× : 2D– 2.18× : 3D
1.23× Mean speedup of Opt+ over HandOpt+Pluto
NAS-MG 17% better over Opt+
![Page 52: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/52.jpg)
Conclusions
Conclusions– Significant performance improvement (automatic)
through DSL approach– Overlapped tiling - very effective for 2D GMG– Diamond tiling better than Overlapped tiling
for most of the 3D benchmarks– Storage allocation and optimization -
important for performance
Future Directions– Distributed memory
![Page 53: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/53.jpg)
Thank You
Acknowledgments
![Page 54: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/54.jpg)
Tiling Smoothing Steps
4 100
1
2
3
4
5
4.03×
2.04×
3.48×
2.24×
Smoothing Steps
Speedupover
PolyMag
enaive(24cores) PolyMG Opt+
Pluto
3D stencil for class C
![Page 55: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/55.jpg)
Sample Output
void pipeline_Vcycle(int N, double * F,
double * V, double *& W)
{
/* Live out allocation */
/* users : [’T9_pre_L3 ’] */
double * _arr_10_2;
_arr_10_2 = (double *) (pool_allocate(sizeof(double) *
(2+N)*(2+N)));
#pragma omp parallel for schedule(static) collapse (2)
for (int Ti = -1; Ti <= N/32; Ti+=1) {
for (int Tj = -1; Tj <= N/512; Tj +=1) {
/* Scratchpads */
/* users : [’T8_pre_L3 ’, ’T6_pre_L3 ’, ’T4_pre_L3 ’,
’T2_pre_L3 ’, ’T0_pre_L3 ’] */
double _buf_2_0 [(50 * 530)];
/* users : [’T7_pre_L3 ’, ’T5_pre_L3 ’, ’T3_pre_L3 ’,
’T1_pre_L3 ’] */
double _buf_2_1 [(50 * 530)];
int ubi = min(N, 32*Ti + 49);
int lbi = max(1, 32*Ti);for (int i = lbi; i <= ubi; i+=1) {
int ubj = min(N, 512*Tj + 529);
int lbj = max(1, 512*Tj);#pragma ivdep
for (int j = lbj; (j <= ubj); j+=1) {
_buf_2_0 [(-32*Ti+i)*530 + -512*Tj+j] = ...;
}}
int ubi = min(N, 32*Ti + 48);
int lbi = max(1, 32*Ti);for (int i = lbi; i <= ubi; i+=1) {
int ubj = min(N, 512*Tj + 528);
int lbj = max(1, 512*Tj);#pragma ivdep
for (int j = lbj; (j <= ubj); j+=1) {
_buf_2_1 [(-32*Ti+i)*530 + -512*Tj+j] = ...;
}}
...
}}
...
pool_deallocate(_arr_10_2);
...
}
![Page 56: Optimizing Geometric Multigrid Method Computation using a ...](https://reader030.fdocuments.net/reader030/viewer/2022012615/619e2c58203d1f11aa178b1d/html5/thumbnails/56.jpg)
NAS-MG V-Cycle
Smoother
Defect/Residual
Correction
Restrict/Reciprocate
Interpolate/Prolongation