Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley...
Transcript of Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley...
![Page 1: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/1.jpg)
Parallel FMM
Matthew Knepley
Computation InstituteUniversity of Chicago
Department of Molecular Biology and PhysiologyRush University Medical Center
Conference on High Performance Scientific ComputingIn Honor of Ahmed Sameh’s 70th Birthday
Purdue University, October 11, 2010
M. Knepley (UC) SC Sameh ’10 1 / 37
![Page 2: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/2.jpg)
Main Point
Using estimates and proofs,
a simple software architecture,
gets good scaling, efficiency,
and adaptive load balance.
M. Knepley (UC) SC Sameh ’10 2 / 37
![Page 3: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/3.jpg)
Main Point
Using estimates and proofs,
a simple software architecture,
gets good scaling, efficiency,
and adaptive load balance.
M. Knepley (UC) SC Sameh ’10 2 / 37
![Page 4: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/4.jpg)
Main Point
Using estimates and proofs,
a simple software architecture,
gets good scaling, efficiency,
and adaptive load balance.
M. Knepley (UC) SC Sameh ’10 2 / 37
![Page 5: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/5.jpg)
Collaborators
The PetFMM team:
Prof. Lorena BarbaDept. of Mechanical Engineering, Boston University
Dr. Felipe Cruz, developer of GPU extensionNagasaki Advanced Computing Center, Nagasaki University
Dr. Rio Yokota, developer of 3D extensionDept. of Mechanical Engineering, Boston University
M. Knepley (UC) SC Sameh ’10 3 / 37
![Page 6: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/6.jpg)
Collaborators
Chicago Automated Scientific Computing Group:
Prof. Ridgway ScottDept. of Computer Science, University of ChicagoDept. of Mathematics, University of Chicago
Peter Brune, (biological DFT)Dept. of Computer Science, University of Chicago
Dr. Andy Terrel, (Rheagen)Dept. of Computer Science and TACC, University of Texas at Austin
M. Knepley (UC) SC Sameh ’10 4 / 37
![Page 7: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/7.jpg)
Complementary Work
Outline
1 Complementary Work
2 Short Introduction to FMM
3 Parallelism
4 What Changes on a GPU?
5 PetFMM
M. Knepley (UC) SC Sameh ’10 5 / 37
![Page 8: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/8.jpg)
Complementary Work
FMM Work
Queue-based hybrid executionOpenMP for multicore processors
CUDA for GPUs
Adaptive hybrid Treecode-FMMTreecode competitive only for very low accuracy
Very high flop rates for treecode M2P operation
Computation/Communication Overlap FMMProvably scalable formulation
Overlap P2P with M2L
M. Knepley (UC) SC Sameh ’10 6 / 37
![Page 9: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/9.jpg)
Short Introduction to FMM
Outline
1 Complementary Work
2 Short Introduction to FMM
3 Parallelism
4 What Changes on a GPU?
5 PetFMM
M. Knepley (UC) SC Sameh ’10 7 / 37
![Page 10: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/10.jpg)
Short Introduction to FMM
FMM Applications
FMM can accelerate both integral and boundary element methods for:LaplaceStokesElasticity
AdvantagesMesh-freeO(N) timeDistributed and multicore (GPU) parallelismSmall memory bandwidth requirement
M. Knepley (UC) SC Sameh ’10 8 / 37
![Page 11: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/11.jpg)
Short Introduction to FMM
FMM Applications
FMM can accelerate both integral and boundary element methods for:LaplaceStokesElasticity
AdvantagesMesh-freeO(N) timeDistributed and multicore (GPU) parallelismSmall memory bandwidth requirement
M. Knepley (UC) SC Sameh ’10 8 / 37
![Page 12: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/12.jpg)
Short Introduction to FMM
Fast Multipole Method
FMM accelerates the calculation of the function:
Φ(xi) =∑
j
K (xi , xj)q(xj) (1)
Accelerates O(N2) to O(N) time
The kernel K (xi , xj) must decay quickly from (xi , xi)
Can be singular on the diagonal (Calderón-Zygmund operator)
Discovered by Leslie Greengard and Vladimir Rohklin in 1987
Very similar to recent wavelet techniques
M. Knepley (UC) SC Sameh ’10 9 / 37
![Page 13: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/13.jpg)
Short Introduction to FMM
Fast Multipole Method
FMM accelerates the calculation of the function:
Φ(xi) =∑
j
qj
|xi − xj |(1)
Accelerates O(N2) to O(N) time
The kernel K (xi , xj) must decay quickly from (xi , xi)
Can be singular on the diagonal (Calderón-Zygmund operator)
Discovered by Leslie Greengard and Vladimir Rohklin in 1987
Very similar to recent wavelet techniques
M. Knepley (UC) SC Sameh ’10 9 / 37
![Page 14: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/14.jpg)
Short Introduction to FMM
Spatial Decomposition
Pairs of boxes are divided into near and far :
Neighbors are treated as very near.
M. Knepley (UC) SC Sameh ’10 10 / 37
![Page 15: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/15.jpg)
Short Introduction to FMM
Spatial Decomposition
Pairs of boxes are divided into near and far :
Neighbors are treated as very near.
M. Knepley (UC) SC Sameh ’10 10 / 37
![Page 16: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/16.jpg)
Short Introduction to FMM
Functional Decomposition
Downward SweepUpward Sweep
Create Multipole Expansions. Evaluate Local Expansions.
P2M M2M M2L L2L L2P
M. Knepley (UC) SC Sameh ’10 11 / 37
![Page 17: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/17.jpg)
Parallelism
Outline
1 Complementary Work
2 Short Introduction to FMM
3 Parallelism
4 What Changes on a GPU?
5 PetFMM
M. Knepley (UC) SC Sameh ’10 12 / 37
![Page 18: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/18.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 19: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/19.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 20: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/20.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 21: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/21.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 22: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/22.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 23: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/23.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 24: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/24.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 37
![Page 25: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/25.jpg)
Parallelism
FMM Control Flow
Downward SweepUpward Sweep
Create Multipole Expansions. Evaluate Local Expansions.
P2M M2M M2L L2L L2P
Kernel operations will map to GPU tasks.
M. Knepley (UC) SC Sameh ’10 14 / 37
![Page 26: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/26.jpg)
Parallelism
FMM Control FlowParallel Operation
M2M and L2L translations M2L transformation Local domain
Level k
Root tree
Sub-tree 1 Sub-tree 2 Sub-tree 3 Sub-tree 4 Sub-tree 5 Sub-tree 6 Sub-tree 7 Sub-tree 8
Kernel operations will map to GPU tasks.
M. Knepley (UC) SC Sameh ’10 14 / 37
![Page 27: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/27.jpg)
Parallelism
Parallel Tree Implementation
Divide tree into a root and local trees
Distribute local trees among processes
Provide communication pattern for local sections (overlap)Both neighbor and interaction list overlaps
Sieve generates MPI from high level description
M. Knepley (UC) SC Sameh ’10 15 / 37
![Page 28: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/28.jpg)
Parallelism
Parallel Tree ImplementationHow should we distribute trees?
Multiple local trees per process allows good load balancePartition weighted graph
Minimize load imbalance and communication
Computation estimate:Leaf Nip (P2M) + nIp2 (M2L) + Nip (L2P) + 3d N2
i (P2P)Interior ncp2 (M2M) + nIp2 (M2L) + ncp2 (L2L)
Communication estimate:Diagonal nc(L − k − 1)
Lateral 2d 2m(L−k−1)−12m−1 for incidence dimesion m
Leverage existing work on graph partitioningParMetis
M. Knepley (UC) SC Sameh ’10 16 / 37
![Page 29: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/29.jpg)
Parallelism
Parallel Tree ImplementationWhy should a good partition exist?
Shang-hua Teng, Provably good partitioning and load balancing algorithmsfor parallel adaptive N-body simulation, SIAM J. Sci. Comput., 19(2), 1998.
Good partitions exist for non-uniform distributions2D O
(√n(log n)3/2
)edgecut
3D O(n2/3(log n)4/3
)edgecut
As scalable as regular grids
As efficient as uniform distributions
ParMetis will find a nearly optimal partition
M. Knepley (UC) SC Sameh ’10 17 / 37
![Page 30: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/30.jpg)
Parallelism
Parallel Tree ImplementationWill ParMetis find it?
George Karypis and Vipin Kumar, Analysis of Multilevel Graph Partitioning,Supercomputing, 1995.
Good partitions exist for non-uniform distributions2D Ci = 1.24iC0 for random matching3D Ci = 1.21iC0?? for random matching
3D proof needs assurance that averge degree does not increase
Efficient in practice
M. Knepley (UC) SC Sameh ’10 18 / 37
![Page 31: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/31.jpg)
Parallelism
Parallel Tree ImplementationAdvantages
Simplicity
Complete serial code reuse
Provably good performance and scalability
M. Knepley (UC) SC Sameh ’10 19 / 37
![Page 32: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/32.jpg)
Parallelism
Parallel Tree ImplementationAdvantages
Simplicity
Complete serial code reuse
Provably good performance and scalability
M. Knepley (UC) SC Sameh ’10 19 / 37
![Page 33: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/33.jpg)
Parallelism
Parallel Tree ImplementationAdvantages
Simplicity
Complete serial code reuse
Provably good performance and scalability
M. Knepley (UC) SC Sameh ’10 19 / 37
![Page 34: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/34.jpg)
Parallelism
Distributing Local Trees
The interaction of locals trees is represented by a weighted graph.
cijwi
wj
This graph is partitioned, and trees assigned to processes.M. Knepley (UC) SC Sameh ’10 20 / 37
![Page 35: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/35.jpg)
Parallelism
Local Tree Distribution
Here local trees are assigned to processes:h h h h h h h h h h h h h h h h h h c c c c d d d d d d d d d dh h h h h h h h h h h h h h h h h c c c c c d d d d d d d d d dh h g h h h h h h h h h h h h f h c c c c c d d d d d d d d d dg g g g h h h h e e h h h h f f c c c c c c c d d d d d d d d dg g g g g h h h e e e e f f f f c c c c c c c c d d d d d d d dg g g g g h h h e e e e f f f f c c c c c c c c d d d d d d d dg g g g g g g g g e e e f f f f c c c c c c c c d d d a a a d dg g g g g g g g e e e e f f f f f c c c c c c c d d a a a a a ag g g g g g g g e e e e f f f f f c c c c c c c d d a a a a a ag g g g g g g e e e e e f f f f f f c c b c c c a a a a a a a ag g g g g g e e e e e e f f f f f f b b b b b b a a a a a a a ag g g g e e e e e e e f f f f f f f b b b b b b a a a a a a a ag g g g e e e e e e e f f f f f f f b b b b b b b a a a a a a ai g g g e e e e e e e f f f f f f f b b b b b b b b a a a a a ai i i i e e e e e e e e f f f f f f b b b b b b b b a a a a a ai i i i i i e j j e e j j j j b b b b b b b b b b b b a a a a ai i i i i i i i j j j j j j j j b b b b b b b b b b b a p p p pi i i i i i i i j j j j j j j j n b b b b b n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n n p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n p p p p p p p p p pi i i i i i i i j j j j j j j j n n n n n n p p p p p p p p p pi i k k k k i i j j j j j j j j n n n n n n n p p p p p p p p pk k k k k k k i l l l l j j n n n n n n n n m m p p p p o o o ok k k k k k l l l l l l l l n n n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l n n n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l n m n n n n n m m m o o o o o o o ok k k k k k l l l l l l l l m m m m n m m m m m m o o o o o o ok k k k k k k l l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k k l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k l l l l l l l m m m m m m m m m m m o o o o o o ok k k k k k k l l l l l l l l m m m m m m m m m o o o o o o o o
M. Knepley (UC) SC Sameh ’10 21 / 37
![Page 36: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/36.jpg)
Parallelism
Parallel Data Movement
1 Complete neighbor section
2 Upward sweep1 Upward sweep on local trees2 Gather to root tree3 Upward sweep on root tree
3 Complete interaction list section
4 Downward sweep1 Downward sweep on root tree2 Scatter to local trees3 Downward sweep on local trees
M. Knepley (UC) SC Sameh ’10 22 / 37
![Page 37: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/37.jpg)
Parallelism
PetFMM Load Balance
0
0.2
0.4
0.6
0.8
1
2 4 8 16 32 64 128 256
Number of processors
uniform 4ML8R5uniform 10ML9R5
spiral 1ML8R5spiral w/space-filling 1ML8R5
M. Knepley (UC) SC Sameh ’10 23 / 37
![Page 38: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/38.jpg)
Parallelism
Local Tree Distribution
Here local trees are assigned to processes for a spiral distribution:
b b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b a a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b b a a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b b a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b b a a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b a a a a a a a a a a a a a a a a ab b b b b b b b b b b b b b b a a a a a a a a a a a a a a a a a
1
(a) 2 cores
c c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c a a a a a a a a a a a ac c c c c c c c c c c c c c c c c c c c b b a a a a a a a a a ac c c c c c c c c c c c c c c c c c c b b b b b a a a a a a a ad d d d d c c c c c c c c c c c c c b b b b b b b a a a a a a ad d d d d d c c c c c c c c c c c c b b b b b b b a a a a a a ad d d d d d d d c c c c c c c c c b b b b b b b b a a a a a a ad d d d d d d d d c c c c c c c b b b b b b b b b a a a a a a ad d d d d d d d d d c c c c c c b b b b b b b b b a a a a a a ad d d d d d d d d d d c c c c c b b b b b b b b b a a a a a a ad d d d d d d d d d d d c c c c b b b b b b b b b a a a a a a ad d d d d d d d d d d d d c c b b b b b b b b b b a a a a a a ad d d d d d d d d d d d d d b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a ad d d d d d d d d d d d d b b b b b b b b b b b b b a a a a a a
1
(b) 4 cores
M. Knepley (UC) SC Sameh ’10 24 / 37
![Page 39: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/39.jpg)
Parallelism
Local Tree Distribution
Here local trees are assigned to processes for a spiral distribution:
d d d d d d d d d d d d d d d c c c c c b g g g g g g g g g g gd d d d d d d d d d d d d d c c c c c c c g g g g g g g g g g gd d d d d d d d d d d d d d c c c c c c c g g g g g g g g g g gd d d d d d d d d d d d d d c c c c c c c g g g g g g g g g g gd d d d d d d d d d d d d d c c c c c c g g g g g g g g g g g gd d d d d d d d d d d d d c c c c c c c g g g g g g g g g g g gd d d d d d d d d d d d c c c c c c c g g g g g g g g g g g g gd d d d d d d d d d d c c c c c c c c g g g g g g g g g g g g hb d d d d d d d d d c c c c c c c c c f f f f g g g g g g g g hb b d d d d d d d c c c c c c c c c c f f f f g g g g g g g g hb b b b b d d d c c c c c c c c c c f f f e e g g g g g g g g hb b b b b b b d c c c c c c c c c c e f e e e e g g g g g g g hb b b b b b b b b c c c c c c c c c e e e e e e e g g g g h h hb b b b b b b b b b c c c c c c c c e e e e e e e e g h h h h hb b b b b b b b b b b c c c c c c c e e e e e e e e h h h h h hb b b b b b b b b b b b b c c c c c e e e e e e e e h h h h h hb b b b b b b b b b b b b b c c c e e e e e e e e e h h h h h hb b b b b b b b b b b b b b b c e e e e e e e e e e h h h h h hb b b b b b b b b b b b b b b e e e e e e e e e e h h h h h h ha a a a b b b b b b b b b b b e e e e e e e e e e h h h h h h ha a a a a a b b b b b b b b e e e e e e e e e e e h h h h h h ha a a a a a a a b b b b b e e e e e e e e e e e e h h h h h h ha a a a a a a a a a b b f f f e e e e e e e e e e h h h h h h ha a a a a a a a a a a a f f f f e e e e e e e e e h h h h h h ha a a a a a a a a a a f f f f f f f e e e e e e e h h h h h h ha a a a a a a a a a a f f f f f f f f f f e e f f h h h h h h ha a a a a a a a a a a f f f f f f f f f f f f f f f h h h h h ha a a a a a a a a a a f f f f f f f f f f f f f f f h h h h h ha a a a a a a a a a a f f f f f f f f f f f f f f f h h h h h ha a a a a a a a a a a f f f f f f f f f f f f f f f h h h h h ha a a a a a a a a a a f f f f f f f f f f f f f f f h h h h h ha a a a a a a a a a a f f f f f f f f f f f f f f f h h h h h h
1
(c) 8 cores
j j j j j j j j j j k k k k k l l l k d d a a a a a a b b b b bj j j j j j j j j i k k k k k k k k a a a a a a a a a b b b b bj j j j j j j j j k k k k k k k k k a a a a a a a a b b b b b bj j j j j j j j j k k k k k k k k k a a a a a a a a b b b b b bj j j j j j j j j k k k k k k k k k a a a a a a a b b b b b b bj j j j j j j j j k k k k k k k k k a a a a a a a b b b b b b bj j j j j j j j i k k k k k k k k k a a a a a a a b b b b b b bi j i i i i i i i i k k k l l l l l a a a a a a a b b b b b b di i i i i i i i i i l l l l l l l l g g a a a a b b b b b b b di i i i i i i i i i l l l l l l l l g g g h h h h b b b b b b ci i i i i i i i i i l l l l l l l l g g g h h h h e e e e b c ci i i i i i i i i i l l l l l l l l g g g h h h e e e e e e c cn n i i i i i i i i l l l l k l l l g g g g h e e e e e e e e en n n n n n i i i l l l l l l k l l f f f f e e e e e e e e e en n n n n n n i i l l l l l l m m m f f f f e e e e e e e e e en n n n n n n n n n l l l m m m m m f f f f f e e e e e e e e em n n n n n n n n m m m m m m m m m f f f f f f f f f e e e e em n n n n n n n n m m m m m m m m f f f f f f f f f f e e e e em n n n n n n n n m m m m m m m d d f f f f f f f f f e e e e eo o n n n n n n n m m m m m m d d d d f f f f f f f f h h h h ho o n n n n n n m m m m m m d d d d d d f f f f f f h h h h h ho o n n p p p p m m m m m m d d d d d d f f f f f h h h h h h ho o n p p p p p p m m m m m d d d d d c c f f f h h h h h h h ho o o o p p p p p p m m m m d d d d d c c c f h h h h h h h h ho o o o p p p p p p p m m m d d d d c c c c c h h h h h h h h ho o o o o o o p p p p p p d d d d c c c c c c c h h h h h h h ho o o o o o o o p p p p p d d d d c c c c c c c g g g g g g g go o o o o o o o p p p p p d d d d c c c c c c c g g g g g g g go o o o o o o p p p p p p d d d d c c c c c c c g g g g g g g go o o o o o o p p p p p p d d d d c c c c c c c g g g g g g g go o o o o o p p p p p p p d d d d c c c c c c c g g g g g g g go o o o o o p p p p p p p d d d d c c c c c c c g g g g g g g g
1
(d) 16 cores
M. Knepley (UC) SC Sameh ’10 24 / 37
![Page 40: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/40.jpg)
Parallelism
Local Tree Distribution
Here local trees are assigned to processes for a spiral distribution:
n n n n n n n n m m m m l l l l k k k k k 2 2 3 z z z z 0 0 0 0n n n n n n n n m m m l l l l l k k k k k z z z z z z z 0 0 0 0n n n n n n n n m m m l l l l l k k k k k z z z z z z z 0 0 0 0n n n n n n n m m m m l l l l l k k k k y y z z z z z z 0 0 0 0p p p p n m m m m m m l l l l l k k k y y y y z z z z 1 1 1 0 0p p p o o o m m m m m l l l l l k k k y y y y y z z z 1 1 1 0 0p p p o o o o m m m m l l l l k k k y y y y y y y z 1 1 1 1 1 1p p p p o o o o m m m i i i i i k k y y y y y y y 1 1 1 1 1 1 1p p p p p o o o o o i i i i i i i i u u v v y y y 1 1 1 1 1 2 3p o p p p o o o o i i i i i i j j u u u v v v y y 1 1 1 1 4 2 2o p p p p o o o o i i i i i i j i r u u v v v y y 1 1 1 4 4 2 3g p p p p o o o o j j i i i j j r r r u v x v w 0 0 0 0 4 4 4 5a c c c c c o o j j j j j j i j r r r r s s s t s 0 0 0 4 4 4 5a c c c c c c c j j j j j j j i r r r r r s s s s 0 0 0 4 4 4 4a c c c c c c c c j j j j j j r r r r r r s s s s 0 0 4 4 5 4 4c c c c c c c c c a j j j j r r r r r r r s s s s s 0 4 4 4 4 4c d d d d d c c a a a a a a r r r q q q q s s s s t t 2 4 4 4 4b d d d d d d b a a a a a a r r q q q q q q s t t s s s 4 4 4 4b d d d d d b b b a a a a a r q q q q q q t t t t s s s 4 4 4 4d c d d d b b b b b a a a a q q q q q q q t t t t s s s 5 5 5 5d d d d d b b b b b b a a v v q q q q q t t t t t s s 5 5 5 5 5d d d d b b b b b b b a a v v v q q q q t t t t t s 5 5 5 5 5 5d d d b b b b b b b b b g v v v v v v v t t t t t 2 2 5 5 5 5 5e e e e e e b b g g g g g v v v v v v v v t t t 2 2 2 2 5 5 5 5e e e e e e e g g g g g g g u u u u v v v w w w 2 2 2 2 2 5 5 5e e e e e f h g g g g g g g u u u u u u w w w w 2 2 2 2 2 2 3 3e e e e e e h e g g g g g g u u u u u w w w w x 2 2 2 2 2 3 3 3e e e e e e e h f g g g h h u u u u w w w w x x x 2 2 2 3 3 3 3f f f f f f f f h h g h h h u u u w w w w x x x x x 2 3 3 3 3 3f f f f f f f h h h h g h h h u w w w w w x x x x x x 3 3 3 3 3f f f f f f f h h h h h h h f x w w w w w x x x x x x 3 3 3 3 3f f f f f f f h h h h h h h h x x w w w x x x x x x x 3 3 3 3 3
1
(e) 32 cores
r r q q q q t t t t t s z z y 1 0 0 0 1 O M Q X V V U U + + + +r r q q q q t t t t s t y y y 1 1 1 1 X X X X V V V U U + + + +r r q q q q q q t t t t y y y 1 1 1 1 X X X X V V V U U + + + +r r r q q q r s s t t z y y y y 1 1 1 X X X X V V V V U + + + +r r r r r s s s s s z z z y y y 1 1 1 W W X X V V V V * * Y + *w w w x x s s s s s z z z z y 0 0 1 1 W W W W U U U * * * * * *w w w x x x x x s 4 4 z z z z 0 0 0 0 W W W W U U U * * * * * Yw w w x x x x x w 4 4 4 5 z z 0 0 0 0 W W W W U U Y Y * * Q Q Zw v u u u x x w w 4 4 4 5 5 5 5 5 y 2 I W X W U Y Y Y Y M Q Q Sx u u u u v x w w 4 4 4 5 5 5 5 5 2 2 D D D D J Y Y Y M M M Q Nu u u u u v v v w 4 4 4 5 5 5 4 2 2 2 B D D D 6 8 Y Y M M M M Ol u u u v v v v v 2 3 3 3 5 5 4 2 2 2 B C C D 7 8 8 N M M M M Pb e e e v v v v v 3 3 3 3 3 2 3 2 2 2 I A A H F 8 8 N N N M M Rg e e e e e f v v c c c 3 3 3 3 2 2 E E E E F F 8 8 N N N N T Pf e e e e f f f c c c c c 3 3 a 3 E E E E E F F F F N N N M N Pe e e e e f f f c c c c d d d d K K K E E E E F F F N N Q N P Pa e e e e f f f c c c c d d d K K K L L E E E F F F G S P P P Pd g g g g g f f b b c d d d K K K K L L E E F H G G G P P P P Pc g g g g g g b b b b b d I I K K K L L H H H H G G G P P P O Og d h h g g g b b b b b b I I I K K L L H H H H G G G P O O O Oh h h h h h h b b b b b I I I I K L L L H H H G G G R R O O O Oh h f h h h h a a a a a I I I J J L L L H H H G G G R R R O O Sm m m n n n n a a a a a I I J J J J L L C C C C D R R R R S S Sm m m n n n n l l l l l I J J J J J 6 C C C C C D R R R R S S Sm m n n p n n l l l l l k J J J J 6 6 6 C C C D D D R R R S S Sm m n n n o o l l l l k k 8 8 8 8 6 6 6 C C C D D D R Q Q S S Sm m n n o o m o l l k k k 8 8 9 9 6 6 6 6 A A A A A Q Q Q T T Tm m o o o o o m k l k k k k 9 9 9 6 6 7 7 A A A A A Z Q Q T T Tp p o o o o j j j j k k i i 9 9 9 7 7 7 7 A A A A A Z Z Z T T Tp p p p o i j j j j j j i i 9 9 9 7 7 7 7 B B B B Z Z Z Z Z T Tp p p p p i i j j j j i i i 9 9 9 8 7 7 7 B B B B Z Z Z Z Z T Tp p p p p i i i j j j i i i 9 9 8 8 6 7 6 B B B B Z Z Z Z T T T
1
(f) 64 cores
M. Knepley (UC) SC Sameh ’10 24 / 37
![Page 41: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/41.jpg)
What Changes on a GPU?
Outline
1 Complementary Work
2 Short Introduction to FMM
3 Parallelism
4 What Changes on a GPU?
5 PetFMM
M. Knepley (UC) SC Sameh ’10 25 / 37
![Page 42: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/42.jpg)
What Changes on a GPU?
Multipole-to-Local Transformation
Re-expands a multipole series as a Taylor series
Up to 85% of time in FMMTradeoff with directinteraction
Dense matrix multiplication2p2 rows
Each interaction list box(6d − 3d
)2dL
d = 2,L = 81,769,472 matvecs
M. Knepley (UC) SC Sameh ’10 26 / 37
![Page 43: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/43.jpg)
What Changes on a GPU?
GPU M2LVersion 0
One thread per M2L transformThread block (TB) transforms one Multipole Expansion (ME) foreach Interaction List (IL) box — 27 timesp = 12Matrix size is 2304 bytesPlenty of work per thread (81 Kflops or 36 flops/byte)BUT, 16K shared memory only holds 7 matrices
Memory limits concurrency!
M. Knepley (UC) SC Sameh ’10 27 / 37
![Page 44: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/44.jpg)
What Changes on a GPU?
GPU M2LVersion 0
One thread per M2L transformThread block (TB) transforms one Multipole Expansion (ME) foreach Interaction List (IL) box — 27 timesp = 12Matrix size is 2304 bytesPlenty of work per thread (81 Kflops or 36 flops/byte)BUT, 16K shared memory only holds 7 matrices
Memory limits concurrency!
M. Knepley (UC) SC Sameh ’10 27 / 37
![Page 45: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/45.jpg)
What Changes on a GPU?
GPU M2LVersion 0
One thread per M2L transformThread block (TB) transforms one Multipole Expansion (ME) foreach Interaction List (IL) box — 27 timesp = 12Matrix size is 2304 bytesPlenty of work per thread (81 Kflops or 36 flops/byte)BUT, 16K shared memory only holds 7 matrices
Memory limits concurrency!
M. Knepley (UC) SC Sameh ’10 27 / 37
![Page 46: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/46.jpg)
What Changes on a GPU?
GPU M2LVersion 0
One thread per M2L transformThread block (TB) transforms one Multipole Expansion (ME) foreach Interaction List (IL) box — 27 timesp = 12Matrix size is 2304 bytesPlenty of work per thread (81 Kflops or 36 flops/byte)BUT, 16K shared memory only holds 7 matrices
Memory limits concurrency!
M. Knepley (UC) SC Sameh ’10 27 / 37
![Page 47: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/47.jpg)
What Changes on a GPU?
GPU M2LVersion 0
One thread per M2L transformThread block (TB) transforms one Multipole Expansion (ME) foreach Interaction List (IL) box — 27 timesp = 12Matrix size is 2304 bytesPlenty of work per thread (81 Kflops or 36 flops/byte)BUT, 16K shared memory only holds 7 matrices
Memory limits concurrency!
M. Knepley (UC) SC Sameh ’10 27 / 37
![Page 48: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/48.jpg)
What Changes on a GPU?
GPU M2LVersion 0
One thread per M2L transformThread block (TB) transforms one Multipole Expansion (ME) foreach Interaction List (IL) box — 27 timesp = 12Matrix size is 2304 bytesPlenty of work per thread (81 Kflops or 36 flops/byte)BUT, 16K shared memory only holds 7 matrices
Memory limits concurrency!
M. Knepley (UC) SC Sameh ’10 27 / 37
![Page 49: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/49.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Traverse matrix by perdiagonalsSame workNo memory limit on concurrency8 concurrent TBs per MultiProcessor (MP)27× 8 = 216 threads, BUT max is 512
M2L ME = LE
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 50: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/50.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Traverse matrix by perdiagonalsSame workNo memory limit on concurrency8 concurrent TBs per MultiProcessor (MP)27× 8 = 216 threads, BUT max is 512
M2L ME = LE
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 51: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/51.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Traverse matrix by perdiagonalsSame workNo memory limit on concurrency8 concurrent TBs per MultiProcessor (MP)27× 8 = 216 threads, BUT max is 512
M2L ME = LE
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 52: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/52.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Traverse matrix by perdiagonalsSame workNo memory limit on concurrency8 concurrent TBs per MultiProcessor (MP)27× 8 = 216 threads, BUT max is 512
M2L ME = LE
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 53: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/53.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Traverse matrix by perdiagonalsSame workNo memory limit on concurrency8 concurrent TBs per MultiProcessor (MP)27× 8 = 216 threads, BUT max is 512
20 GFlops
5x Speedup ofDownward Sweep
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 54: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/54.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Traverse matrix by perdiagonalsSame workNo memory limit on concurrency8 concurrent TBs per MultiProcessor (MP)27× 8 = 216 threads, BUT max is 512
20 GFlops
5x Speedup ofDownward Sweep
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 55: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/55.jpg)
What Changes on a GPU?
GPU M2LVersion 1
Apply M2L transform matrix-free
m2lij = −1i(
i + jj
)t−i−j−1 (2)
Additional problems: Not enough parallelism for data movementMove 27 LE to global memory per TB27× 2p = 648 floatsWith 32 threads, takes 21 memory transactions
Algorithm limits concurrency!
M. Knepley (UC) SC Sameh ’10 28 / 37
![Page 56: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/56.jpg)
What Changes on a GPU?
GPU M2LVersion 2
One thread per element of the LE
m2lij = −1i(
i + jj
)t−i−j−1 (3)
Each thread does a dot productCannot use diagonal traversal, more workAvoid branching
Each row precomputes t−i−1
All threads loop to p + 1, only store t−i−1
Loop unrollingNo thread synchronization
M2L ME = LE
Examine memory access
M. Knepley (UC) SC Sameh ’10 29 / 37
![Page 57: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/57.jpg)
What Changes on a GPU?
GPU M2LVersion 2
One thread per element of the LE
m2lij = −1i(
i + jj
)t−i−j−1 (3)
Each thread does a dot productCannot use diagonal traversal, more workAvoid branching
Each row precomputes t−i−1
All threads loop to p + 1, only store t−i−1
Loop unrollingNo thread synchronization
M2L ME = LE
Examine memory access
M. Knepley (UC) SC Sameh ’10 29 / 37
![Page 58: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/58.jpg)
What Changes on a GPU?
GPU M2LVersion 2
One thread per element of the LE
m2lij = −1i(
i + jj
)t−i−j−1 (3)
Each thread does a dot productCannot use diagonal traversal, more workAvoid branching
Each row precomputes t−i−1
All threads loop to p + 1, only store t−i−1
Loop unrollingNo thread synchronization
M2L ME = LE
Examine memory access
M. Knepley (UC) SC Sameh ’10 29 / 37
![Page 59: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/59.jpg)
What Changes on a GPU?
GPU M2LVersion 2
One thread per element of the LE
m2lij = −1i(
i + jj
)t−i−j−1 (3)
Each thread does a dot productCannot use diagonal traversal, more workAvoid branching
Each row precomputes t−i−1
All threads loop to p + 1, only store t−i−1
Loop unrollingNo thread synchronization
M2L ME = LE
Examine memory access
M. Knepley (UC) SC Sameh ’10 29 / 37
![Page 60: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/60.jpg)
What Changes on a GPU?
GPU M2LVersion 2
One thread per element of the LE
m2lij = −1i(
i + jj
)t−i−j−1 (3)
Each thread does a dot productCannot use diagonal traversal, more workAvoid branching
Each row precomputes t−i−1
All threads loop to p + 1, only store t−i−1
Loop unrollingNo thread synchronization
300 GFlops
15x Speedup ofDownward Sweep
Examine memory access
M. Knepley (UC) SC Sameh ’10 29 / 37
![Page 61: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/61.jpg)
What Changes on a GPU?
GPU M2LVersion 2
One thread per element of the LE
m2lij = −1i(
i + jj
)t−i−j−1 (3)
Each thread does a dot productCannot use diagonal traversal, more workAvoid branching
Each row precomputes t−i−1
All threads loop to p + 1, only store t−i−1
Loop unrollingNo thread synchronization
300 GFlops
15x Speedup ofDownward Sweep
Examine memory accessM. Knepley (UC) SC Sameh ’10 29 / 37
![Page 62: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/62.jpg)
What Changes on a GPU?
Memory Bandwidth
Superior GPU memory bandwidth is due to both
bus width and clock speed.
CPU GPUBus Width (bits) 64 512Bus Clock Speed (MHz) 400 1600Memory Bandwidth (GB/s) 3 102Latency (cycles) 240 600
Tesla always accesses blocks of 64 or 128 bytes
M. Knepley (UC) SC Sameh ’10 30 / 37
![Page 63: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/63.jpg)
What Changes on a GPU?
GPU M2LVersion 3
Coalesce and overlap memory accessesCoalescing is
a group of 16 threadsaccessing consective addresses
4, 8, or 16 bytesin the same block of memory
32, 64, or 128 bytes
480 GFlops
25x Speedup ofDownward
Sweep
M. Knepley (UC) SC Sameh ’10 31 / 37
![Page 64: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/64.jpg)
What Changes on a GPU?
GPU M2LVersion 3
Coalesce and overlap memory accessesMemory accesses can be overlapped with
computation whena TB is waiting for data from main memory
another TB can be scheduled on the SM
512 TB can be active at once on Tesla
480 GFlops
25x Speedup ofDownward
Sweep
M. Knepley (UC) SC Sameh ’10 31 / 37
![Page 65: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/65.jpg)
What Changes on a GPU?
GPU M2LVersion 3
Coalesce and overlap memory accessesNote that the theoretical peak (1 TF)
MULT and FMA must execute simultaneously
346 GOps
Without this, peak can be closer to 600 GF
480 GFlops
25x Speedup ofDownward
Sweep
M. Knepley (UC) SC Sameh ’10 31 / 37
![Page 66: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/66.jpg)
What Changes on a GPU?
Design Principles
M2L required all of these optimization steps:Many threads per kernel
Avoid branching
Unroll loops
Coalesce memory accesses
Overlap main memory access with computation
M. Knepley (UC) SC Sameh ’10 32 / 37
![Page 67: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/67.jpg)
PetFMM
Outline
1 Complementary Work
2 Short Introduction to FMM
3 Parallelism
4 What Changes on a GPU?
5 PetFMM
M. Knepley (UC) SC Sameh ’10 33 / 37
![Page 68: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/68.jpg)
PetFMM
PetFMM
PetFMM is an freely available implementation of theFast Multipole Method
http://barbagroup.bu.edu/Barba_group/PetFMM.html
Leverages PETScSame open source licenseUses Sieve for parallelism
Extensible design in C++Templated over the kernelTemplated over traversal for evaluation
MPI implementationNovel parallel strategy for anisotropic/sparse particle distributionsPetFMM–A dynamically load-balancing parallel fast multipole library86% efficient strong scaling on 64 procs
Example application using the Vortex Method for fluids(coming soon) GPU implementation
M. Knepley (UC) SC Sameh ’10 34 / 37
![Page 69: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/69.jpg)
PetFMM
PetFMM CPU PerformanceStrong Scaling
1
2
4
8
16
32
64
128
256
2 4 8 16 32 64 128 256
Spee
dup
Number of processors
uniform 4ML8R5uniform 10ML9R5
spiral 1ML8R5spiral w/ space-filling 1ML8R5
Perfect Speedup
M. Knepley (UC) SC Sameh ’10 35 / 37
![Page 70: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/70.jpg)
PetFMM
PetFMM CPU PerformanceStrong Scaling
10-2
10-1
100
101
102
103
2 4 8 16 32 64 128 256
Tim
e [s
ec]
Number of processors
ME InitializationUpward Sweep
Downward SweepEvaluation
Load balancing stageTotal time
M. Knepley (UC) SC Sameh ’10 35 / 37
![Page 71: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/71.jpg)
PetFMM
Largest Calculation With Development Code
10,648 randomly oriented lysozyme molecules102,486 boundary elements/moleculeMore than 1 billion unknowns1 minute on 512 GPUs
M. Knepley (UC) SC Sameh ’10 36 / 37
![Page 72: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/72.jpg)
PetFMM
Largest Calculation With Development Code
10,648 randomly oriented lysozyme molecules102,486 boundary elements/moleculeMore than 1 billion unknowns1 minute on 512 GPUs
M. Knepley (UC) SC Sameh ’10 36 / 37
![Page 73: Parallel FMM - Rice Umk51/presentations/PresSameh2010.pdf · Parallel FMM Matthew Knepley Computation Institute University of Chicago Department of Molecular Biology and Physiology](https://reader033.fdocuments.net/reader033/viewer/2022052923/5f03f4d77e708231d40b9a5d/html5/thumbnails/73.jpg)
PetFMM
What do we need for Parallel FMM?
Urgent need for reduction in complexityComplete serial code reuseModeling integral to optimization
Unstructured communicationUses optimization to automatically generateProvided by ParMetis and PETSc
Massive concurrency is necessaryMix of vector and thread paradigmsDemands new analysis
M. Knepley (UC) SC Sameh ’10 37 / 37