Matrix multiplication on multidimensional torus...
Transcript of Matrix multiplication on multidimensional torus...
Torus networksAlgorithms
ImplementationPerformanceConclusion
Matrix multiplication on multidimensional torusnetworks
Edgar Solomonik, and James Demmel
University of California, Berkeley
VECPAR, July 2012
Edgar Solomonik Split-Dimensional Cannon’s algorithm 1/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Outline
Torus networksBlueGene architectureCollective communication
AlgorithmsSUMMACannon’s algorithmSD-Cannon’s algorithm
ImplementationOne-sided MPI communicationCharm++ virtualization
Performance
Conclusion
Edgar Solomonik Split-Dimensional Cannon’s algorithm 2/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
BlueGene/P and BlueGene/Q
Direct torus networks
I BG/P is 3D, BG/Q is 5D
I Both are bidirectional networks (6 and 10 links per node)
I Injection bandwidth sufficient to saturate all links
I Topology-aware partition allocation and collectives
Edgar Solomonik Split-Dimensional Cannon’s algorithm 3/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
Performance of multicast (BG/P vs Cray)
128
256
512
1024
2048
4096
8192
8 64 512 4096
Ban
dwid
th (M
B/s
ec)
#nodes
1 MB multicast on BG/P, Cray XT5, and Cray XE6
BG/PXE6XT5
Edgar Solomonik Split-Dimensional Cannon’s algorithm 4/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
Why the performance discrepancy in multicasts?
I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus
I BG/P uses rectangular (pipelined) multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees
I Route in different dimensional orderI Use both directions of bidirectional network
Edgar Solomonik Split-Dimensional Cannon’s algorithm 5/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
BlueGene architectureCollective communication
2D rectangular pipelined multicast trees
root2D 4X4 Torus Spanning tree 1 Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
[Watts and Van De Geijn 95]
Edgar Solomonik Split-Dimensional Cannon’s algorithm 6/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Matrix multiplication
A
BA
B
A
B
AB
[Van De Geijn and Watts 97]
Edgar Solomonik Split-Dimensional Cannon’s algorithm 7/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
SUMMA and LU with rectangular vs binomial collectives
0
10
20
30
40
50
60
70
80
MM LU LU+PVT
Per
cent
age
of m
achi
ne p
eak
Different collectives on BG/P (n=131,072, p=16,384)
binomialrectangular
Edgar Solomonik Split-Dimensional Cannon’s algorithm 8/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Cannon’s algorithmA B
Stagger left
A[i,j] := A[i,j+1]
Shift right
A[i,j] := A[i,j-1]
Starting position
Stagger up
B[i,j] := B[i+1,j]
Shift down
B[i,j] := B[i-1,j]
Starting position
...
[Cannon 69]
Edgar Solomonik Split-Dimensional Cannon’s algorithm 9/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Cannon’s algorithm
Advantages over SUMMAI Uses only near-neighbor sends rather than multicasts
I lower latency cost
I Can be done in-place given near-neighbor data-swaps
Disadvantages with respect to SUMMA
I Does not generalize well to non-square processor grids
I Cannot exploit multiple links via rectangular multicasts
Edgar Solomonik Split-Dimensional Cannon’s algorithm 10/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Split-dimensional Cannon’s algorithm
Improves over Cannon by saturating all network links
I Subdivide whole multiply into 2n block outer products
I Use bidirectional links by shifting half the outer products inopposite direction
I Perform each outer product in a different dimensional order
I Accumulation of outer-products into one buffer allows forsame-sized local multiplications as pure Cannon’s algorithm
I Does not require pipelined multicasts
Edgar Solomonik Split-Dimensional Cannon’s algorithm 11/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SUMMACannon’s algorithmSD-Cannon’s algorithm
Split-dimensional Cannon’s algorithm
SD-Cannon on a 3-ary 6-cube
dim 1
dim 2
dim 3
Each circle corresponds to a shift along a dimension
Each color corresponds to an outer product
Edgar Solomonik Split-Dimensional Cannon’s algorithm 12/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
One-sided MPI communicationCharm++ virtualization
MPI implementation
SD-Cannon parallel implementation using MPI
I All communication done with near-neighbor one-sided puts
I Code is simple (200 lines)
I Limited to square (k-ary n-cube) processor grids
Edgar Solomonik Split-Dimensional Cannon’s algorithm 13/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
One-sided MPI communicationCharm++ virtualization
Charm++
Charm++ is an asynchronous dynamic runtime system
I Provides object-based (chares) virtualization (decouples fromprocess grid)
I Message-directed task invocation
I Allows topology-aware task mapping
I Provides additional features such as dynamic load balancingand performance profiling
Edgar Solomonik Split-Dimensional Cannon’s algorithm 14/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
One-sided MPI communicationCharm++ virtualization
Virtual topology via Charm++
Implemented Cannon and SD-Cannon in Charm++
I Can map to any torus process topology
I Code more complex, but not significantly so
I Not using one-sided communication (though possible viaCkDirect)
I Virtualization lowers task granularity
Edgar Solomonik Split-Dimensional Cannon’s algorithm 15/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
2D BlueGene/P performance
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1024 2048 4096 8192 16384
Frac
tion
of th
eore
tical
pea
k
matrix dimension
Performance on 8x8 torus partition of BG/P
SUMMAMPI SD-Cannon
Charm++ SD-CannonMPI Cannon
Charm++ Cannon
Edgar Solomonik Split-Dimensional Cannon’s algorithm 16/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
3D BlueGene/P performance
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4096 8192 16384 32768 65536
Frac
tion
of th
eore
tical
pea
k
matrix dimension
Performance on 8x8x8 torus partition of BG/P
SUMMACharm++ Cannon VN
Charm++ SD-Cannon VNCharm++ Cannon SMP
Charm++ SD-Cannon SMP
Edgar Solomonik Split-Dimensional Cannon’s algorithm 17/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Preliminary 5D BlueGene/Q performance
12
14
16
18
20
22
24
26
28
30
8192 16384 32768 65536
Tera
flops
matrix dimension
Performance on 2x4x4x2x4 torus partition of BG/Q
SUMMAMPI SD-Cannon
MPI Cannon
Edgar Solomonik Split-Dimensional Cannon’s algorithm 18/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
K computer Tofu network
T - Torus M - Mesh
M
M
T
TTT
UnitNode
10 links per node / 4 torus dimensions / 2 mesh dimensions of length 2
Edgar Solomonik Split-Dimensional Cannon’s algorithm 19/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Tofu network is 5D not 6D
=
A two-by-two mesh is a 1D ring of length 4
Edgar Solomonik Split-Dimensional Cannon’s algorithm 20/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
SD-Cannon K computer potential
5D K computer torus different from 5D BG/Q
I K computer injection bandwidth can saturate only 4/10 links
SD-Cannon could still be beneficial
I Can saturate 4 links rather than 2 (up to 2X speed-up)
I Does not require pipelined broadcast implementation
Edgar Solomonik Split-Dimensional Cannon’s algorithm 21/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Conclusion
SD-Cannon
I Breaches performance gap between Cannon and SUMMA
I Is uniquely asymptotically communication-optimal on a k-aryn-cube
I Virtualization allows general mapping support but incursoverhead
Topology-aware mapping and algorithm design
I Allows zero network contention
I Permits saturation of much more bandwidth on torus networks
I Pervasive for parallel scalability on high-end supercomputers
Edgar Solomonik Split-Dimensional Cannon’s algorithm 22/ 23
Torus networksAlgorithms
ImplementationPerformanceConclusion
Finish
Acknowledgements:
I Krell CSGF DOE fellowship (DE-AC02-06CH11357)
I Resources at Argonne National Lab
I Anonymous VECPAR reviewersI Discussions with collaborators
I James Demmel (UC Berkeley)I Jeff Hammond (Argonne National Lab)I Grey Ballard (UC Berkeley)
Also, see my talk on 2.5D algorithms at University of Tokyo
I Tuesday, July 24, 10:00-12:00
Edgar Solomonik Split-Dimensional Cannon’s algorithm 23/ 23
Backup slides
Edgar Solomonik Split-Dimensional Cannon’s algorithm 24/ 23