Parallel Matrix Multiplication - Cannon's Algorithm and 2 ...
Transcript of Parallel Matrix Multiplication - Cannon's Algorithm and 2 ...
Parallel Matrix Multiplication
Cannon’s Algorithm and 2.5D Matrix Multiplication
Charles and Dulac
Thursday April 2, 2020
Questions
1. If we are calculating the product of two 16× 16 matrices using
16 processors, what are the dimensions of the submatrices
used in Cannon’s Algorithm?
2. What is a downside of Cannon’s Algorithm?
3. How many iterations are required for 2.5D matrix
multiplication?
1
Outline
Introductions
Parallelizing Matrix Multiplication
Cannon’s Algorithm
3D Matrix Multiplication
2.5D Matrix Multiplication
Summary
2
About Me: Liz Dulac
Major:
Physics
−→ [Applied] Mathematics (BS)
−→ Computer Science (BS)
Minor:
Fine Arts
−→ French
−→ Theatre (minor)
4
Why Matrix Multiplication?
Applications
• Physics
• Graph theory
• Recurrence relations
• Tensors
15
Intro to Parallelization
Serial
Instructions Processor
I8 I7 I6 I5 I4 I3 I2 I1 I0 −→ P
Intro to Parallel
• Balance workload
• Avoid dependencies
• Limit Communication
Parallel
Instructions Processors
I6 I3 I0 −→ P0
I7 I4 I1 −→ P1
I8 I5 I2 −→ P2
16
Review: Matrix Multiplication
Am×n × Bn×p = Cm×pa11 a12 a13 . . . a1na21 a22 a23 . . . a2na31 a32 a33 . . . a3n...
......
...
am1 am2 am3 . . . amn
b11 b12 b13 . . . b1pb21 b22 b23 . . . b2pb31 b32 b33 . . . b3p...
......
...
bn1 bn2 bn3 . . . bnp
17
Review: Matrix Multiplication
Some Take-Aways
• Naively O(n3) operations• No dependencies between cij
• Summation can occur in any order
cij =n∑
k=1
aikbkj
• Will need to calculate aik × bkj , ∀i ≤ m, j ≤ n
18
Background
Cannon’s Algorithm
• Lynn Elliot Cannon
• Ph.D. Thesis, Montana State University,
14 July 1969
• A cellular computer to implement the
Kalman Filter Algorithm
19
Cannon’s Algorithm
Cannon’s Algorithm
• Each processor calculates
block of Cm×n
• Calculate one piece of dotproduct each iteration• Calculate index
k = (i + j + iter)(mod√p)
• Increment result by Aik × Bkj
P00 P01 P02 . . . P0√p
P10 P11 P12 . . . P1√p
P20 P21 P22 . . . P2√p
. . . . . . . . . . . . . . .
P√p0 P√
p1 P√p2 . . . P√
p√p
20
Cannon’s Algorithm: Example
Calculate: C 8×8 = A8×8 ∗ B8×8 using 16 processors.
• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •
• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •
21
Cannon’s Algorithm: Example
Processor Grid:
• 16 processors
=⇒ 4× 4 processor grid
P00 P01 P02 P03
P10 P11 P12 P13
P20 P21 P22 P23
P30 P31 P32 P33
22
Cannon’s Algorithm: Example
Processor Grid:
• 4× 4 processor grid
=⇒ 4× 4 block matrix dimensionsC00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
23
Cannon’s Algorithm: Example
Processor Grid:
• 4× 4 block matrix to represent an 8× 8 matrix
=⇒ 2× 2 submatrix per processor
c00 c01 c02 c03 c04 c05 c06 c07c10 c11 c12 c13 c14 c15 c16 c17
c20 c21 c22 c23 c24 c25 c26 c27c30 c31 c32 c33 c34 c35 c36 c37
c40 c41 c42 c43 c44 c45 c46 c47c50 c51 c52 c53 c54 c55 c56 c57
c60 c61 c62 c63 c64 c65 c66 c67c70 c71 c72 c73 c74 c75 c76 c77
24
Cannon’s Algorithm: Example
1. Partition Input Matrices:A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
x
B00 B01 B02 B03
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
25
Cannon’s Algorithm: Example
2. Pivot on Diagonals. Distribute to Processor Grid.
A00 A01 A02 A03
A11 A12 A13 A10
A22 A23 A20 A21
A33 A30 A31 A32
B00 B11 B22 B33
B10 B21 B32 B03
B20 B31 B02 B13
B30 B01 B12 B23
26
Cannon’s Algorithm: Example
3. Shift Matrices
←−A01 A02 A03 A00
A12 A13 A10 A11
A23 A20 A21 A22
A30 A31 A32 A33
↑
B10 B21 B32 B03
B20 B31 B02 B13
B30 B01 B12 B23
B00 B11 B22 B33
27
Cannon’s Algorithm: Example
3. Shift Matrices
←−A02 A03 A00 A01
A13 A10 A11 A12
A20 A21 A22 A23
A31 A32 A33 A30
↑
B20 B31 B02 B13
B30 B01 B12 B23
B00 B11 B22 B33
B10 B21 B32 B03
28
Cannon’s Algorithm: Example
3. Shift Matrices
←−A03 A00 A01 A02
A10 A11 A12 A13
A21 A22 A23 A20
A32 A33 A30 A31
↑
B30 B01 B12 B23
B00 B11 B22 B33
B10 B21 B32 B03
B20 B31 B02 B13
29
Cannon’s Algorithm −→ 3D
P P P P P
P P P P P
P P P P P
P P P P P
P P P P P
−→
P P P
P P P P
P P P P P
P P P P
P P P
Cannon (2D)
• n√p ×
n√p blocks
• √p Aij ∗ Bjk per processor
3D
• n3√p ×
n3√p blocks
• 1 Aij ∗ Bjk per processor
31
Cost
Time
• O(n3/p)Space
• O(n2/p2/3)
n2 mem/matrix copy ∗ 3√p copies /p processors
Communication Cost: only 1 iteration
32
What if we don’t QUITE have enough space for 3√p copies, but
would like to use the memory we do have?
33
Background
2.5D Matrix Multiplication
• Edgar Solomnik & James Demmel
• Communication-optimal parallel
2.5D matrix multiplication and LU
factorization algorithms
• Published in 2011
34
2.5D Matrix Multiplication
Goal:
• Take advantage of any extra memory to reduce amount of
communication
35
2.5D Matrix Multiplication
P P P P P P P P
P P P P P P P P P
P P P P P P P P P
P P P P P P P P P
P P P P P P P P P
P P P P P P P P P
P P P P P P P P P
P P P P P P P P P
P P P P P P P P
2.5D Copies
• Generalize to use
c copies
• c ∈ [1, 3√p]
36
Partitioning
Consider: square n × n matrices, p processors,
and c copies.
• pc processors per copy
•√
pc ×
√pc processor grid
• n√p/c× n√
p/cblocks
√pc
√pc
c11 c12 c13 . . . c1nc21 c22 c23 . . . c2nc31 c32 c33 . . . c3n...
......
...
cn1 cn2 cn3 . . . cnn
n
n
37