Parallel Matrix Multiplication - Cannon's Algorithm and 2 ...

Parallel Matrix Multiplication

Cannon’s Algorithm and 2.5D Matrix Multiplication

Charles and Dulac

Thursday April 2, 2020

Questions

1. If we are calculating the product of two 16× 16 matrices using

16 processors, what are the dimensions of the submatrices

used in Cannon’s Algorithm?

2. What is a downside of Cannon’s Algorithm?

3. How many iterations are required for 2.5D matrix

multiplication?

1

Outline

Introductions

Parallelizing Matrix Multiplication

Cannon’s Algorithm

3D Matrix Multiplication

2.5D Matrix Multiplication

Summary

2

Introductions

Introduction: Liz Dulac

3

About Me: Liz Dulac

Major:

Physics

−→ [Applied] Mathematics (BS)

−→ Computer Science (BS)

Minor:

Fine Arts

−→ French

−→ Theatre (minor)

4

Hobbies: Theatre

5

Hobbies: Guard

6

The Bay State: Wicked Awesome

7

Amherst: Five College Consortium

8

The Bay State: Amherst

9

In Conclusion...

10

Introduction: MeiLi Charles

11

Maryville College

12

Hobbies: Cosplay

13

Meet my little friends

14

Parallelizing Matrix

Multiplication

Why Matrix Multiplication?

Applications

• Physics

• Graph theory

• Recurrence relations

• Tensors

15

Intro to Parallelization

Serial

Instructions Processor

I8 I7 I6 I5 I4 I3 I2 I1 I0 −→ P

Intro to Parallel

• Balance workload

• Avoid dependencies

• Limit Communication

Parallel

Instructions Processors

I6 I3 I0 −→ P0

I7 I4 I1 −→ P1

I8 I5 I2 −→ P2

16

Review: Matrix Multiplication

Am×n × Bn×p = Cm×pa11 a12 a13 . . . a1na21 a22 a23 . . . a2na31 a32 a33 . . . a3n...

......

...

am1 am2 am3 . . . amn

b11 b12 b13 . . . b1pb21 b22 b23 . . . b2pb31 b32 b33 . . . b3p...

......

...

bn1 bn2 bn3 . . . bnp

17

Review: Matrix Multiplication

Some Take-Aways

• Naively O(n3) operations• No dependencies between cij

• Summation can occur in any order

cij =n∑

k=1

aikbkj

• Will need to calculate aik × bkj , ∀i ≤ m, j ≤ n

18

Background


• Lynn Elliot Cannon

• Ph.D. Thesis, Montana State University,

14 July 1969

• A cellular computer to implement the

Kalman Filter Algorithm

19



• Each processor calculates

block of Cm×n

• Calculate one piece of dotproduct each iteration• Calculate index

k = (i + j + iter)(mod√p)

• Increment result by Aik × Bkj

P00 P01 P02 . . . P0√p

P10 P11 P12 . . . P1√p

P20 P21 P22 . . . P2√p

. . . . . . . . . . . . . . .

P√p0 P√

p1 P√p2 . . . P√

p√p

20

Cannon’s Algorithm: Example

Calculate: C 8×8 = A8×8 ∗ B8×8 using 16 processors.

• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •

• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •• • • • • • • •

21


Processor Grid:

• 16 processors

=⇒ 4× 4 processor grid

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

22


Processor Grid:

• 4× 4 processor grid

=⇒ 4× 4 block matrix dimensionsC00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

23


Processor Grid:

• 4× 4 block matrix to represent an 8× 8 matrix

=⇒ 2× 2 submatrix per processor

c00 c01 c02 c03 c04 c05 c06 c07c10 c11 c12 c13 c14 c15 c16 c17




24


1. Partition Input Matrices:A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

x

B00 B01 B02 B03

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

25


2. Pivot on Diagonals. Distribute to Processor Grid.

A00 A01 A02 A03

A11 A12 A13 A10

A22 A23 A20 A21

A33 A30 A31 A32

B00 B11 B22 B33

B10 B21 B32 B03

B20 B31 B02 B13

B30 B01 B12 B23

26


3. Shift Matrices

←−A01 A02 A03 A00

A12 A13 A10 A11

A23 A20 A21 A22

A30 A31 A32 A33

↑

B10 B21 B32 B03

B20 B31 B02 B13

B30 B01 B12 B23

B00 B11 B22 B33

27


3. Shift Matrices

←−A02 A03 A00 A01

A13 A10 A11 A12

A20 A21 A22 A23

A31 A32 A33 A30

↑

B20 B31 B02 B13

B30 B01 B12 B23

B00 B11 B22 B33

B10 B21 B32 B03

28


3. Shift Matrices

←−A03 A00 A01 A02

A10 A11 A12 A13

A21 A22 A23 A20

A32 A33 A30 A31

↑

B30 B01 B12 B23

B00 B11 B22 B33

B10 B21 B32 B03

B20 B31 B02 B13

29

Cost

Time Space

O(n3/p) O(n2/p)

Note: redistributed matrices each of√p iterations

30

3D Matrix Multiplication

Cannon’s Algorithm −→ 3D

P P P P P

P P P P P

P P P P P

P P P P P

P P P P P

−→

P P P

P P P P

P P P P P

P P P P

P P P

Cannon (2D)

• n√p ×

n√p blocks

• √p Aij ∗ Bjk per processor

3D

• n3√p ×

n3√p blocks

• 1 Aij ∗ Bjk per processor

31

Cost

Time

• O(n3/p)Space

• O(n2/p2/3)

n2 mem/matrix copy ∗ 3√p copies /p processors

Communication Cost: only 1 iteration

32

What if we don’t QUITE have enough space for 3√p copies, but

would like to use the memory we do have?

33

Background


• Edgar Solomnik & James Demmel

• Communication-optimal parallel

2.5D matrix multiplication and LU

factorization algorithms

• Published in 2011

34


Goal:

• Take advantage of any extra memory to reduce amount of

communication

35


P P P P P P P P

P P P P P P P P P

P P P P P P P P P

P P P P P P P P P

P P P P P P P P P

P P P P P P P P P

P P P P P P P P P

P P P P P P P P P

P P P P P P P P

2.5D Copies

• Generalize to use

c copies

• c ∈ [1, 3√p]

36

Partitioning

Consider: square n × n matrices, p processors,

and c copies.

• pc processors per copy

•√

pc ×

√pc processor grid

• n√p/c× n√

p/cblocks

√pc

√pc

c11 c12 c13 . . . c1nc21 c22 c23 . . . c2nc31 c32 c33 . . . c3n...

......

...

cn1 cn2 cn3 . . . cnn

n

n

37

Cost

Note:

√pc elements of block matrix dot product

c copies at work =√

pc3

iterations

38

Summary

Summary

39

Questions?

40

Questions

1. If we are calculating the product of two 16× 16 matrices using

16 processors, what are the dimensions of the submatrices

used in Cannon’s Algorithm?

2. What is a downside of Cannon’s Algorithm?

3. How many iterations are required for 2.5D matrix

multiplication?

41

Parallel Matrix Multiplication - Cannon's Algorithm and 2 ...

Documents

Transcript of Parallel Matrix Multiplication - Cannon's Algorithm and 2 ...