Systolic Arrays1

35
Systolic Arrays & Their Applications

Transcript of Systolic Arrays1

Page 1: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 1/35

Systolic Arrays & Their

Applications

Page 2: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 2/35

What Is a Systolic Array?

A systolic array is an arrangement of processors in an array

where data flows synchronously across the array between

neighbors, usually with different data flowing in different directions

Each processor at each step takes in data from one or more

neighbors (e.g. North and West), processes it and, in the next

step, outputs results in the opposite direction (South and East).

H. T. Kung and Charles Leiserson were the first to publish

a paper on systolic arrays in 1978, and coined the name.

Page 3: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 3/35

What Is a Systolic Array?

A specialized form of parallel computing.

Multiple processors connected by short

wires.

Unlike many forms of parallelism which

lose speed through their connection.

Cells(processors), compute data and store it

independently of each other.

Page 4: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 4/35

Systolic Unit(cell)

Each unit is an independent processor.

Every processor has some registers and an

ALU.

The cells share information with their

neighbors, after performing the needed

operations on the data.

Page 5: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 5/35

Some simple examples of systolic array models.

Page 6: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 6/35

Matrix Multiplication

a11 a12 a13

a21 a22 a23

a31 a32 a33

*

b11 b12 b13

b21 b22 b23

b31 b32 b33

=

c11 c12 c13

c21 c22 c23

c31 c32 c33

Conventional Method: N3

For I = 1 to N

For J = 1 to N

For K = 1 to N

C[I,J] = C[I,J] + A[J,K] * B[K,J];

Page 7: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 7/35

Systolic Method

This will run in O(n) time!

To run in N time we need N x N processing units, in this case

we need 9.

P9P8P7

P6P5P4

P1 P2 P3

Page 8: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 8/35

We need to modify the input data, like so:

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31 b32 b33

b21 b22 b23

b11 b12 b13

Flip columns 1 & 3

Flip rows 1 & 3

and finally stagger the data sets for input.

Page 9: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 9/35

P9P8P7

P6P5P4

P1 P2 P3a13 a12 a11

a23 a22 a21

a33 a32 a31

b31

b21

b11

b32

b22

b12

b33

b23

b13

At every tick of the global system clock data is passed to each

processor from two different directions, then it is multiplied

and the result is saved in a register.

Page 10: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 10/35

3 4 2 

2 5 3

3 2 5 *=

3 4 2 

2 5 3

3 2 5

23 36 28 

25 39 34

28 32 37

Lets try this using a systolic array.

P9P8P7

P6P5P4

P1 P2 P32 4 3

3 5 2

3

2

3

5 2 3

5

3

2

2

5

4

Page 11: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 11/35

3*32 4

3 5 2

32

5 2 3

5

3

2

2

5

4

Clock tick: 1

9 0 0 0 0 0 0 0 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 12: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 12/35

2*3

4*2 3*42

3 5

3

5 2 3

5

3

225

Clock tick: 2

17 12 0 6 0 0 0 0 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 13: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 13/35

3*3

2*45*2

2*3 4*5 3*2

3

5 2

5

32

Clock tick: 3

23 32 6 16 8 0 9 0 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 14: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 14/35

3*42*2

2*25*53*3

2*2 4*3

5

5

Clock tick: 4

23 36 18 25 33 4 13 12 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 15: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 15/35

3*25*25*3

5*33*2

2*5

Clock tick: 5

23 36 28 25 39 19 28 22 6

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 16: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 16/35

2*35*2

3*5

Clock tick: 6

23 36 28 25 39 34 28 32 12

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 17: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 17/35

5*5

Clock tick: 7

23 36 28 25 39 34 28 32 37

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 18: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 18/35

373228

343925

23 36 28

23 36 28 25 39 34 28 32 37

P1 P2 P3 P4 P6P5 P7 P8 P9

Same answer! In 2n + 1 time, can we do better?

The answer is yes, there is an optimization.

Page 19: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 19/35

Why Systolic?

Extremely fast.

Easily scalable architecture.

Can do many tasks single processor

machines cannot attain.

Turns some exponential problems into

linear or polynomial time.

Page 20: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 20/35

Why Not Systolic?

Expensive.

Not needed on most applications, they are a

highly specialized processor type.

Difficult to implement and build.

Page 21: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 21/35

MIMD: Multiple instruction

multiple data

Page 22: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 22/35

 

Flynn’s Classical Taxonomy  Distinguishes multi-processor architecture

by instruction and data

SISD – 

Single Instruction, Single Data

SIMD – Single Instruction, Multiple Data

MISD – Multiple Instruction, Single Data

MIMD – Multiple Instruction, Multiple

Data

Page 23: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 23/35

Flynn’s Classical Taxonomy: MIMD

Can execute different

instructions on different

data elements.

Most common type of parallel computer.

Page 24: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 24/35

Multi-computer:Structure of Distributed Memory MIMD Architectures

P ll l C t M

Page 25: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 25/35

Parallel Computer MemoryArchitectures:

Shared Memory Architecture All processors access all

memory as a single

global address space.

Data sharing is fast. Lack of scalability

between memory and

CPUs

P ll l C t M

Page 26: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 26/35

Parallel Computer MemoryArchitectures:

Distributed Memory Each processor has its

own memory.

Is scalable, no overhead

for cache coherency. Programmer is

responsible for many

details of 

communication between

processors.

Page 27: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 27/35

Multi-computer (distributed memory system):Advantages and Disadvantages

+Highly Scalable

+Message passing solves memory access

synchronization problem

-Load balancing problem

-Deadlock in message passing

-Need to physically copying data between

processes

Page 28: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 28/35

Multi-processor (shared memory system):Advantages and Disadvantages

+May use uniprocessor programming

techniques

+Communication between processor isefficient

-Synchronized access to share data in

memory needed

-Lack of scalability due to (memory)

contention problem

B t f B th W ld

Page 29: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 29/35

Best of Both Worlds(Multicomputer using virtual shared memory)

Also called distributed shared memory

architecture

The local memories of multi-computer are

components of global address space:

 –  any processor can access the local memory of 

any other processor

Three approaches:

 – Non-uniform memory access (NUMA) machines

 –  Cache-only memory access (COMA) machines

 –  Cache-coherent non-uniform memory access

(CC-NUMA) machines

Page 30: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 30/35

Structure of NUMA

Architectures

Page 31: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 31/35

NUMA: remote load

Page 32: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 32/35

Structure of COMAArchitectures

Structure of CC NUMA

Page 33: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 33/35

Structure of CC-NUMAArchitectures

Classification of MIMD

Page 34: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 34/35

Classification of MIMDcomputers

Page 35: Systolic Arrays1

8/3/2019 Systolic Arrays1

http://slidepdf.com/reader/full/systolic-arrays1 35/35

THANK YOU