Systolic Arrays1
Transcript of Systolic Arrays1
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 1/35
Systolic Arrays & Their
Applications
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 2/35
What Is a Systolic Array?
A systolic array is an arrangement of processors in an array
where data flows synchronously across the array between
neighbors, usually with different data flowing in different directions
Each processor at each step takes in data from one or more
neighbors (e.g. North and West), processes it and, in the next
step, outputs results in the opposite direction (South and East).
H. T. Kung and Charles Leiserson were the first to publish
a paper on systolic arrays in 1978, and coined the name.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 3/35
What Is a Systolic Array?
A specialized form of parallel computing.
Multiple processors connected by short
wires.
Unlike many forms of parallelism which
lose speed through their connection.
Cells(processors), compute data and store it
independently of each other.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 4/35
Systolic Unit(cell)
Each unit is an independent processor.
Every processor has some registers and an
ALU.
The cells share information with their
neighbors, after performing the needed
operations on the data.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 5/35
Some simple examples of systolic array models.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 6/35
Matrix Multiplication
a11 a12 a13
a21 a22 a23
a31 a32 a33
*
b11 b12 b13
b21 b22 b23
b31 b32 b33
=
c11 c12 c13
c21 c22 c23
c31 c32 c33
Conventional Method: N3
For I = 1 to N
For J = 1 to N
For K = 1 to N
C[I,J] = C[I,J] + A[J,K] * B[K,J];
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 7/35
Systolic Method
This will run in O(n) time!
To run in N time we need N x N processing units, in this case
we need 9.
P9P8P7
P6P5P4
P1 P2 P3
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 8/35
We need to modify the input data, like so:
a13 a12 a11
a23 a22 a21
a33 a32 a31
b31 b32 b33
b21 b22 b23
b11 b12 b13
Flip columns 1 & 3
Flip rows 1 & 3
and finally stagger the data sets for input.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 9/35
P9P8P7
P6P5P4
P1 P2 P3a13 a12 a11
a23 a22 a21
a33 a32 a31
b31
b21
b11
b32
b22
b12
b33
b23
b13
At every tick of the global system clock data is passed to each
processor from two different directions, then it is multiplied
and the result is saved in a register.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 10/35
3 4 2
2 5 3
3 2 5 *=
3 4 2
2 5 3
3 2 5
23 36 28
25 39 34
28 32 37
Lets try this using a systolic array.
P9P8P7
P6P5P4
P1 P2 P32 4 3
3 5 2
3
2
3
5 2 3
5
3
2
2
5
4
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 11/35
3*32 4
3 5 2
32
5 2 3
5
3
2
2
5
4
Clock tick: 1
9 0 0 0 0 0 0 0 0
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 12/35
2*3
4*2 3*42
3 5
3
5 2 3
5
3
225
Clock tick: 2
17 12 0 6 0 0 0 0 0
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 13/35
3*3
2*45*2
2*3 4*5 3*2
3
5 2
5
32
Clock tick: 3
23 32 6 16 8 0 9 0 0
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 14/35
3*42*2
2*25*53*3
2*2 4*3
5
5
Clock tick: 4
23 36 18 25 33 4 13 12 0
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 15/35
3*25*25*3
5*33*2
2*5
Clock tick: 5
23 36 28 25 39 19 28 22 6
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 16/35
2*35*2
3*5
Clock tick: 6
23 36 28 25 39 34 28 32 12
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 17/35
5*5
Clock tick: 7
23 36 28 25 39 34 28 32 37
P1 P2 P3 P4 P6P5 P7 P8 P9
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 18/35
373228
343925
23 36 28
23 36 28 25 39 34 28 32 37
P1 P2 P3 P4 P6P5 P7 P8 P9
Same answer! In 2n + 1 time, can we do better?
The answer is yes, there is an optimization.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 19/35
Why Systolic?
Extremely fast.
Easily scalable architecture.
Can do many tasks single processor
machines cannot attain.
Turns some exponential problems into
linear or polynomial time.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 20/35
Why Not Systolic?
Expensive.
Not needed on most applications, they are a
highly specialized processor type.
Difficult to implement and build.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 21/35
MIMD: Multiple instruction
multiple data
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 22/35
Flynn’s Classical Taxonomy Distinguishes multi-processor architecture
by instruction and data
SISD –
Single Instruction, Single Data
SIMD – Single Instruction, Multiple Data
MISD – Multiple Instruction, Single Data
MIMD – Multiple Instruction, Multiple
Data
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 23/35
Flynn’s Classical Taxonomy: MIMD
Can execute different
instructions on different
data elements.
Most common type of parallel computer.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 24/35
Multi-computer:Structure of Distributed Memory MIMD Architectures
P ll l C t M
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 25/35
Parallel Computer MemoryArchitectures:
Shared Memory Architecture All processors access all
memory as a single
global address space.
Data sharing is fast. Lack of scalability
between memory and
CPUs
P ll l C t M
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 26/35
Parallel Computer MemoryArchitectures:
Distributed Memory Each processor has its
own memory.
Is scalable, no overhead
for cache coherency. Programmer is
responsible for many
details of
communication between
processors.
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 27/35
Multi-computer (distributed memory system):Advantages and Disadvantages
+Highly Scalable
+Message passing solves memory access
synchronization problem
-Load balancing problem
-Deadlock in message passing
-Need to physically copying data between
processes
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 28/35
Multi-processor (shared memory system):Advantages and Disadvantages
+May use uniprocessor programming
techniques
+Communication between processor isefficient
-Synchronized access to share data in
memory needed
-Lack of scalability due to (memory)
contention problem
B t f B th W ld
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 29/35
Best of Both Worlds(Multicomputer using virtual shared memory)
Also called distributed shared memory
architecture
The local memories of multi-computer are
components of global address space:
– any processor can access the local memory of
any other processor
Three approaches:
– Non-uniform memory access (NUMA) machines
– Cache-only memory access (COMA) machines
– Cache-coherent non-uniform memory access
(CC-NUMA) machines
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 30/35
Structure of NUMA
Architectures
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 31/35
NUMA: remote load
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 32/35
Structure of COMAArchitectures
Structure of CC NUMA
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 33/35
Structure of CC-NUMAArchitectures
Classification of MIMD
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 34/35
Classification of MIMDcomputers
8/3/2019 Systolic Arrays1
http://slidepdf.com/reader/full/systolic-arrays1 35/35
THANK YOU