Post on 10-Mar-2020
ARCHITECTURES FOR PARALLEL COMPUTATION
1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures 4. Performance of Parallel Architectures 5. The Interconnection Network 6. SIMD Computers: Array Processors 7. MIMD Computers 9. Multicore Architectures10. Multithreading11. General Purpose Graphic Processing Units12. Vector Processors13. Multimedia Extensions to Microprocessors
1 of 81Datorarkitektur Fö 11-12
The Need for High PerformanceTwo main factors contribute to high performance of modern processors:
Fast circuit technology Architectural features:
- large caches- multiple fast buses- pipelining- superscalar architectures (multiple functional units)
2 of 81Datorarkitektur Fö 11-12
The Need for High PerformanceTwo main factors contribute to high performance of modern processors:
Fast circuit technology Architectural features:
- large caches- multiple fast buses- pipelining- superscalar architectures (multiple functional units)
However Computers running with a single CPU, often are not able to meet
performance needs in certain areas:- Fluid flow analysis and aerodynamics;- Simulation of large complex systems, for example in physics,
economy, biology, technic;- Computer aided design;- Multimedia.
3 of 81Datorarkitektur Fö 11-12
A Solution: Parallel Computers
One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application.
Such computers have been organized in different ways. Some key features:
number and complexity of individual CPUs availability of common (shared memory) interconnection topology performance of interconnection network I/O devices - - - - - - - - - - - - -
To efficiently use parallel computers you need to write parallel programs.
4 of 81Datorarkitektur Fö 11-12
U
S
Parallel ProgramsParallel sorting
nsorted-1 Unsorted-4Unsorted-3Unsorted-2
orted-1 Sorted-4Sorted-3Sorted-2
Sort-1 Sort-4Sort-3Sort-2
Merge
S O R T E D
5 of 81Datorarkitektur Fö 11-12
-U
S
Parallel ProgramsParallel sorting var t: array [1..1000] of integer;
- - - - - - - - - - -procedure sort (i, j:integer);
- - sort elements between t[i] and t[j] - end sort;procedure merge;
- - merge the four sub-arrays - - end merge;- - - - - - - - - - - begin
- - - - - - - - cobegin
sort (1,250) |sort (251,500) |sort (501,750) |sort (751,1000)
coend;merge;- - - - - - - -
end;
nsorted-1 Unsorted-4Unsorted-3Unsorted-2
orted-1 Sorted-4Sorted-3Sorted-2
Sort-1 Sort-4Sort-3Sort-2
Merge
S O R T E D
6 of 81Datorarkitektur Fö 11-12
a1a2a3⋅ ⋅an
Parallel ProgramsMatrix addition:1 1 1
⋅ ⋅1
a12 a22 a32⋅ ⋅ ⋅ ⋅an2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
a1m a2m a3m⋅ ⋅ ⋅ ⋅anm
b11 b21 b31⋅ ⋅ ⋅ ⋅bn1
b12 b22 b32⋅ ⋅ ⋅ ⋅bn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
b1m b2m b3m⋅ ⋅ ⋅ ⋅bnm
c11 c21 c31⋅ ⋅ ⋅ ⋅cn1
c12 c22 c32⋅ ⋅ ⋅ ⋅cn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
c1m c2m c3m⋅ ⋅ ⋅ ⋅cnm
+ =
7 of 81Datorarkitektur Fö 11-12
a1a2a3⋅ ⋅an
Parallel ProgramsMatrix addition:1 1 1
⋅ ⋅1
a12 a22 a32⋅ ⋅ ⋅ ⋅an2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
a1m a2m a3m⋅ ⋅ ⋅ ⋅anm
b11 b21 b31⋅ ⋅ ⋅ ⋅bn1
b12 b22 b32⋅ ⋅ ⋅ ⋅bn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
b1m b2m b3m⋅ ⋅ ⋅ ⋅bnm
c11 c21 c31⋅ ⋅ ⋅ ⋅cn1
c12 c22 c32⋅ ⋅ ⋅ ⋅cn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
c1m c2m c3m⋅ ⋅ ⋅ ⋅cnm
+ =
var a: array [1..n, 1..m] of integer;b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;i: integer
- - - - - - - - - - -begin
- - - - - - - - for i:=1 to n do
for j:= 1 to m doc[i,j]:=a[i, j] + b[i, j];
end forend for- - - - - - - -
end;
Sequential version:
8 of 81Datorarkitektur Fö 11-12
a1a2a3⋅ ⋅an
r);
j];
Parallel ProgramsMatrix addition:1 1 1
⋅ ⋅1
a12 a22 a32⋅ ⋅ ⋅ ⋅an2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
a1m a2m a3m⋅ ⋅ ⋅ ⋅anm
b11 b21 b31⋅ ⋅ ⋅ ⋅bn1
b12 b22 b32⋅ ⋅ ⋅ ⋅bn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
b1m b2m b3m⋅ ⋅ ⋅ ⋅bnm
c11 c21 c31⋅ ⋅ ⋅ ⋅cn1
c12 c22 c32⋅ ⋅ ⋅ ⋅cn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
c1m c2m c3m⋅ ⋅ ⋅ ⋅cnm
+ =
var a: array [1..n, 1..m] of integer;b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;i: integer
- - - - - - - - - - -begin
- - - - - - - - for i:=1 to n do
for j:= 1 to m doc[i,j]:=a[i, j] + b[i, j];
end forend for- - - - - - - -
end;
var a: array [1..n, 1..m] of integer;b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;i: integer
- - - - - - - - - - -procedure add_vector(n_ln: intege
var j: integerbegin
for j:=1 to m doc[n_ln, j]:=a[n_ln, j] + b[n_ln,
end forend add_vector;begin
- - - - - - - - cobegin for i:=1 to n do
add_vector(i);coend;- - - - - - - -
end;
Sequential version:
Parallel version:
9 of 81Datorarkitektur Fö 11-12
a1a2a3⋅ ⋅an
r);
j];
Parallel ProgramsMatrix addition:1 1 1
⋅ ⋅1
a12 a22 a32⋅ ⋅ ⋅ ⋅an2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
a1m a2m a3m⋅ ⋅ ⋅ ⋅anm
b11 b21 b31⋅ ⋅ ⋅ ⋅bn1
b12 b22 b32⋅ ⋅ ⋅ ⋅bn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
b1m b2m b3m⋅ ⋅ ⋅ ⋅bnm
c11 c21 c31⋅ ⋅ ⋅ ⋅cn1
c12 c22 c32⋅ ⋅ ⋅ ⋅cn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
c1m c2m c3m⋅ ⋅ ⋅ ⋅cnm
+ =
var a: array [1..n, 1..m] of integer;b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;i: integer
- - - - - - - - - - -begin
- - - - - - - - for i:=1 to n do
c[i,1:m]:=a[i,1:m] +b [i,1:m];end for;- - - - - - - -
end;
Vector computation version 1:
var a: array [1..n, 1..m] of integer;b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;i: integer
- - - - - - - - - - -procedure add_vector(n_ln: intege
var j: integerbegin
for j:=1 to m doc[n_ln, j]:=a[n_ln, j] + b[n_ln,
end forend add_vector;begin
- - - - - - - - cobegin for i:=1 to n do
add_vector(i);coend;- - - - - - - -
end;
Parallel version:
10 of 81Datorarkitektur Fö 11-12
a1a2a3⋅ ⋅an
Parallel ProgramsMatrix addition:1 1 1
⋅ ⋅1
a12 a22 a32⋅ ⋅ ⋅ ⋅an2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
a1m a2m a3m⋅ ⋅ ⋅ ⋅anm
b11 b21 b31⋅ ⋅ ⋅ ⋅bn1
b12 b22 b32⋅ ⋅ ⋅ ⋅bn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
b1m b2m b3m⋅ ⋅ ⋅ ⋅bnm
c11 c21 c31⋅ ⋅ ⋅ ⋅cn1
c12 c22 c32⋅ ⋅ ⋅ ⋅cn2
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
c1m c2m c3m⋅ ⋅ ⋅ ⋅cnm
+ =
var a: array [1..n, 1..m] of integer;b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;i: integer
- - - - - - - - - - -begin
- - - - - - - - for i:=1 to n do
c[i,1:m]:=a[i,1:m] +b [i,1:m];end for;- - - - - - - -
end;
Vector computation version 1:var a: array [1..n, 1..m] of integer;
b: array [1..n, 1..m] of integer;c: array [1..n, 1..m] of integer;
- - - - - - - - - - -begin
- - - - - - - - c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m];- - - - - - - -
end;
Vector computation version 2:
11 of 81Datorarkitektur Fö 11-12
Parallel ProgramsPipeline model computation:
xy 5= 45 xlog+×
y
12 of 81Datorarkitektur Fö 11-12
Parallel ProgramsPipeline model computation:
xy 5= 45 xlog+×
a 45 xlog+=
y
y 5= a×a yx
13 of 81Datorarkitektur Fö 11-12
Parallel ProgramsPipeline model computation:
xy 5= 45 xlog+×
a 45 xlog+=
y
y 5= a×a yx
channel ch:real;- - - - - - - - -cobegin
var x: real;while true do
read(x);send(ch, 45+log(x));
end while |var v: real;while true do
receive(ch, v);write(5 * sqrt(v));
end whilecoend;- - - - - - - - -
14 of 81Datorarkitektur Fö 11-12
Flynn’s Classification of Computer Architectures
Flynn’s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate.
15 of 81Datorarkitektur Fö 11-12
Flynn’s Classification of Computer Architectures
Single Instruction stream, Single Data stream (SISD)
Controlunit
Processingunit
instr. stream
data
stre
am
Memory
CPU
16 of 81Datorarkitektur Fö 11-12
Flynn’s Classification of Computer Architectures
Single Instruction stream, Multiple Data stream (SIMD)
Controlunit
Processingunit_1
SharedMemory
Processingunit_2
Processingunit_n
Inte
rcon
nect
ion
Net
wor
k
IS
DS1
DS2
DSn
SIMD with shared memory
17 of 81Datorarkitektur Fö 11-12
Flynn’s Classification of Computer ArchitecturesSingle Instruction stream, Multiple Data stream (SIMD)
SIMD with no shared memory
Controlunit
Processingunit_1
Processingunit_2
Processingunit_n
Inte
rcon
nect
ion
Net
wor
k
DS1
LM1
IS
LM
DS2
LM2
DSn
LMn
18 of 81Datorarkitektur Fö 11-12
Flynn’s Classification of Computer ArchitecturesMultiple Instruction stream, Multiple Data stream (MIMD)
MIMD with shared memory
Controlunit_1
Processingunit_1
Processingunit_2
Processingunit_n
Inte
rcon
nect
ion
Net
wor
k
DS1
LM1IS1
DS2
LM2
DSn
LMn
Controlunit_2
Controlunit_n
IS2
ISn
CPU_1
CPU_2
CPU_n
SharedMemory
19 of 81Datorarkitektur Fö 11-12
Flynn’s Classification of Computer ArchitecturesMultiple Instruction stream, Multiple Data stream (MIMD)
MIMD with no shared memory
Controlunit_1
Processingunit_1
Processingunit_2
Processingunit_n
Inte
rcon
nect
ion
Net
wor
k
DS1
LM1IS1
DS2
LM2
DSn
LMn
Controlunit_2
Controlunit_n
IS2
ISn
CPU_1
CPU_2
CPU_n
20 of 81Datorarkitektur Fö 11-12
Performance of Parallel Architectures
Important questions:
How fast runs a parallel computer at its maximal potential?
How fast execution can we expect from a parallel computer for a concrete application?
How do we measure the performance of a parallel computer and the performance improvement we get by using such a computer?
21 of 81Datorarkitektur Fö 11-12
Performance Metrics
Peak rate: the maximal computation rate that can be theoretically achieved when all modules are fully utilized. The peak rate is of no practical significance for the user. It is mostly used by vendor companies for marketing of their computers.
22 of 81Datorarkitektur Fö 11-12
23 of 81Datorarkitektur Fö 11-12
Performance Metrics
Peak rate: the maximal computation rate that can be theoretically achieved when all modules are fully utilized. The peak rate is of no practical significance for the user. It is mostly used by vendor companies for marketing of their computers.
Speedup: measures the gain we get by using a certain parallel computer to run a given parallel program in order to solve a specific problem.
TS: execution time needed with the best sequential algorithm;TP: execution time needed with the parallel algorithm.
STSTP------=
Performance Metrics
Efficiency: this metric relates the speedup to the number of processors used; by this it provides a measure of the efficiency with which the processors are used.
S: speedup;p: number of processors.
For the ideal situation, in theory:
; which means E = 1
Practically the ideal efficiency of 1 can not be achieved!
E Sp---=
STSTSp----------- p= =
24 of 81Datorarkitektur Fö 11-12
Amdahl’s Law Consider f to be the ratio of computations that, according to the algorithm,
have to be executed sequentially (0 ≤ f ≤ 1); p is the number of processors;
TP f TS×1 f–( ) TS×
p---------------------------+=
STS
f TS× 1 f–( )TSp-----×+
------------------------------------------------ 1
f 1 f–( )p---------------+
-----------------------= =
25 of 81Datorarkitektur Fö 11-12
f
Amdahl’s Law Consider f to be the ratio of computations that, according to the algorithm,
have to be executed sequentially (0 ≤ f ≤ 1); p is the number of processors;
TP f TS×1 f–( ) TS×
p---------------------------+=
STS
f TS× 1 f–( )TSp-----×+
------------------------------------------------ 1
f 1 f–( )p---------------+
-----------------------= =
123456789
10
0.2 0.4 0.6 0.8 1.0
S
26 of 81Datorarkitektur Fö 11-12
Amdahl’s Law
‘Amdahl’s law: even a little ratio of sequential computation imposes a certain limit to speedup; a higher speedup than 1/f can not be achieved, regardless the number of processors.
To efficiently exploit a high number of processors, f must be small (the algorithm has to be highly parallel).
E SP---
1f p 1–( )× 1+----------------------------------= =
27 of 81Datorarkitektur Fö 11-12
Other Aspects which Limit the Speedup
Beside the intrinsic sequentiality of some parts of an algorithm there are also other factors that limit the achievable speedup:
communication cost load balancing of processors costs of creating and scheduling processes I/O operations
There are many algorithms with a high degree of parallelism; for such algorithms the value of f is very small and can be ignored. These algorithms are suited for massively parallel systems; in such cases the other limiting factors, like the cost of communications, become critical.
28 of 81Datorarkitektur Fö 11-12
The Interconnection Network
The interconnection network (IN) is a key component of the architecture. It has a decisive influence on the overall performance and cost.
The traffic in the IN consists of data transfer and transfer of commands and requests.
The key parameters of the IN are
total bandwidth: transferred bits/second cost
29 of 81Datorarkitektur Fö 11-12
The Interconnection Network
Single Bus
Single bus networks are simple and cheap. One single communication allowed at a time; bandwidth shared by all nodes. Performance is relatively poor. In order to keep performance, the number of nodes is limited (16 - 20).
Node1 Node2 Noden
30 of 81Datorarkitektur Fö 11-12
The Interconnection NetworkCompletely connected network
Each node is connected to every other one. Communications can be performed in parallel between any pair of nodes. Both performance and cost are high. Cost increases rapidly with number of nodes.
Node1
Node2 Node5
Node3 Node4
31 of 81Datorarkitektur Fö 11-12
32 of 81Datorarkitektur Fö 11-12
The Interconnection NetworkCrossbar network
The crossbar is a dynamic network: the interconnection topology can be modified by positioning of switches.
The crossbar network is completely connected: any node can be directly connected to any other.
Fewer interconnections are needed than for the static completely connected network; however, a large number of switches is needed.
Several communications can be performed in parallel.
Node1
Node2
Noden
The Interconnection NetworkMesh network
Mesh networks are cheaper than completely connected ones and provide relatively good performance.
In order to transmit an information between certain nodes, routing through intermediate nodes is needed (max. 2*(n-1) intermediates for an n*n mesh).
It is possible to provide wraparound connections: between nodes 1 and 13, 2 and 14, etc.
Three dimensional meshes have been also implemented.
Node1
Node2
Node3
Node4
Node5
Node6
Node7
Node8
Node9
Node10
Node11
Node12
Node13
Node14
Node15
Node16
33 of 81Datorarkitektur Fö 11-12
The Interconnection NetworkHypercube network
2n nodes are arranged in an n-dimensional cube. Each node is connected to n neighbours.
In order to transmit an information between certain nodes, routing through intermediate nodes is needed (maximum n intermediates).
N10
N11 N15
N14
N12
N9 N13
N0
N2
N3 N7
N5
N4
N1
N6
N8
34 of 81Datorarkitektur Fö 11-12
35 of 81Datorarkitektur Fö 11-12
SIMD Computers
SIMD computers are usually called array processors. PU’s are very simple: an ALU which executes the instruction broadcast by
the CU, a few registers, and some local memory. The first SIMD computer: ILLIAC IV (1970s), 64 relatively powerful
processors (mesh connection, see above). Newer SIMD computer: CM-2 (Connection Machine, by Thinking Machines
Corporation, 65 536 very simple processors (connected as hypercube). Array processors are specialized for numerical problems formulated as
matrix or vector calculations. Each PU computes one element of the result.
Controlunit
PU
PU
PU
PU
PU
PU
PU
PU
PU
MIMD computersMIMD with shared memory
Communication between processors is through shared memory. One processor can change the value in a location and the other processors can read the new value.
With many processors, memory contention seriously degrades performance such architectures don’t support a high number of processors.
SharedMemory
Processor1
Processor2
Processorn
LocalMemory
LocalMemory
LocalMemory
Interconnection Network
36 of 81Datorarkitektur Fö 11-12
3
MIMD computersMIMD with shared memory
Communication between processors is through shared memory. One processor can change the value in a location and the other processors can read the new value.
With many processors, memory contention seriously degrades performance such architectures don’t support a high number of processors.
SharedMemory
Processor1
Processor2
Processorn
LocalMemory
LocalMemory
LocalMemory
Classical parallel mainframe computers (1970-1980-1990):
IBM 370/390 Series CRAY X-MP, CRAY Y-MP, CRAY
Modern multicore chips: Intel Core Duo, i5, i7; Arm MPCInterconnection Network
37 of 81Datorarkitektur Fö 11-12
MIMD computersMIMD with no shared memory
Communication between processors is only by passing messages over the interconnection network.
There is no competition of the processors for the shared memory the number of processors is not limited by memory contention.
The speed of the interconnection network is an important parameter for the overall performance.
Modern large parallel computers do not have a system-wide shared memory.
Processor1
Processor2
Processorn
PrivateMemory
PrivateMemory
PrivateMemory
Interconnection Network
38 of 81Datorarkitektur Fö 11-12
Muticore ArchitecturesThe Parallel Computer in Your Pocket
Multicore chips:
Several processors on the same chip.
A parallel computer on a chip.
This is the only way to increase chip performance without excessive increase in power consumption:
Instead of increasing processor frequency, use several processors and run each at lower frequency.
Examples: Intel x86 Multicore architectures
- Intel Core Duo- Intel Core i7
ARM11 MPCore
39 of 81Datorarkitektur Fö 11-12Intel Core DuoComposed of two Intel Core superscalar processors
Processorcore
32-KB L1I Cache
32-KB L1D Cache
Processorcore
32-KB L1I Cache
32-KB L1D Cache
2 MB L2 Shared Cache & Cache coherence
Off chipMain Memory
40 of 81Datorarkitektur Fö 11-12
Intel Core i7Contains four Nehalem processors.
Processorcore
32-KB L1I Cache
32-KB L1D Cache
Processorcore
32-KB L1I Cache
32-KB L1D Cache
Processorcore
32-KB L1I Cache
32-KB L1D Cache
Processorcore
32-KB L1I Cache
32-KB L1D Cache
256 KB L2Cache
256 KB L2Cache
256 KB L2Cache
256 KB L2Cache
8 MB L3 Shared Cache & Cache coherence
Off chip
41 of 81Datorarkitektur Fö 11-12 Main MemoryARM11 MPCore
Arm11 Processorcore
32-KB L1I Cache
32-KB L1D Cache
32-KB L1I Cache
32-KB L1D Cache
32-KB L1I Cache
32-KB L1D Cache
32-KB L1I Cache
32-KB L1D Cache
Off chipMain Memory
Cache coherence unit
Arm11 Processorcore
Arm11 Processorcore
Arm11 Processorcore
42 of 81Datorarkitektur Fö 11-12
Multithreading A running program:
one or several processes; each process:- one or several threads
thread: a piece of sequential code executed in parallel with other threads.
process 1 process 3process 2th
read
1_1
thre
ad 1
_2
thre
ad 1
_3
thre
ad 2
_1
thre
ad 2
_2
thre
ad 3
_1
processor 1 processor 2
43 of 81Datorarkitektur Fö 11-12
Multithreading
Several threads can be active simultaneously on the same processor.
Typically, the Operating System is scheduling threads on the processor. The OS is switching between threads so that one thread is active
(running) on a processor at a time.
Switching between threads implies saving/restoring the Program Counter, Registers, Status flags, etc.
Switching overhead is considerable!
44 of 81Datorarkitektur Fö 11-12
Hardware Multithreading Multithreaded processors provide hardware support for executing
multithreaded code: separate program counter & register set for individual threads; instruction fetching on thread basis; hardware supported context switching.
- Efficient execution of multithread software.- Efficient utilisation of processor resources.
45 of 81Datorarkitektur Fö 11-12
Hardware Multithreading Multithreaded processors provide hardware support for executing
multithreaded code: separate program counter & register set for individual threads; instruction fetching on thread basis; hardware supported context switching.
- Efficient execution of multithread software.- Efficient utilisation of processor resources.
By handling several threads: There is greater chance to find instructions to execute in parallel on the
available resources. When one thread is blocked, due to e.g. memory access or data
dependencies, instructions from another thread can be executed. Multithreading can be implemented on both scalar and superscalar processors.
46 of 81Datorarkitektur Fö 11-12
Approaches to Multithreaded Execution Interleaved multithreading:
The processor switches from one thread to another at each clock cycle; if any thread is blocked due to dependency or memory latency, it is skipped and an instruction from a ready thread is executed.
47 of 81Datorarkitektur Fö 11-12
Approaches to Multithreaded Execution Interleaved multithreading:
The processor switches from one thread to another at each clock cycle; if any thread is blocked due to dependency or memory latency, it is skipped and an instruction from a ready thread is executed.
Blocked multithreading:
Instructions of the same thread are executed until the thread is blocked; blocking of a thread triggers a switch to another thread ready to execute.
48 of 81Datorarkitektur Fö 11-12
Approaches to Multithreaded Execution Interleaved multithreading:
The processor switches from one thread to another at each clock cycle; if any thread is blocked due to dependency or memory latency, it is skipped and an instruction from a ready thread is executed.
Blocked multithreading:
Instructions of the same thread are executed until the thread is blocked; blocking of a thread triggers a switch to another thread ready to execute.
Interleaved and blocked multithreading can be applied to both scalar and superscalar processors. When applied to superscalars, several instructions are issued simultaneously. However, all instructions issued during a cycle have to be from the same thread.
49 of 81Datorarkitektur Fö 11-12
Approaches to Multithreaded Execution Interleaved multithreading:
The processor switches from one thread to another at each clock cycle; if any thread is blocked due to dependency or memory latency, it is skipped and an instruction from a ready thread is executed.
Blocked multithreading:
Instructions of the same thread are executed until the thread is blocked; blocking of a thread triggers a switch to another thread ready to execute.
Interleaved and blocked multithreading can be applied to both scalar and superscalar processors. When applied to superscalars, several instructions are issued simultaneously. However, all instructions issued during a cycle have to be from the same thread.
Simultaneous multithreading (SMT):
Applies only to superscalar processors. Several instructions are fetched during a cycle and they can belong to different threads.
50 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedScalar (non-superscalar) processors
A
A
AA
thread blocked
no multithreading
51 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedScalar (non-superscalar) processors
A
A
AA
ABCDBCDA
thread blocked
no multithreading
ABCDinterleaved multithreading
52 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedScalar (non-superscalar) processors
A
A
AA
ABCDBCDA
thread blocked
no multithreading
ABCDinterleaved multithreading
ABBBCCDA
ABCDblocked multithreading
53 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedSuperscalar processors
X
XX
XX
X
XX
XXX
X
XXX
X X
X
X
X
no multithreading
54 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedSuperscalar processors
X
XX
XX
X
XX
XXX
X
XXX
X X
X
X
X
XYZVXYZX
XYZVXYZX
YZ
VX
ZX
Y
VX
ZX
X
X
X
no multithreadingXYZV
interleavedmultithreading
55 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedSuperscalar processors
X
XX
XX
X
XX
XXX
X
XXX
X X
X
X
X
XYZVXYZX
XYZVXYZX
YZ
VX
ZX
Y
VX
ZX
X
X
X
XXYYYZZV
XX
YYYZ
ZV
XY
Y
Z
V
Y
Z
V
Z Z
no multithreadingXYZV
interleavedmultithreading
XYZV
blockedmultithreading
56 of 81Datorarkitektur Fö 11-12
Approaches to MultithreadedSuperscalar processors
X
XX
XX
X
XX
XXX
X
XXX
X X
X
X
X
XYZVXYZX
XYZVXYZX
YZ
VX
ZX
Y
VX
ZX
X
X
X
XXYYYZZV
XX
YYYZ
ZV
XY
Y
Z
V
Y
Z
V
Z Z
XZXZVYXZ
XZXZVYXZ
XXZXVX
YVYXXVYY
YVYXZVVV
Z
XZCYV
no multithreadingXYZV
interleavedmultithreading
XYZV
blockedmultithreading
XYZV
simultaneousmultithreading
X
57 of 81Datorarkitektur Fö 11-12
Multithreaded ProcessorsAre multithreaded processors parallel computers?
58 of 81Datorarkitektur Fö 11-12
Multithreaded ProcessorsAre multithreaded processors parallel computers?
Yes:
they execute parallel threads; certain sections of the processor are available in several copies (e.g.
program counter, instruction registers + other registers); the processor appears to the operating system as several processors.
59 of 81Datorarkitektur Fö 11-12
Multithreaded ProcessorsAre multithreaded processors parallel computers?
Yes:
they execute parallel threads; certain sections of the processor are available in several copies (e.g.
program counter, instruction registers + other registers); the processor appears to the operating system as several processors.
No:
only certain sections of the processor are available in several copies but we do not have several processors; the execution resources (e.g. functional units) are common.
In fact, a single physical processor appears as multiple logical processors to the operating system.
60 of 81Datorarkitektur Fö 11-12
Multithreaded Processors IBM Power5, Power6:
simultaneous multithreading; two threads/core; both power5 and power6 are dual core chips.
Intel Montecito (Itanium 2 family):
blocked multithreading (called by Intel temporal multithreading); two threads/core; Itanium 2 processors are dual core.
Intel Pentium 4, Nehalem
Pentium 4 was the first Intel processor to implement multithreading; simultaneous multithreading (called by Intel Hyperthreading); two threads/core (8 simultaneous threads per quad core);
61 of 81Datorarkitektur Fö 11-12
General Purpose GPUs
The first GPUs (graphic processing units) were non-programmable 3D-graphic accelerators.
Today’s GPUs are highly programmable and efficient.
NVIDIA, AMD, etc. have introduced high performance GPUs that can be used for general purpose high performance computing: general purpose graphic processing units (GPGPUs).
GPGPUs are multicore, multithreaded processors which also include SIMD capabilities.
62 of 81Datorarkitektur Fö 11-12
TPS
The NVIDIA Tesla GPGPU
C TPC TPC TPC TPC TPC TPC TPC
Host System
Memory
MSM SMSM SMSM SMSM SMSM SMSM SMSM SMSM
The NVIDIA Tesla (GeForce 8800) architecture:
8 independent processing units (texture processor clusters - TPCs);
each TPC consists of 2 streaming multiprocessors (SMs);
each SM consists of 8 streaming processors (SPs).
63 of 81Datorarkitektur Fö 11-12
TPS
2
The NVIDIA Tesla GPGPU
C TPC TPC TPC TPC TPC TPC TPC
Host System
Memory
MSM SMSM SMSM SMSM SMSM SMSM SMSM SMSM
Each TPC consists of:
two SMs, controlled by the SM controller (SMC);
a texture unit (TU): is specialised for graphic processing; the TU can serve four threads per cycle.
SP
SP
SP
SP
SP
SP
SP
SP
CacheMT Issue
SFU SFU
Shared memory
SP
SP
SP
SP
SP
SP
SP
SP
CacheMT Issue
SFU SFU
Shared memory
Texture unit
SMC
TPC
SMSM1
64 of 81Datorarkitektur Fö 11-12
2
The NVIDIA Tesla GPGPUEach SM consists of: 8 streaming processors (SPs); multithreaded instruction fetch and issue
unit (MT issue); cache and shared memory; two Special Function Units (SFUs); each SFU
also includes 4 floating point multipliers.
Each SM is a multithreaded processor; it executes up to 768 concurrent threads.
The SM architecture is based on a combination of SIMD and multithreading - single instruction multiple thread (SIMT): a warp consist of 32 parallel threads; each SM manages a group of 24 warps at a
time (768 threads, in total);
SP
SP
SP
SP
SP
SP
SP
SP
CacheMT Issue
SFU SFU
Shared memory
SP
SP
SP
SP
SP
SP
SP
SP
CacheMT Issue
SFU SFU
Shared memory
Texture unit
SMC
TPC
SMSM1
65 of 81Datorarkitektur Fö 11-12
2
The NVIDIA Tesla GPGPUAt each instruction issue the SM selects a ready warp for execution.
Individual threads belonging to the active warp are mapped for execution on the SP cores.
The 32 threads of the same warp execute code starting from the same address.
Once a warp has been selected for execution by the SM, all threads execute the same instruction at a time; some threads can stay idle (due to branching).
The SP (streaming processor) is the primary processing unit of an SM. It performs: floating point add, multiply, multiply-add; integer arithmetic, comparison, conversion.
SP
SP
SP
SP
SP
SP
SP
SP
CacheMT Issue
SFU SFU
Shared memory
SP
SP
SP
SP
SP
SP
SP
SP
CacheMT Issue
SFU SFU
Shared memory
Texture unit
SMC
TPC
SMSM1
66 of 81Datorarkitektur Fö 11-12
The NVIDIA Tesla GPGPU GPGPUS are optimised for throughput:
each thread may take longer time to execute but there are hundreds of threads active;
large amount of activity in parallel and large amount of data produced at the same time.
Common CPU based parallel computers are primarily optimised for latency:
each thread runs as fast as possible, but only a limited amount of threads are active.
Throughput oriented applications for GPGPUs:
extensive data parallelism: thousands of computations on independent data elements;
limited process-level parallelism: large groups of threads run the same program; not so many different groups that run different programs.
latency tolerant; the primary goal is the amount of work completed.
67 of 81Datorarkitektur Fö 11-12Vector Processors
Vector processors include in their instruction set, beside scalar instructions, also instructions operating on vectors.
Array processors (SIMD) computers can operate on vectors by executing simultaneously the same instruction on pairs of vector elements; each instruction is executed by a separate processing element.
Several computer architectures have implemented vector operations using the parallelism provided by pipelined functional units. Such architectures are called vector processors.
68 of 81Datorarkitektur Fö 11-12
Vector Processors Strictly speaking, vector processors are not parallel processors, although
they behave like SIMD computers. There are not several CPUs in a vector processor, running in parallel. They are SISD processors which have implemented vector instructions executed on pipelined functional units.
Vector computers usually have vector registers which can store each 64 up to 128 words.
Vector instructions:
load vector from memory into vector register store vector into memory arithmetic and logic operations between vectors operations between vectors and scalars etc.
The programmers is allowed to use operations on vectors; the compiler translates these instructions into vector instructions at machine level.
69 of 81Datorarkitektur Fö 11-12
Vector Processors
Scalar registers
Vector registers
Scalar functional units
Vector functional units
Scalar unit
Vector unit
Instructiondecoder
Scalar instructions
Vector instructions
Memory
Vector computers: CDC Cyber 205
CRAY
IBM 3090
NEC SX
Fujitsu VP
HITACHI S8000
70 of 81Datorarkitektur Fö 11-12
)
Vector Unit
Scalar registers
Vector registers
Scalar functional units
Vector functional units
Scalar unit
Vector unit
Instructiondecoder
Scalar instructions
Vector instructions
Memory
A vector unit consists of: pipelined functional units vector registers
Vector registers:n general purpose vector
registers Ri, 0 ≤ i ≤ n-1, each of length s;
vector length register VL:stores the length l (0 ≤ l ≤ sof the currently processed vectors;
mask register M; stores a set of l bits, interpreted as boolean values; vector instructions can be executed in masked mode:vector register elements corresponding to a false
71 of 81Datorarkitektur Fö 11-12 value in M, are ignored.
Vector Instructions
LOAD-STORE instructions:R ← A(x1:x2:incr) loadA(x1:x2:incr) ← R store
R ← MASKED(A) masked loadA ← MASKED(R) masked store
R ← INDIRECT(A(X)) indirect loadA(X) ← INDIRECT(R) indirect store
Arithmetic - logicR ← R' b_op R''R ← S b_op R'R ← u_op R'M ← R rel_op R'WHERE(M) R ← R' b_op R''
72 of 81Datorarkitektur Fö 11-12
Vector Instructions
LOAD-STORE instructions:R ← A(x1:x2:incr) loadA(x1:x2:incr) ← R store
R ← MASKED(A) masked loadA ← MASKED(R) masked store
R ← INDIRECT(A(X)) indirect loadA(X) ← INDIRECT(R) indirect store
Arithmetic - logicR ← R' b_op R''R ← S b_op R'R ← u_op R'M ← R rel_op R'WHERE(M) R ← R' b_op R''
ChainingR2 ← R0 + R1R3 ← R2 * R4
Execution of the vector multiplicationhas not to wait until the vector addition has terminated; as elementsof the sum are generated by the addition pipeline they enter the multiplication pipeline;
addition and multiplication are per-formed (partially) in parallel.
73 of 81Datorarkitektur Fö 11-12
Vector Instructions
In a language with vector computation instructions:
if T[1..50] > 0 thenT[1..50] := T[1..50] + 1;
A compiler for a vector computer generates something like:
R0 ← T(0:49:1)VL ← 50M ← R0 > 0WHERE(M) R0 ← R0 + 1T(0:49:1) ← R0
74 of 81Datorarkitektur Fö 11-12
Multimedia Extensions to General Purpose Microprocessors
Video and audio applications very often deal with large arrays of small data types (8 or 16 bits).
Such applications exhibit a large potential of SIMD (vector) parallelism. General purpose microprocessors have been equipped with special instructions to exploit this potential of parallelism.
The specialised multimedia instructions perform vector computations on bytes, half-words, or words.
75 of 81Datorarkitektur Fö 11-12
Multimedia Extensions to General Purpose Microprocessors
Several vendors have extended the instruction set of their processors in order to improve performance with multimedia applications:
MMX for Intel x86 family
VIS for UltraSparc
MDMX for MIPS
MAX-2 for Hewlett-Packard PA-RISC
NEON for ARM Cortex-A8, ARM Cortex-A9
The Pentium family provides 57 MMX instructions. They treat data in a SIMD fashion.
76 of 81Datorarkitektur Fö 11-12
Multimedia Extensions to General Purpose Microprocessors
The basic idea: subword execution
Use the entire width of the data path (32 or 64 bits) when processing small data types used in signal processing (8, 12, or 16 bits).
With word size 64 bits, the adders will be used to implement eight 8 bit additions in parallel.
This is practically a kind of SIMD parallelism, at a reduced scale.
77 of 81Datorarkitektur Fö 11-12
Multimedia Extensions to General Purpose Microprocessors
Three packed data types are defined for parallel operations: packed byte, packed word, packed double word.
q0q1q2q3q4q5q6q7
q0q1q2q3
q0q1
q0
Packed byte
Packed word
Packed double word
Quadword
78 of 81Datorarkitektur Fö 11-12 64 bits
Multimedia Extensions to General Purpose Microprocessors
Examples of SIMD arithmetics with the MMX instruction set:
a0a1a2a3a4a5a6a7ADD R3 ← R1,R2
b0b1b2b3b4b5b6b7
a0+b0
++++++++
========a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1
79 of 81Datorarkitektur Fö 11-12
Multimedia Extensions to General Purpose Microprocessors
Examples of SIMD arithmetics with the MMX instruction set:
a0a1a2a3a4a5a6a7ADD R3 ← R1,R2
b0b1b2b3b4b5b6b7
a0+b0
++++++++
========a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1
a0a1a2a3a4a5a6a7MPYADD R3 ← R1,R2
b0b1b2b3b4b5b6b7x-+
====(a6xb6)+(a7xb7)
x-+ x-+ x-+
(a4xb4)+(a5xb5) (a2xb2)+(a3xb3) (a0xb0)+(a1xb1)
80 of 81Datorarkitektur Fö 11-12
Multimedia Extensions to General Purpose Microprocessors
How to get the data ready for computation? How to get the results back in the right format? Packing and Unpacking:
truncated a0
a0a1
b0b1
truncated b1 truncated b0 truncated a1
PACK.W R3 ← R1,R2
a0a1a2a3
a0a1
UNPACK R3 ← R1
81 of 81Datorarkitektur Fö 11-12