CS 770G - Parallel Algorithms in Scientific Computing ...
Transcript of CS 770G - Parallel Algorithms in Scientific Computing ...
2
References
� Parallel Computer Architecture: A Hardware / Software Approach
Culler, Singh, Gupta, Morgan Kaufmann
� Introduction to Parallel Computing: Design and Analysis of Algorithms
Kumar, Grama, Gupta, Karypis, Benjamin Cummings
5
What is a Parallel Computer ?� Almasi-Gotllib 1989: A parallel computer is a "collection of processing
elements that communicate and cooperate to solve large problems fast".
� Why parallel architecture?� Add new dimension to design space -- number of
processors.� In principle, achieve higher performance by using
more processors� How much additional performance is gained and at
what additional cost depends on several factors.
6
Questions
� How large is the collection?� How powerful are the individual processing
elements (pe)?� Can the number be increased in a straightforward
manner?� How do they communicate and cooperate?� How is data transmitted between pe's?� What interconnection topology?
7
Taxonomy of Parallel Architectures
I. By control mechanism- instruction stream and data stream
II. By process granularity- coarse vs fine grain
III. By address space organization- shared vs distributed memory
IV. By interconnection network- dynamic vs static
8
(I) Control Mechanism (Flynn�s taxonomy)
� SISD: Single Instruction stream Single Data stream, e.g. conventional sequential computers.
� SIMD: Single Instruction stream Multiple Data stream
� MIMD: Multiple Instruction stream Multiple Data stream
� MISD: Multiple Instruction stream Single Data stream
9
SIMD
� Multiple processing elements are under the supervision of a control unitThinking Machine CM-2, MasPar MP-2, Quadrics
� SIMD extensions are now present in commercial microprocessors (MMX or Katmai in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4)
10
MIMD
� Each processing elements is capable of executing a different program independent of the other processors
� Most multiprocessor can be classified in this category)
11
(II) Process Granularity
� Coarse grain: Cray C90, Fujitsu� small number of very powerful processors
� Fine grain: CM-2, Quadrics� large number of relatively less powerful processors
� Medium grain: IBM SP2, CM-5� between the two extremes.
� Commuication cost >> computational cost → coarse grain� Commuication cost << computational cost → fine grain
12
(III) Address Space Organization
� Single/shared address space� Uniform Memory Address:SMP (UMA)
� Non Uniform memory Address (NUMA)
� Message passing� Distributed memory
13
SMP Architecture
Memory I/O
Bus or Crossbar Switch
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
�SMP uses shared system resources (memory, I/O) that can be accessed equally from all the processors�Cache coherence is maintained
14
NUMA Architecture
� Shared address space� Memory latency varies whether you access local or remote
memory� Cache coherence is maintained using hardware or software
protocol
Memory
Bus or Crossbar Switch
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory Memory Memory
15
Message-Passing Architecture
� Local address space� No cache coherence
Memory
Communication network
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory Memory Memory
16
(IV) Interconnection Networks
� Dynamic� Switches and communication links.� Communication links are connected to one
another dynamically by switches.� Static
� Point-to-point communication links.� Message-passing computers.
17
Dynamic Interconnections
� Crossbar switching : Most expensive and extensive interconnection.
� Bus connected : Processors are connected to memory through a common datapath
� Multistage interconnection:Butterfly,Omega network, perfect shuffle, etc
P1
M1
P2
M2
Butterfly
18
Static Interconnection
� Completely-connected� Star-connected� Linear array� Mesh: 2D/3D mesh, 2D/3D torus� Tree and fat tree network� Hypercube network
19
Characteristics of Static Networks
� Diameter: maximum distance between any two processors in the network
D=1 complete connectionD=N-1 linear arrayD=N/2 ringD=2(√N -1) 2D meshD=2 (√(N/2)) 2D torusD=log N hypercube
20
Characteristics of Static Networks (cont.)
� Bisection width: the minimum number of communications links that have to be removed to partition the network in half.
� Channel rate: peak rate at which a single wire can deliver bits.
� Channel bandwidth: it is the product of channel rate and channel width.
� Bisection bandwidth B: it is the product of bisection width and channel bandwidth.
21
Linear Array, Ring, Mesh, Torus
Processors are arranged as a d-dimensional grid or torus
P
P
P
P
P
P
P
P
P
P
PPP
P
P
P
P
P
P
P
P
P
P
P
22
Tree, Fat-tree
� Tree network: there is only one path between any pair of processors.
� Fat tree network: increase the number of communication links close to the root.
24
Binary Reflected GRAY Code
G(i,d) denotes the i-th entry in a sequence ofGray codes of d bits. G(i,d+1) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with 1 and the original entry with 0.
25
Binary Reflected GRAY Code
� Mapping a linear array into an hypercube:A linear array (or ring) of 2^d processors can be embedded into a d-dimensional hpercube by mapping processor I onto processor G(I,d) of the hypercube
� Mapping a 2^r x 2^s mesh on an hypercube: processor(i,j)---> G(i,r)||G(j,s) (|| denote concatenation)
26
Example of BRG Code
01
10
00
11
01
1 01 10 10 0
0000
1111
0 00 11 11 0
0123
4567
0132
6754
1-bit 2-bit 3-bit 8p ring 8p hyper
27
Trade-off Among Different Networks
Network Minimum latency Maximum Bw per Proc Wires Switches Example
Completely connected Constant Constant O(p*p) - -
Crossbar Constant Constant O(p) O(p*p) Cray
Bus Constant O(1/p) O(p) O(p) SGI Challenge
Mesh O(sqrt p) Constant O(p) - Intel ASCI Red
Hypercube O(log p) Constant O(p log p) - Sgi Origin
Switched O(log p) Constant O(p log p) O(p log p) IBM SP-2
28
Beowulf� Cluster built with commodity hardware
components� PC hardware (x86,Alpha,PowerPC)� Commercial high-speed interconnection (100Base-T,
Gigabit Ethernet, Myrinet,SCI)� Linux, Free-BSD operating system
http://www.beowulf.org
30
Clusters of SMP
� The next generation of supercomputers will have thousand of SMP nodes connected.� Increase the computational power of the single
node� Keep the number of nodes �low�� New programming approach needed,
MPI+Threads (OpenMp,Pthreads,�.)� ASCI White, CompaqSC, IBM SP3�.
http://www.llnl.gov/asci
31
Multithread Architectures
� The MTA system provides scalable shared memory, in which every processor has equal access to every memory location.
� No concerns about the layout of memory.� Each MTA processor has up to 128 RISC-like virtual
processors.
http://www.tera.com
Each virtual processor is a hardware stream with its own instruction counter, register set, stream status word and target and trap registers. A different hardware stream is activated every clock period.