CS 770G - Parallel Algorithms in Scientific Computing ...

CS 770G - Parallel Algorithms in Scientific Computing

May 7 , 2001Lecture 2

Parallel Architectures

2

References

� Parallel Computer Architecture: A Hardware / Software Approach

Culler, Singh, Gupta, Morgan Kaufmann

� Introduction to Parallel Computing: Design and Analysis of Algorithms

Kumar, Grama, Gupta, Karypis, Benjamin Cummings

3

Performance goals

4

Microprocessor performance

5

What is a Parallel Computer ?� Almasi-Gotllib 1989: A parallel computer is a "collection of processing

elements that communicate and cooperate to solve large problems fast".

� Why parallel architecture?� Add new dimension to design space -- number of

processors.� In principle, achieve higher performance by using

more processors� How much additional performance is gained and at

what additional cost depends on several factors.

6

Questions

� How large is the collection?� How powerful are the individual processing

elements (pe)?� Can the number be increased in a straightforward

manner?� How do they communicate and cooperate?� How is data transmitted between pe's?� What interconnection topology?

7

Taxonomy of Parallel Architectures

I. By control mechanism- instruction stream and data stream

II. By process granularity- coarse vs fine grain

III. By address space organization- shared vs distributed memory

IV. By interconnection network- dynamic vs static

8

(I) Control Mechanism (Flynn�s taxonomy)

� SISD: Single Instruction stream Single Data stream, e.g. conventional sequential computers.

� SIMD: Single Instruction stream Multiple Data stream

� MIMD: Multiple Instruction stream Multiple Data stream

� MISD: Multiple Instruction stream Single Data stream

9

SIMD

� Multiple processing elements are under the supervision of a control unitThinking Machine CM-2, MasPar MP-2, Quadrics

� SIMD extensions are now present in commercial microprocessors (MMX or Katmai in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4)

10

MIMD

� Each processing elements is capable of executing a different program independent of the other processors

� Most multiprocessor can be classified in this category)

11

(II) Process Granularity

� Coarse grain: Cray C90, Fujitsu� small number of very powerful processors

� Fine grain: CM-2, Quadrics� large number of relatively less powerful processors

� Medium grain: IBM SP2, CM-5� between the two extremes.

� Commuication cost >> computational cost → coarse grain� Commuication cost << computational cost → fine grain

12

(III) Address Space Organization

� Single/shared address space� Uniform Memory Address:SMP (UMA)

� Non Uniform memory Address (NUMA)

� Message passing� Distributed memory

13

SMP Architecture

Memory I/O

Bus or Crossbar Switch

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

�SMP uses shared system resources (memory, I/O) that can be accessed equally from all the processors�Cache coherence is maintained

14

NUMA Architecture

� Shared address space� Memory latency varies whether you access local or remote

memory� Cache coherence is maintained using hardware or software

protocol

Memory

Bus or Crossbar Switch

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory Memory Memory

15

Message-Passing Architecture

� Local address space� No cache coherence

Memory

Communication network

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory Memory Memory

16

(IV) Interconnection Networks

� Dynamic� Switches and communication links.� Communication links are connected to one

another dynamically by switches.� Static

� Point-to-point communication links.� Message-passing computers.

17

Dynamic Interconnections

� Crossbar switching : Most expensive and extensive interconnection.

� Bus connected : Processors are connected to memory through a common datapath

� Multistage interconnection:Butterfly,Omega network, perfect shuffle, etc

P1

M1

P2

M2

Butterfly

18

Static Interconnection

� Completely-connected� Star-connected� Linear array� Mesh: 2D/3D mesh, 2D/3D torus� Tree and fat tree network� Hypercube network

19

Characteristics of Static Networks

� Diameter: maximum distance between any two processors in the network

D=1 complete connectionD=N-1 linear arrayD=N/2 ringD=2(√N -1) 2D meshD=2 (√(N/2)) 2D torusD=log N hypercube

20

Characteristics of Static Networks (cont.)

� Bisection width: the minimum number of communications links that have to be removed to partition the network in half.

� Channel rate: peak rate at which a single wire can deliver bits.

� Channel bandwidth: it is the product of channel rate and channel width.

� Bisection bandwidth B: it is the product of bisection width and channel bandwidth.

21

Linear Array, Ring, Mesh, Torus

Processors are arranged as a d-dimensional grid or torus

P

P

P

P

P

P

P

P

P

P

PPP

P

P

P

P

P

P

P

P

P

P

P

22

Tree, Fat-tree

� Tree network: there is only one path between any pair of processors.

� Fat tree network: increase the number of communication links close to the root.

23

Hypercube

1-D 3-D2-D

24

Binary Reflected GRAY Code

G(i,d) denotes the i-th entry in a sequence ofGray codes of d bits. G(i,d+1) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with 1 and the original entry with 0.

25

Binary Reflected GRAY Code

� Mapping a linear array into an hypercube:A linear array (or ring) of 2^d processors can be embedded into a d-dimensional hpercube by mapping processor I onto processor G(I,d) of the hypercube

� Mapping a 2^r x 2^s mesh on an hypercube: processor(i,j)---> G(i,r)||G(j,s) (|| denote concatenation)

26

Example of BRG Code

01

10

00

11

01

1 01 10 10 0

0000

1111

0 00 11 11 0

0123

4567

0132

6754

1-bit 2-bit 3-bit 8p ring 8p hyper

27

Trade-off Among Different Networks

Network Minimum latency Maximum Bw per Proc Wires Switches Example

Completely connected Constant Constant O(p*p) - -

Crossbar Constant Constant O(p) O(p*p) Cray

Bus Constant O(1/p) O(p) O(p) SGI Challenge

Mesh O(sqrt p) Constant O(p) - Intel ASCI Red

Hypercube O(log p) Constant O(p log p) - Sgi Origin

Switched O(log p) Constant O(p log p) O(p log p) IBM SP-2

28

Beowulf� Cluster built with commodity hardware

components� PC hardware (x86,Alpha,PowerPC)� Commercial high-speed interconnection (100Base-T,

Gigabit Ethernet, Myrinet,SCI)� Linux, Free-BSD operating system

http://www.beowulf.org

29

Appleseed: PowerPC Cluster

http://exodus.physics.ucla.edu/appleseed

30

Clusters of SMP

� The next generation of supercomputers will have thousand of SMP nodes connected.� Increase the computational power of the single

node� Keep the number of nodes �low�� New programming approach needed,

MPI+Threads (OpenMp,Pthreads,�.)� ASCI White, CompaqSC, IBM SP3�.

http://www.llnl.gov/asci

31

Multithread Architectures

� The MTA system provides scalable shared memory, in which every processor has equal access to every memory location.

� No concerns about the layout of memory.� Each MTA processor has up to 128 RISC-like virtual

processors.

http://www.tera.com

Each virtual processor is a hardware stream with its own instruction counter, register set, stream status word and target and trap registers. A different hardware stream is activated every clock period.

CS 770G - Parallel Algorithms in Scientific Computing ...

Documents

Transcript of CS 770G - Parallel Algorithms in Scientific Computing ...