6/25/2015 Parallel Computer Architectures Computer Science and Engineering 1 Parallel Computer...

39
03/25/22 Parallel Computer Architectures Computer Science and Engineering 1 Parallel Computer Architectures Duncan A. Buell
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    231
  • download

    3

Transcript of 6/25/2015 Parallel Computer Architectures Computer Science and Engineering 1 Parallel Computer...

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

1

Parallel Computer Architectures

Duncan A. Buell

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

2

Rules for Parallel Computing

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

3

There are no rules

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

4

Parallel Computing History

Late 1960s ILLIAC-41970 CDC STAR-1001980s

Denelcor HEPTera Computer Corp. MPAAlliantSequentStardentKendall Square Research (KSR)Intel HypercubeNCubeBBN ButterflyNASA MPPThinking Machines CM-2MasPar

1990s and forwardCray T3D, T3EThinking Machines CM-5Tera Computer Corp. MPASGI ChallengeSun EnterpriseSGI OriginHP-ConvexDEC 84xxPittsburgh TerascaleASCI machinesBeowulf clustersIBM SP-1, SP-2New DoE-inspired machines

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

5

Memory Latency is the Problem

Instructions execute in nanoseconds

Memory provides data in 100s of nanoseconds

The problem is keeping processors fed with data

Standard machines use levels of cache

How do we keep lots of processors fed?

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

6

Solutions(?) to the Latency Problem

Connect all the processors to all the memory

SMP: Sun Enterprise, SGI Challenge, Cray multiprocessors

Provide fast, constant time, memory fetch to anywhere from anywhere

Requires a fast, expensive, full crossbar switch

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

7

Solutions(?) to the Latency Problem (2)

Build a machine that is physically structured like the computations to be performed

Vectors: Cray, CDCSIMD: MPP, CM-2, MASPAR2D/3D Grid: CRAY T3D, T3EButterfly: BBNMeiko “computing surface”

Works well on problems on which it works wellWorks badly on problems that don’t fit

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

8

Solutions(?) to the Latency Problem (3)

Build a machine with “generic” structure and software support for computations that may not fit well

Butterfly: BBNLog network: CM-2, CM-5

Relies on magicMagic has always been hard to do

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

9

Solutions(?) to the Latency Problem (4)

Build an an SMP and then connect SMPs together in clusters

SGI: Origin (NUMA, ccNUMA)DoE: ASCI Red, Blue Pacific, White, etc.

Performance requires distributable computations, because the memory access is slow off the local node

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

10

Solutions(?) to the Latency Problem (5)

Ignore performance and concentrate on cost

Beowulf clustersNetworks of workstations

If the machine is cheap, and works very well on some (distributable) computations, then maybe no one will notice that it’s not so great on other computations.

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

11

The Vector Dinosaurs

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

12

Vector Computers

Much of high end computing is for scientific and engineering applications

Many of these involve linear algebraWe happen to know how to do linear algebraMany solutions can be expressed with lin alg(Lin alg is both the hammer and the nail)

The basic operation is a dot product, i.e. a vector multiplication

Vector computers do blocks of arithmetic ops as one operation

Register-based (CRAY) or memory-memory(CDC)

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

13

Programming Vector Computers

Everything reduces to a compiler’s recognizing (or being told to recognize) a loop whose ops can be done in parallel.

for(i=0; i < n; i++) /* works just fine */a[i] = b[i] * c[i];

for(i = 0; i < n; i++) /* fails, a[.] values not independent */

a[i] = a[i-1] * b[i];

Programming involves contortions of code to make it into independent operations inside the loops.

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

14

Vector Computing History

1960s Seymour R. Cray does CDC 6400Cray leaves CDC, forms Cray Research, Inc., produces

CRAY-1 (1976)CDC Cyber 205 (late 1970s)CDC spins off ETA, liquid nitrogen ETA-10 fails, ETA failsCRAY X-MP (1983?), CRAY 2 runs Unix (1985)Convex C-1 and a host of “Cray-ettes”, now HP-ConvexCRAY Y-MP (1988?), C90, T90, J series (1990s)Steve Chen leaves CRI, forms SSC, fails spectacularlyCray leaves CRI, forms Cray Computer Corp.CCC CRAY 3 fails, CRAY 4 fails, CCC SSS failsCRI sold to SGI, then sold to Tera Computer Corp.1996 S.R. Cray killed in auto wreck by teenager

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

15

True Parallel Computing

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

16

Parallel Computers

• The theoretic model of a PRAM• Symmetric Multi Processors• Distributed memory machines• Machines with an inherent structure• Non Uniform Memory Access machines• Massively parallel machines• Grid computing

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

17

Theory – The PRAM Model

PRAM (Parallel Random Access Machine):

• Control unit• Global memory

• Unbounded set of procs• Private mem for each

processor

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

18

PRAM

Types of PRAM:• EREW (Exclusive Read Exclusive Write)• CREW (Concurrent Read Exclusive Write)• CRCW (Concurrent Read Concurrent Write)

Flaws with PRAM:• Logical flaw:

– Must deal with the concurrent write problem• Practicality flaw:

– Can’t really assume unbounded number of processors– Can’t really afford to build the interconnect switch

Nonetheless, it’s a good starting place

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

19

Standard Single Processor Machine

• One processor• One memory block• Bus to memory• All addresses visible Processor

Memory

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

20

(Michael) Flynn’s Taxonomy

SISD (Single Instruction, Single Data) – The ordinary computer

MIMD (Multiple Instruction, Multiple Data) – True, symmetric, parallel computing (Sun Enterprise)

SIMD (Single Instruction, Multiple Data) – Massively parallel army-of-ants approach – Processors execute the same sequence of instructions (or else NO-OP) in lockstep (TMC CM-2)

SCMD/SPMD (Single Code/Program Multiple Data) – Processors run the same program, but on their own local data (Beowulf clusters)

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

21

Symmetric Multi-Processor (SMP) (MIMD)

• Lots of processors (32? 64? 128? 1024?)

• Multiple “ordinary” processors

• Lots of global memory• All addresses visible to

all processors

• Closest thing to a PRAM

• This the holy grail

Processors

Memory

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

22

SMP Characteristics

Middle level parallel execution

Processors spawn “threads” at or below the size of a function

Compiler magic to extract parallelism(if no pointers in the code, then at the function level one can determine independence of use of variables)

Compiler directives to force parallelism

Sun Enterprise, SGI Challenge, …

Processors

Memory

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

23

But SMPs Are Hard to Build

• N processors

• M memory blocks

• N*M connections

• This is hard and expensive

PP PP

MM MM

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

24

But SMPs Are Hard to Build

For large N and M, we do this as a switch, not point to point

But it’s still hard and expensive

Half the cost of a CRAY was the switch between processors and memory

Beyond 128 processors, almost impossible

PP PP

MM MM

SWITCH

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

25

Memory Banking Issues

Many processors requesting dataProcessors generate addresses faster than memory can

respond

Memory banking: use low bits of address to specify the physical bank so consecutive addresses go to physically different banks

But power-of-2 stride (as in an FFT) hits the same bank repeatedly

CDC deliberately used 17 memory banks to randomize accesses

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

26

FFT Butterfly Communication

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

27

Distributed Parallelism

• Beowulf cluster of Linux nodes (requires an identifible “computer” to be a Beowulf?)

• SNOW (Scalable Network of Workstations)• SETI@home, GIMP,

• Beowulfs programmed with MPI or PVM• MPI uses explicit processor-to-processor

message passing• Sun (and others) have tools for networks

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

28

Distributed Parallel Computers

Usually we can’t get to the memory except through the processor, but we would like to have memory-to-memory connections.

P M

P M

P M

P M

P M

Network

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

29

Parallel Computers With Structure

• If it’s hard/expensive to build an SMP, is it useful to build the structure into the machine?

• Build in a communication pattern that you expect to see in the computations, but keep things simple enough to make them buildable

• Make sure that you have efficient algorithms for the common computational tasks

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

30

Parallel Computers With Structure

• Ring-connected machines (Alliant)• 2-dimensional meshes (CRAY T3D, T3E)• 3-D mesh with missing links (Tera MPA)• Logarithmic tree interconnections

– Thinking Machines Connection Machine

CM-2, CM-5– MasPar MP-1, MP-2)

• Bolt, Beranek, and Newman BBN Butterfly

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

31

2-dimensional Mesh with Wraparound

A vector multiply can be done very efficiently (shift column data up past row data), but what about a matrix transpose?

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

32

Logarithmic Tree Communications

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

33

Parallel Computers With Structure

Machines with structure that were intended to be SMPs were generally not successful

Alliant, Sequent, BBN Butterfly, etc.

CM-5 claimed magical compilers, but efficiency only came by using the structure explicitly

T3D, T3E were the ONLY machines that allowed shared memory with clusters of nodes—and had it work

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

34

NUMA Clusters of SMPs

• 2-4 Processors, 2-4Gbytes memory on a node• 4 (plus or minus) nodes per cabinet with a

switch• Cabinets interconnected with another switch• Non Uniform Memory Access

– Fast access to node memory– Slower access elsewhere in the cabinet– Yet slower access off-cabinet

• Nearly all large machines are NUMA (DoE ASCI, SGI Origin, Pittsburgh Terascale, …

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

35

Massively Parallel SIMD Computers

• NASA Massively Parallel Processor– Built by Goodyear 1984 for image processing– 16384 1-bit procs, 1024 bits/proc of mem– Mesh connections

• Thinking Machines CM-2 (1986)– 65536 1-bit procs, 8192 bits/proc– Log network– Compute cost = communication cost?

• MasPar MP-1, MP-2 (late 1980s)– 8192 4-bit processors– Log network

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

36

Massively Parallel SIMD Computers

• Plane of processors each sitting above an array of memory bits

• Usually a log network connecting the processors

• Usually also some local connections (e.g., 16 procs/node on CM-2)

Memory

Procs

Control processor

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

37

Massively Parallel SIMD Computers

• Control processor sends instructions clock by clock to the compute processors

• All compute processors execute the instruction (or NO-OP) on the same relative data location

• Obvious image processing model• Allows variable data types (although TMC

didn’t do this until told to)

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

38

Massively Parallel SIMD Computers

Processor in Memory (PIM)Take half the memory off a chipUse the silicon for implementing SIMD

processorsExtra address bit toggles mode

If 0, use address as addressIf 1, use “address” as SIMD instruction

2048 processors per memory chipCray Computer Corp. SSS would have provided

millions of processors

04/18/23

Parallel Computer ArchitecturesComputer Science and

Engineering

39

Grid Computing