MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer...

MBG 1CIS501, Fall 99

Lecture 23:Intro to Multi-processors

Michael B. Greenwald

Computer Architecture

CIS 501

Fall 1999


Administrative stuff• Final exam will be in room Moore 216, 8:30-

10:30am on Thursday, December 16th.

• HW #6 delayed until Thursday, Dec. 9th.

• Project extension: no penalty if I get it by the time I show up tomorrow morning (Friday, 9am-ish).

• Final: open book? Vote

• Penn CISter’s women’s luncheon on Wednesday, December 8th, 12:30-2:30,

– Polar Bear Lounge (129 Pender)

– Hosted by Professors Martha Palmer & Susan Davidson

– questions? [email protected]


Why multiprocessors?

• Exploit parallelism (duplicate every resource, so no structural hazards).

• Increase availability (single processors may fail but system remains robust).

• Simplify parallelization

Goal: increase performance by factor of N,if there are N processors.

Pay more money, increase speedup!Rarely achievable


Barriers to factor of N speedup

• Not all resources are duplicated (structural hazards)

– High cost or low utilization

– Need to maintain identity, or used for sharing information.

• Data dependencies: – A depends upon result of B, true dependencies

– Name dependencies: false sharing

• Synchronization– x := 25;

x := x+1; x := x+1; => individual reads and writes

– Timing, Barriers


Impact of barriers:lack of duplication/structural hazards

• Well understood in CIS501:– Stalls

– Bottleneck (e.g. shared bus)

– Cost of arbitration


Impact of barriers:Data Dependencies

• Increased Memory Costs– Cache misses as memory goes from cache 1 to cache 2.

• Proc A stall waiting for B to finish (lack of parallelism)

• Communication costs between subtasks – Stall waiting for data to be xmitted

– Increased memory costs (more misses)

• False sharing– Example 2 objects in 1 cache line.

– Increases memory costs


Impact of barriers:Synchronization

• Hotspot/Bottleneck (leads to data dependencies on lock)

• Increased communication

• Lack of parallelism (mutual exclusion)


Structure of Multiprocessors

• A multiprocess has N processors, with some manner of shared memory or communications

• In what sense do they “run the same program”? (How do they process Instructions/Data?)

• Memory Hierarchy: How is the memory organized?

• MemoryCommunication Interface: How is state shared?


Popular Flynn Categories

• SISD (Single Instruction Single Data)– Uniprocessors

• MISD (Multiple Instruction Single Data)– ??? (Image processing? Cellular automata?)

• SIMD (Single Instruction Multiple Data)– Examples: Illiac-IV, CM-2 (early multiproc, special purpose)

» Simple programming model» Low overhead» Flexibility» All custom integrated circuits

• MIMD (Multiple Instruction Multiple Data)– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

» Flexible» Economy of scale (each uproc is same as commodity off-the-shelf uni-

processor).» Independent tasks can operate independently


Memory Organization

• Centralized Shared-memory architecture; also known as “UMA (Uniform Memory Access)”:

– Shared bus (low latency, high throughput)

– Shared physical memory (shared L3 cache?)

– Shared I/O system

– Separate L1 (and L2?) caches

• Distributed Memory architecture; NUMA, “cluster”:

– Independent I/O, memory, and caches per processor

– Scales memory bandwidth, I/O bw, fast access to local memory

– Large spectrum of interconnection networks (each node may be a UMA multiprocessor)


Memory Architecture, Communication Models

• Distributed Shared Memory vs. Message passing

• DSM– Load/Store

– Addressing:

» one physical address space

» One virtual address space

• Message passing– Synchronous (RPC)

– Asynchronous (Pure message passing)

» (Null RPC makes this distinction less important).


Communication Models• Shared Memory

– Processors communicate with shared address space

– Easy on small-scale machines

– Advantages:

» Model of choice for uniprocessors, small-scale MPs

» Ease of programming

» Lower latency

» Easier to use hardware controlled caching

• Message passing– Processors have private memories,

communicate via messages

– Advantages:

» Less hardware, easier to design

» Focuses attention on costly non-local operations

• Can support either SW model on either HW base


Parallel Applications:What programs can usefully use a

multiprocessor?

• What applications can we make parallel?

• Need independent computations

• SPLASH benchmark


Structure of parallel programs

• (Amdahl’s Law): never faster than setup + cleanup

Setup

Loop body1

Loop body2

Loop body3

Loop body(n-1)

Loop bodyn

Cleanup

Setup

Loop body1

Loop body2

Loop body3

Loop body(n-1)

Loop bodyn

Cleanup

Loop body4

Loop body1

Loop body2

Loop body3

Loop body(n-1)

Loop bodyn

Cleanup

Setup


Structure of parallel programs

• (Amdahl’s Law): never faster than setup + cleanup

Setup

Loop body1

Loop body2

Loop body3

Loop body(n-1)

Loop bodyn

Cleanup

Setup

Loop body1

Loop body2

Loop body3

Loop body(n-1)

Loop bodyn

Cleanup

Loop body4

Loop body1

Loop body2

Loop body3

Loop body(n-1)

Loop bodyn

Cleanup

Setup

Too Simple!


Effect of parallelization

• If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P?

• Simple answer is “yes”, but...

• Reality is no: Data dependencies; must communicate results from one sub-computation to another

– Must spend the time transmitting data (throughput)

– Must wait for data to arrive (latency)


Effect of parallelization

• If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P?

• Computation cost scales as 1/P

• Communication cost scales in algorithm specific way.

• Example: particle simulation.– 2-d Grid, communication cost is O(1/sqrt(P)) per

processor, so aggregate communication cost increases as we add processors and performance increase is sublinear.


Effect of parallelization(continued)

• Inter-processor Communication is expensive.– Inter-proc Communication costs

(computation/communication ratio only 1st order effect)

– Memory costs (locality)

– Redundant computation

• Trade off computation for communication

• Change memory layout (more cache misses on uni-processor, but fewer on multi-proc).


Fundamental Issues• 4 Issues to characterize parallel machines/systems

1) Naming

2) Synchronization

3) Latency and Bandwidth

4) Consistency


Fundamental Issue #1: Naming• Naming: how to solve large problem fast

– what data is shared

– how it is addressed

– what operations can access data

– how processes refer to each other

• Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing

• Choice of naming affects replication of data; via load in cache memory hierachy or via SW replication and consistency


Fundamental Issue #1: Naming• Global physical address space:

any processor can generate address, and access it in a single operation

– memory can be anywhere: virtual addr. translation handles it

• Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program

• Segmented shared address space: locations are named <process number, address> uniformly for all processes of the parallel program


Fundamental Issue #2: Synchronization

• To cooperate, processes must coordinate

• Message passing is implicit coordination with transmission or arrival of data

• Shared address => additional operations to explicitly coordinate:

e.g., write a flag, awaken a thread, interrupt a processor, atomic operation


Fundamental Issue #3: Latency and Bandwidth

• Bandwidth– Need high bandwidth in communication

– Cannot scale, but stay close

– Match limits in network, memory, and processor

– Overhead to communicate is a problem in many machines

• Latency– Affects performance, since processor may have to wait

– Affects ease of programming, since requires more thought to overlap communication and computation

• Latency Hiding– How can a mechanism help hide latency?

– Examples: overlap message send with computation, prefetch data, switch to other tasks


SMP Interconnect

• Processors to Memory AND to I/O

• Bus based: all memory locations equal access time so SMP = “Symmetric MP”

– Sharing limited BW as add processors, I/O

– (see Chapter 1, Figs 1-18/19, page 42-43 of [CSG96])

• Crossbar: expensive to expand

• Multistage network (less expensive to expand than crossbar with more BW)

• “Dance Hall” designs: All processors on the left, all memories on the right

MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer...

Documents

Transcript of MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer...