MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer...
-
Upload
stephanie-gray -
Category
Documents
-
view
219 -
download
1
Transcript of MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer...
MBG 1CIS501, Fall 99
Lecture 23:Intro to Multi-processors
Michael B. Greenwald
Computer Architecture
CIS 501
Fall 1999
MBG 2CIS501, Fall 99
Administrative stuff• Final exam will be in room Moore 216, 8:30-
10:30am on Thursday, December 16th.
• HW #6 delayed until Thursday, Dec. 9th.
• Project extension: no penalty if I get it by the time I show up tomorrow morning (Friday, 9am-ish).
• Final: open book? Vote
• Penn CISter’s women’s luncheon on Wednesday, December 8th, 12:30-2:30,
– Polar Bear Lounge (129 Pender)
– Hosted by Professors Martha Palmer & Susan Davidson
– questions? [email protected]
MBG 3CIS501, Fall 99
Why multiprocessors?
• Exploit parallelism (duplicate every resource, so no structural hazards).
• Increase availability (single processors may fail but system remains robust).
• Simplify parallelization
Goal: increase performance by factor of N,if there are N processors.
Pay more money, increase speedup!Rarely achievable
MBG 4CIS501, Fall 99
Barriers to factor of N speedup
• Not all resources are duplicated (structural hazards)
– High cost or low utilization
– Need to maintain identity, or used for sharing information.
• Data dependencies: – A depends upon result of B, true dependencies
– Name dependencies: false sharing
• Synchronization– x := 25;
x := x+1; x := x+1; => individual reads and writes
– Timing, Barriers
MBG 5CIS501, Fall 99
Impact of barriers:lack of duplication/structural hazards
• Well understood in CIS501:– Stalls
– Bottleneck (e.g. shared bus)
– Cost of arbitration
MBG 6CIS501, Fall 99
Impact of barriers:Data Dependencies
• Increased Memory Costs– Cache misses as memory goes from cache 1 to cache 2.
• Proc A stall waiting for B to finish (lack of parallelism)
• Communication costs between subtasks – Stall waiting for data to be xmitted
– Increased memory costs (more misses)
• False sharing– Example 2 objects in 1 cache line.
– Increases memory costs
MBG 7CIS501, Fall 99
Impact of barriers:Synchronization
• Hotspot/Bottleneck (leads to data dependencies on lock)
• Increased communication
• Lack of parallelism (mutual exclusion)
MBG 8CIS501, Fall 99
Structure of Multiprocessors
• A multiprocess has N processors, with some manner of shared memory or communications
• In what sense do they “run the same program”? (How do they process Instructions/Data?)
• Memory Hierarchy: How is the memory organized?
• MemoryCommunication Interface: How is state shared?
MBG 9CIS501, Fall 99
Popular Flynn Categories
• SISD (Single Instruction Single Data)– Uniprocessors
• MISD (Multiple Instruction Single Data)– ??? (Image processing? Cellular automata?)
• SIMD (Single Instruction Multiple Data)– Examples: Illiac-IV, CM-2 (early multiproc, special purpose)
» Simple programming model» Low overhead» Flexibility» All custom integrated circuits
• MIMD (Multiple Instruction Multiple Data)– Examples: Sun Enterprise 5000, Cray T3D, SGI Origin
» Flexible» Economy of scale (each uproc is same as commodity off-the-shelf uni-
processor).» Independent tasks can operate independently
MBG 10CIS501, Fall 99
Memory Organization
• Centralized Shared-memory architecture; also known as “UMA (Uniform Memory Access)”:
– Shared bus (low latency, high throughput)
– Shared physical memory (shared L3 cache?)
– Shared I/O system
– Separate L1 (and L2?) caches
• Distributed Memory architecture; NUMA, “cluster”:
– Independent I/O, memory, and caches per processor
– Scales memory bandwidth, I/O bw, fast access to local memory
– Large spectrum of interconnection networks (each node may be a UMA multiprocessor)
MBG 11CIS501, Fall 99
Memory Architecture, Communication Models
• Distributed Shared Memory vs. Message passing
• DSM– Load/Store
– Addressing:
» one physical address space
» One virtual address space
• Message passing– Synchronous (RPC)
– Asynchronous (Pure message passing)
» (Null RPC makes this distinction less important).
MBG 12CIS501, Fall 99
Communication Models• Shared Memory
– Processors communicate with shared address space
– Easy on small-scale machines
– Advantages:
» Model of choice for uniprocessors, small-scale MPs
» Ease of programming
» Lower latency
» Easier to use hardware controlled caching
• Message passing– Processors have private memories,
communicate via messages
– Advantages:
» Less hardware, easier to design
» Focuses attention on costly non-local operations
• Can support either SW model on either HW base
MBG 13CIS501, Fall 99
Parallel Applications:What programs can usefully use a
multiprocessor?
• What applications can we make parallel?
• Need independent computations
• SPLASH benchmark
MBG 14CIS501, Fall 99
Structure of parallel programs
• (Amdahl’s Law): never faster than setup + cleanup
Setup
Loop body1
Loop body2
Loop body3
Loop body(n-1)
Loop bodyn
Cleanup
Setup
Loop body1
Loop body2
Loop body3
Loop body(n-1)
Loop bodyn
Cleanup
Loop body4
Loop body1
Loop body2
Loop body3
Loop body(n-1)
Loop bodyn
Cleanup
Setup
MBG 15CIS501, Fall 99
Structure of parallel programs
• (Amdahl’s Law): never faster than setup + cleanup
Setup
Loop body1
Loop body2
Loop body3
Loop body(n-1)
Loop bodyn
Cleanup
Setup
Loop body1
Loop body2
Loop body3
Loop body(n-1)
Loop bodyn
Cleanup
Loop body4
Loop body1
Loop body2
Loop body3
Loop body(n-1)
Loop bodyn
Cleanup
Setup
Too Simple!
MBG 16CIS501, Fall 99
Effect of parallelization
• If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P?
• Simple answer is “yes”, but...
• Reality is no: Data dependencies; must communicate results from one sub-computation to another
– Must spend the time transmitting data (throughput)
– Must wait for data to arrive (latency)
MBG 17CIS501, Fall 99
Effect of parallelization
• If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P?
• Computation cost scales as 1/P
• Communication cost scales in algorithm specific way.
• Example: particle simulation.– 2-d Grid, communication cost is O(1/sqrt(P)) per
processor, so aggregate communication cost increases as we add processors and performance increase is sublinear.
MBG 18CIS501, Fall 99
Effect of parallelization(continued)
• Inter-processor Communication is expensive.– Inter-proc Communication costs
(computation/communication ratio only 1st order effect)
– Memory costs (locality)
– Redundant computation
• Trade off computation for communication
• Change memory layout (more cache misses on uni-processor, but fewer on multi-proc).
MBG 19CIS501, Fall 99
Fundamental Issues• 4 Issues to characterize parallel machines/systems
1) Naming
2) Synchronization
3) Latency and Bandwidth
4) Consistency
MBG 20CIS501, Fall 99
Fundamental Issue #1: Naming• Naming: how to solve large problem fast
– what data is shared
– how it is addressed
– what operations can access data
– how processes refer to each other
• Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing
• Choice of naming affects replication of data; via load in cache memory hierachy or via SW replication and consistency
MBG 21CIS501, Fall 99
Fundamental Issue #1: Naming• Global physical address space:
any processor can generate address, and access it in a single operation
– memory can be anywhere: virtual addr. translation handles it
• Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program
• Segmented shared address space: locations are named <process number, address> uniformly for all processes of the parallel program
MBG 22CIS501, Fall 99
Fundamental Issue #2: Synchronization
• To cooperate, processes must coordinate
• Message passing is implicit coordination with transmission or arrival of data
• Shared address => additional operations to explicitly coordinate:
e.g., write a flag, awaken a thread, interrupt a processor, atomic operation
MBG 23CIS501, Fall 99
Fundamental Issue #3: Latency and Bandwidth
• Bandwidth– Need high bandwidth in communication
– Cannot scale, but stay close
– Match limits in network, memory, and processor
– Overhead to communicate is a problem in many machines
• Latency– Affects performance, since processor may have to wait
– Affects ease of programming, since requires more thought to overlap communication and computation
• Latency Hiding– How can a mechanism help hide latency?
– Examples: overlap message send with computation, prefetch data, switch to other tasks
MBG 24CIS501, Fall 99
SMP Interconnect
• Processors to Memory AND to I/O
• Bus based: all memory locations equal access time so SMP = “Symmetric MP”
– Sharing limited BW as add processors, I/O
– (see Chapter 1, Figs 1-18/19, page 42-43 of [CSG96])
• Crossbar: expensive to expand
• Multistage network (less expensive to expand than crossbar with more BW)
• “Dance Hall” designs: All processors on the left, all memories on the right