COMP 322: Principles of Parallel Programming Lecture 17 ...vs3/PDF/comp322-lec17-f09-v2.pdf5 COMP...
Transcript of COMP 322: Principles of Parallel Programming Lecture 17 ...vs3/PDF/comp322-lec17-f09-v2.pdf5 COMP...
COMP 322 Lecture 17 27 October 2009
COMP 322: Principles of Parallel Programming
Lecture 17: Understanding Parallel Computers (Chapter 2)
Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322
Vivek Sarkar Department of Computer Science
Rice University [email protected]
COMP 322, Fall 2009 (V.Sarkar) 2
Acknowledgments for todayʼs lecture • Course text: “Principles of Parallel Programming”, Calvin Lin & Lawrence
Snyder — Includes resources available at
http://www.pearsonhighered.com/educator/academic/product/0,3110,0321487907,00.html
• “Parallel Architectures”, Calvin Lin — Lectures 5 & 6, CS380P, Spring 2009, UT Austin
— http://www.cs.utexas.edu/users/lin/cs380p/schedule.html
• “A Gentler, Kinder Guide to the Multi-core Galaxy” — ECE 4100/6100 guest lecture by Prof. Hsien-Hsin S. Lee
School of Electrical and Computer Engineering Georgia Tech
COMP 322, Fall 2009 (V.Sarkar) 3
Design Space • How should we build parallel computers?
• Many dimensions to consider — Here we discuss just a few of them
– Interconnection network – Memory model – Number of processors
— Can we build parallel computers that scale to large sizes?
COMP 322, Fall 2009 (V.Sarkar) 4
Examples – Core Duo, Sun Enterprise 5500, SGI Power Challenge, IBM
Power4
Interconnection Networks: Buses • Bus
— Simple design — Each processor typically has its own cache — These caches are typically kept coherent using some protocol — Single shared resource limits scalability
Memory
P P P P
COMP 322, Fall 2009 (V.Sarkar) 5
Examples – IBM SP-2, IBM RP3, BBN Butterfly GP1000
Interconnection Networks: MIN • MIN: Multi-stage interconnection networks
— O(logP) switches provide more concurrency than a bus, hence better scalability
— Longer latency since each transfer requires O(logP) hops
P
P
logP stages
P
P
M
M
M
M
P
P
M
M
M
M
P
P
COMP 322, Fall 2009 (V.Sarkar) 6
Examples – Intel Paragon, Cray T3E, Meiko
Interconnection Networks: Meshes • Meshes
— Fixed point to point communication arranged in a mesh or torus. Simpler to build than MIN.
— Bisection bandwidth does not scale well with P. Bisection bandwidth: the amount of data that can be
transferred between any two partitions of the processors. — Latency is O(√P)
P M P M
P M
P M
P M P M
P M
P M
bisection bandwidth
COMP 322, Fall 2009 (V.Sarkar) 7
Hierarchical Interconnection Networks • Fat Trees
— Thinking Machines CM-5
• Ring of rings — Kendall Square Research KSR-2
• Clusters and hybrids
COMP 322, Fall 2009 (V.Sarkar) 8
Modern Interconnection Networks Commercially available
— Myrinet – Point-to-point routed network
— Quadrics – Fat tree topology
— Infiniband – Designed for I/O systems
— . . .
COMP 322, Fall 2009 (V.Sarkar) 9
Interconnection Networks • Is the topology important?
— Used to be the focus of portability studies – e.g. How do you embed a tree structure in a mesh?
— In reality, the topology does not matter too much to programmers — Instead, software overhead dominates
• What is important? — Latency — Bandwidth — As the number of processors grows, latency tends to grow
Topologies affect both of these characteristics
COMP 322, Fall 2009 (V.Sarkar) 10
Impact of the Network • Network performance
— Two components: bandwidth and latency
• Effects on programming model — High cost of communication changes what you compute — Compare the Pony Express with Instant Messaging
COMP 322, Fall 2009 (V.Sarkar) 11
Dealing with Slow Communication • Long latency, low bandwidth
— The Pony Express: Send as little mail as possible and don’t send time-sensitive material
— Expend effort to reduce communication
• Long latency, high bandwidth — Slow boat to China: Send as much material as you like, but
avoid time-sensitive material — Expend effort to reduce the number of communication operations
• Low latency, low bandwidth — Instant messaging: Send whatever you want whenever you want,
especially if Daddy is paying!
COMP 322, Fall 2009 (V.Sarkar) 12
Memory Models • Memory Model
— The view of memory that is presented to the programmer
• Shared Address Space — All processors share a single address space — Also known as shared memory (SM)
• Distributed Memory — Each processor sees a disjoint view of memory — Also known as non-shared memory (NSM)
COMP 322, Fall 2009 (V.Sarkar) 13
Shared Address Space Machines • Goal
— Simplify the programmer’s task
• Disadvantage — Difficult to build really large SM machines
• Symmetric vs. Asymmetric — Uniform Memory Access (UMA) — Non-Uniform Memory Access (NUMA)
– Physically distributed memory – More scalable
— These are performance characteristics
• Cache Coherence — Cached values are transparently kept consistent by the hardware — This is a semantic issue — Some SM machines do not provide cache coherence (e.g., Cray T3E)
COMP 322, Fall 2009 (V.Sarkar) 14
Intel Core Duo Single chip containing • Two 32-bit Pentium processors
(P0, P1) • A 32KB Level 1 Instr cache (L1-
I) and a 32KB Level 1 Data cache (L1-D) per processor
• A shared unified 2MB or 4MB Level 2 cache (L2) — Supports fast on-chip
communication between P0 and P1 — Supports a sequential consistency
memory model for sharing between P0 and P1
• A memory bus controller which accesses off-chip memory via the Front Side Bus (FSB)
COMP 322, Fall 2009 (V.Sarkar) 15 15
Intel Core 2 Duo • Homogeneous cores • Bus based on chip
interconnect • Shared on-die Cache
Memory • Traditional I/O
Classic OOO: Reservation Stations, Issue ports, Schedulers…etc
Large, shared set associative, prefetch, etc.
Source: Intel Corp.
COMP 322, Fall 2009 (V.Sarkar) 16 16
Core 2 Duo Microarchitecture
COMP 322, Fall 2009 (V.Sarkar) 17
Cache Coherence on Bus-Based Machines
• Snooping caches — Each cache snoops on the bus looking for accesses to data that it holds — On a read, the cache can return the value faster than main memory — On a write, the cache either invalidates or updates the value that it
holds. – Invalidates are the norm because they reduce network traffic
Can we provide cache coherence with other types of interconnection networks? — Yes, use broadcast and multicast operations to support snooping
Memory
P P P P
snooping caches
COMP 322, Fall 2009 (V.Sarkar) 18
AMD Dual Core Opteron Single chip containing • Two AMD64 processors (P0, P1) • A 64KB Level 1 I-cache (L1-I)
and a 64KB Level 1 I-cache (L1-D) per processor
• Separate 1MB Level 2 cache (L2) per processor
• System Request Interface (SRI) handles coherence between P0’s and P1’s caches, and also supports fast intra-chip communication between them
• A memory bus controller which accesses off-chip memory via the HyperTransport interface
COMP 322, Fall 2009 (V.Sarkar) 19
Directory-Based Cache Coherency • Use indirection
— A directory manages access to each page of memory – Maintains the state of each page (e.g., shared, exclusive,
dirty) – Keeps track of the various cached copies
— All memory accesses go through the directory — The directory can be distributed to increase concurrency and
reduce contention — Added indirection increases latency
• Scalability? — Early studies (DASH) used 64 processors — Few studies on larger numbers of processors
COMP 322, Fall 2009 (V.Sarkar) 20
Case Study: The KSR-2 • Goal: the best of both worlds
— Provide the scalability of distributed physical memory — Provide the programmability of an SM machine
• COMA — Cache-Only Memory Architecture
– Instead of allocating each memory location to a fixed home, allow the data to move to where it’s used, as is done with caches
COMP 322, Fall 2009 (V.Sarkar) 21
The KSR-2 (cont)
• Performance — Typically exhibits poor performance when more than one ring
is used, i.e., not very scalable
• Problems: the worst of both worlds? — Distributed physical memory implies large non-local access
times — The COMA protocol makes it impossible for the programmer
to control locality — Ping pong effect can kill performance
COMP 322, Fall 2009 (V.Sarkar) 22
Case Study: The Tera MTA (aka Cray MTA)
• The logical extreme in SM computers: Provide the illusion of uniform access to memory even as P scales to large values
• The key idea — Use parallelism to hide latency — Each processor supports multiple threads. At each clock
cycle, the processor switches to another thread. Latency is hidden because by the time a thread executes its next cycle, any expensive memory access had already completed.
register file 128 threads register
file register file
Multithreaded Processor processor
COMP 322, Fall 2009 (V.Sarkar) 23
The Tera MTA (cont) • Massive parallelism
— How do you get so much parallelism? — Exploit parallelism at many levels
– Instruction level – Within basic blocks – Across different processes – Between user code and OS code
• Advantage — Supports hard-to-parallelize applications
• Disadvantage — Everything is custom designed — GaAs instead of CMOS technology
COMP 322, Fall 2009 (V.Sarkar) 24
• Interconnection Topology — Sparsely populated 3D Torus
— Why? — P processors with latency L to memory ⇒ network must hold P ×
L messages if each processor will be busy each cycle – As L grows, we need to reduce P – This is why urban sprawl is bad
The Tera MTA (cont) Memory
– Randomized memory allocation to reduce contention
– No caches
COMP 322, Fall 2009 (V.Sarkar) 25
The Tera Computer—Epiloque • MTA-1
— Delivered in late 1990’s — Set record for integer sort in 1997
• MTA-2 — Follow-on to MTA-1 implemented in CMOS technology — Impressive speedups on hard problems [Anderson, et al SC2003]
• Lessons — With a good design, good performance can be delivered for a variety of
application domains
• Aside — Recognizes the importance of good tools — Large compiler effort with excellent personnel — In 2000, Tera Computer Co. bought Cray, Inc. from SGI
COMP 322, Fall 2009 (V.Sarkar) 26
Distributed Memory • Goal
— Provide a scalable architecture — Processes communicate through messages
• Disadvantage — Often considered more difficult to program — The distributed memory model is often mistakenly used
synonymously with “message passing” – This is a short-sighted view, as we can imagine divorcing the
programming model from the hardware substrate
• Examples — Most of the larger machines are distributed memory machines
COMP 322, Fall 2009 (V.Sarkar) 27
The Law of Nature • Big fish eat little fish
COMP 322, Fall 2009 (V.Sarkar) 28
The Killer Micros • Economies of scale
— Sales of microprocessors took off in the 80’s — Supercomputers with custom-designed processors found it difficult
to compete against those with commodity processors
COMP 322, Fall 2009 (V.Sarkar) 29
Networks of Workstations (NOW, COW…)
Use distributed system as a supercomputer — Don’t just reuse the CPU, reuse the entire workstation, including
the CPU, memory, and I/O interface — Views parallel computing as an extension of distributed computing — Some claim that Networks of Workstations provide parallel
computing for free
• Problems — Software is still not a commodity part — Moreover, the simpler the hardware, the more the software
needs to do — Workstations typically not designed with NOW’s in mind, so some
components are not quite right – e.g., Need to redesign the network interface
COMP 322, Fall 2009 (V.Sarkar) 30
Clusters • Basic idea
— Build distributed memory machines from commodity parts, perhaps with some new redesign – e.g., different form factors for rack-mounting
— Connect these workstations with high-speed commodity networks
• Advantages — Scalable price/performance — Can grow the system incrementally — Relatively low cost
• Disadvantages — Relatively high communication latency compared to CPU speed
COMP 322, Fall 2009 (V.Sarkar) 31
Earth Simulator
Clusters
The Landscape of Parallel Architectures M
emor
y M
odel
Number of Processors
Coherent Shared Memory Core Duo KSR-
2
16
Vector supercomputer Cray Y-MP
Cray MTA-2
Shared Address Space
Cell NSM BlueGene/L
64,000
GPUs
Vector