COMP 322: Principles of Parallel Programming Lecture 17 ...vs3/PDF/comp322-lec17-f09-v2.pdf5 COMP...

COMP 322 Lecture 17 27 October 2009

COMP 322: Principles of Parallel Programming

Lecture 17: Understanding Parallel Computers (Chapter 2)

Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322

Vivek Sarkar Department of Computer Science

Rice University [email protected]

COMP 322, Fall 2009 (V.Sarkar) 2

Acknowledgments for todayʼs lecture •  Course text: “Principles of Parallel Programming”, Calvin Lin & Lawrence

Snyder — Includes resources available at

http://www.pearsonhighered.com/educator/academic/product/0,3110,0321487907,00.html

•  “Parallel Architectures”, Calvin Lin — Lectures 5 & 6, CS380P, Spring 2009, UT Austin

— http://www.cs.utexas.edu/users/lin/cs380p/schedule.html

•  “A Gentler, Kinder Guide to the Multi-core Galaxy” — ECE 4100/6100 guest lecture by Prof. Hsien-Hsin S. Lee

School of Electrical and Computer Engineering Georgia Tech


Design Space •  How should we build parallel computers?

•  Many dimensions to consider — Here we discuss just a few of them

–  Interconnection network –  Memory model –  Number of processors

— Can we build parallel computers that scale to large sizes?


Examples –  Core Duo, Sun Enterprise 5500, SGI Power Challenge, IBM

Power4

Interconnection Networks: Buses •  Bus

— Simple design — Each processor typically has its own cache — These caches are typically kept coherent using some protocol — Single shared resource limits scalability

Memory

P P P P


Examples –  IBM SP-2, IBM RP3, BBN Butterfly GP1000

Interconnection Networks: MIN •  MIN: Multi-stage interconnection networks

— O(logP) switches provide more concurrency than a bus, hence better scalability

— Longer latency since each transfer requires O(logP) hops

P

P

logP stages

P

P

M

M

M

M

P

P

M

M

M

M

P

P


Examples –  Intel Paragon, Cray T3E, Meiko

Interconnection Networks: Meshes •  Meshes

— Fixed point to point communication arranged in a mesh or torus. Simpler to build than MIN.

— Bisection bandwidth does not scale well with P. Bisection bandwidth: the amount of data that can be

transferred between any two partitions of the processors. — Latency is O(√P)

P M P M

P M

P M

P M P M

P M

P M

bisection bandwidth


Hierarchical Interconnection Networks •  Fat Trees

— Thinking Machines CM-5

•  Ring of rings — Kendall Square Research KSR-2

•  Clusters and hybrids


Modern Interconnection Networks Commercially available

— Myrinet –  Point-to-point routed network

— Quadrics –  Fat tree topology

— Infiniband –  Designed for I/O systems

— . . .


Interconnection Networks •  Is the topology important?

— Used to be the focus of portability studies –  e.g. How do you embed a tree structure in a mesh?

— In reality, the topology does not matter too much to programmers — Instead, software overhead dominates

•  What is important? — Latency — Bandwidth — As the number of processors grows, latency tends to grow

Topologies affect both of these characteristics


Impact of the Network •  Network performance

— Two components: bandwidth and latency

•  Effects on programming model — High cost of communication changes what you compute — Compare the Pony Express with Instant Messaging


Dealing with Slow Communication •  Long latency, low bandwidth

— The Pony Express: Send as little mail as possible and don’t send time-sensitive material

— Expend effort to reduce communication

•  Long latency, high bandwidth — Slow boat to China: Send as much material as you like, but

avoid time-sensitive material — Expend effort to reduce the number of communication operations

•  Low latency, low bandwidth — Instant messaging: Send whatever you want whenever you want,

especially if Daddy is paying!


Memory Models •  Memory Model

— The view of memory that is presented to the programmer

•  Shared Address Space — All processors share a single address space — Also known as shared memory (SM)

•  Distributed Memory — Each processor sees a disjoint view of memory — Also known as non-shared memory (NSM)


Shared Address Space Machines •  Goal

— Simplify the programmer’s task

•  Disadvantage — Difficult to build really large SM machines

•  Symmetric vs. Asymmetric — Uniform Memory Access (UMA) — Non-Uniform Memory Access (NUMA)

–  Physically distributed memory –  More scalable

— These are performance characteristics

•  Cache Coherence — Cached values are transparently kept consistent by the hardware — This is a semantic issue — Some SM machines do not provide cache coherence (e.g., Cray T3E)


Intel Core Duo Single chip containing •  Two 32-bit Pentium processors

(P0, P1) •  A 32KB Level 1 Instr cache (L1-

I) and a 32KB Level 1 Data cache (L1-D) per processor

•  A shared unified 2MB or 4MB Level 2 cache (L2) — Supports fast on-chip

communication between P0 and P1 — Supports a sequential consistency

memory model for sharing between P0 and P1

•  A memory bus controller which accesses off-chip memory via the Front Side Bus (FSB)

COMP 322, Fall 2009 (V.Sarkar) 15 15

Intel Core 2 Duo •  Homogeneous cores •  Bus based on chip

interconnect •  Shared on-die Cache

Memory •  Traditional I/O

Classic OOO: Reservation Stations, Issue ports, Schedulers…etc

Large, shared set associative, prefetch, etc.

Source: Intel Corp.

COMP 322, Fall 2009 (V.Sarkar) 16 16

Core 2 Duo Microarchitecture


Cache Coherence on Bus-Based Machines

•  Snooping caches — Each cache snoops on the bus looking for accesses to data that it holds — On a read, the cache can return the value faster than main memory — On a write, the cache either invalidates or updates the value that it

holds. –  Invalidates are the norm because they reduce network traffic

Can we provide cache coherence with other types of interconnection networks? — Yes, use broadcast and multicast operations to support snooping

Memory

P P P P

snooping caches


AMD Dual Core Opteron Single chip containing •  Two AMD64 processors (P0, P1) •  A 64KB Level 1 I-cache (L1-I)

and a 64KB Level 1 I-cache (L1-D) per processor

•  Separate 1MB Level 2 cache (L2) per processor

•  System Request Interface (SRI) handles coherence between P0’s and P1’s caches, and also supports fast intra-chip communication between them

•  A memory bus controller which accesses off-chip memory via the HyperTransport interface


Directory-Based Cache Coherency •  Use indirection

— A directory manages access to each page of memory –  Maintains the state of each page (e.g., shared, exclusive,

dirty) –  Keeps track of the various cached copies

— All memory accesses go through the directory — The directory can be distributed to increase concurrency and

reduce contention — Added indirection increases latency

•  Scalability? — Early studies (DASH) used 64 processors — Few studies on larger numbers of processors


Case Study: The KSR-2 •  Goal: the best of both worlds

— Provide the scalability of distributed physical memory — Provide the programmability of an SM machine

•  COMA — Cache-Only Memory Architecture

–  Instead of allocating each memory location to a fixed home, allow the data to move to where it’s used, as is done with caches


The KSR-2 (cont)

•  Performance — Typically exhibits poor performance when more than one ring

is used, i.e., not very scalable

•  Problems: the worst of both worlds? — Distributed physical memory implies large non-local access

times — The COMA protocol makes it impossible for the programmer

to control locality — Ping pong effect can kill performance


Case Study: The Tera MTA (aka Cray MTA)

•  The logical extreme in SM computers: Provide the illusion of uniform access to memory even as P scales to large values

•  The key idea — Use parallelism to hide latency — Each processor supports multiple threads. At each clock

cycle, the processor switches to another thread. Latency is hidden because by the time a thread executes its next cycle, any expensive memory access had already completed.

register file 128 threads register

file register file

Multithreaded Processor processor


The Tera MTA (cont) •  Massive parallelism

— How do you get so much parallelism? — Exploit parallelism at many levels

–  Instruction level –  Within basic blocks –  Across different processes –  Between user code and OS code

•  Advantage — Supports hard-to-parallelize applications

•  Disadvantage — Everything is custom designed — GaAs instead of CMOS technology


•  Interconnection Topology — Sparsely populated 3D Torus

— Why? — P processors with latency L to memory ⇒ network must hold P ×

L messages if each processor will be busy each cycle –  As L grows, we need to reduce P –  This is why urban sprawl is bad

The Tera MTA (cont) Memory

–  Randomized memory allocation to reduce contention

–  No caches


The Tera Computer—Epiloque •  MTA-1

— Delivered in late 1990’s — Set record for integer sort in 1997

•  MTA-2 — Follow-on to MTA-1 implemented in CMOS technology — Impressive speedups on hard problems [Anderson, et al SC2003]

•  Lessons — With a good design, good performance can be delivered for a variety of

application domains

•  Aside — Recognizes the importance of good tools — Large compiler effort with excellent personnel — In 2000, Tera Computer Co. bought Cray, Inc. from SGI


Distributed Memory •  Goal

— Provide a scalable architecture — Processes communicate through messages

•  Disadvantage — Often considered more difficult to program — The distributed memory model is often mistakenly used

synonymously with “message passing” –  This is a short-sighted view, as we can imagine divorcing the

programming model from the hardware substrate

•  Examples — Most of the larger machines are distributed memory machines


The Law of Nature •  Big fish eat little fish


The Killer Micros •  Economies of scale

— Sales of microprocessors took off in the 80’s — Supercomputers with custom-designed processors found it difficult

to compete against those with commodity processors


Networks of Workstations (NOW, COW…)

Use distributed system as a supercomputer — Don’t just reuse the CPU, reuse the entire workstation, including

the CPU, memory, and I/O interface — Views parallel computing as an extension of distributed computing — Some claim that Networks of Workstations provide parallel

computing for free

•  Problems — Software is still not a commodity part — Moreover, the simpler the hardware, the more the software

needs to do — Workstations typically not designed with NOW’s in mind, so some

components are not quite right –  e.g., Need to redesign the network interface


Clusters •  Basic idea

— Build distributed memory machines from commodity parts, perhaps with some new redesign –  e.g., different form factors for rack-mounting

— Connect these workstations with high-speed commodity networks

•  Advantages — Scalable price/performance — Can grow the system incrementally — Relatively low cost

•  Disadvantages — Relatively high communication latency compared to CPU speed


Earth Simulator

Clusters

The Landscape of Parallel Architectures M

emor

y M

odel

Number of Processors

Coherent Shared Memory Core Duo KSR-

2

16

Vector supercomputer Cray Y-MP

Cray MTA-2

Shared Address Space

Cell NSM BlueGene/L

64,000

GPUs

Vector

COMP 322: Principles of Parallel Programming Lecture 17 ...vs3/PDF/comp322-lec17-f09-v2.pdf5 COMP...

Documents

Transcript of COMP 322: Principles of Parallel Programming Lecture 17 ...vs3/PDF/comp322-lec17-f09-v2.pdf5 COMP...