01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James...
-
Upload
harriet-matthews -
Category
Documents
-
view
213 -
download
0
Transcript of 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James...
![Page 1: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/1.jpg)
01/28/2009 CS267 Lecture 3 1
CS 267: Introduction to Parallel Machines and Programming
Models
James Demmel www.cs.berkeley.edu/~demmel/
cs267_Spr09
![Page 2: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/2.jpg)
01/28/2009 CS267 Lecture 3 2
Outline
• Overview of parallel machines (~hardware) and programming models (~software)
• Shared memory• Shared address space• Message passing• Data parallel• Clusters of SMPs• Grid
• Parallel machine may or may not be tightly coupled to programming model
• Historically, tight coupling• Today, portability is important
• Trends in real machines
![Page 3: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/3.jpg)
01/28/2009 CS267 Lecture 3 3
A generic parallel architecture
Proc
Interconnection Network
• Where is the memory physically located?• Is it connect directly to processors?• What is the connectivity of the network?
Memory
ProcProc
Proc
Proc Proc
MemoryMemory
Memory Memory
![Page 4: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/4.jpg)
01/28/2009 CS267 Lecture 3 4
Parallel Programming Models
• Programming model is made up of the languages and libraries that create an abstract view of the machine
• Control• How is parallelism created?• What orderings exist between operations?
• Data• What data is private vs. shared?• How is logically shared data accessed or communicated?
• Synchronization• What operations can be used to coordinate parallelism?• What are the atomic (indivisible) operations?
• Cost• How do we account for the cost of each of the above?
![Page 5: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/5.jpg)
01/28/2009 CS267 Lecture 3 5
Simple Example
• Consider applying a function f to the elements of an array A and then computing its sum:
• Questions:• Where does A live? All in single memory?
Partitioned?• What work will be done by each processors?• They need to coordinate to get a single result, how?
1
0
])[(n
i
iAf
A:
fA:f
sum
A = array of all datafA = f(A)s = sum(fA)
s:
![Page 6: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/6.jpg)
01/28/2009 CS267 Lecture 3 6
Programming Model 1: Shared Memory
• Program is a collection of threads of control.• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common
blocks, or global heap.• Threads communicate implicitly by writing and reading shared
variables.• Threads coordinate by synchronizing on shared variables
PnP1P0
s s = ...y = ..s ...
Shared memory
i: 2 i: 5 Private memory
i: 8
![Page 7: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/7.jpg)
01/28/2009 CS267 Lecture 3 7
Simple Example
• Shared memory strategy:• small number p << n=size(A) processors • attached to single memory
• Parallel Decomposition: • Each evaluation and each partial sum is a task.
• Assign n/p numbers to each of p procs• Each computes independent “private” results and partial sum.• Collect the p partial sums and compute a global sum.
Two Classes of Data: • Logically Shared
• The original n numbers, the global sum.
• Logically Private• The individual function evaluations.• What about the individual partial sums?
1
0
])[(n
i
iAf
![Page 8: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/8.jpg)
01/28/2009 CS267 Lecture 3 8
Shared Memory “Code” for Computing a Sum
Thread 1
for i = 0, n/2-1 s = s + f(A[i])
Thread 2
for i = n/2, n-1 s = s + f(A[i])
static int s = 0;
• Problem is a race condition on variable s in the program• A race condition or data race occurs when:
- two processors (or two threads) access the same variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so they could happen simultaneously
![Page 9: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/9.jpg)
01/28/2009 CS267 Lecture 3 9
Shared Memory “Code” for Computing a Sum
Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
static int s = 0;
• Assume A = [3,5], f(x) = x2, and s=0 initially
• For this program to work, s should be 32 + 52 = 34 at the end• but it may be 34,9, or 25
• The atomic operations are reads and writes• Never see ½ of one number, but no += operation is not atomic• All computations happen in (private) registers
9 250 09 25
259
3 5A= f (x) = x2
![Page 10: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/10.jpg)
01/28/2009 CS267 Lecture 3 10
Improved Code for Computing a Sum
Thread 1
local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1
Thread 2
local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2
static int s = 0;
• Since addition is associative, it’s OK to rearrange order• Most computation is on private variables
- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s- The race condition can be fixed by adding locks (only one
thread can hold a lock at a time; others wait for it)
static lock lk;
lock(lk);
unlock(lk);
lock(lk);
unlock(lk);
![Page 11: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/11.jpg)
01/28/2009 CS267 Lecture 3 11
Machine Model 1a: Shared Memory
P1
bus
$
memory
• Processors all connected to a large shared memory.• Typically called Symmetric Multiprocessors (SMPs)• SGI, Sun, HP, Intel, IBM SMPs (nodes of Millennium, SP)• Multicore chips, except that all caches are shared
• Difficulty scaling to large numbers of processors• <= 32 processors typical
• Advantage: uniform memory access (UMA) • Cost: much cheaper to access data in cache than main memory.
P2
$
Pn
$
Note: $ = cache
shared $
![Page 12: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/12.jpg)
01/28/2009 CS267 Lecture 3 12
Problems Scaling Shared Memory Hardware
• Why not put more processors on (with larger memory?)• The memory bus becomes a bottleneck• Caches need to be kept coherent
• Example from a Parallel Spectral Transform Shallow Water Model (PSTSWM) demonstrates the problem
• Experimental results (and slide) from Pat Worley at ORNL• This is an important kernel in atmospheric models
• 99% of the floating point operations are multiplies or adds, which generally run well on all processors
• But it does sweeps through memory with little reuse of operands, so uses bus and shared memory frequently
• These experiments show serial performance, with one “copy” of the code running independently on varying numbers of procs
• The best case for shared memory: no sharing• But the data doesn’t all fit in the registers/cache
![Page 13: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/13.jpg)
01/28/2009 CS267 Lecture 3 13From Pat Worley, ORNL
Example: Problem in Scaling Shared Memory
• Performance degradation is a “smooth” function of the number of processes.
• No shared data between them, so there should be perfect parallelism.
• (Code was run for a 18 vertical levels with a range of horizontal sizes.)
![Page 14: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/14.jpg)
01/28/2009 CS267 Lecture 3 14
Machine Model 1b: Multithreaded Processor
• Multiple thread “contexts” without full processors• Memory and some other state is shared• Sun Niagra processor (for servers)
• Up to 64 threads all running simultaneously (8 threads x 8 cores)• In addition to sharing memory, they share floating point units • Why? Switch between threads for long-latency memory operations
• Cray MTA and Eldorado processors (for HPC)
Memory
shared $, shared floating point units, etc.
T0 T1 Tn
![Page 15: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/15.jpg)
01/28/2009 CS267 Lecture 3 15
Eldorado Processor (logical view)
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Unused streams
. . . .
Programs running in parallel
Concurrent threads of computation
Hardware streams (128)
Instruction Ready Pool;
Pipeline of executing instructions
Source: John Feo, Cray
![Page 16: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/16.jpg)
01/28/2009 CS267 Lecture 3 16
Machine Model 1c: Distributed Shared Memory
• Memory is logically shared, but physically distributed• Any processor can access any address in memory• Cache lines (or pages) are passed around machine
• SGI Origin is canonical example (+ research machines)• Scales to 512 (SGI Altix (Columbia) at NASA/Ames)• Limitation is cache coherency protocols – how to
keep cached copies of the same address consistent
P1
network
$
memory
P2
$
Pn
$
memory memory
Cache lines (pages) must be large to amortize overhead locality still critical to performance
![Page 17: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/17.jpg)
01/28/2009 CS267 Lecture 3 17
Programming Model 2: Message Passing• Program consists of a collection of named processes.
• Usually fixed at program startup time• Thread of control plus local address space -- NO shared data.• Logically shared data is partitioned over local processes.
• Processes communicate by explicit send/receive pairs• Coordination is implicit in every communication event.• MPI (Message Passing Interface) is the most commonly used SW
PnP1P0
y = ..s ...
s: 12
i: 2
Private memory
s: 14
i: 3
s: 11
i: 1
send P1,s
Network
receive Pn,s
![Page 18: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/18.jpg)
01/28/2009 CS267 Lecture 3 18
Computing s = A[1]+A[2] on each processor° First possible solution – what could go wrong?
Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote
Processor 2 xlocal = A[2] receive xremote, proc1 send xlocal, proc1 s = xlocal + xremote
° Second possible solution
Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote
Processor 2 xlocal = A[2] send xlocal, proc1 receive xremote, proc1 s = xlocal + xremote
° If send/receive acts like the telephone system? The post office?
° What if there are more than 2 processors?
![Page 19: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/19.jpg)
01/28/2009 CS267 Lecture 3 19
MPI has become the de facto standard for parallel computing using message passingPros and Cons of standards
• MPI created finally a standard for applications development in the HPC community portability
• The MPI standard is a least common denominator building on mid-80s technology, so may discourage innovation
Programming Model reflects hardware!
“I am not sure how I will program a Petaflops computer, but I am sure that I will need MPI somewhere” – HDS 2001
MPI – the de facto standard
![Page 20: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/20.jpg)
01/28/2009 CS267 Lecture 3 20
Machine Model 2a: Distributed Memory
• Cray T3E, IBM SP2• PC Clusters (Berkeley NOW, Beowulf)• IBM SP-3, Millennium, CITRIS are distributed memory
machines, but the nodes are SMPs.• Each processor has its own memory and cache but
cannot directly access another processor’s memory.• Each “node” has a Network Interface (NI) for all
communication and synchronization.
interconnect
P0
memory
NI
. . .
P1
memory
NI Pn
memory
NI
![Page 21: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/21.jpg)
01/28/2009 CS267 Lecture 3 21
PC Clusters: Contributions of Beowulf
• An experiment in parallel computing systems
• Established vision of low cost, high end computing
• Demonstrated effectiveness of PC clusters for some (not all) classes of applications
• Provided networking software
• Conveyed findings to broad community (great PR)
• Tutorials and book• Design standard to rally community!
• Standards beget: books, trained people, software … virtuous cycle
Adapted from Gordon Bell, presentation at Salishan 2000
![Page 22: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/22.jpg)
01/28/2009 CS267 Lecture 3 22
Tflop/s and Pflop/s Clusters
The following are examples of clusters configured out of separate networks and processor components
• 82% of Top 500 (Nov 2008, up from 72% in 2005), • 3 of top 10
• IBM Cell cluster at Los Alamos (Roadrunner) is #1• 12,960 Cell chips + 6,948 dual-core AMD Opterons;
• 129600 cores altogether
• 1.45 PFlops peak, 1.1PFlops Linpack, 2.5MWatts• Infiniband connection network
• In 2005 Walt Disney Feature Animation (The Hive) was #96
• 1110 Intel Xeons @ 3 GHz• Gigabit Ethernet
• For more details use “database/sublist generator” at www.top500.org
![Page 23: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/23.jpg)
01/28/2009 CS267 Lecture 4 23
Machine Model 2b: Internet/Grid Computing• SETI@Home: Running on 500,000 PCs
• ~1000 CPU Years per Day• 485,821 CPU Years so far
• Sophisticated Data & Signal Processing Analysis• Distributes Datasets from Arecibo Radio Telescope
Next Step-Allen Telescope Array
![Page 24: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/24.jpg)
01/28/2009 CS267 Lecture 3 24
Programming Model 2a: Global Address Space
• Program consists of a collection of named threads.• Usually fixed at program startup time• Local and shared data, as in shared memory model• But, shared data is partitioned over local processes• Cost models says remote data is expensive
• Examples: UPC, Titanium, Co-Array Fortran• Global Address Space programming is an intermediate
point between message passing and shared memory
PnP1P0 s[myThread] = ...
y = ..s[i] ...i: 1 i: 5 Private
memory
Shared memory
i: 8
s[0]: 26 s[1]: 32 s[n]: 27
![Page 25: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/25.jpg)
01/28/2009 CS267 Lecture 3 25
Machine Model 2c: Global Address Space• Cray T3D, T3E, X1, and HP Alphaserver cluster• Clusters built with Quadrics, Myrinet, or Infiniband• The network interface supports RDMA (Remote Direct
Memory Access)• NI can directly access memory without interrupting the CPU• One processor can read/write memory with one-sided
operations (put/get)• Not just a load/store as on a shared memory machine
• Continue computing while waiting for memory op to finish
• Remote data is typically not cached locally
interconnect
P0
memory
NI
. . .
P1
memory
NI Pn
memory
NIGlobal address space may be supported in varying degrees
![Page 26: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/26.jpg)
01/28/2009 CS267 Lecture 3 26
Programming Model 3: Data Parallel
• Single thread of control consisting of parallel operations.• Parallel operations applied to all (or a defined subset) of a
data structure, usually an array• Communication is implicit in parallel operators • Elegant and easy to understand and reason about • Coordination is implicit – statements executed synchronously• Similar to Matlab language for array operations
• Drawbacks: • Not all problems fit this model• Difficult to map onto coarse-grained machines
A:
fA:f
sum
A = array of all datafA = f(A)s = sum(fA)
s:
![Page 27: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/27.jpg)
01/28/2009 CS267 Lecture 3 27
Machine Model 3a: SIMD System• A large number of (usually) small processors.
• A single “control processor” issues each instruction.• Each processor executes the same instruction.• Some processors may be turned off on some instructions.
• Originally machines were specialized to scientific computing, few made (CM2, Maspar)
•Programming model can be implemented in the compiler•mapping n-fold parallelism to p processors, n >> p, but it’s hard (e.g., HPF)
interconnect
P1
memory
NI. . .
control processor
P1
memory
NI P1
memory
NI P1
memory
NI P1
memory
NI
![Page 28: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/28.jpg)
01/28/2009 CS267 Lecture 3 28
Machine Model 3b: Vector Machines
• Vector architectures are based on a single processor• Multiple functional units• All performing the same operation• Instructions may specific large amounts of parallelism (e.g., 64-
way) but hardware executes only a subset in parallel
• Historically important• Overtaken by MPPs in the 90s
• Re-emerging in recent years• At a large scale in the Earth Simulator (NEC SX6) and Cray X1• At a small scale in SIMD media extensions to microprocessors
• SSE, SSE2 (Intel: Pentium/IA64)• Altivec (IBM/Motorola/Apple: PowerPC)• VIS (Sun: Sparc)
• At a larger scale in GPUs
• Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to
![Page 29: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/29.jpg)
01/28/2009 CS267 Lecture 3 29
Vector Processors
• Vector instructions operate on a vector of elements• These are specified as operations on vector registers
• A supercomputer vector register holds ~32-64 elts• The number of elements is larger than the amount of parallel
hardware, called vector pipes or lanes, say 2-4
• The hardware performs a full vector operation in• #elements-per-vector-register / #pipes
r1 r2
r3
+ +
… vr2 … vr1
… vr3
(logically, performs # elts adds in parallel)
… vr2 … vr1
(actually, performs #pipes adds in parallel)
++ ++ ++
![Page 30: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/30.jpg)
01/28/2009 CS267 Lecture 3 31
Cray X1: Parallel Vector ArchitectureCray combines several technologies in the X1• 12.8 Gflop/s Vector processors (MSP)• Shared caches (unusual on earlier vector machines)• 4 processor nodes sharing up to 64 GB of memory• Single System Image to 4096 Processors• Remote put/get between nodes (faster than MPI)
![Page 31: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/31.jpg)
01/28/2009 CS267 Lecture 3 32
Earth Simulator Architecture
Parallel Vector Architecture
•High speed (vector) processors•High memory bandwidth (vector architecture)•Fast network (new crossbar switch)Rearranging commodity parts can’t match this performance
![Page 32: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/32.jpg)
01/28/2009 CS267 Lecture 3 33
Programming Model 4: Hybrids
• These programming models can be mixed • Message passing (MPI) at the top level with shared
memory within a node is common• New DARPA HPCS languages mix data parallel and
threads in a global address space• Global address space models can (often) call
message passing libraries or vice verse• Global address space models can be used in a
hybrid mode• Shared memory when it exists in hardware• Communication (done by the runtime system) otherwise
• For better or worse….
![Page 33: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/33.jpg)
01/28/2009 CS267 Lecture 3 34
Machine Model 4: Clusters of SMPs
• SMPs are the fastest commodity machine, so use them as a building block for a larger machine with a network
• Common names:• CLUMP = Cluster of SMPs• Hierarchical machines, constellations
• Many modern machines look like this:• Millennium, IBM SPs, ASCI machines
• What is an appropriate programming model #4 ???• Treat machine as “flat”, always use message
passing, even within SMP (simple, but ignores an important part of memory hierarchy).
• Shared memory within one SMP, but message passing outside of an SMP.
![Page 34: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/34.jpg)
01/28/2009 CS267 Lecture 3 35
Outline
• Overview of parallel machines and programming models
• Shared memory• Shared address space• Message passing• Data parallel• Clusters of SMPs
• Trends in real machines (www.top500.org)
![Page 35: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/35.jpg)
01/28/2009 CS267 Lecture 3 36
- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from Linpack
Ax=b, dense problem
- Updated twice a year:ISC‘xy in Germany, June xySC‘xy in USA, November xy
- All data available from www.top500.org
Size
Rate
TPP performance
TOP500
![Page 36: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/36.jpg)
01/28/2009
EXTRA SLIDES(TOP 500 FROM NOV 2007)
CS267 Lecture 3 37
![Page 37: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/37.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 38
30th List: The TOP10Manufacturer
Computer Rmax [TF/s]
Installation Site Country Year #Cores
1 IBM BlueGene/LeServer Blue Gene
478.2 DOE/NNSA/LLNL USA 2007 212,992
2 IBM JUGENEBlueGene/P Solution
167.3Forschungszentrum
JuelichGermany 2007 65,536
3 SGI SGI Altix ICE 8200 126.9 New Mexico Computing Applications Center
USA 2007 14,336
4 HP Cluster Platform 3000 BL460c 117.9
Computational Research
Laboratories, TATA SONS
India 2007 14,240
5 HP Cluster Platform 3000 BL460c 102.8
Swedish Government Agency
Sweden 2007 13,728
6Sandia/Cray
Red StormCray XT3
102.2 DOE/NNSA/Sandia USA 2006 26,569
7 Cray JaguarCray XT3/XT4
101.7 DOE/ORNL USA 2007 23,016
8 IBM BGWeServer Blue Gene
91.29 IBM Thomas Watson USA 2005 40,960
9 Cray FranklinCray XT4
85.37 NERSC/LBNL USA 2007 19,320
10 IBM New York BlueeServer Blue Gene
82.16 Stony Brook/BNL USA 2007 36,864
![Page 38: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/38.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 39
Top500 Status
1st Top500 List in June 1993 at ISC’93 in Mannheim••••
29th Top500 List on June 27, 2007 at ISC’07 in Dresden30th Top500 List on November 13, 2007 at SC07 in Reno
31st Top500 List on June 18, 2008 at ISC’08 in Dresden32nd Top500 List on November 18, 2008 at SC08 in Austin33rd Top500 List on June 24, 2009 at ISC’09 in Hamburg
Acknowledged by HPC-users, manufacturers and media
Competition between Manufacturers, Countries and Sites
![Page 39: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/39.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 40
Countries Count Share %
USA 225 45.00 %
Japan 111 22.20 %
Germany 59 11.80 %
France 26 5.20 %
United Kingdom
25 5.00 %
Australia 9 1.80 %
Italy 6 1.20 %
Netherlands 6 1.20 %
Switzerland 4 0.80 %
Canada 3 0.60 %
Denmark 3 0.60 %
Korea 3 0.60 %
Others 20 4.00 %
Totals 500 100 %
1st List as of 06/1993
100 %500Totals
11.00 %55Others
1.80 %9India
2.00 %10China
0.20 %1Korea
0.20 %1Denmark
1.00 %5Canada
1.40 %7Switzerland
1.20 %6Netherlands
1.20 %6Italy
0.20 %1Australia
9.60 %48United Kingdom
3.40 %17France
6.20 %31Germany
4.00 %20Japan
56.60%283USA
Share %CountCountries
30th List as of 11/2007
Competition between Manufacturers, Countries and Sites
![Page 40: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/40.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 41
Competition between Manufacturers, Countries and Sites
Countries/Systems
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Others
China
Korea
Italy
France
UK
Germany
J apan
US
![Page 41: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/41.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 42
Manufacturers Count Share %
Cray Research 205 41.00 %
Fujitsu 69 13.80 %
Thinking Machines 54 10.80 %
Intel 44 8.80 %
Convex 36 7.20 %
NEC 32 6.40 %
Kendall Square Research
21 4.20 %
MasPar 18 3.60 %
Meiko 9 1.80 %
Hitachi 6 1.20 %
Parsytec 3 0.60 %
nCube 3 0.60 %
Totals 500 100 %
1st List as of 06/1993
7.00 %35Others
100 %500Totals
4.80 %24Dell
4.40 %22SGI
46.40 %232IBM
——nCube
——Parsytec
0.20 %1Hitachi/Fujitsu
——Meiko
——MasPar
——Kendall Square Research
0.40 %2NEC
33.20 %166Hewlett-Packard
0.20 %1Intel
——Thinking Machines
0.60 %3Fujitsu
2.80 %14Cray Inc.
Share %CountManufacturers
30th List as of 11/2007
Competition between Manufacturers, Countries and Sites
![Page 42: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/42.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 43
Competition between Manufacturers, Countries and Sites
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
others
Hitachi
NECFujitsu
Intel
TMC
HP
SunIBM
SGI
Cray
Manufacturer/Systems
![Page 43: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/43.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 44
Top Sites through 30 Lists
Competition between Manufacturers, Countries and Sites
Rank Site Country % over time
1 Lawrence Livermore National Laboratory United States 5.392 Sandia National Laboratories United States 3.703 Los Alamos National Laboratory United States 3.414 Government United States 3.345 The Earth Simulator Center Japan 1.996 National Aerospace Laboratory of Japan Japan 1.707 Oak Ridge National Laboratory United States 1.398 NCSA United States 1.319 NASA/Ames Research Center/NAS United States 1.25
10 University of Tokyo Japan 1.2111 NERSC/LBNL United States 1.1912 Pittsburgh Supercomputing Center United States 1.1513 Semiconductor Company (C) United States 1.1114 Naval Oceanographic Office
(NAVOCEANO)United States 1.08
15 ECMWF United Kingdom 1.0216 ERDC MSRC United States 0.9117 IBM Thomas J. Watson Research Center United States 0.8618 Forschungszentrum Juelich (FZJ) Germany 0.8419 Japan Atomic Energy Research Institute Japan 0.8320 Minnesota Supercomputer Center United States 0.74
![Page 44: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/44.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 45
Number of Teraflop/s Systems in the Top500
1
10
100
500
Jun-9
7
Jun-9
8
Jun-9
9
Jun-0
0
Jun-0
1
Jun-0
2
Jun-0
3
Jun-0
4
Jun-0
5
My Supercomputer Favorite in the Top500 Lists
![Page 45: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/45.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 46
The 30th List as of November 2007
Processor Architecture/Systems
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
SIMD
Vector
Scalar
![Page 46: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/46.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 47
The 30th List as of November 2007
Operating Systems/Systems
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Windows
Mac OS
N/ A
Mixed
BSD Based
Linux
Unix
![Page 47: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/47.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 48
The 30th List as of November 2007
Processor Generation/Systems (November 2007)
![Page 48: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/48.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 49
The 30th List as of November 2007
Interconnect Family/Systems
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Others
Quadrics
Proprietary
Fat Tree
Infiniband
Cray Interconnect
Myrinet
SP Switch
Gigabit Ethernet
Crossbar
N/ A
![Page 49: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/49.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 50
The 30th List as of November 2007
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
SIMD
Single Proc.
SMP
Const.
Cluster
MPP
Architectures / Systems
![Page 50: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/50.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 51
Performance Developmentand Projection
Performance Development
1.167 TF/s
6.97 PF/s
478.2 TF/s
59.7 GF/s
5.9 TF/s
0.4 GF/s
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Fujitsu
'NWT' NAL
NEC
Earth Simulator
Intel ASCI Red
Sandia
IBM ASCI White
LLNL
N=1
N=500
SUM
1 Gflop/s
1 Tflop/s
100 Mflop/ s
100 Gflop/ s
100 Tflop/ s
10 Gflop/ s
10 Tflop/ s
1 Pflop/s
IBM
BlueGene/ L
Jack‘s Notebook
Jack‘s Notebook
![Page 51: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/51.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 52
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
N=1
N=500
SUM
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
10 Pflop/s
1 Eflop/s
100 Pflop/s
6-8 years
8-10 years
Performance Developmentand Projection
Performance Projection
1 Pflop/s
Jack‘s Notebook
Jack‘s Notebook
![Page 52: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/52.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 53
Analysis of TOP500 Data
Annual performance growth about a factor of 1.82
Two factors contribute almost equally to the annual total performance growth
Processor number grows per year on the average by a factor of 1.30 and the
Processor performance grows by 1.40 compared to 1.58 of Moore's Law
Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp 1517-1544.
![Page 53: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/53.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 54
Bell‘s Law
Bell's Law of Computer Class formation was discovered about 1972. It states that technology advances in semiconductors, storage, user interface and networking advance every decade enable a new, usually lower priced computing platform to form. Once formed, each class is maintained as a quite independent industry structure. This explains mainframes, minicomputers, workstations and Personal computers, the web, emerging web services, palm and mobile devices, and ubiquitous interconnected networks. We can expect home and body area networks to follow this path.
From Gordon Bell (2007), http://research.microsoft.com/~GBell/Pubs.htm
![Page 54: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/54.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 55
Bell‘s Law
Bell’s Law states, that:Important classes of computer architectures come in cycles of about 10 years.It takes about a decade for each phase :
Early researchEarly adoption and maturationPrime usagePhase out past its prime
Can we use Bell’s Law to classify computer architectures in the TOP500?
![Page 55: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/55.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 56
Bell‘s Law
Gordon Bell (1972): 10 year cycles for computer classesComputer classes in HPC based on the TOP500:
Data Parallel Systems:Vector (Cray Y-MP and X1, NEC SX, …)SIMD (CM-2, …)
Custom Scalar Systems:MPP (Cray T3E and XT3, IBM SP, …)Scalar SMPs and Constellations (Cluster of big SMPs)
Commodity Cluster: NOW, PC cluster, Blades, …Power-Efficient Systems (BG/L or BG/P as first example of low-power / embedded systems = potential new class ?)
Tsubame with Clearspeed, Roadrunner with Cell ?
![Page 56: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/56.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 57
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
BG/ L orBG/ P
CommodityCluster
CustomScalar
Vector/ SIMD
Bell‘s Law
Computer Classes / Systems
![Page 57: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/57.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 58
Bell‘s Law
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
BG/ L orBG/ P
CommodityCluster
SMP/Constellat.
MPP
SIMD
Vector
Computer Classes - refined / Systems
![Page 58: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/58.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 59
Bell‘s Law
Class Early Adoption starts:
Prime Use starts:
Past Prime Usage starts:
Data Parallel Systems
Mid 70’s Mid 80’s Mid 90’s
Custom Scalar Systems
Mid 80’s Mid 90’s Mid 2000’s
Commodity Cluster
Mid 90’s Mid 2000’s Mid 2010’s ???
BG/L or BG/P
Mid 2000’s Mid 2010’s ???
Mid 2020’s ???
HPC Computer Classes
![Page 59: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/59.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 60
Supercomputing, quo vadis?
Countries/Systems (November 2007)
![Page 60: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/60.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 61
Supercomputing, quo vadis?
Countries/Performance (November 2007)
![Page 61: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/61.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 62
Supercomputing, quo vadis?
TOP50Countries/Systems(November 2007)
![Page 62: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/62.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 63
Supercomputing, quo vadis?
Continents/Systems
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Others
Asia
Europe
Americas
![Page 63: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/63.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 64
Supercomputing, quo vadis?
0
20
40
60
80
100
120
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Others
India
China
Korea, South
J apan
Asian Countries / Systems
![Page 64: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/64.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 65
Supercomputing, quo vadis?
Producing Regions / Systems
USA
J apan
Europe Others
0
100
200
300
400
50019
93
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
![Page 65: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/65.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 66
www.top500.org
Top500, quo vadis?
![Page 66: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/66.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 67
Top500, quo vadis?
TOP500 proofed itself by correcting the Mannheim Supercomputer Statistics.
It is simplistic, but (or becauseof it) it gets trends right.
It does not easily allow to track market size
It’s inventory based.This smoothes seasonal fluctuation.
Turn-over very high, so it still reflects recent developments.
Summary after Fifteen Years of Experience
0
50
100
150
200
250
300
Nov 9
3
Nov 9
4
Nov 9
5
Nov 9
6
Nov 9
7
Nov 9
8
Nov 9
9
Nov 0
0
Nov 0
1
Nov 0
2
Nov 0
3
Nov 0
4
Nov 0
5
Nov 0
6
Nov 0
7
Replacement Rate
![Page 67: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/67.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 68
Motivation for Additional Benchmarks
Clearly need something more than Linpack
HPC Challenge Benchmark and others
ConsEmphasizes only “peak” CPU speed and number of CPUsDoes not stress local bandwidthDoes not stress the networkNo single number can reflect overall performance
Top500, quo vadis?
Linpack Benchmark
ProsOne numberSimple to define and rankAllows problem size to change with machine and over timeAllowing Competitions
![Page 68: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/68.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 69
Top500, quo vadis?
Presented at ISC’06June 27-30 2006
![Page 69: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/69.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 70
Top500, quo vadis?
![Page 70: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/70.jpg)
Slides from TOP500 (Meuer, Dongarra, Strohmaier, Simon) 2007page 71
Top500, quo vadis?
http://www.green500.org Green500
RankMFLOPS/W Site Computer
Total Power (kW)
TOP500 Rank
1 357.23 Science and Technology Facilities Council -
Daresbury Laboratory
Blue Gene/P Solution
31.10 121
2 352.25 Max-Planck-Gesellschaft MPI/IPP
Blue Gene/P Solution
62.20 40
3 346.95 IBM - Rochester Blue Gene/P Solution
124.40 24
4 336.21 Forschungszentrum Juelich (FZJ)
Blue Gene/P Solution
497.60 2
5 310.93 Oak Ridge National Laboratory
Blue Gene/P Solution
70.47 41
6 210.56 Harvard University eServer Blue Gene
Solution
44.80 170
7 210.56 High Energy Accelerator Research Organization
/KEK
eServer Blue Gene
Solution
44.80 171
8 210.56 IBM - Almaden Research Center
eServer Blue Gene
Solution
44.80 172
9 210.56 IBM Research eServer Blue Gene
Solution
44.80 173
10 210.56 IBM Thomas J. Watson Research Center
eServer Blue Gene
Solution
44.80 174
![Page 71: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/71.jpg)
01/28/2009 CS267 Lecture 4 72
Summary
• Historically, each parallel machine was unique, along with its programming model and programming language.
• It was necessary to throw away software and start over with each new kind of machine.
• Now we distinguish the programming model from the underlying machine, so we can write portably correct codes that run on many machines.
• MPI now the most portable option, but can be tedious.
• Writing portably fast code requires tuning for the architecture.
• Algorithm design challenge is to make this process easy. • Example: picking a blocksize, not rewriting whole algorithm.
![Page 72: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/72.jpg)
01/28/2009 CS267 Lecture 4 73
Reading Assignment
• Extra reading for today• Cray X1
http://www.sc-conference.org/sc2003/paperpdfs/pap183.pdf• Clusters
http://www.mirror.ac.uk/sites/www.beowulf.org/papers/ICPP95/• "Parallel Computer Architecture: A Hardware/Software Approach
" by Culler, Singh, and Gupta, Chapter 1.
• Next week: Current high performance architectures• Shared memory (for Monday)
• Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, Gharachorloo et al, Proceedings of the International symposium on Computer Architecture, 1990.
• Or read about the Altix system on the web (www.sgi.com)• Blue Gene L (for Wednesday)
• http://sc-2002.org/paperpdfs/pap.pap207.pdf
![Page 73: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/73.jpg)
01/28/2009 CS267 Lecture 4 74
Open Source Software Model for HPC
• Linus's law, named after Linus Torvalds, the creator of Linux, states that "given enough eyeballs, all bugs are shallow".
• All source code is “open”• Everyone is a tester• Everything proceeds a lot faster when
everyone works on one code (HPC: nothing gets done if resources are scattered)
• Software is or should be free (Stallman)• Anyone can support and market the code
for any price• Zero cost software attracts users!• Prevents community from losing HPC software
(CM5, T3E)
![Page 74: 01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel/cs267_Spr09.](https://reader035.fdocuments.net/reader035/viewer/2022070409/56649ea15503460f94ba4850/html5/thumbnails/74.jpg)
01/28/2009 CS267 Lecture 4 75
Cluster of SMP Approach
• A supercomputer is a stretched high-end server• Parallel system is built by assembling nodes that are
modest size, commercial, SMP servers – just put more of them together
Image from LLNL