Scalable Parallel Architectures and their Software
description
Transcript of Scalable Parallel Architectures and their Software
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
NPACI Parallel Computing SeminarsSan Diego Supercomputing Center
Scalable Parallel Architectures
and their
Software
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE2
Introduction
• Overview of RISC CPUs, Memory Hierarchy• Parallel Systems - General Hardware Layout (SMP, Distributed, Hybrid)• Communications Networks for Parallel Systems• Parallel I/O• Operating Systems Concepts• Overview of Parallel Programming Methodologies
– Distributed Memory– Shared-Memory
• Hardware Specifics of NPACI Parallel Machines – IBM SP Blue Horizon – New CPU Architectures
• IBM Power 4• Intel IA-64
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE3
What is Parallel Computing?
• Parallel computing: the use of multiple computers or processors or processes working together on a common task.– Each processor works on its section of the problem– Processors are allowed to exchange information (data in local
memory) with other processors
CPU #1 works on this area of the problem
CPU #3 works on this areaof the problem
CPU #4 works on this areaof the problem
CPU #2 works on this areaof the problem
Grid of Problem to be solved
y
x
exchange
exchange
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE4
Why Parallel Computing?
• Limits of single-CPU computing– Available memory– Performance - usually “time to solution”
• Limits of Vector Computers – main HPC alternative– System cost, including maintenance– Cost/MFlop
• Parallel computing allows:– Solving problems that don’t fit on a single CPU– Solving problems that can’t be solved in a reasonable time on one CPU
• We can run…– Larger problems– Finer resolution– Faster– More cases
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE5
Scalable Parallel Computer Systems
(Scalable) [ ( CPUs) + (Memory) + (I/O) + (Interconnect) + (OS) ] = Scalable Parallel Computer System
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE6
Scalable Parallel Computer Systems
Scalablity: A parallel system is scalable if it is capable of providing enhanced resources to accommodate increasing performance and/or functionality
• Resource scalability: scalability achieved by increasing machine size ( # CPUs, memory, I/O, network, etc.)
• Application scalability
– machine size
– problem size
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE7
Shared and Distributed Memory Systems
Multiprocessor (Shared memory) -Single address space. All processors have access to a pool of shared memory.Examples: SUN HPC, CRAY T90, NEC SX-6
Methods of memory access : - Bus - Crossbar
Multicomputer (Distributed memory) -Each processor has it’s own local -memory. Examples: CRAY T3E, IBM SP2, PC Cluster
MEMORY
BUS/CROSSBAR
CPU CPU CPU CPUCPU CPU CPU CPU
MMMM
NETWORK
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE8
Hybrid (SMP Clusters) Systems
MEMORY
Interconnect
CPU CPU CPU CPU
MEMORY
Interconnect
CPU CPU CPU CPU
MEMORY
Interconnect
CPU CPU CPU CPU
Network
Hybrid Architecture – Processes share memory on-node, may/must use message-passing off-node, may share off-node memoryExample: IBM SP Blue Horizon, SGI Origin, Compaq Alphaserver
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE9
RISC-Based Computer Hardware Concepts
RISC CPUs most common CPUs in HPC – many design concepts transferred from vector CPUs to RISC to CISC
• Multiple Functional Units
• Pipelined Instructions
• Memory Hierarchy
• Instructions typically take 1-several CPU clock cycles
– Clock cycles provide time scale for measurement
• Data transfers – memory-to-CPU, network, I/O, etc.
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE10
Processor Related Terms
• RISC : Reduced Instruction Set Computers
• PIPELINE : Technique where multiple instructions are overlapped in execution
• SUPERSCALAR : Computer design feature - multiple instructions can be executed per clock period
Laura C. Nett:
Instruction set is just how each operation is processed x=y+1
load y and a
add y and a
put in x
Laura C. Nett:
Instruction set is just how each operation is processed x=y+1
load y and a
add y and a
put in x
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE11
‘Typical’ RISC CPU
r0
r32
.
.
.
.
r2
r1
registers FP Add
FP Multiply
FP Divide
Functional Units
Loads & Stores
CPU Chip
Mem
ory/
Cac
he
FP Multiply & Add
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE12
Functional Unit
D(I)
C(I)A(I)
Multiply pipeline length
• Fully Segmented - A(I)=C(I)*D(I)
Chair Building Function Unit
Carpenter 1 Carpenter 2 Carpenter 3 Carpenter 4 Carpenter 5
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE13
Dual Hardware Pipes
odd C(I)
odd C(I)
A(I) & A(I+1)
even D(I+1)
even C(I+1)
A(I) = C(I)*D(I)
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE14
RISC Memory/Cache Related Terms
ICACHE : Instruction cache
DCACHE (Level 1) : Data cache closest to registers
SCACHE (Level 2) : Secondary data cache
Data from SCACHE has to go through DCACHE to registers
SCACHE is larger than DCACHE
All processors do not have SCACHE
CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy
TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed
• MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory• MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one
level in memory to another
• CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE15
CPU
Level 1 Cache
Level 2 Cache
MAIN MEMORY
RISC CPU, CACHE, and MEMORY Basic Layout
SPEED SIZE Cost ($/bit)
Registers
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE16
RISC Memory/Cache Related Terms
ICACHE : Instruction cache
DCACHE (Level 1) : Data cache closest to registers
SCACHE (Level 2) : Secondary data cache
Data from SCACHE has to go through DCACHE to registers
SCACHE is larger than DCACHE
All processors do not have SCACHE
CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy
TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed
• MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory• MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one
level in memory to another
• CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE17
RISC Memory/Cache Related Terms (cont.)
Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.
cache
Main memory
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE18
RISC Memory/Cache Related Terms (cont.)
Fully associative cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.
cache
Main memory
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE19
RISC Memory/Cache Related Terms (cont.)
Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into n (n at least 2) locations in the cache.
2-way set-associative cache
Main memory
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE20
RISC Memory/Cache Related Terms
• The data cache was designed to allow programmers to take advantage of common data access patterns : Spatial Locality
When an array element is referenced, its neighbors are likely to be referenced
Cache lines are fetched together Work on consecutive data elements in the same cache line
Temporal Locality When an array element is referenced, it is likely to be referenced
again soon Arrange code so that data in cache is reused as often as possible
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE21
Typical RISC Floating-Point Operation Times
IBM POWER3 II
• CPU Clock Speed – 375 MHz ( ~ 3 ns)
Instruction 32-Bit 64-Bit
FP Multiply or Add
3-4 3-4
FP Multiply-Add
3-4 3-4
FP Square Root
14-23 22-31
FP Divide 14-21 18-25
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE22
Typical RISC Memory Access Times
IBM POWER3 II
Access Bandwidth
(GB/s)
Time
(Cycles)
Load Register
From L1
3.2 1
Store Register
To L1
1.6 1
Load/Store L1 from/to L2
6.4 9
Load/Store L1
From/to RAM
1.6 35
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE23
Single CPU Optimization
Optimization of serial (single CPU) version is very important
• Want to parallelize best serial version – where appropriate
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE24
New CPUs in HPC
New CPU designs with new features
• IBM POWER 4
– U Texas Regatta nodes – covered on Wednesday
• Intel IA-64
– SDSC DTF TeraGrid PC Linux Cluster
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE25
Parallel Networks
Network function is to transfer data from source to destination in support of network transactions used to realize supported programming model(s).
Data transfer can be for message-passing and/or shared-memory operations.
Network Terminology Common Parallel Networks
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE26
System Interconnect TopologiesSend Information among CPUs through a Network - Best choice would be a fully connected network in which each processor has a direct link to every other processor – Fully Connected Network. This type of network would be very expensive and difficult to scale ~N*N. Instead, processors are arranged in some variation of a mesh, torus, hypercube, etc.
3-D Hypercube 2-D Mesh 2-D Torus
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE27
Network Terminology
• Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better.
• Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better.
– May vary with data size
Switch type
Latency (sec) Bandwidth (MB/sec)
US ~ 17 (~6000 cpu cycles)
~350
For IBM Blue Horizon:
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE28
Network TerminologyBus
• Shared data path
• Data requests require exclusive access
• Complexity ~ O(N)
• Not scalable – Bandwidth ~ O(1)
Crossbar Switch
• Non-blocking switching grid among network elements
• Bandwidth ~ O(N)
• Complexity ~ O(N*N)
Multistage Interconnection Network (MIN)
• Hierarchy of switching networks – e.g., Omega network for N CPUs, N memory banks: complexity ~ O(ln(N))
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE29
Network Terminology (Continued)
• Diameter – maximum distance (in nodes) between any two processors
• Connectivity – number of distinct paths between any two processors
• Channel width – maximum number of bits that can be simultaneously sent over link connecting two processors = number of physical wires in each link
• Channel rate – peak rate at which data can be sent over single physical wire
• Channel bandwidth – peak rate at which data can be sent over link = (channel rate) * (channel width)
• Bisection width – minimum number of links that have to be removed to partition network into two equal halves
• Bisection bandwidth – maximum amount of data between any two halves of network connecting equal numbers of CPUs = (bisection width) * (channel bandwidth)
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE30
Communication Overhead
Time to send a message of M bytes – simple form:
Tcomm = TL + M*Td + TContention
TL = Message Latency
T = 1byte/bandwidth
Tcontention – Takes into account other network traffic
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE31
Communication Overhead
Time to send a message of M bytes – simple form:
Tcomm = TL + M*Td + TContention
TL = Message Latency
T = 1byte/bandwidth
Tcontention – Takes into account other network traffic
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE32
Parallel I/O
I/O can be limiting factor in parallel application
• I/O system properties – capacity, bandwidth, access time
• Need support for Parallel I/O in programming system
• Need underlying HW and system support for parallel I/O
– IBM GPFS – low-level API for developing high-level parallel I/O functionality – MPI I/O, HDF 5, etc.
–
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE33
Unix OS Concepts for Parallel Programming
Most Operating Systems used by Parallel Computers are Unix-based• Unix Process (task)
– Executable code– Instruction pointer– Stack– Logical registers– Heap – Private address space– Task forking to create dependent processes – thousands of clock cycles
• Thread – “lightweight process”– Logical registers– Stack– Shared address space
• Hundreds of clock cycles to create/destroy/synchronize threads
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE34
Parallel Computer Architectures (Flynn Taxonomy)
Control Mechanism
Memory Model
SIMD MIMD
shared-memoryHybrid
(SMP cluster)distributed-memory
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE35
Hardware Architecture Models for Design of Parallel Programs
Sequential computers - von Neumann model (RAM) is universal computational model
Parallel computers - no one model exists
• Model must be sufficiently general to encapsulate hardware features of parallel systems
• Programs designed from model must execute efficiently on real parallel systems
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE36
Designing and Building Parallel Applications
Donald [email protected]
San Diego Supercomputing Center
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE37
What is Parallel Computing?
• Parallel computing: the use of multiple computers or processors or processes concurrently working together on a common task.– Each processor/process works on its section of the problem– Processors/process are allowed to exchange information (data in
local memory) with other processors/processes
CPU #1 works on this area of the problem
CPU #3 works on this areaof the problem
CPU #4 works on this areaof the problem
CPU #2 works on this areaof the problem
Grid of Problem to be solved
y
x
exchange
exchange
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE38
Shared and Distributed Memory Systems
MulitprocessorShared memory - Single address space. Processes have access to a pool of shared memory. Single OS.
MulticomputerDistributed memory - Each processorhas it’s own local memory. Processesusually do message passing to exchange data among processors. Usually multiple Copies of OS
MEMORY
Interconnect
CPU CPU CPU CPUCPU CPU CPU CPU
MMMM
NETWORK
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE39
Hybrid (SMP Clusters) System
MEMORY
Interconnect
CPU CPU CPU CPU
MEMORY
Interconnect
CPU CPU CPU CPU
MEMORY
Interconnect
CPU CPU CPU CPU
Network
•Must/may use message-passing.•Single or multiple OS copies•Node-Local operations less costly than off-node
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE40
Unix OS Concepts for Parallel Programming
Most Operating Systems used are Unix-based• Unix Process (task)
– Executable code– Instruction pointer– Stack– Logical registers– Heap – Private address space– Task forking to create dependent processes – thousands of clock cycles
• Thread – “lightweight process”– Logical registers– Stack– Shared address space
• Hundreds of clock cycles to create/destroy/synchronize threads
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE41
Generic Parallel Programming Models
Single Program Multiple Data Stream (SPMD)– Each CPU accesses same object code– Same application run on different data
• Data exchange may be handled explicitly/implicitly– “Natural” model for SIMD machines– Most commonly used generic parallel programming model
• Message-passing• Shared-memory
– Usually uses process/task ID to differentiate– Focus of remainder of this section
Multiple Program Multiple Data Stream (MPMD)– Each CPU accesses different object code– Each CPU has only data/instructions needed– “Natural” model for MIMD machines
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE42
Parallel “Architectures” – Mapping Hardware Models to Programming Models
Control Mechanism
Memory Model
Programming Model
SIMD MIMD
shared-memoryHybrid
(SMP cluster)distributed-memory
SPMD MPMD
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE43
Methods of Problem Decomposition for Parallel Programming
Want to map (Problem + Algorithms + Data) Architecture
Conceptualize mapping via e.g., pseudocode
Realize mapping via programming language
• Data Decomposition - data parallel program
– Each processor performs the same task on different data
– Example - grid problems
• Task (Functional ) Decomposition - task parallel program
– Each processor performs a different task
– Example - signal processing – adding/subtracting frequencies from spectrum
• Other Decomposition methods
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE44
Designing and Building Parallel Applications
•Generic Problem Architectures•Design and Construction Principles•Incorporate Computer Science Algorithms•Use Parallel Numerical Libraries Where Possible
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE45
Designing and Building Parallel Applications
•Know when (not) to parallelize is very important•Cherri Pancake’s “Rules” summarized:
•Frequency of Use•Execution Time•Resolution Needs•Problem Size
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE46
Categories of Parallel Problems
Generic Parallel Problem “Architectures” ( after G Fox)• Ideally Parallel (Embarrassingly Parallel, “Job-Level Parallel”)
– Same application run on different data– Could be run on separate machines– Example: Parameter Studies
• Almost Ideally Parallel– Similar to Ideal case, but with “minimum” coordination required– Example: Linear Monte Carlo calculations, integrals
• Pipeline Parallelism– Problem divided into tasks that have to be completed sequentially– Can be transformed into partially sequential tasks– Example: DSP filtering
• Synchronous Parallelism– Each operation performed on all/most of data– Operations depend on results of prior operations– All processes must be synchronized at regular points– Example: Modeling Atmospheric Dynamics
• Loosely Synchronous Parallelism– similar to Synchronous case, but with “minimum” intermittent data sharing– Example: Modeling Diffusion of contaminants through groundwater
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE47
Designing and Building Parallel Applications
Attributes of Parallel Algorithms
– Concurrency - Many actions performed “simultaneously”– Modularity - Decomposition of complex entities into simpler
components– Locality - Want high ratio of of local memory access to remote
memory access– Usually want to minimize communication/computation ratio– Performance
• Measures of algorithmic “efficiency”– Execution time– Complexity usually ~ Execution Time– Scalability
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE48
Designing and Building Parallel Applications
Partitioning - Break down main task into smaller ones – either identical or “disjoint”.
Communication phase - Determine communication patterns for task coordination, communication algorithms.
Agglomeration - Evaluate task and/or communication structures wrt performance and implementation costs. Tasks may be combined to improve performance or reduce communication costs.
Mapping - Tasks assigned to processors; maximize processor utilization, minimize communication costs. Mapping may be either static or dynamic.
May have to iterate whole process until satisfied with expected performance – Consider writing application in parallel, using either SPMD message-passing or
shared-memory– Implementation (software & hardware) may require revisit, additional refinement or
re-design
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE49
Designing and Building Parallel Applications
Partitioning
– Geometric or Physical decomposition (Domain Decomposition) - partition data associated with problem
– Functional (task) decomposition – partition into disjoint tasks associated with problem
– Divide and Conquer – partition problem into two simpler problems of approximately equivalent “size” – iterate to produce set of indivisible sub-problems
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE50
Generic Parallel Programming Software Systems
Message-Passing– Local tasks, each encapsulating local data– Explicit data exchange– Supports both SPMD and MPMD– Supports both task and data decomposition– Most commonly used– Process-based, but for performance, processes should be running on separate CPUs– Example API: MPI, PVM Message-Passing libraries– MP systems, in particular, MPI, will be focus of remainder of workshop
Data Parallel– Usually SPMD– Supports data decomposition– Data mapping to cpus may be either implicit/explicit– Example: HPF compiler
Shared-Memory– Tasks share common address space– No explicit transfer of data - supports both task and data decomposition– Can be SPMD, MPMD– Thread-based, but for performance, threads should be running on separate CPUs– Example API : OpenMP, Pthreads
Hybrid - Combination of Message-Passing and Shared-Memory - supports both task and data decomposition – Example: OpenMP + MPI
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE51
Programming Methodologies - Practical Aspects
Bulk of parallel programs written in Fortran, C, or C++• Generally, best compiler, tool support for parallel program development
Bulk of parallel programs use Message-Passing with MPI • Performance, portability, mature compilers, libraries for parallel program development
Data and/or tasks are split up onto different processors by:
• Distributing the data/tasks onto different CPUs, each with local memory (MPPs,MPI)
• Distribute work of each loop to different CPUs (SMPs,OpenMP, Pthreads)
• Hybrid - distribute data onto SMPs and then within each SMP distribute work of each loop (or task) to different CPUs within the box (SMP-Cluster, MPI&OpenMP)
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE52
Typical Data Decomposition for Parallelism
Example: Solve 2-D Wave Equation:
2
2
2
2
yB
xD
t
2
1,,1,
2
,1,,11 22
y
fffB
x
fffD
t
ffnji
nji
nji
nji
nji
nji
ni
ni
Original partial differential equation:
Finite DifferenceApproximation:
PE #0 PE #1 PE #2
PE #4 PE #5 PE #6
PE #3
PE #7
y
x
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE53
Sending Data Between CPUs
Finite DifferenceApproximation:
PE #0 PE #1
PE #3 PE #4
y
x
2
1,,1,
2
,1,,11 22
y
fffB
x
fffD
t
ffnji
nji
nji
nji
nji
nji
ni
ni
i=1,25j=1,25
i=1,25j=26,50
i=26,50j=1,25
i=26,50j=26,50
i=26-50,j=25
i=1-25, j=26
i=1-25, j=26
i=26-50,j=25
i=26,j=1-25 i=26,j=26-50
i=25,j=1-25 i=25,j=26-50
if (taskid=0) then li = 1 ui = 25 lj = 1 uj = 25 send(1:25)=f(25,1:25)elseif (taskid=1)then....elseif (taskid=2) then...elseif(taskid=3) then...end if
do j = lj,uj do i = li,ui work on f(i,j) end doend do
Sample Pseudo Code
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE54
Typical Task Parallel Decomposition
•Signal processing•Use one processor for each independent task
• Can use more processors if one is overloadedv
SPECTRUM IN SubtractFrequencyf1
SubtractFrequencyf2
SubtractFrequencyf3
SPECTRUM OUT
Process 0 Process 1 Process 2
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE55
Basics of Task Parallel Decomposition - SPMDSame program will run on 2 different CPUsTask decomposition analysis has defined 2 tasks (a and b) to be done by 2 CPUs
program.f:… initialize...if TaskID=A then do task aelseif TaskID=B then do task bend if….end program
Task AExecution Stream
Task BExecution Stream
program.f:…Initialize …do task a…end program
program.f:…Initialize…do task b…end program
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE56
Multi-Level Task Parallelism
Program tskparImplicit none
(declarations)
Do loop #1 par blockEnd task #1
(serial work)
Do loop #2 par blockEnd task #2
(serial work)
Program tskparImplicit none
(declarations)
Do loop #1 par blockEnd task #1
(serial work)
Do loop #2 par blockEnd task #2
(serial work)
Program tskparImplicit none
(declarations)
Do loop #1 par blockEnd task #1
(serial work)
Do loop #2 par blockEnd task #2
(serial work)
Program tskparImplicit none
(declarations)
Do loop #1 par blockEnd task #1
(serial work)
Do loop #2 par blockEnd task #2
(serial work)
MPI
MPI
MPI MPI
MPI
MPI
MPI
MPI
network
Proc set #1 Proc set #2 Proc set #3 Proc set #4
MPI
threads
Implementation: MPI and OpenMP
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE57
Parallel Application Performance Concepts
• Parallel Speedup
• Parallel Efficiency
• Parallel Overhead
• Limits on Parallel Performance
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE58
Parallel Application Performance Concepts
• Parallel Speedup - ratio of best sequential time to parallel execution time
– S(n) = ts/tp
• Parallel Efficiency - fraction of time processors in use
– E(n) = ts/(tp*n) = S(n)/n
• Parallel Overhead
– Communication time (Message-Passing)
– Process creation/synchronization (MP)
– Extra code to support parallelism, such as Load Balancing
– Thread creation/coordination time (SMP)
• Limits on Parallel Performance
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE59
Parallel Application Performance Concepts
• Parallel Speedup - ratio of best sequential time to parallel execution time
– S(n) = ts/tp
• Parallel Efficiency - fraction of time processors in use
– E(n) = ts/(tp*n) = S(n)/n
• Parallel Overhead
– Communication time (Message-Passing)
– Process creation/synchronization (MP)
– Extra code to support parallelism, such as Load Balancing
– Thread creation/coordination time (SMP)
• Limits on Parallel Performance
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE60
Limits of Parallel Computing
• Theoretical upper limits
– Amdahl’s Law
– Gustafson’s Law
• Practical limits
– Communication overhead
– Synchronization overhead
– Extra operations necessary for parallel version
• Other Considerations
– Time used to re-write (existing) code
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE61
Parallel Computing - Theoretical Performance Upper Limits
• All parallel programs contain:
– Parallel sections
– Serial sections
Serial sections limit the parallel performance
Amdahl’s Law provides a theoretical upper limit on parallel performance for size-constant problems
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE62
tn fp / N fs t1
S 1fs
fp / N
Amdahl’s Law• Amdahl’s Law places a strict limit on the speedup that can be realized by
using multiple processors– Effect of multiple processors on run time for size-constant problems– Effect of multiple processors on parallel speedup, S:
– Where• fs = serial fraction of code• fp = parallel fraction of code• N = number of processors• t1 = sequential execution time
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE63
Amdahl’s Law
Amdahl's Law (Ideal Case)
0
10
20
30
40
50
10 20 30 40
Number CPUs
Spee
dup f=0.0
f=0.01f=0.05f=0.1
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE64
Amdahl’s Law (Continued)
Amdahl's Law (Actual)
0
10
20
30
40
50
10 20 30 40
Number CPUs
Spee
dup f=0.0
f=0.01Actual
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE65
Gustafson’s Law
Consider scaling problem size as processor count increased
Ts = serial execution timeTp(N,W) = parallel execution time for same problem, size W, on N CPUsS(N,W) = Speedup on problem size W, N CPUs S(N,W) = (Ts + Tp(1,W) )/( Ts + Tp(N,W) )Consider case where Tp(N,W) ~ W*W/NS(N,W) -> (N*Ts + N*W*W)/(N*Ts + W*W) -> N
Gustafson’s Law provides some hope for parallel applications to deliveron the promise.
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE66
Parallel Programming Analysis - Example
Consider solving 2-D Poisson’s equation by iterative method on a regular grid with M points –
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE67
Parallel Programming Concepts
Program must be correct and terminate for some input data set(s)
• Race condition – result(s) depends upon order in which processes/threads finish calculation(s). May or may not be problem, depending upon results
• Deadlock – Process/thread requests resource it will never get. To be avoided – common problem in message-passing parallel programs
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE68
Other Considerations
• Writing efficient parallel applications is usually more difficult than writing serial applications– Serial version may (may not) provide good starting point for parallel
version– Communication, synchronization, etc., can limit parallel performance
• Usually want to overlap communication and computation to minimize ratio of communication to computation time
– Serial time can dominate– CPU computational load balance is important
• Is it worth your time to rewrite existing application? Or create new one? Recall Cherri Pancake’s Rules (simplified version).– Do the CPU and/or memory requirements justify parallelization?– Will the code be used “enough” times to justify parallelization?
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE69
Parallel Programming - Real Life
• These are the main models in use today (circa 2002)
• New approaches – languages, hardware, etc., are likely to arise as
technology advances
• Other combinations of these models are possible
• Large applications will probably use more than one model
• Shared memory model is closest to mathematical model of application
– Scaling to large numbers of cpus is major issue
San DIEGO SUPERCOMPUTER CENTER
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE70
Parallel Computing
References• NPACI PCOMP web-page - www.npaci.edu/PCOMP
– Selected HPC link collection - categorized, updated• Online Tutorials, Books
– Designing and Building Parallel Programs, Ian Foster. http://www-unix.mcs.anl.gov/dbpp/
– NCSA Intro to MPI Tutorial http://pacont.ncsa.uiuc.edu:8900/public/MPI/index.html
– HLRS Parallel Programming Workshop http://www.hlrs.de/organization/par/par_prog_ws/
• Books– Parallel Programming, B. Wilkinson, M. Allen– Computer Organization and Design, D. Patterson and J. L. Hennessy – Scalable Parallel Computing, K. Huang, Z. Xu