Scalable Parallel Architectures and their Software

70
San DIEGO SUPERCOMPUTER CENTER NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing Center Scalable Parallel Architectures and their Software

description

Scalable Parallel Architectures and their Software. Introduction. Overview of RISC CPUs, Memory Hierarchy Parallel Systems - General Hardware Layout (SMP, Distributed, Hybrid) Communications Networks for Parallel Systems Parallel I/O Operating Systems Concepts - PowerPoint PPT Presentation

Transcript of Scalable Parallel Architectures and their Software

Page 1: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

NPACI Parallel Computing SeminarsSan Diego Supercomputing Center

Scalable Parallel Architectures

and their

Software

Page 2: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE2

Introduction

• Overview of RISC CPUs, Memory Hierarchy• Parallel Systems - General Hardware Layout (SMP, Distributed, Hybrid)• Communications Networks for Parallel Systems• Parallel I/O• Operating Systems Concepts• Overview of Parallel Programming Methodologies

– Distributed Memory– Shared-Memory

• Hardware Specifics of NPACI Parallel Machines – IBM SP Blue Horizon – New CPU Architectures

• IBM Power 4• Intel IA-64

Page 3: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE3

What is Parallel Computing?

• Parallel computing: the use of multiple computers or processors or processes working together on a common task.– Each processor works on its section of the problem– Processors are allowed to exchange information (data in local

memory) with other processors

CPU #1 works on this area of the problem

CPU #3 works on this areaof the problem

CPU #4 works on this areaof the problem

CPU #2 works on this areaof the problem

Grid of Problem to be solved

y

x

exchange

exchange

Page 4: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE4

Why Parallel Computing?

• Limits of single-CPU computing– Available memory– Performance - usually “time to solution”

• Limits of Vector Computers – main HPC alternative– System cost, including maintenance– Cost/MFlop

• Parallel computing allows:– Solving problems that don’t fit on a single CPU– Solving problems that can’t be solved in a reasonable time on one CPU

• We can run…– Larger problems– Finer resolution– Faster– More cases

Page 5: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE5

Scalable Parallel Computer Systems

(Scalable) [ ( CPUs) + (Memory) + (I/O) + (Interconnect) + (OS) ] = Scalable Parallel Computer System

Page 6: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE6

Scalable Parallel Computer Systems

Scalablity: A parallel system is scalable if it is capable of providing enhanced resources to accommodate increasing performance and/or functionality

• Resource scalability: scalability achieved by increasing machine size ( # CPUs, memory, I/O, network, etc.)

• Application scalability

– machine size

– problem size

Page 7: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE7

Shared and Distributed Memory Systems

Multiprocessor (Shared memory) -Single address space. All processors have access to a pool of shared memory.Examples: SUN HPC, CRAY T90, NEC SX-6

Methods of memory access : - Bus - Crossbar

Multicomputer (Distributed memory) -Each processor has it’s own local -memory. Examples: CRAY T3E, IBM SP2, PC Cluster

MEMORY

BUS/CROSSBAR

CPU CPU CPU CPUCPU CPU CPU CPU

MMMM

NETWORK

Page 8: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE8

Hybrid (SMP Clusters) Systems

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

Network

Hybrid Architecture – Processes share memory on-node, may/must use message-passing off-node, may share off-node memoryExample: IBM SP Blue Horizon, SGI Origin, Compaq Alphaserver

Page 9: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE9

RISC-Based Computer Hardware Concepts

RISC CPUs most common CPUs in HPC – many design concepts transferred from vector CPUs to RISC to CISC

• Multiple Functional Units

• Pipelined Instructions

• Memory Hierarchy

• Instructions typically take 1-several CPU clock cycles

– Clock cycles provide time scale for measurement

• Data transfers – memory-to-CPU, network, I/O, etc.

Page 10: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE10

Processor Related Terms

• RISC : Reduced Instruction Set Computers

• PIPELINE : Technique where multiple instructions are overlapped in execution

• SUPERSCALAR : Computer design feature - multiple instructions can be executed per clock period

Laura C. Nett:

Instruction set is just how each operation is processed x=y+1

load y and a

add y and a

put in x

Laura C. Nett:

Instruction set is just how each operation is processed x=y+1

load y and a

add y and a

put in x

Page 11: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE11

‘Typical’ RISC CPU

r0

r32

.

.

.

.

r2

r1

registers FP Add

FP Multiply

FP Divide

Functional Units

Loads & Stores

CPU Chip

Mem

ory/

Cac

he

FP Multiply & Add

Page 12: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE12

Functional Unit

D(I)

C(I)A(I)

Multiply pipeline length

• Fully Segmented - A(I)=C(I)*D(I)

Chair Building Function Unit

Carpenter 1 Carpenter 2 Carpenter 3 Carpenter 4 Carpenter 5

Page 13: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE13

Dual Hardware Pipes

odd C(I)

odd C(I)

A(I) & A(I+1)

even D(I+1)

even C(I+1)

A(I) = C(I)*D(I)

Page 14: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE14

RISC Memory/Cache Related Terms

ICACHE : Instruction cache

DCACHE (Level 1) : Data cache closest to registers

SCACHE (Level 2) : Secondary data cache

Data from SCACHE has to go through DCACHE to registers

SCACHE is larger than DCACHE

All processors do not have SCACHE

CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy

TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed

• MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory• MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one

level in memory to another

• CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy

Page 15: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE15

CPU

Level 1 Cache

Level 2 Cache

MAIN MEMORY

RISC CPU, CACHE, and MEMORY Basic Layout

SPEED SIZE Cost ($/bit)

Registers

Page 16: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE16

RISC Memory/Cache Related Terms

ICACHE : Instruction cache

DCACHE (Level 1) : Data cache closest to registers

SCACHE (Level 2) : Secondary data cache

Data from SCACHE has to go through DCACHE to registers

SCACHE is larger than DCACHE

All processors do not have SCACHE

CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy

TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed

• MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory• MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one

level in memory to another

• CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy

Page 17: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE17

RISC Memory/Cache Related Terms (cont.)

Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.

cache

Main memory

Page 18: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE18

RISC Memory/Cache Related Terms (cont.)

Fully associative cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.

cache

Main memory

Page 19: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE19

RISC Memory/Cache Related Terms (cont.)

Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into n (n at least 2) locations in the cache.

2-way set-associative cache

Main memory

Page 20: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE20

RISC Memory/Cache Related Terms

• The data cache was designed to allow programmers to take advantage of common data access patterns : Spatial Locality

When an array element is referenced, its neighbors are likely to be referenced

Cache lines are fetched together Work on consecutive data elements in the same cache line

Temporal Locality When an array element is referenced, it is likely to be referenced

again soon Arrange code so that data in cache is reused as often as possible

Page 21: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE21

Typical RISC Floating-Point Operation Times

IBM POWER3 II

• CPU Clock Speed – 375 MHz ( ~ 3 ns)

Instruction 32-Bit 64-Bit

FP Multiply or Add

3-4 3-4

FP Multiply-Add

3-4 3-4

FP Square Root

14-23 22-31

FP Divide 14-21 18-25

Page 22: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE22

Typical RISC Memory Access Times

IBM POWER3 II

Access Bandwidth

(GB/s)

Time

(Cycles)

Load Register

From L1

3.2 1

Store Register

To L1

1.6 1

Load/Store L1 from/to L2

6.4 9

Load/Store L1

From/to RAM

1.6 35

Page 23: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE23

Single CPU Optimization

Optimization of serial (single CPU) version is very important

• Want to parallelize best serial version – where appropriate

Page 24: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE24

New CPUs in HPC

New CPU designs with new features

• IBM POWER 4

– U Texas Regatta nodes – covered on Wednesday

• Intel IA-64

– SDSC DTF TeraGrid PC Linux Cluster

Page 25: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE25

Parallel Networks

Network function is to transfer data from source to destination in support of network transactions used to realize supported programming model(s).

Data transfer can be for message-passing and/or shared-memory operations.

Network Terminology Common Parallel Networks

Page 26: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE26

System Interconnect TopologiesSend Information among CPUs through a Network - Best choice would be a fully connected network in which each processor has a direct link to every other processor – Fully Connected Network. This type of network would be very expensive and difficult to scale ~N*N. Instead, processors are arranged in some variation of a mesh, torus, hypercube, etc.

3-D Hypercube 2-D Mesh 2-D Torus

Page 27: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE27

Network Terminology

• Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better.

• Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better.

– May vary with data size

Switch type

Latency (sec) Bandwidth (MB/sec)

US ~ 17 (~6000 cpu cycles)

~350

For IBM Blue Horizon:

Page 28: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE28

Network TerminologyBus

• Shared data path

• Data requests require exclusive access

• Complexity ~ O(N)

• Not scalable – Bandwidth ~ O(1)

Crossbar Switch

• Non-blocking switching grid among network elements

• Bandwidth ~ O(N)

• Complexity ~ O(N*N)

Multistage Interconnection Network (MIN)

• Hierarchy of switching networks – e.g., Omega network for N CPUs, N memory banks: complexity ~ O(ln(N))

Page 29: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE29

Network Terminology (Continued)

• Diameter – maximum distance (in nodes) between any two processors

• Connectivity – number of distinct paths between any two processors

• Channel width – maximum number of bits that can be simultaneously sent over link connecting two processors = number of physical wires in each link

• Channel rate – peak rate at which data can be sent over single physical wire

• Channel bandwidth – peak rate at which data can be sent over link = (channel rate) * (channel width)

• Bisection width – minimum number of links that have to be removed to partition network into two equal halves

• Bisection bandwidth – maximum amount of data between any two halves of network connecting equal numbers of CPUs = (bisection width) * (channel bandwidth)

Page 30: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE30

Communication Overhead

Time to send a message of M bytes – simple form:

Tcomm = TL + M*Td + TContention

TL = Message Latency

T = 1byte/bandwidth

Tcontention – Takes into account other network traffic

Page 31: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE31

Communication Overhead

Time to send a message of M bytes – simple form:

Tcomm = TL + M*Td + TContention

TL = Message Latency

T = 1byte/bandwidth

Tcontention – Takes into account other network traffic

Page 32: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE32

Parallel I/O

I/O can be limiting factor in parallel application

• I/O system properties – capacity, bandwidth, access time

• Need support for Parallel I/O in programming system

• Need underlying HW and system support for parallel I/O

– IBM GPFS – low-level API for developing high-level parallel I/O functionality – MPI I/O, HDF 5, etc.

Page 33: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE33

Unix OS Concepts for Parallel Programming

Most Operating Systems used by Parallel Computers are Unix-based• Unix Process (task)

– Executable code– Instruction pointer– Stack– Logical registers– Heap – Private address space– Task forking to create dependent processes – thousands of clock cycles

• Thread – “lightweight process”– Logical registers– Stack– Shared address space

• Hundreds of clock cycles to create/destroy/synchronize threads

Page 34: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE34

Parallel Computer Architectures (Flynn Taxonomy)

Control Mechanism

Memory Model

SIMD MIMD

shared-memoryHybrid

(SMP cluster)distributed-memory

Page 35: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE35

Hardware Architecture Models for Design of Parallel Programs

Sequential computers - von Neumann model (RAM) is universal computational model

Parallel computers - no one model exists

• Model must be sufficiently general to encapsulate hardware features of parallel systems

• Programs designed from model must execute efficiently on real parallel systems

Page 36: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE36

Designing and Building Parallel Applications

Donald [email protected]

San Diego Supercomputing Center

Page 37: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE37

What is Parallel Computing?

• Parallel computing: the use of multiple computers or processors or processes concurrently working together on a common task.– Each processor/process works on its section of the problem– Processors/process are allowed to exchange information (data in

local memory) with other processors/processes

CPU #1 works on this area of the problem

CPU #3 works on this areaof the problem

CPU #4 works on this areaof the problem

CPU #2 works on this areaof the problem

Grid of Problem to be solved

y

x

exchange

exchange

Page 38: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE38

Shared and Distributed Memory Systems

MulitprocessorShared memory - Single address space. Processes have access to a pool of shared memory. Single OS.

MulticomputerDistributed memory - Each processorhas it’s own local memory. Processesusually do message passing to exchange data among processors. Usually multiple Copies of OS

MEMORY

Interconnect

CPU CPU CPU CPUCPU CPU CPU CPU

MMMM

NETWORK

Page 39: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE39

Hybrid (SMP Clusters) System

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

Network

•Must/may use message-passing.•Single or multiple OS copies•Node-Local operations less costly than off-node

Page 40: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE40

Unix OS Concepts for Parallel Programming

Most Operating Systems used are Unix-based• Unix Process (task)

– Executable code– Instruction pointer– Stack– Logical registers– Heap – Private address space– Task forking to create dependent processes – thousands of clock cycles

• Thread – “lightweight process”– Logical registers– Stack– Shared address space

• Hundreds of clock cycles to create/destroy/synchronize threads

Page 41: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE41

Generic Parallel Programming Models

Single Program Multiple Data Stream (SPMD)– Each CPU accesses same object code– Same application run on different data

• Data exchange may be handled explicitly/implicitly– “Natural” model for SIMD machines– Most commonly used generic parallel programming model

• Message-passing• Shared-memory

– Usually uses process/task ID to differentiate– Focus of remainder of this section

Multiple Program Multiple Data Stream (MPMD)– Each CPU accesses different object code– Each CPU has only data/instructions needed– “Natural” model for MIMD machines

Page 42: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE42

Parallel “Architectures” – Mapping Hardware Models to Programming Models

Control Mechanism

Memory Model

Programming Model

SIMD MIMD

shared-memoryHybrid

(SMP cluster)distributed-memory

SPMD MPMD

Page 43: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE43

Methods of Problem Decomposition for Parallel Programming

Want to map (Problem + Algorithms + Data) Architecture

Conceptualize mapping via e.g., pseudocode

Realize mapping via programming language

• Data Decomposition - data parallel program

– Each processor performs the same task on different data

– Example - grid problems

• Task (Functional ) Decomposition - task parallel program

– Each processor performs a different task

– Example - signal processing – adding/subtracting frequencies from spectrum

• Other Decomposition methods

Page 44: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE44

Designing and Building Parallel Applications

•Generic Problem Architectures•Design and Construction Principles•Incorporate Computer Science Algorithms•Use Parallel Numerical Libraries Where Possible

Page 45: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE45

Designing and Building Parallel Applications

•Know when (not) to parallelize is very important•Cherri Pancake’s “Rules” summarized:

•Frequency of Use•Execution Time•Resolution Needs•Problem Size

Page 46: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE46

Categories of Parallel Problems

Generic Parallel Problem “Architectures” ( after G Fox)• Ideally Parallel (Embarrassingly Parallel, “Job-Level Parallel”)

– Same application run on different data– Could be run on separate machines– Example: Parameter Studies

• Almost Ideally Parallel– Similar to Ideal case, but with “minimum” coordination required– Example: Linear Monte Carlo calculations, integrals

• Pipeline Parallelism– Problem divided into tasks that have to be completed sequentially– Can be transformed into partially sequential tasks– Example: DSP filtering

• Synchronous Parallelism– Each operation performed on all/most of data– Operations depend on results of prior operations– All processes must be synchronized at regular points– Example: Modeling Atmospheric Dynamics

• Loosely Synchronous Parallelism– similar to Synchronous case, but with “minimum” intermittent data sharing– Example: Modeling Diffusion of contaminants through groundwater

Page 47: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE47

Designing and Building Parallel Applications

Attributes of Parallel Algorithms

– Concurrency - Many actions performed “simultaneously”– Modularity - Decomposition of complex entities into simpler

components– Locality - Want high ratio of of local memory access to remote

memory access– Usually want to minimize communication/computation ratio– Performance

• Measures of algorithmic “efficiency”– Execution time– Complexity usually ~ Execution Time– Scalability

Page 48: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE48

Designing and Building Parallel Applications

Partitioning - Break down main task into smaller ones – either identical or “disjoint”.

Communication phase - Determine communication patterns for task coordination, communication algorithms.

Agglomeration - Evaluate task and/or communication structures wrt performance and implementation costs. Tasks may be combined to improve performance or reduce communication costs.

Mapping - Tasks assigned to processors; maximize processor utilization, minimize communication costs. Mapping may be either static or dynamic.

May have to iterate whole process until satisfied with expected performance – Consider writing application in parallel, using either SPMD message-passing or

shared-memory– Implementation (software & hardware) may require revisit, additional refinement or

re-design

Page 49: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE49

Designing and Building Parallel Applications

Partitioning

– Geometric or Physical decomposition (Domain Decomposition) - partition data associated with problem

– Functional (task) decomposition – partition into disjoint tasks associated with problem

– Divide and Conquer – partition problem into two simpler problems of approximately equivalent “size” – iterate to produce set of indivisible sub-problems

Page 50: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE50

Generic Parallel Programming Software Systems

Message-Passing– Local tasks, each encapsulating local data– Explicit data exchange– Supports both SPMD and MPMD– Supports both task and data decomposition– Most commonly used– Process-based, but for performance, processes should be running on separate CPUs– Example API: MPI, PVM Message-Passing libraries– MP systems, in particular, MPI, will be focus of remainder of workshop

Data Parallel– Usually SPMD– Supports data decomposition– Data mapping to cpus may be either implicit/explicit– Example: HPF compiler

Shared-Memory– Tasks share common address space– No explicit transfer of data - supports both task and data decomposition– Can be SPMD, MPMD– Thread-based, but for performance, threads should be running on separate CPUs– Example API : OpenMP, Pthreads

Hybrid - Combination of Message-Passing and Shared-Memory - supports both task and data decomposition – Example: OpenMP + MPI

Page 51: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE51

Programming Methodologies - Practical Aspects

Bulk of parallel programs written in Fortran, C, or C++• Generally, best compiler, tool support for parallel program development

Bulk of parallel programs use Message-Passing with MPI • Performance, portability, mature compilers, libraries for parallel program development

Data and/or tasks are split up onto different processors by:

• Distributing the data/tasks onto different CPUs, each with local memory (MPPs,MPI)

• Distribute work of each loop to different CPUs (SMPs,OpenMP, Pthreads)

• Hybrid - distribute data onto SMPs and then within each SMP distribute work of each loop (or task) to different CPUs within the box (SMP-Cluster, MPI&OpenMP)

Page 52: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE52

Typical Data Decomposition for Parallelism

Example: Solve 2-D Wave Equation:

2

2

2

2

yB

xD

t

2

1,,1,

2

,1,,11 22

y

fffB

x

fffD

t

ffnji

nji

nji

nji

nji

nji

ni

ni

Original partial differential equation:

Finite DifferenceApproximation:

PE #0 PE #1 PE #2

PE #4 PE #5 PE #6

PE #3

PE #7

y

x

Page 53: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE53

Sending Data Between CPUs

Finite DifferenceApproximation:

PE #0 PE #1

PE #3 PE #4

y

x

2

1,,1,

2

,1,,11 22

y

fffB

x

fffD

t

ffnji

nji

nji

nji

nji

nji

ni

ni

i=1,25j=1,25

i=1,25j=26,50

i=26,50j=1,25

i=26,50j=26,50

i=26-50,j=25

i=1-25, j=26

i=1-25, j=26

i=26-50,j=25

i=26,j=1-25 i=26,j=26-50

i=25,j=1-25 i=25,j=26-50

if (taskid=0) then li = 1 ui = 25 lj = 1 uj = 25 send(1:25)=f(25,1:25)elseif (taskid=1)then....elseif (taskid=2) then...elseif(taskid=3) then...end if

do j = lj,uj do i = li,ui work on f(i,j) end doend do

Sample Pseudo Code

Page 54: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE54

Typical Task Parallel Decomposition

•Signal processing•Use one processor for each independent task

• Can use more processors if one is overloadedv

SPECTRUM IN SubtractFrequencyf1

SubtractFrequencyf2

SubtractFrequencyf3

SPECTRUM OUT

Process 0 Process 1 Process 2

Page 55: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE55

Basics of Task Parallel Decomposition - SPMDSame program will run on 2 different CPUsTask decomposition analysis has defined 2 tasks (a and b) to be done by 2 CPUs

program.f:… initialize...if TaskID=A then do task aelseif TaskID=B then do task bend if….end program

Task AExecution Stream

Task BExecution Stream

program.f:…Initialize …do task a…end program

program.f:…Initialize…do task b…end program

Page 56: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE56

Multi-Level Task Parallelism

Program tskparImplicit none

(declarations)

Do loop #1 par blockEnd task #1

(serial work)

Do loop #2 par blockEnd task #2

(serial work)

Program tskparImplicit none

(declarations)

Do loop #1 par blockEnd task #1

(serial work)

Do loop #2 par blockEnd task #2

(serial work)

Program tskparImplicit none

(declarations)

Do loop #1 par blockEnd task #1

(serial work)

Do loop #2 par blockEnd task #2

(serial work)

Program tskparImplicit none

(declarations)

Do loop #1 par blockEnd task #1

(serial work)

Do loop #2 par blockEnd task #2

(serial work)

MPI

MPI

MPI MPI

MPI

MPI

MPI

MPI

network

Proc set #1 Proc set #2 Proc set #3 Proc set #4

MPI

threads

Implementation: MPI and OpenMP

Page 57: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE57

Parallel Application Performance Concepts

• Parallel Speedup

• Parallel Efficiency

• Parallel Overhead

• Limits on Parallel Performance

Page 58: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE58

Parallel Application Performance Concepts

• Parallel Speedup - ratio of best sequential time to parallel execution time

– S(n) = ts/tp

• Parallel Efficiency - fraction of time processors in use

– E(n) = ts/(tp*n) = S(n)/n

• Parallel Overhead

– Communication time (Message-Passing)

– Process creation/synchronization (MP)

– Extra code to support parallelism, such as Load Balancing

– Thread creation/coordination time (SMP)

• Limits on Parallel Performance

Page 59: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE59

Parallel Application Performance Concepts

• Parallel Speedup - ratio of best sequential time to parallel execution time

– S(n) = ts/tp

• Parallel Efficiency - fraction of time processors in use

– E(n) = ts/(tp*n) = S(n)/n

• Parallel Overhead

– Communication time (Message-Passing)

– Process creation/synchronization (MP)

– Extra code to support parallelism, such as Load Balancing

– Thread creation/coordination time (SMP)

• Limits on Parallel Performance

Page 60: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE60

Limits of Parallel Computing

• Theoretical upper limits

– Amdahl’s Law

– Gustafson’s Law

• Practical limits

– Communication overhead

– Synchronization overhead

– Extra operations necessary for parallel version

• Other Considerations

– Time used to re-write (existing) code

Page 61: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE61

Parallel Computing - Theoretical Performance Upper Limits

• All parallel programs contain:

– Parallel sections

– Serial sections

Serial sections limit the parallel performance

Amdahl’s Law provides a theoretical upper limit on parallel performance for size-constant problems

Page 62: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE62

tn fp / N fs t1

S 1fs

fp / N

Amdahl’s Law• Amdahl’s Law places a strict limit on the speedup that can be realized by

using multiple processors– Effect of multiple processors on run time for size-constant problems– Effect of multiple processors on parallel speedup, S:

– Where• fs = serial fraction of code• fp = parallel fraction of code• N = number of processors• t1 = sequential execution time

Page 63: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE63

Amdahl’s Law

Amdahl's Law (Ideal Case)

0

10

20

30

40

50

10 20 30 40

Number CPUs

Spee

dup f=0.0

f=0.01f=0.05f=0.1

Page 64: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE64

Amdahl’s Law (Continued)

Amdahl's Law (Actual)

0

10

20

30

40

50

10 20 30 40

Number CPUs

Spee

dup f=0.0

f=0.01Actual

Page 65: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE65

Gustafson’s Law

Consider scaling problem size as processor count increased

Ts = serial execution timeTp(N,W) = parallel execution time for same problem, size W, on N CPUsS(N,W) = Speedup on problem size W, N CPUs S(N,W) = (Ts + Tp(1,W) )/( Ts + Tp(N,W) )Consider case where Tp(N,W) ~ W*W/NS(N,W) -> (N*Ts + N*W*W)/(N*Ts + W*W) -> N

Gustafson’s Law provides some hope for parallel applications to deliveron the promise.

Page 66: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE66

Parallel Programming Analysis - Example

Consider solving 2-D Poisson’s equation by iterative method on a regular grid with M points –

Page 67: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE67

Parallel Programming Concepts

Program must be correct and terminate for some input data set(s)

• Race condition – result(s) depends upon order in which processes/threads finish calculation(s). May or may not be problem, depending upon results

• Deadlock – Process/thread requests resource it will never get. To be avoided – common problem in message-passing parallel programs

Page 68: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE68

Other Considerations

• Writing efficient parallel applications is usually more difficult than writing serial applications– Serial version may (may not) provide good starting point for parallel

version– Communication, synchronization, etc., can limit parallel performance

• Usually want to overlap communication and computation to minimize ratio of communication to computation time

– Serial time can dominate– CPU computational load balance is important

• Is it worth your time to rewrite existing application? Or create new one? Recall Cherri Pancake’s Rules (simplified version).– Do the CPU and/or memory requirements justify parallelization?– Will the code be used “enough” times to justify parallelization?

Page 69: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE69

Parallel Programming - Real Life

• These are the main models in use today (circa 2002)

• New approaches – languages, hardware, etc., are likely to arise as

technology advances

• Other combinations of these models are possible

• Large applications will probably use more than one model

• Shared memory model is closest to mathematical model of application

– Scaling to large numbers of cpus is major issue

Page 70: Scalable Parallel Architectures and their  Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE70

Parallel Computing

References• NPACI PCOMP web-page - www.npaci.edu/PCOMP

– Selected HPC link collection - categorized, updated• Online Tutorials, Books

– Designing and Building Parallel Programs, Ian Foster. http://www-unix.mcs.anl.gov/dbpp/

– NCSA Intro to MPI Tutorial http://pacont.ncsa.uiuc.edu:8900/public/MPI/index.html

– HLRS Parallel Programming Workshop http://www.hlrs.de/organization/par/par_prog_ws/

• Books– Parallel Programming, B. Wilkinson, M. Allen– Computer Organization and Design, D. Patterson and J. L. Hennessy – Scalable Parallel Computing, K. Huang, Z. Xu