Scalable Parallel Architectures and their Software

San DIEGO SUPERCOMPUTER CENTER

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

NPACI Parallel Computing SeminarsSan Diego Supercomputing Center

Scalable Parallel Architectures

and their

Software


NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE2

Introduction

• Overview of RISC CPUs, Memory Hierarchy• Parallel Systems - General Hardware Layout (SMP, Distributed, Hybrid)• Communications Networks for Parallel Systems• Parallel I/O• Operating Systems Concepts• Overview of Parallel Programming Methodologies

– Distributed Memory– Shared-Memory

• Hardware Specifics of NPACI Parallel Machines – IBM SP Blue Horizon – New CPU Architectures

• IBM Power 4• Intel IA-64



What is Parallel Computing?

• Parallel computing: the use of multiple computers or processors or processes working together on a common task.– Each processor works on its section of the problem– Processors are allowed to exchange information (data in local

memory) with other processors

CPU #1 works on this area of the problem

CPU #3 works on this areaof the problem



Grid of Problem to be solved

y

x

exchange

exchange



Why Parallel Computing?

• Limits of single-CPU computing– Available memory– Performance - usually “time to solution”

• Limits of Vector Computers – main HPC alternative– System cost, including maintenance– Cost/MFlop

• Parallel computing allows:– Solving problems that don’t fit on a single CPU– Solving problems that can’t be solved in a reasonable time on one CPU

• We can run…– Larger problems– Finer resolution– Faster– More cases



Scalable Parallel Computer Systems

(Scalable) [ ( CPUs) + (Memory) + (I/O) + (Interconnect) + (OS) ] = Scalable Parallel Computer System



Scalable Parallel Computer Systems

Scalablity: A parallel system is scalable if it is capable of providing enhanced resources to accommodate increasing performance and/or functionality

• Resource scalability: scalability achieved by increasing machine size ( # CPUs, memory, I/O, network, etc.)

• Application scalability

– machine size

– problem size



Shared and Distributed Memory Systems

Multiprocessor (Shared memory) -Single address space. All processors have access to a pool of shared memory.Examples: SUN HPC, CRAY T90, NEC SX-6

Methods of memory access : - Bus - Crossbar

Multicomputer (Distributed memory) -Each processor has it’s own local -memory. Examples: CRAY T3E, IBM SP2, PC Cluster

MEMORY

BUS/CROSSBAR

CPU CPU CPU CPUCPU CPU CPU CPU

MMMM

NETWORK



Hybrid (SMP Clusters) Systems

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

Network

Hybrid Architecture – Processes share memory on-node, may/must use message-passing off-node, may share off-node memoryExample: IBM SP Blue Horizon, SGI Origin, Compaq Alphaserver



RISC-Based Computer Hardware Concepts

RISC CPUs most common CPUs in HPC – many design concepts transferred from vector CPUs to RISC to CISC

• Multiple Functional Units

• Pipelined Instructions

• Memory Hierarchy

• Instructions typically take 1-several CPU clock cycles

– Clock cycles provide time scale for measurement

• Data transfers – memory-to-CPU, network, I/O, etc.



Processor Related Terms

• RISC : Reduced Instruction Set Computers

• PIPELINE : Technique where multiple instructions are overlapped in execution

• SUPERSCALAR : Computer design feature - multiple instructions can be executed per clock period

Laura C. Nett:

Instruction set is just how each operation is processed x=y+1

load y and a

add y and a

put in x

Laura C. Nett:

Instruction set is just how each operation is processed x=y+1

load y and a

add y and a

put in x



‘Typical’ RISC CPU

r0

r32

.

.

.

.

r2

r1

registers FP Add

FP Multiply

FP Divide

Functional Units

Loads & Stores

CPU Chip

Mem

ory/

Cac

he

FP Multiply & Add



Functional Unit

D(I)

C(I)A(I)

Multiply pipeline length

• Fully Segmented - A(I)=C(I)*D(I)

Chair Building Function Unit

Carpenter 1 Carpenter 2 Carpenter 3 Carpenter 4 Carpenter 5



Dual Hardware Pipes

odd C(I)

odd C(I)

A(I) & A(I+1)

even D(I+1)

even C(I+1)

A(I) = C(I)*D(I)



RISC Memory/Cache Related Terms

ICACHE : Instruction cache

DCACHE (Level 1) : Data cache closest to registers

SCACHE (Level 2) : Secondary data cache

Data from SCACHE has to go through DCACHE to registers

SCACHE is larger than DCACHE

All processors do not have SCACHE

CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy

TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed

• MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory• MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one

level in memory to another

• CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy



CPU

Level 1 Cache

Level 2 Cache

MAIN MEMORY

RISC CPU, CACHE, and MEMORY Basic Layout

SPEED SIZE Cost ($/bit)

Registers




ICACHE : Instruction cache

DCACHE (Level 1) : Data cache closest to registers

SCACHE (Level 2) : Secondary data cache

Data from SCACHE has to go through DCACHE to registers

SCACHE is larger than DCACHE

All processors do not have SCACHE

CACHE LINE: Minimum transfer unit (usually in bytes) for moving data between different levels of memory hierarchy

TLB : Translation-look-aside buffer keeps addresses of pages ( block of memory) in main memory that have been recently accessed

• MEMORY BANDWIDTH: Transfer rate (in MBytes/sec) between different levels of memory• MEMORY ACCESS TIME: Time required (often measured in clock cycles) to bring data items from one

level in memory to another

• CACHE COHERENCY: Mechanism for ensuring data consistency of shared variables across memory hierarchy



RISC Memory/Cache Related Terms (cont.)

Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.

cache

Main memory




Fully associative cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.

cache

Main memory




Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into n (n at least 2) locations in the cache.

2-way set-associative cache

Main memory




• The data cache was designed to allow programmers to take advantage of common data access patterns : Spatial Locality

When an array element is referenced, its neighbors are likely to be referenced

Cache lines are fetched together Work on consecutive data elements in the same cache line

Temporal Locality When an array element is referenced, it is likely to be referenced

again soon Arrange code so that data in cache is reused as often as possible



Typical RISC Floating-Point Operation Times

IBM POWER3 II

• CPU Clock Speed – 375 MHz ( ~ 3 ns)

Instruction 32-Bit 64-Bit

FP Multiply or Add

3-4 3-4

FP Multiply-Add

3-4 3-4

FP Square Root

14-23 22-31

FP Divide 14-21 18-25



Typical RISC Memory Access Times

IBM POWER3 II

Access Bandwidth

(GB/s)

Time

(Cycles)

Load Register

From L1

3.2 1

Store Register

To L1

1.6 1

Load/Store L1 from/to L2

6.4 9

Load/Store L1

From/to RAM

1.6 35



Single CPU Optimization

Optimization of serial (single CPU) version is very important

• Want to parallelize best serial version – where appropriate



New CPUs in HPC

New CPU designs with new features

• IBM POWER 4

– U Texas Regatta nodes – covered on Wednesday

• Intel IA-64

– SDSC DTF TeraGrid PC Linux Cluster



Parallel Networks

Network function is to transfer data from source to destination in support of network transactions used to realize supported programming model(s).

Data transfer can be for message-passing and/or shared-memory operations.

Network Terminology Common Parallel Networks



System Interconnect TopologiesSend Information among CPUs through a Network - Best choice would be a fully connected network in which each processor has a direct link to every other processor – Fully Connected Network. This type of network would be very expensive and difficult to scale ~N*N. Instead, processors are arranged in some variation of a mesh, torus, hypercube, etc.

3-D Hypercube 2-D Mesh 2-D Torus



Network Terminology

• Network Latency : Time taken to begin sending a message. Unit is microsecond, millisecond etc. Smaller is better.

• Network Bandwidth : Rate at which data is transferred from one point to another. Unit is bytes/sec, Mbytes/sec etc. Larger is better.

– May vary with data size

Switch type

Latency (sec) Bandwidth (MB/sec)

US ~ 17 (~6000 cpu cycles)

~350

For IBM Blue Horizon:



Network TerminologyBus

• Shared data path

• Data requests require exclusive access

• Complexity ~ O(N)

• Not scalable – Bandwidth ~ O(1)

Crossbar Switch

• Non-blocking switching grid among network elements

• Bandwidth ~ O(N)

• Complexity ~ O(N*N)

Multistage Interconnection Network (MIN)

• Hierarchy of switching networks – e.g., Omega network for N CPUs, N memory banks: complexity ~ O(ln(N))



Network Terminology (Continued)

• Diameter – maximum distance (in nodes) between any two processors

• Connectivity – number of distinct paths between any two processors

• Channel width – maximum number of bits that can be simultaneously sent over link connecting two processors = number of physical wires in each link

• Channel rate – peak rate at which data can be sent over single physical wire

• Channel bandwidth – peak rate at which data can be sent over link = (channel rate) * (channel width)

• Bisection width – minimum number of links that have to be removed to partition network into two equal halves

• Bisection bandwidth – maximum amount of data between any two halves of network connecting equal numbers of CPUs = (bisection width) * (channel bandwidth)



Communication Overhead

Time to send a message of M bytes – simple form:

Tcomm = TL + M*Td + TContention

TL = Message Latency

T = 1byte/bandwidth

Tcontention – Takes into account other network traffic



Parallel I/O

I/O can be limiting factor in parallel application

• I/O system properties – capacity, bandwidth, access time

• Need support for Parallel I/O in programming system

• Need underlying HW and system support for parallel I/O

– IBM GPFS – low-level API for developing high-level parallel I/O functionality – MPI I/O, HDF 5, etc.

–



Unix OS Concepts for Parallel Programming

Most Operating Systems used by Parallel Computers are Unix-based• Unix Process (task)

– Executable code– Instruction pointer– Stack– Logical registers– Heap – Private address space– Task forking to create dependent processes – thousands of clock cycles

• Thread – “lightweight process”– Logical registers– Stack– Shared address space

• Hundreds of clock cycles to create/destroy/synchronize threads



Parallel Computer Architectures (Flynn Taxonomy)

Control Mechanism

Memory Model

SIMD MIMD

shared-memoryHybrid

(SMP cluster)distributed-memory



Hardware Architecture Models for Design of Parallel Programs

Sequential computers - von Neumann model (RAM) is universal computational model

Parallel computers - no one model exists

• Model must be sufficiently general to encapsulate hardware features of parallel systems

• Programs designed from model must execute efficiently on real parallel systems



Designing and Building Parallel Applications

Donald [email protected]

San Diego Supercomputing Center



What is Parallel Computing?

• Parallel computing: the use of multiple computers or processors or processes concurrently working together on a common task.– Each processor/process works on its section of the problem– Processors/process are allowed to exchange information (data in

local memory) with other processors/processes

CPU #1 works on this area of the problem




Grid of Problem to be solved

y

x

exchange

exchange



Shared and Distributed Memory Systems

MulitprocessorShared memory - Single address space. Processes have access to a pool of shared memory. Single OS.

MulticomputerDistributed memory - Each processorhas it’s own local memory. Processesusually do message passing to exchange data among processors. Usually multiple Copies of OS

MEMORY

Interconnect

CPU CPU CPU CPUCPU CPU CPU CPU

MMMM

NETWORK



Hybrid (SMP Clusters) System

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

MEMORY

Interconnect

CPU CPU CPU CPU

Network

•Must/may use message-passing.•Single or multiple OS copies•Node-Local operations less costly than off-node



Unix OS Concepts for Parallel Programming

Most Operating Systems used are Unix-based• Unix Process (task)

– Executable code– Instruction pointer– Stack– Logical registers– Heap – Private address space– Task forking to create dependent processes – thousands of clock cycles

• Thread – “lightweight process”– Logical registers– Stack– Shared address space

• Hundreds of clock cycles to create/destroy/synchronize threads



Generic Parallel Programming Models

Single Program Multiple Data Stream (SPMD)– Each CPU accesses same object code– Same application run on different data

• Data exchange may be handled explicitly/implicitly– “Natural” model for SIMD machines– Most commonly used generic parallel programming model

• Message-passing• Shared-memory

– Usually uses process/task ID to differentiate– Focus of remainder of this section

Multiple Program Multiple Data Stream (MPMD)– Each CPU accesses different object code– Each CPU has only data/instructions needed– “Natural” model for MIMD machines



Parallel “Architectures” – Mapping Hardware Models to Programming Models

Control Mechanism

Memory Model

Programming Model

SIMD MIMD

shared-memoryHybrid

(SMP cluster)distributed-memory

SPMD MPMD



Methods of Problem Decomposition for Parallel Programming

Want to map (Problem + Algorithms + Data) Architecture

Conceptualize mapping via e.g., pseudocode

Realize mapping via programming language

• Data Decomposition - data parallel program

– Each processor performs the same task on different data

– Example - grid problems

• Task (Functional ) Decomposition - task parallel program

– Each processor performs a different task

– Example - signal processing – adding/subtracting frequencies from spectrum

• Other Decomposition methods




•Generic Problem Architectures•Design and Construction Principles•Incorporate Computer Science Algorithms•Use Parallel Numerical Libraries Where Possible




•Know when (not) to parallelize is very important•Cherri Pancake’s “Rules” summarized:

•Frequency of Use•Execution Time•Resolution Needs•Problem Size



Categories of Parallel Problems

Generic Parallel Problem “Architectures” ( after G Fox)• Ideally Parallel (Embarrassingly Parallel, “Job-Level Parallel”)

– Same application run on different data– Could be run on separate machines– Example: Parameter Studies

• Almost Ideally Parallel– Similar to Ideal case, but with “minimum” coordination required– Example: Linear Monte Carlo calculations, integrals

• Pipeline Parallelism– Problem divided into tasks that have to be completed sequentially– Can be transformed into partially sequential tasks– Example: DSP filtering

• Synchronous Parallelism– Each operation performed on all/most of data– Operations depend on results of prior operations– All processes must be synchronized at regular points– Example: Modeling Atmospheric Dynamics

• Loosely Synchronous Parallelism– similar to Synchronous case, but with “minimum” intermittent data sharing– Example: Modeling Diffusion of contaminants through groundwater




Attributes of Parallel Algorithms

– Concurrency - Many actions performed “simultaneously”– Modularity - Decomposition of complex entities into simpler

components– Locality - Want high ratio of of local memory access to remote

memory access– Usually want to minimize communication/computation ratio– Performance

• Measures of algorithmic “efficiency”– Execution time– Complexity usually ~ Execution Time– Scalability




Partitioning - Break down main task into smaller ones – either identical or “disjoint”.

Communication phase - Determine communication patterns for task coordination, communication algorithms.

Agglomeration - Evaluate task and/or communication structures wrt performance and implementation costs. Tasks may be combined to improve performance or reduce communication costs.

Mapping - Tasks assigned to processors; maximize processor utilization, minimize communication costs. Mapping may be either static or dynamic.

May have to iterate whole process until satisfied with expected performance – Consider writing application in parallel, using either SPMD message-passing or

shared-memory– Implementation (software & hardware) may require revisit, additional refinement or

re-design




Partitioning

– Geometric or Physical decomposition (Domain Decomposition) - partition data associated with problem

– Functional (task) decomposition – partition into disjoint tasks associated with problem

– Divide and Conquer – partition problem into two simpler problems of approximately equivalent “size” – iterate to produce set of indivisible sub-problems



Generic Parallel Programming Software Systems

Message-Passing– Local tasks, each encapsulating local data– Explicit data exchange– Supports both SPMD and MPMD– Supports both task and data decomposition– Most commonly used– Process-based, but for performance, processes should be running on separate CPUs– Example API: MPI, PVM Message-Passing libraries– MP systems, in particular, MPI, will be focus of remainder of workshop

Data Parallel– Usually SPMD– Supports data decomposition– Data mapping to cpus may be either implicit/explicit– Example: HPF compiler

Shared-Memory– Tasks share common address space– No explicit transfer of data - supports both task and data decomposition– Can be SPMD, MPMD– Thread-based, but for performance, threads should be running on separate CPUs– Example API : OpenMP, Pthreads

Hybrid - Combination of Message-Passing and Shared-Memory - supports both task and data decomposition – Example: OpenMP + MPI



Programming Methodologies - Practical Aspects

Bulk of parallel programs written in Fortran, C, or C++• Generally, best compiler, tool support for parallel program development

Bulk of parallel programs use Message-Passing with MPI • Performance, portability, mature compilers, libraries for parallel program development

Data and/or tasks are split up onto different processors by:

• Distributing the data/tasks onto different CPUs, each with local memory (MPPs,MPI)

• Distribute work of each loop to different CPUs (SMPs,OpenMP, Pthreads)

• Hybrid - distribute data onto SMPs and then within each SMP distribute work of each loop (or task) to different CPUs within the box (SMP-Cluster, MPI&OpenMP)



Typical Data Decomposition for Parallelism

Example: Solve 2-D Wave Equation:

2

2

2

2

yB

xD

t

2

1,,1,

2

,1,,11 22

y

fffB

x

fffD

t

ffnji

nji

nji

nji

nji

nji

ni

ni

Original partial differential equation:

Finite DifferenceApproximation:

PE #0 PE #1 PE #2

PE #4 PE #5 PE #6

PE #3

PE #7

y

x



Sending Data Between CPUs

Finite DifferenceApproximation:

PE #0 PE #1

PE #3 PE #4

y

x

2

1,,1,

2

,1,,11 22

y

fffB

x

fffD

t

ffnji

nji

nji

nji

nji

nji

ni

ni

i=1,25j=1,25

i=1,25j=26,50

i=26,50j=1,25

i=26,50j=26,50

i=26-50,j=25

i=1-25, j=26

i=1-25, j=26

i=26-50,j=25

i=26,j=1-25 i=26,j=26-50

i=25,j=1-25 i=25,j=26-50

if (taskid=0) then li = 1 ui = 25 lj = 1 uj = 25 send(1:25)=f(25,1:25)elseif (taskid=1)then....elseif (taskid=2) then...elseif(taskid=3) then...end if

do j = lj,uj do i = li,ui work on f(i,j) end doend do

Sample Pseudo Code



Typical Task Parallel Decomposition

•Signal processing•Use one processor for each independent task

• Can use more processors if one is overloadedv

SPECTRUM IN SubtractFrequencyf1

SubtractFrequencyf2

SubtractFrequencyf3

SPECTRUM OUT

Process 0 Process 1 Process 2



Basics of Task Parallel Decomposition - SPMDSame program will run on 2 different CPUsTask decomposition analysis has defined 2 tasks (a and b) to be done by 2 CPUs

program.f:… initialize...if TaskID=A then do task aelseif TaskID=B then do task bend if….end program

Task AExecution Stream

Task BExecution Stream

program.f:…Initialize …do task a…end program

program.f:…Initialize…do task b…end program



Multi-Level Task Parallelism

Program tskparImplicit none

(declarations)

Do loop #1 par blockEnd task #1

(serial work)


(serial work)


(declarations)


(serial work)


(serial work)


(declarations)


(serial work)


(serial work)


(declarations)


(serial work)


(serial work)

MPI

MPI

MPI MPI

MPI

MPI

MPI

MPI

network

Proc set #1 Proc set #2 Proc set #3 Proc set #4

MPI

threads

Implementation: MPI and OpenMP



Parallel Application Performance Concepts

• Parallel Speedup

• Parallel Efficiency

• Parallel Overhead

• Limits on Parallel Performance




• Parallel Speedup - ratio of best sequential time to parallel execution time

– S(n) = ts/tp

• Parallel Efficiency - fraction of time processors in use

– E(n) = ts/(tp*n) = S(n)/n


– Communication time (Message-Passing)

– Process creation/synchronization (MP)

– Extra code to support parallelism, such as Load Balancing

– Thread creation/coordination time (SMP)




Limits of Parallel Computing

• Theoretical upper limits

– Amdahl’s Law

– Gustafson’s Law

• Practical limits

– Communication overhead

– Synchronization overhead

– Extra operations necessary for parallel version

• Other Considerations

– Time used to re-write (existing) code



Parallel Computing - Theoretical Performance Upper Limits

• All parallel programs contain:

– Parallel sections

– Serial sections

Serial sections limit the parallel performance

Amdahl’s Law provides a theoretical upper limit on parallel performance for size-constant problems



tn fp / N fs t1

S 1fs

fp / N

Amdahl’s Law• Amdahl’s Law places a strict limit on the speedup that can be realized by

using multiple processors– Effect of multiple processors on run time for size-constant problems– Effect of multiple processors on parallel speedup, S:

– Where• fs = serial fraction of code• fp = parallel fraction of code• N = number of processors• t1 = sequential execution time



Amdahl’s Law

Amdahl's Law (Ideal Case)

0

10

20

30

40

50

10 20 30 40

Number CPUs

Spee

dup f=0.0

f=0.01f=0.05f=0.1



Amdahl’s Law (Continued)

Amdahl's Law (Actual)

0

10

20

30

40

50

10 20 30 40

Number CPUs

Spee

dup f=0.0

f=0.01Actual



Gustafson’s Law

Consider scaling problem size as processor count increased

Ts = serial execution timeTp(N,W) = parallel execution time for same problem, size W, on N CPUsS(N,W) = Speedup on problem size W, N CPUs S(N,W) = (Ts + Tp(1,W) )/( Ts + Tp(N,W) )Consider case where Tp(N,W) ~ W*W/NS(N,W) -> (N*Ts + N*W*W)/(N*Ts + W*W) -> N

Gustafson’s Law provides some hope for parallel applications to deliveron the promise.



Parallel Programming Analysis - Example

Consider solving 2-D Poisson’s equation by iterative method on a regular grid with M points –



Parallel Programming Concepts

Program must be correct and terminate for some input data set(s)

• Race condition – result(s) depends upon order in which processes/threads finish calculation(s). May or may not be problem, depending upon results

• Deadlock – Process/thread requests resource it will never get. To be avoided – common problem in message-passing parallel programs



Other Considerations

• Writing efficient parallel applications is usually more difficult than writing serial applications– Serial version may (may not) provide good starting point for parallel

version– Communication, synchronization, etc., can limit parallel performance

• Usually want to overlap communication and computation to minimize ratio of communication to computation time

– Serial time can dominate– CPU computational load balance is important

• Is it worth your time to rewrite existing application? Or create new one? Recall Cherri Pancake’s Rules (simplified version).– Do the CPU and/or memory requirements justify parallelization?– Will the code be used “enough” times to justify parallelization?



Parallel Programming - Real Life

• These are the main models in use today (circa 2002)

• New approaches – languages, hardware, etc., are likely to arise as

technology advances

• Other combinations of these models are possible

• Large applications will probably use more than one model

• Shared memory model is closest to mathematical model of application

– Scaling to large numbers of cpus is major issue



Parallel Computing

References• NPACI PCOMP web-page - www.npaci.edu/PCOMP

– Selected HPC link collection - categorized, updated• Online Tutorials, Books

– Designing and Building Parallel Programs, Ian Foster. http://www-unix.mcs.anl.gov/dbpp/

– NCSA Intro to MPI Tutorial http://pacont.ncsa.uiuc.edu:8900/public/MPI/index.html

– HLRS Parallel Programming Workshop http://www.hlrs.de/organization/par/par_prog_ws/

• Books– Parallel Programming, B. Wilkinson, M. Allen– Computer Organization and Design, D. Patterson and J. L. Hennessy – Scalable Parallel Computing, K. Huang, Z. Xu

Scalable Parallel Architectures and their Software

Documents

Transcript of Scalable Parallel Architectures and their Software