Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of...

70
Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

Transcript of Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of...

Page 1: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

Eric AubanelAdvanced Computational Research Laboratory

Faculty of Computer Science, UNB

Page 2: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Advanced Computational Research Laboratory

• High Performance Computational Problem-Solving and Visualization Environment

• Computational Experiments in multiple disciplines: CS, Science and Eng.

• 16-Processor IBM SP3

• Member of C3.ca Association, Inc. (http://www.c3.ca)

Page 3: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Advanced Computational Research Laboratory

www.cs.unb.ca/acrl

• Virendra Bhavsar, Director

• Eric Aubanel, Research Associate & Scientific Computing Support

• Sean Seeley, System Administrator

Page 4: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.
Page 5: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.
Page 6: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 7: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

POWER chip: 1990 to 2003

1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-

add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz

– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

Page 8: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

POWER chip: 1990 to 2003

1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5

MHz) connected by internal switch network– Parallel Environment & system software

Page 9: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

POWER chip: 1990 to 2003

1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square

root)– SP2: POWER2 + higher bandwidth switch for

larger systems

Page 10: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

POWER chip: 1990 to 2003

1993: POWERPCSupport SMP

1996: P2SCPOWER2 super chip: clock speeds up to 160

MHz

Page 11: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

POWER chip: 1990 to 2003

Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-

16 MB– Instruction & data prefetch

Page 12: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz

• 4- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 1.6 GB/ s Memory Bandwidth

• 6 GFLOPS/ Node

• Nighthawk II - 375 MHz

• 16- way SMP

• 2 MULT/ ADD - 1500 MFLOPS

• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec

• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec

• 14 GB/ s Memory Bandwidth

• 24 GFLOPS/ Node

Page 13: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

The Clustered SMP

ACRL’s SP: Four 4-way SMPs

Each node has its own copy of the O/S

Processors on the node are closer than those on differentnodes

Page 14: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Power3 Architecture

Page 15: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Power4 - 32 way

• Logical UMA

• SP High Node

• L3 cache shared between all processors on node - 32 MB

• Up to 32 GB main memory

• Each processor: 1.1 GHz

• 140 Gflops total peak

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

2 procsPrivate L1,

L2

GX Bus GX Bus

Page 16: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors

Federation Switch

SP GP Node

AIX

Federation Adapters

Memory

Processors / Intra-node Interconnect Up to 16

Links

SP GP Node

AIX

Federation Adapters

Memory

Up to 16 Links

Processors / Intra-node Interconnect

NUMA up to 256 processors - 1.1 Teraflops

Page 17: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 18: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Uni-processor Optimization

• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3

• Cache re-use

• Take advantage of superscalar architecture – give enough operations per load/store

• Use ESSL - optimization already maximally exploited

Page 19: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Memory Access Times

Memory to L2or L1

L2 to L1 L1 toRegisters

Width 16 bytes/2cycles

32 bytes/cycle 2 x 8bytes/cycle

Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s

Latency 35 cycles(approximately)

6 to 7 cycles(approximately)

1 cycle

Page 20: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Cache128 byte cache line

2 MB

2 MB

2 MB

2 MB

L2 cache: 4-way set-associative, 8 MB total

L1 cache: 128-way set-associative, 64 KB

Page 21: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

How to Monitor Performance?

• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP

Page 22: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

HMPCOUNT sample output

real*8 a(256,256),b(256,256),c(256,256)

common a,b,c

do j=1,256

do i=1,256

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 66543

Average number of loads per TLB miss : 5.916

Total loads and stores : 0.525 M

Instructions per load/store : 2.749

Cycles per instruction : 2.378

Instructions per cycle : 0.420

Total floating point operations : 0.066 M

Hardware float point rate : 2.749 Mflop/sec

Page 23: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

HMPCOUNT sample output

real*8 a(257,256),b(257,256),c(257,256)

common a,b,c

do j=1,256

do i=1,257

a(i,j)=b(i,j)+c(i,j)

end do

end do

end

PM_TLB_MISS (TLB misses) : 1634

Average number of loads per TLB miss : 241.876

Total loads and stores : 0.527 M

Instructions per load/store : 2.749

Cycles per instruction : 1.271

Instructions per cycle : 0.787

Total floating point operations : 0.066 M

Hardware float point rate : 3.525 Mflop/sec

Page 24: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

ESSL

• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers

• Fast!– 560x560 real*8 matrix multiply

• Hand coding: 19 Mflops

• dgemm: 1.2 GFlops

• Parallel (threaded and distributed) versions

Page 25: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 26: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

ACRL’s IBM SP

• 4 Winterhawk II nodes– 16 processors

• Each node has:– 1 GB RAM

– 9 GB (mirrored) disk on each node

– Switch adapter

• High Perforrnance Switch

• Gigabit Ethernet (1 node)

• Control workstation

• Disk: SSA tower with 6 18.2 GB disks

Disk

Gigabit Ethernet

Page 27: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.
Page 28: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

IBM Power3 SP Switch

• Bidirectional multistage interconnection networks (MIN)

• 300 MB/sec bi-directional

• 1.2 sec latency

Page 29: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

General Parallel File System

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Client

RVSD/VSD

Application

GPFS Server

RVSD/VSD

Node 2 Node 3 Node 4

Node 1

SP Switch

Page 30: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

ACRL Software• Operating System: AIX 4.3.3

• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)

– VisualAge C for AIX, Version 5.0.1.0

– VisualAge C++ Professional for AIX, Version 5.0.0.0

– IBM Visual Age Java - not yet installed

• Job Scheduler: Loadleveler 2.2

• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O

• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )

• Visualization: OpenDX (not yet installed)

• E-Commerce software (not yet installed)

Page 31: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 32: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel

– image processing, Monte Carlo

– Simulations (eg. CFD)

• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components

• Beowulf clusters

• SMP nodes

– Improvements in network technology

Page 33: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

NRL Layered Ocean Model at Naval Research Laboratory

IBM Winterhawk II SP

Page 34: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Parallel Computational Models

• Data Parallelism– Parallel program looks like serial program

• parallelism in the data

– Vector processors– HPF

Page 35: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Parallel Computational Models

• Message Passing (MPI)– Processes have only local memory but can communicate

with other processes by sending & receiving messages– Data transfer between processes requires operations to be

performed by both processes– Communication network not part of computational

model (hypercube, torus, …)

Send Receive

Page 36: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Parallel Computational Models

• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard

Address space

Processes

Page 37: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Parallel Computational Models

• Remote Memory Operations– “One-sided” communication

• MPI-2, IBM’s LAPI

– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory

Put

Get

Page 38: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Parallel Computational Models

• Combined: Message Passing & Threads– Driven by clusters of SMPs

– Leads to software complexity!

Address space

Processes

Address space

Processes

Address space

Processes

Network

Page 39: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 40: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Message Passing Interface

• MPI 1.0 standard in 1994

• MPI 1.1 in 1995 - IBM support

• MPI 2.0 in 1997– Includes 1.1 but adds new features

• MPI-IO

• One-sided communication

• Dynamic processes

Page 41: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Advantages of MPI

• Universality

• Expressivity– Well suited to formulating a parallel algorithm

• Ease of debugging– Memory is local

• Performance– Explicit association of data with process allows

good use of cache

Page 42: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

MPI Functionality• Several modes of point-to-point message passing

– blocking (e.g. MPI_SEND)

– non-blocking (e.g. MPI_ISEND)

– synchronous (e.g. MPI_SSEND)

– buffered (e.g. MPI_BSEND)

• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER

• User-defined datatypes

• Logically distinct communicator spaces

• Application-level or virtual topologies

Page 43: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Simple MPI Example

My_Id 0 1

This is from MPI process number 0

This is from MPI processes other than 0

Page 44: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Simple MPI ExampleProgram Trivial

implicit none

include "mpif.h" ! MPI header file

integer My_Id, Numb_of_Procs, Ierr

call MPI_INIT ( ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )

call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )

print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs

if ( My_Id .eq. 0 ) then

print *, ' This is from MPI process number ',My_Id

else

print *, ' This is from MPI processes other than 0 ', My_Id

end if

call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr

stop

end

Page 45: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

MPI Example with send/recv

My_Id 0 1

Send Receive

SendReceive

Page 46: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

MPI Example with send/recvProgram Simple

implicit none

Include "mpif.h"

Integer My_Id, Other_Id, Nx, Ierr

Parameter ( Nx = 100 )

Real A ( Nx ), B ( Nx )

call MPI_INIT ( Ierr )

call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )

Other_Id = Mod ( My_Id + 1, 2 )

A = My_Id

call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )

call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )

call MPI_FINALIZE ( Ierr )

stop

end

Page 47: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

What Will Happen?/* Processor 0 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

/* Processor 1 */

...

MPI_Send(sendbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD);

printf("Posting receive now ...\n");

MPI_Recv(recvbuf,

bufsize,

MPI_CHAR,

partner,

tag,

MPI_COMM_WORLD,

status);

Page 48: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

MPI Message Passing Modes

Ready

Standard

Synchronous

Buffered

Ready

Eager

Rendezvous

Buffered

> eager limit

<= eager limit

Default Eager Limit on SP is 4 KB (can be up to 64 KB)

Page 49: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

MPI Performance Visualization

• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing

behaviour and performance of MPI programs

Page 50: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.
Page 51: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.
Page 52: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Message Passing on SMP

Call MPI_SEND Call MPI_RECEIVE

BufferBuffer

Memory Crossbar or Switch

Data toSend

ReceivedData

export MP_SHARED_MEMORY=yes|no

Page 53: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Shared Memory MPI

MPI_SHARED_MEMORY=<yes|no>

Latency Bandwidth

(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)

Page 54: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Message Passing off Node

MPI Across all the processors

Many more messages going through the fabric

Page 55: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 56: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

OpenMP• 1997: group of hardware and software vendors

announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.

• www.openmp.org• OpenMP parallelism specified through the use of

compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

Page 57: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

OpenMP

• All processors can access all the memory in the parallel system

• Parallel execution is achieved by generating threads which execute in parallel

• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead

Page 58: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

OpenMP1.All OpenMP programs begin as a single process: the master thread

2.FORK: the master thread then creates a team of parallel threads

3.Parallel region statements executed in parallel among the various team threads

4.JOIN: threads synchronize and terminate, leaving only the master thread

Page 59: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

OpenMP

How is OpenMP typically used?

• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.

• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

Page 60: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

OpenMP Loop Parallelization!$OMP PARALLEL DO

do i=0,ilong

do k=1,kshort

...

end do

end do

#pragma omp parallel for

for(i=0; i <= ilong; i++)

for(k=1; k <= kshort; k++) {

...

}

Page 61: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Variable Scoping• Most difficult part of Shared Memory

Parallelization– What memory is Shared

– What memory is Private - each processor has its own copy

• Compare MPI: all variables are private• Variables are shared by default, except:

– loop indices

– scalars that are set and then used in loop

Page 62: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

How Does Sharing Work?

THREAD 1: increment(x)

{

x = x + 1;

}

THREAD 1:

10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

THREAD 2: increment(x)

{ x = x + 1;

}

THREAD 2: 10 LOAD A, (x address)

20 ADD A, 1

30 STORE A, (x address)

Shared X initially 0

Result could be 1 or 2

Need synchronization

Page 63: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

False Sharing7

6

5

4

3

2

1

0

Processor 1 Processor 2

Block in Cache

Cache line

Address tag

Block

Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished

!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo

Page 64: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 65: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Why Hybrid MPI-OpenMP?

• To optimize performance on “mixed-mode” hardware like the SP

• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a

pure MPI implementation

Page 66: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid

model to be worthwhile:– There has to be obvious parallelism to exploit

– The code has to be easy to program and maintain• easy to write bad OpenMP code

– It has to promise to perform at least as well as the equivalent all-MPI program

• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of

parallelism

Page 67: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Hybrid Scenario• Thread the computational portions of the code that

exist between MPI calls• MPI calls are “single-threaded” and therefore use

only a single CPU.• Assumes:

– application has two natural levels of parallelism– or that in breaking an MPI code with one level

of parallelism that communication between resulting threads is little/none

Page 68: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Programming the IBM Power3 SP

• History and future of POWER chip

• Uni-processor optimization

• Description of ACRL’s IBM SP

• Parallel Processing– MPI– OpenMP

• Hybrid MPI/OpenMP• MPI-I/O (one slide)

Page 69: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

MPI-IO

• Part of MPI-2• Resulted work at IBM Research exploring the

analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)

memory

processes

file

Page 70: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB.

Conclusion• Don’t forget uni-processor optimization

• If you choose one parallel programming API, choose MPI

• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here

• Remote memory access model may be the answer