Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of...
-
Upload
kyler-mccann -
Category
Documents
-
view
214 -
download
0
Transcript of Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of...
Programming the IBM Power3 SP
Eric AubanelAdvanced Computational Research Laboratory
Faculty of Computer Science, UNB
Advanced Computational Research Laboratory
• High Performance Computational Problem-Solving and Visualization Environment
• Computational Experiments in multiple disciplines: CS, Science and Eng.
• 16-Processor IBM SP3
• Member of C3.ca Association, Inc. (http://www.c3.ca)
Advanced Computational Research Laboratory
www.cs.unb.ca/acrl
• Virendra Bhavsar, Director
• Eric Aubanel, Research Associate & Scientific Computing Support
• Sean Seeley, System Administrator
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
POWER chip: 1990 to 2003
1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-
add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz
– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache
POWER chip: 1990 to 2003
1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5
MHz) connected by internal switch network– Parallel Environment & system software
POWER chip: 1990 to 2003
1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square
root)– SP2: POWER2 + higher bandwidth switch for
larger systems
POWER chip: 1990 to 2003
1993: POWERPCSupport SMP
1996: P2SCPOWER2 super chip: clock speeds up to 160
MHz
POWER chip: 1990 to 2003
Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-
16 MB– Instruction & data prefetch
POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz
• 4- way SMP
• 2 MULT/ ADD - 1500 MFLOPS
• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
• 1.6 GB/ s Memory Bandwidth
• 6 GFLOPS/ Node
• Nighthawk II - 375 MHz
• 16- way SMP
• 2 MULT/ ADD - 1500 MFLOPS
• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
• 14 GB/ s Memory Bandwidth
• 24 GFLOPS/ Node
The Clustered SMP
ACRL’s SP: Four 4-way SMPs
Each node has its own copy of the O/S
Processors on the node are closer than those on differentnodes
Power3 Architecture
Power4 - 32 way
• Logical UMA
• SP High Node
• L3 cache shared between all processors on node - 32 MB
• Up to 32 GB main memory
• Each processor: 1.1 GHz
• 140 Gflops total peak
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors
Federation Switch
SP GP Node
AIX
Federation Adapters
Memory
Processors / Intra-node Interconnect Up to 16
Links
SP GP Node
AIX
Federation Adapters
Memory
Up to 16 Links
Processors / Intra-node Interconnect
NUMA up to 256 processors - 1.1 Teraflops
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Uni-processor Optimization
• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3
• Cache re-use
• Take advantage of superscalar architecture – give enough operations per load/store
• Use ESSL - optimization already maximally exploited
Memory Access Times
Memory to L2or L1
L2 to L1 L1 toRegisters
Width 16 bytes/2cycles
32 bytes/cycle 2 x 8bytes/cycle
Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s
Latency 35 cycles(approximately)
6 to 7 cycles(approximately)
1 cycle
Cache128 byte cache line
2 MB
2 MB
2 MB
2 MB
L2 cache: 4-way set-associative, 8 MB total
L1 cache: 128-way set-associative, 64 KB
How to Monitor Performance?
• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP
HMPCOUNT sample output
real*8 a(256,256),b(256,256),c(256,256)
common a,b,c
do j=1,256
do i=1,256
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses) : 66543
Average number of loads per TLB miss : 5.916
Total loads and stores : 0.525 M
Instructions per load/store : 2.749
Cycles per instruction : 2.378
Instructions per cycle : 0.420
Total floating point operations : 0.066 M
Hardware float point rate : 2.749 Mflop/sec
HMPCOUNT sample output
real*8 a(257,256),b(257,256),c(257,256)
common a,b,c
do j=1,256
do i=1,257
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses) : 1634
Average number of loads per TLB miss : 241.876
Total loads and stores : 0.527 M
Instructions per load/store : 2.749
Cycles per instruction : 1.271
Instructions per cycle : 0.787
Total floating point operations : 0.066 M
Hardware float point rate : 3.525 Mflop/sec
ESSL
• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers
• Fast!– 560x560 real*8 matrix multiply
• Hand coding: 19 Mflops
• dgemm: 1.2 GFlops
• Parallel (threaded and distributed) versions
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
ACRL’s IBM SP
• 4 Winterhawk II nodes– 16 processors
• Each node has:– 1 GB RAM
– 9 GB (mirrored) disk on each node
– Switch adapter
• High Perforrnance Switch
• Gigabit Ethernet (1 node)
• Control workstation
• Disk: SSA tower with 6 18.2 GB disks
Disk
Gigabit Ethernet
IBM Power3 SP Switch
• Bidirectional multistage interconnection networks (MIN)
• 300 MB/sec bi-directional
• 1.2 sec latency
General Parallel File System
Application
GPFS Client
RVSD/VSD
Application
GPFS Client
RVSD/VSD
Application
GPFS Client
RVSD/VSD
Application
GPFS Server
RVSD/VSD
Node 2 Node 3 Node 4
Node 1
SP Switch
ACRL Software• Operating System: AIX 4.3.3
• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)
– VisualAge C for AIX, Version 5.0.1.0
– VisualAge C++ Professional for AIX, Version 5.0.0.0
– IBM Visual Age Java - not yet installed
• Job Scheduler: Loadleveler 2.2
• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O
• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )
• Visualization: OpenDX (not yet installed)
• E-Commerce software (not yet installed)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel
– image processing, Monte Carlo
– Simulations (eg. CFD)
• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components
• Beowulf clusters
• SMP nodes
– Improvements in network technology
NRL Layered Ocean Model at Naval Research Laboratory
IBM Winterhawk II SP
Parallel Computational Models
• Data Parallelism– Parallel program looks like serial program
• parallelism in the data
– Vector processors– HPF
Parallel Computational Models
• Message Passing (MPI)– Processes have only local memory but can communicate
with other processes by sending & receiving messages– Data transfer between processes requires operations to be
performed by both processes– Communication network not part of computational
model (hypercube, torus, …)
Send Receive
Parallel Computational Models
• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard
Address space
Processes
Parallel Computational Models
• Remote Memory Operations– “One-sided” communication
• MPI-2, IBM’s LAPI
– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory
Put
Get
Parallel Computational Models
• Combined: Message Passing & Threads– Driven by clusters of SMPs
– Leads to software complexity!
Address space
Processes
Address space
Processes
Address space
Processes
Network
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Message Passing Interface
• MPI 1.0 standard in 1994
• MPI 1.1 in 1995 - IBM support
• MPI 2.0 in 1997– Includes 1.1 but adds new features
• MPI-IO
• One-sided communication
• Dynamic processes
Advantages of MPI
• Universality
• Expressivity– Well suited to formulating a parallel algorithm
• Ease of debugging– Memory is local
• Performance– Explicit association of data with process allows
good use of cache
MPI Functionality• Several modes of point-to-point message passing
– blocking (e.g. MPI_SEND)
– non-blocking (e.g. MPI_ISEND)
– synchronous (e.g. MPI_SSEND)
– buffered (e.g. MPI_BSEND)
• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER
• User-defined datatypes
• Logically distinct communicator spaces
• Application-level or virtual topologies
Simple MPI Example
My_Id 0 1
This is from MPI process number 0
This is from MPI processes other than 0
Simple MPI ExampleProgram Trivial
implicit none
include "mpif.h" ! MPI header file
integer My_Id, Numb_of_Procs, Ierr
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )
print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs
if ( My_Id .eq. 0 ) then
print *, ' This is from MPI process number ',My_Id
else
print *, ' This is from MPI processes other than 0 ', My_Id
end if
call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr
stop
end
MPI Example with send/recv
My_Id 0 1
Send Receive
SendReceive
MPI Example with send/recvProgram Simple
implicit none
Include "mpif.h"
Integer My_Id, Other_Id, Nx, Ierr
Parameter ( Nx = 100 )
Real A ( Nx ), B ( Nx )
call MPI_INIT ( Ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )
Other_Id = Mod ( My_Id + 1, 2 )
A = My_Id
call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )
call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )
call MPI_FINALIZE ( Ierr )
stop
end
What Will Happen?/* Processor 0 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
/* Processor 1 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
MPI Message Passing Modes
Ready
Standard
Synchronous
Buffered
Ready
Eager
Rendezvous
Buffered
> eager limit
<= eager limit
Default Eager Limit on SP is 4 KB (can be up to 64 KB)
MPI Performance Visualization
• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing
behaviour and performance of MPI programs
Message Passing on SMP
Call MPI_SEND Call MPI_RECEIVE
BufferBuffer
Memory Crossbar or Switch
Data toSend
ReceivedData
export MP_SHARED_MEMORY=yes|no
Shared Memory MPI
MPI_SHARED_MEMORY=<yes|no>
Latency Bandwidth
(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)
Message Passing off Node
MPI Across all the processors
Many more messages going through the fabric
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
OpenMP• 1997: group of hardware and software vendors
announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.
• www.openmp.org• OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.
OpenMP
• All processors can access all the memory in the parallel system
• Parallel execution is achieved by generating threads which execute in parallel
• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead
OpenMP1.All OpenMP programs begin as a single process: the master thread
2.FORK: the master thread then creates a team of parallel threads
3.Parallel region statements executed in parallel among the various team threads
4.JOIN: threads synchronize and terminate, leaving only the master thread
OpenMP
How is OpenMP typically used?
• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.
• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!
OpenMP Loop Parallelization!$OMP PARALLEL DO
do i=0,ilong
do k=1,kshort
...
end do
end do
#pragma omp parallel for
for(i=0; i <= ilong; i++)
for(k=1; k <= kshort; k++) {
...
}
Variable Scoping• Most difficult part of Shared Memory
Parallelization– What memory is Shared
– What memory is Private - each processor has its own copy
• Compare MPI: all variables are private• Variables are shared by default, except:
– loop indices
– scalars that are set and then used in loop
How Does Sharing Work?
THREAD 1: increment(x)
{
x = x + 1;
}
THREAD 1:
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
THREAD 2: increment(x)
{ x = x + 1;
}
THREAD 2: 10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
Shared X initially 0
Result could be 1 or 2
Need synchronization
False Sharing7
6
5
4
3
2
1
0
Processor 1 Processor 2
Block in Cache
Cache line
Address tag
Block
Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished
!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
Why Hybrid MPI-OpenMP?
• To optimize performance on “mixed-mode” hardware like the SP
• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a
pure MPI implementation
Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid
model to be worthwhile:– There has to be obvious parallelism to exploit
– The code has to be easy to program and maintain• easy to write bad OpenMP code
– It has to promise to perform at least as well as the equivalent all-MPI program
• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of
parallelism
Hybrid Scenario• Thread the computational portions of the code that
exist between MPI calls• MPI calls are “single-threaded” and therefore use
only a single CPU.• Assumes:
– application has two natural levels of parallelism– or that in breaking an MPI code with one level
of parallelism that communication between resulting threads is little/none
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
MPI-IO
• Part of MPI-2• Resulted work at IBM Research exploring the
analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)
memory
processes
file
Conclusion• Don’t forget uni-processor optimization
• If you choose one parallel programming API, choose MPI
• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here
• Remote memory access model may be the answer