Optimal coding practices for IBM POWER4 processors
-
Upload
keegan-rice -
Category
Documents
-
view
45 -
download
1
description
Transcript of Optimal coding practices for IBM POWER4 processors
Optimal coding practices for IBM POWER4 processors
Steve Behling
IBM Corporation
Getting the most out of AIX, xlf, and xlcor
Outline
• Some hardware details
• Some software discussions
• My favorite hints
• Questions
Memory Hierarchy
CPU
Register
Cache
Main Memory
Disk
Massive Tape Storage
1 cycle
Cache miss: 8-200 cycles TLB miss: tens to hundreds of cycles
~ 100,000 cycles
Don't want to know
Speed Size
POWER4 processor chip layout
Memory
I/O Bus
Processor local busL3 Cache
>1GHz CPU
L3 Controller L3 Directory
Shared L2 Cache
FABRIC CONTROLLERDISTRIBUTED SWITCH
DD
II>1GHz CPU
• Contains two 64-bit processors (PowerPC architecture)• POWER4 has 1.4 MB (1440 KB) L2 cache; POWER4+ has 1.5 MB L2 cache• L3 cache directory on chip• All chip frequencies scale with processor frequency
POWER4 Processor Features
• High-frequency, speculative execution, superscalar processor with out-of-order instruction execution capabilities• Eight independent execution units (capable of executing instructions in parallel) = superscalar
−Two identical floating-point execution units; each with 2 floating-point operations per cycle
−Two load/store execution units−Two fixed-point execution units−One branch execution unit−One conditional register unit to perform logical
operations on the condition register−Only one of the FPUs does divides
POWER4 Instruction Issue Block Diagram
FX1ExecUnit
FX2ExecUnit
FP1ExecUnit
FP2ExecUnit
CRExecUnit
BRExecUnit
BR/CRIssue Q
FX/LD 1Issue Q
FX/LD 2Issue Q
D-cache
StQ
LD2ExecUnit
LD1ExecUnit
Decode,Crack &Group
Formation
Instr Q
I-cache
GCT
BRScan
BRPredict
FX1ExecUnit
FX2ExecUnit
FP1ExecUnit
FP2ExecUnit
CRExecUnit
BRExecUnit
BR/CRIssue Q
FX/LD 1Issue Q
FX/LD 2Issue Q
D-cache
StQ
LD2ExecUnit
LD1ExecUnit
FX1ExecUnit
FX2ExecUnit
FP1ExecUnit
FP2ExecUnit
CRExecUnit
BRExecUnit
BR/CRIssue Q
FX/LD 1Issue Q
FX/LD 2Issue Q
D-cache
StQ
D-cache
StQ
LD2ExecUnit
LD1ExecUnit
Decode,Crack &Group
Formation
Instr Buffer
IFARI-cacheI-cache
GCT
BRScan
BRPredict
FPIssue Q
FPIssue Q
FPIssue Q
L3
MemCtrl L3
L3
GX Bus
GX Bus GX Bus
GX Bus
Multi-chip Module Boundary
>1 Ghz Core
>1 Ghz Core
Chip-chip communication
Shared L2
Shared L2
Shared L2
Shared L2
L3 Dir
L3MemCtrl
MemCtrl
MemCtrl
MEMORY
MEMORY
Multi-Chip Module (MCM)
p690 Multi-Chip Module (MCM)
GXGX
P
L2
PP
L2
P
P
L2
P P
L2
P
GXGX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
GX
GX
GX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
GX
GX
GX
MemSlot
GX Slot
L3 L3 L3 L3L3 L3L3 L3
L3 L3
L3 L3
L3 L3L3 L3 L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
MCM 1
MCM 3MCM 2
MCM 0
GX Slot
MemSlot
MemSlot
MemSlot
MemSlot
MemSlot
MemSlot
MemSlot
GX Slot
GX Slot
IBM 32 processor pSeries 690
Cache Organization and Size
Cache Organization Capacity
L1 instruction cache Direct map, 128-byte line 64 KB per processor
L1 data cache Two-way set associative, 128-byte cache line
32 KB per processor
Shared L2 cache POWER4 mostly eight-way, some four-way; POWER4+ all eight-way
1.4 MB per chip POWER4; 1.5 MB per chip POWER4+
L3 cache Eight-way. Two boot modes: 1 cache line per transfer or 4 cache lines per transfer
128 MB per MCM
Virtual Memory Manager
Virtual storage is the addressable memoryspace used by the AIX operating system
This linear contiguous address space is mapped, bya combination of hardware and software, onto thehardware memory of the computer and onto disk paging space(s)
Pages are 4096 bytes on POWER3 and earlier hardware. Pages on POWER4 can be 4096 bytes, 16 MB, and 256 MB (requires AIX 5.1.0.25)
Translation Lookaside Buffer (TLB)
TLB misses are likely when using indirect addressing.
TLB holds the information to translate between virtual and physical memory addresses. If page is in TLB; no cost translation.
The cost of TLB misses varies between ~25 cycles to possibly hundreds of cycles in unfavorable cases
L=left_neighbor[i];R=right_neighbor[i];a[i] += b[i]*a[L] + c[i]*a[R];
Hardware data prefetch
• IBM POWER4 has 8 hardware prefetch streams.• 2 sequential cache line accesses (forward or
backward) establish a prefetch stream• Prefetch streams stop when they reach a page
boundary.• Prefetching can be encouraged using compiler
directives or code changes• Prefetch streams only get established for loads
– Can use PREFETCH_BY_LOAD() directive for store
do 10 i=1,NCELL!IBM$ PREFETCH_BY_LOAD(i+33) a(i)=0.0 10 continue
Coding for prefetch performance
double s;double *a, *b;....s=0.0;for(i=0;i<N;i++) s = s + a[i]*b[i];
Example: Dot product. 2 prefetch streams
Example: Interleaved dot product. 6 prefetch streamsdouble s,s1,s2;double *a, *b;int onethird,twothird;....s = s1 = s2 = 0.0;onethird = N/3; twothird = 2*onethird;for(i=0;i<onethird;i++) { s = s + a[i]*b[i]; s1 = s1 + a[i+onethird]*b[i+onethird]; s2 = s2 + a[i+twothird]*b[i+twothird]; }for(i=3*onethird;i<N;i++) s = s + a[i]*b[i];s = s + s1 + s2;
AIX Large pages
• 16 MB large pages help HPC application performance by:– Eliminating TLB misses– Enhancing prefetch since prefetch streams get
reset at page boundary• Typically 5 to 15 % improvement• Some start up overhead since each task
gets full 256 MB segment (16 pages).– Deadly for scripts; may be bad for fork(), execlp()
• If large pages are exhausted, jobs silently fall over to use small pages– Watch with “vmstat –l”
AIX Large Page Administration
• AIX can set aside memory to be backed by large pages (typically 50%)– vmtune -g nnn –L mmm– bosboot –a; reboot
• Application can be large page enabled– ldedit –b lpdata a.out
• User must be enabled:– chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE userid
• Or set default in /etc/security/users
TLB coverage
POWER3:TLB contained 256 entries.TLB coverage is 1 MB (smaller than L2 cache)
POWER4:TLB contains 1024 entriesTLB coverage is 4 MB for small pagesTLB coverage is 16 GB for large pages
TLB example (xlf -WF,-DHPM …)
program stand#ifdef HPM#include "f_hpm.h"#endif parameter (NCELL=400) common /mystuff/ a1,a2,a3 real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) real(8) time1,time2,rtc,etime,sc a1 = 1.0d0 a2 = 2.0d0#ifdef HPM call f_hpminit(0,"Job") call f_hpmstart(1,"Total_routine")#else time1=rtc()#endif call sub1(a1,a2,a3,NCELL)#ifdef HPM call f_hpmstop(1) call f_hpmterminate(0)#else time2=rtc() etime=time2-time1 print *,'Subroutine took ', etime,' seconds'#endif end
TLB subroutines and performance
subroutine sub_nest(a1,a2,a3,n) parameter (NCELL=400) real(8) a1(NCELL,NCELL,NCELL) real(8) a2(NCELL,NCELL,NCELL) real(8) a3(NCELL,NCELL,NCELL) integer(4) n integer(4) i,j,k real(8) s! s=1.1d0 do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end
TLB: performance, 375 MHz Power3, 4 MB L2 cache
do 10 k=1,NCELL do 10 j=1,NCELL do 10 i=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end
do 10 i=1,NCELL do 10 j=1,NCELL do 10 k=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end
do 10 k=1,NCELL do 10 i=1,NCELL do 10 j=1,NCELL a3(i,j,k)=a2(i,j,k) + s*a1(i,j,k) 10 continue end
Time = 21.5 s.329.7 LD/TLB miss
Time = 980.6 s.0.667 LD/TLB miss
Time = 178.0 s.0.853 LD/TLB miss
Favorite hints
• Put “export AIXTHREAD_SCOPE=S” in your .profile• -g does not decrease optimization • First compile: -O2 –qarch=pwr4 –qtune=pwr4 –
qmaxmem=-1– C: use –qlibansi– Fortran: use xlf90 -qfixed
• Most likely to get within 5% of optimal performance using –O3– May need to use –qstrict
• Use –lmass if you use any intrinsics (sqrt, exp, **, etc.)• Try –O4; -qhot; -qalias=allptrs (C) etc. on individual
routines.• OpenMP: use guided scheduling. –qsmp=omp,noauto
Favorite hints (cont.)
• MPI codes run very well on SMP systems– MP_SHARED_MEMORY=yes– MP_WAIT_MODE=poll
• (MPICH ch_shmem is pretty good, too, if you build it with –O3 –qarch=pwr4 –qtune=pwr4 -- at least through 8 processors)
• If you do lots of 64-bit integer arithmetic use –q64 so you can exploit the PowerPC 64-bit integer hardware.
• Use “nmon” for low overhead, curses based system monitoring program.
• dbx a.out core is OK, but Totalview is awesome.• Don’t use –bmaxdata with –q64• Use –bmaxdata:0x80000000/dsa with –q32
End
L3 Cache (POWER4 only)
Four POWER4 chips are combined into a multi-chip module (MCM) each of which has a 128 MB level 3 cache
L3 cache is 8-way set associative
L3 cache may be bypassed if busy Consequence: Data may not be where you think it is.
On p690, L3 cache is shared system wide.
Tuning Recomendation
POWER4: For optimal performance it is recommended to block data for L2 cache and to structure the data access for the L1 data cache
Use FMA for best performance
A multiply/add counts as two floating point operations, so that, for example, a program doing only additions might run at half the MFlops rate of one doing alternate multiplies and adds
/* bad code */for(i=0; i<N; i++) a[i] = s*a[i];printf("I did the multiply loop.\n");for(i=0; i<N; i++) a[i] = b[i]+a[i];
/* good code */for(i=0; i<N; i++) a[i] = b[i] + s*a[i];
Note: C++ operator overloading could result in “bad code” – requires careful examination
How to get the most MFlops
Operate within L1 and L2 cache via blocking Avoid TLB misses (Stride 1 as much as possible)Multiplies must be paired with adds or subtracts so that each FMA is two flopsFMAs must be independent (and at least eight in number to keep two pipes of depth four going)
Peak Mflops example!Matrix multiply kerneldo i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddoenddo
! Same code but scalar explicitly stated! Good, but load/store bounddo i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) s =d(i,j) do k=kk,min(n,kk+nb-1) s =s +a(j,k)*b(k,i) enddo d(i,j)=s enddoenddo
Peak Mflops (cont.)do i=ii,min(n,ii+nb-1),5 do j=jj,min(n,jj+nb-1),4 s00 =d(i+0,j+0) s10 =d(i+1,j+0) s20 =d(i+2,j+0) s30 =d(i+3,j+0) s40 =d(i+4,j+0) s01 =d(i+0,j+1) s11 =d(i+1,j+1) s21 =d(i+2,j+1) s31 =d(i+3,j+1) s41 =d(i+4,j+1) s02 =d(i+0,j+2) s12 =d(i+1,j+2) s22 =d(i+2,j+2) s32 =d(i+3,j+2) s42 =d(i+4,j+2) s03 =d(i+0,j+3) s13 =d(i+1,j+3) s23 =d(i+2,j+3) s33 =d(i+3,j+3) s43 =d(i+4,j+3) do k=kk,min(n,kk+nb-1) s00 =s00 +a(j+0,k)*b(k,i+0) s10 =s10 +a(j+0,k)*b(k,i+1) s20 =s20 +a(j+0,k)*b(k,i+2) s30 =s30 +a(j+0,k)*b(k,i+3) s40 =s40 +a(j+0,k)*b(k,i+4) s01 =s01 +a(j+1,k)*b(k,i+0) s11 =s11 +a(j+1,k)*b(k,i+1) s21 =s21 +a(j+1,k)*b(k,i+2) s31 =s31 +a(j+1,k)*b(k,i+3)
s41 =s41 +a(j+1,k)*b(k,i+4) s02 =s02 +a(j+2,k)*b(k,i+0) s12 =s12 +a(j+2,k)*b(k,i+1) s22 =s22 +a(j+2,k)*b(k,i+2) s32 =s32 +a(j+2,k)*b(k,i+3) s42 =s42 +a(j+2,k)*b(k,i+4) s03 =s03 +a(j+3,k)*b(k,i+0) s13 =s13 +a(j+3,k)*b(k,i+1) s23 =s23 +a(j+3,k)*b(k,i+2) s33 =s33 +a(j+3,k)*b(k,i+3) s43 =s43 +a(j+3,k)*b(k,i+4) enddo d(i+0,j+0)=s00 d(i+1,j+0)=s10 d(i+2,j+0)=s20 d(i+3,j+0)=s30 d(i+4,j+0)=s40 d(i+0,j+1)=s01 d(i+1,j+1)=s11 d(i+2,j+1)=s21 d(i+3,j+1)=s31 d(i+4,j+1)=s41 d(i+0,j+2)=s02 d(i+1,j+2)=s12 d(i+2,j+2)=s22 d(i+3,j+2)=s32 d(i+4,j+2)=s42 d(i+0,j+3)=s03 d(i+1,j+3)=s13 d(i+2,j+3)=s23 d(i+3,j+3)=s33 d(i+4,j+3)=s43 enddoenddo
5x4 hand unrolling to maximize FMA and register usage
Avoid divides – only one FPU on Power4 does divides!
Untuned Tuned------- -----DO I=1,N DO I=1,N A(I)=B(I)/C(I) OC=1.0/C(I) P(I)=Q(I)/C(I) A(I)=B(I)*OCENDDO P(I)=Q(I)*OC ENDDO
Untuned Tuned------- -----DO I=1,N DO I=1,NA(I)=B(I)/C(I) OCD=1.0/(C(I)*D(I))P(I)=Q(I)/D(I) A(I)=B(I)*D(I)*OCDENDDO P(I)=Q(I)*C(I)*OCD ENDDO
For simple cases, compiler does this for you.
Clever method to replace 2 divides by 1 divide and 5 multiplies and use both FPUs
Minimize expensive intrinsic calls
Untuned Tuned------- -----DO I=1,N DIMENSION SINX(N) DO J=1,N ... A(J,I)=B(J,I)*SIN(X(J)) DO J=1,N ENDDO SINX(J)=SIN(X(J))ENDDO ENDDO DO I=1,N DO J=1,N A(J,I)=B(J,I)*SINX(J) ENDDO ENDDO