Joachim HeinEPCC, The University of Edinburgh
[email protected]+44 131 651 3390
Experiences on the Edinburgh
Blue Gene System
06/10/2005 Experiences on the Edinburgh Blue Gene System
Outline
• What can be learned from “STREAMS”
• Fourier transforms
• Application performance– CASTEP– H2MOL– DL_POLY– NAMD– MDCASK– PCHAN– LUDWIG
06/10/2005 Experiences on the Edinburgh Blue Gene System
Systems
• BlueSky: Single e-Server Blue Gene frame– 1024 dual core chips– 2048 PowerPC440 processors, 700 MHz– 512 MB of RAM per chip (distributed memory system), shared
between the two cores– 4713 TFLOP/s Linpack, joint No 58 on top500
• HPCx: 50 IBM e-Server p690+ frames– SMP cluster– 32 Power4 processors per frame, 1700 MHz– 32 GB of RAM per frame– Federation interconnect, 2 links (4 wires) per frame– 6188 TFLOP/s Linpack, No 45 on top500
06/10/2005 Experiences on the Edinburgh Blue Gene System
Relative performance expectation:
2.43 (double unit) 4.86 (single unit)
Naïve perf-ratio:
1 (use double unit) 2 (use single unit)
2 single word FPU 2 single word LSU
1 double word FPU 1 double word LSU
Functional units:
32 GB/frame512 MB/chipMemory:
512 MB/frame, 512 b4MB/chip, 128 bL3 cache:
1.5kB/2 proc, 128 bmem-bufferL2 cache:
32kB, 128 byte line32kB, 32 byte lineL1 cache:
17/7 = 2.431700 MHz700 MHzClock:
Performance ratioHPCxBG
06/10/2005 Experiences on the Edinburgh Blue Gene System
Alignment
• FPU units– 16 byte registers, holding two 8 byte floating point words– Instructions can operate simultaneously on both 8 byte parts
• 16 byte alignment– FPU must load two 8 byte words from L1 in a single instruction – Words must be consecutive and occupy either the first or the second
half of a L1 cache line (32 bytes)
OK
Not OK
06/10/2005 Experiences on the Edinburgh Blue Gene System
Streams tests:
0 + 1r = r + a(i)vsum (own test)
1 + 2a(i) = b(i)+s*c(i)triad
1 + 2a(i) = b(i)+c(i)add
1 + 1a(i) = s*b(i)scale
1 + 1a(i) = b(i)copy
Streams:
(store + load)
Operation:Test:
06/10/2005 Experiences on the Edinburgh Blue Gene System
Example Routine (copy)subroutine copy_st(a,b,n)
integer :: n
Real(kind(1.0d0)), dimension(n) :: a, b
integer :: i
Call alignx(16,a(1))
Call alignx(16,b(1))
Do i = 1, n
a(i) = b(i)
end Do
end subroutine copy_st
• Assumes a(1) and b(1) are 16 byte aligned
06/10/2005 Experiences on the Edinburgh Blue Gene System
Alignment
2030161MB, a(2), b(2)
2030311MB, a(2), b(1)
2030341MB, a(1), b(2)
203036031MB, a(1), b(1)
13291616MB, a(2), b(2)
13303116MB, a(2), b(1)
13293416MB, a(1), b(2)
1329133016MB, a(1), b(1)
440 (MB/s)440d (MB/s)
– Performance benefit for L3 problem size, if the assertion is correct– Incorrect alignment results in a substantial performance loss
– caused by trapping of alignment exceptions
06/10/2005 Experiences on the Edinburgh Blue Gene System
Streams on L1 cache
• Hardware limit:– 11200 MB/s
(double-fpu)– 5600 MB/s
(single-fpu)
• Copy: very satisfactory
• Vsum: not quite– Short of HW
limit– -qhot & 440d
very bad
06/10/2005 Experiences on the Edinburgh Blue Gene System
Streams on L3 cache
• About 1/3 of L1 perform.
• 440d almost doubles performance
• vsum still not happy
06/10/2005 Experiences on the Edinburgh Blue Gene System
Streams on main memory
• No difference between 440 and 440d
• Vsum: 440d, no hot ok, finally
06/10/2005 Experiences on the Edinburgh Blue Gene System
dcbz
• ‘Hidden’ loads reduce performance– Loading data to cache prior to a write (L1 does write allocate)– dcbz instruction can be used to eliminate these loads - by zeroing a cache
line– Need to know 32-byte alignment
subroutine copy_st(a,b,n)..........do i = 1, n, 4!IBM* CACHE_ZERO(a(i))
a(i) = b(i)a(i+1) = b(i+1)a(i+2) = b(i+2)a(i+3) = b(i+3)
end doend subroutine copy_st
• Stream copy a(j) = b(j)
• 16Mb, 440d, dp:– without DCBZ
1316 MB/s – with DCBZ
2299 MB/s
• Use with care!
• Need to control 32-byte alignment
06/10/2005 Experiences on the Edinburgh Blue Gene System
Compare to gnu-compiler
• C-version of Streams (out of the box), no alignment declared
• Main memory
• xlC clearly outperforms the gnu compiler
06/10/2005 Experiences on the Edinburgh Blue Gene System
Streams CO vs VN mode
06/10/2005 Experiences on the Edinburgh Blue Gene System
Comparison to p690+
• L2 helps for intermediate sizes
• L1 and Memory: no factor of 2.48
• Mem: stable for single FPU
• Single task/frame
• Copy and add under perform
06/10/2005 Experiences on the Edinburgh Blue Gene System
Strided Access
Cache Lines:
• BG:L1: 4 wordsL3: 16 words
• p690+L1: 16 wordsL2: 16 wordsL3: 4*16 wordsTLB: 512 words
(small pages)
Graph: Memory Copy
06/10/2005 Experiences on the Edinburgh Blue Gene System
Summary of Streams
• L1 performance (compiler) at the hardware limit (11 GB/s)
• L3 can sustain two cores (VN-mode), L1 obviously can
• Counting properly: about 2.2 GB/s from/to main memory
• Memory bandwidth almost halfed in VN-mode
• L1 and Memory better balanced than Power4
• Intermediate problem size: Power4 is faster due to L2
• Large strides: no TLB pays off for BG
06/10/2005 Experiences on the Edinburgh Blue Gene System
Fast Fourier Transform
• Measured benefits for large problem
• L1: HPCx – 2.0x Vienna– 2.8x comp
• Large problem: Full HPCx slower than BG
06/10/2005 Experiences on the Edinburgh Blue Gene System
Application comparison
• Study the performance of a variety of scientific applications on the BG system and compare to the performance on HPCx (p690+ cluster)
• For some applications: – Reduced optimisation level for some/all routines, to make them work– Take as a present status
06/10/2005 Experiences on the Edinburgh Blue Gene System
CASTEP
• Density functional theory application, Payne et al., 2002, Segall et al., 2002
• Widely used in the UK (largest user of HPCx)
• Web site: http://www.tcm.phy.cam.ac.uk/castep/
• Compiled on BG with –O3, apart from one routine
• Looked at a couple of benchmark configurations, here:– Titanium Nitride: 33 atoms, Cell volume of 654 Å3
– Al2O3: 120 atoms, Cell volume of 2160 Å3
• Application benefits from MASS lib and FFTW
06/10/2005 Experiences on the Edinburgh Blue Gene System
CASTEP: Titanium Nitride
• Good Scaling on both machines
• VN benefits
• HPCx about 5 times faster
06/10/2005 Experiences on the Edinburgh Blue Gene System
CASTEP: AL2O3
• Benchmark to large for VN
• Again: similar scaling
• HPCx about 2.6 times faster
• Previous benchmark better helped by cache?
06/10/2005 Experiences on the Edinburgh Blue Gene System
H2MOL
• Written by Ken Taylor and Daniel Dundas at Queens University Belfast
• Solves time dependent Schrödinger equation
• Laser driven dissociation of H2-molecules
• Refines grid when increasing processor count, hence constant work/proc
06/10/2005 Experiences on the Edinburgh Blue Gene System
H2MOL
• Writing of intermed. results bad
• HPCxabout 1.7x faster
06/10/2005 Experiences on the Edinburgh Blue Gene System
DL_POLY3
DL_POLY – A general purpose molecular dynamics package– Smith and Forester, 1996
• DL_POLY3 uses a distributed domain decomposition model
• Utilises it own FFT that maps directly to the decomposition used in DL_POLY– A smaller number of larger messages are sent, over that which a
traditional parallel FFT would use
• No problems porting this application
• The benchmark: Gramicidin– a system of eight Gramicidin-A species (792,960 atoms)
06/10/2005 Experiences on the Edinburgh Blue Gene System
DL_POLY3
• VN beneficial
• BG scales slightly better
• HPCxabout 2.7x faster
• Benefits from MASS
06/10/2005 Experiences on the Edinburgh Blue Gene System
NAMD
• NAMD: parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems– Kalé et al. 1999
• Relies on a series of packages– FFTW single precision – no double FPU– charm++
• charm++– Required quite a few changes to 5.8 (e.g. library locations, compilers),
however a release of 5.9 fixed many of these problems
• NAMD 2.6b1 compiled with only minor problems– 2.5 was more complicated
• ApoA1 benchmark– 92224 atoms
06/10/2005 Experiences on the Edinburgh Blue Gene System
NAMD 2.6b1
• xlc7 boost performance on HPCx
• VN mode beneficial
• HPCx 4.2x faster (NAMD doesn’t use memory!)
06/10/2005 Experiences on the Edinburgh Blue Gene System
MDCASK
• MDCASK: first developed as a molecular dynamics code to study radiation damage in metals
• Part of ASCI purple benchmark suite
• Benchmark used: 1372000 atoms in Ti lattice
• Needs n3 processors
06/10/2005 Experiences on the Edinburgh Blue Gene System
MDCASK
• HPCxabout 4x faster
• Code speeds up, to 512 proc
• BG better scaling
06/10/2005 Experiences on the Edinburgh Blue Gene System
PCHAN
• Finite difference code for Turbulent Flow: shock/boundary layer interaction (SBLI)
• Developed at the University of Southampton
• PCHAN: simple turbulent channel flow benchmark using the SBLI code
• Communications: halo-exchanges between adjacent computational sub-domains
06/10/2005 Experiences on the Edinburgh Blue Gene System
PCHAN
• Very good scaling all systems -HPCx superscales
• Benefits from VN
• HPCx faster 2.1x (32 p) 2.8x (256 p)
T2 (240x240x240)
06/10/2005 Experiences on the Edinburgh Blue Gene System
LUDWIG
• LUDWIG: studying complex fluids (mixtures of fluids, solids/fluids)– Jean Christophe Desplat, Dublin Institute for Advanced Studies– Kevin Stratford, Mike Cates, The University of Edinburgh– Applications include personal care products, e.g. shampoo
• Blue Gene– Well suited to parallelisation
– Being used to investigate the dynamics of mixtures involving colloidal particles and systems under shear
06/10/2005 Experiences on the Edinburgh Blue Gene System
LUDWIG - performance
• Ludwig (Model size 3843)– VN beneficial– HPCx 1.4x faster– IBM p575 – Power 5 processor, 1.9 GHz
0
5
10
15
20
25
30
35
40
64 256BG-chips/Power-proc
Exec
utio
n Ti
me
(hou
rs)
HPCxBG (CO)BG (VN)IBM p575
06/10/2005 Experiences on the Edinburgh Blue Gene System
LUDWIG
• Blue Gene gives better performance than the p690+ system (using VN mode, on a per chip (resource) basis)
• Key routines of LUDWIG are memory access bound: many strided accesses to main memory– Blue Gene memory access latency relatively smaller than the p690+
system– Similar bandwidths
• P575 gives excellent performance
06/10/2005 Experiences on the Edinburgh Blue Gene System
Acknowledgements
• Lorna Smith, Mark Bull, Alan Gray: Editing material
• Bartosz Dobrzelecki, Alan Gray, Ian Bush (DL), Lorna Smith,Owain Kenway, Mike Ashworth (DL), Fiona Reid, Kevin Stratford, J-C Desplat (now Dublin): various Benchmark results
06/10/2005 Experiences on the Edinburgh Blue Gene System
Summary
• One can learn a lot from streams (even if it is only to convince you that you can “double hum”)
• If 256MB of memory sufficient, VN mode beneficial
• We observe good scaling for a wide variety of codes, however BG does not obviously “out-scale” HPCx
• In the light of the naïve expectations of 4.86: Serial performance on BG is good in comparison to Power4 - we are doing better than I thought
Top Related