Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes...
Transcript of Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes...
![Page 1: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/1.jpg)
1
Benchmarking Clusters with High
Core-Count Nodes
9th LCI International Conference on High-Performance Clustered Computing29 April, 2008
Tom ElkenManager, Performance EngineeringQLogic Corporation
![Page 2: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/2.jpg)
Apr 08, 20082
Motivation
� The most common interconnnect benchmarks
exercise one core on each of two nodes
� Nodes are becoming larger in terms of core-
count.
• The interconnect needs to keep pace
• What benchmarks are appropriate in this
environment?
![Page 3: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/3.jpg)
Apr 08, 20083
Overview
� QLogic Background
� InfiniBand Host Channel Adapters
� MPI Benchmarks for
High core-count nodes & clusters
![Page 4: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/4.jpg)
Apr 08, 20084
QLogic Acquired Two Leading InfiniBand
Companies in 2006
![Page 5: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/5.jpg)
Apr 08, 20085
Complete InfiniBand Solution
� HPC oriented Host Channel Adaptors
� OFED Plus software stacks
� High Port-count Director Switches
� Edge Switches
� Multi-protocol gateways
� Cables
� InfiniBand Fabric Management
� InfiniBand Silicon
� Worldwide Support
![Page 6: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/6.jpg)
Apr 08, 20086
QLogic commitment to HPC
� Application performance steered development
• SPEC HPG member and contributor to MPI2007
• InfiniBand DDR fabric for bandwidth-intensive apps
• QLogic InfiniPath HCAs for latency and message-rate
sensitive apps
� Two paths to the interconnect
1. OpenFabrics driver and libraries interoperate with the
InfiniBand ecosystem and high-performance storage
2. QLogic-specific libraries evolve as a top-down software
stack; leverages ASIC capabilities specifically for HPC
Apps and parallel programming models.
![Page 7: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/7.jpg)
Apr 08, 20087
QLogic InfiniBand HCAs
� In this presentation
• QLogic DDR (20 Gbit/s) HCAs
� QLE7240, PCIe x8
� QLE7280, PCIe x16
• QLogic SDR (10 Gbit/s) HCA
� QLE7140, PCIe x8
![Page 8: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/8.jpg)
Apr 08, 20088
OFED+ Provides OpenFabrics Benefits
Plus Optional Value Added Capabilities
MPICH2
QLogic MPI
PSMUser Mode Driver Extension
Open MPI
HP-MPI
Scali MPI
MVAPICH
Host Channel Adapter
OFEDULPs and Utilities
uDAPL
IPoIBSDPVNICSRPiSERRDS
OpenSM
Community and 3
rdParty
Middleware and Applications
QLogic SRP
QLogic VNIC
OFED Driver
Additional Robust ULPs and Tools
Kernel User
InfiniBand Fabric Suite
AcceleratedPerformance
Community and 3rd Parties
![Page 9: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/9.jpg)
Apr 08, 20089
What is the spectrum of MPI
benchmarks?
� Microbenchmarks include (there are more):
• OSU MPI Benchmarks (OMB)
• Intel MPI Benchmarks (IMB) formerly Pallas
� Mid-level Benchmarks
• HPC Challenge
• Linpack, e.g. HPL
• NAS Parallel Benchmarks (NPB)
� Application Benchmark Suites
• SPEC MPI2007
• TI-0n (TI-06, TI-07, TI-08) DOD benchmarks
� Your MPI application
![Page 10: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/10.jpg)
Apr 08, 200810
OSU MPI Benchmarks applicable to
high-core-count nodes
� The MVAPICH group publishes results
on one node
• Intra-node MPI performance importance
growing as nodes grow their core-counts
• Pure MPI applications under some
competition from more complex hybrid
OpenMP – MPI styles of development
� OMB includes the Multiple Bandwidth,
Message Rate test
� OMB v3.1 has added a benchmark:
Multiple Latency test (osu_multi_lat.c)
![Page 11: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/11.jpg)
Apr 08, 200811
Intra-node MPI Bandwidth measurement
� Most current MPIs use shared-memory copies for intra-node communications – might expect that they all do equally well
� QLogic PSM 2.2 provides a large improvement in intra-node bandwidth; All MPIs it supports (MVAPICH, Open MPI, HP-MPI, Scali, QLogic) get the benefit.
• Results in a 2% improvement in average applications performance (SPEC MPI2007 on 8x 4-core nodes)
MPI Intra-node Bandwidth (osu_bw)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 4 16 64 256
1K 4K 16K
64k
256K 1M 4M
MB/s
Shared-Cache Intra-Socket Inter-Socket
Measured on one node w/ Xeon E5410
2.33 GHz quad-core CPUs, 8-cores;
QLogic InfiniPath 2.2 software
![Page 12: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/12.jpg)
Apr 08, 200812
Intra-node MPI Latency measurement
� Most current MPIs use shared-memory copies for intra-node
communications – might expect that they all do equally well
� QLogic PSM 2.2 provides a nice improvement in intra-node
latency; All MPIs it supports (MVAPICH, Open MPI, HP-MPI,
Scali, QLogic) get the benefit.
Measured on one node w/ Xeon E5472
3.0 GHz quad-core CPUs, 8-cores;
QLogic InfiniPath 2.2 software
MPI Intra-node Latency
0
0.5
1
1.5
2
2.5
0 1 2 4 16 64 256 1K
microseconds
Shared Cache Intra-Socket Inter-Socket
Lower is better
![Page 13: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/13.jpg)
Apr 08, 200813
Relationship of applications to micro-
benchmarks
� As the number of processors is increased:
• Message size goes down (� small-message latency)
• Number of messages goes up (� message rate)
DL Poly
( molecular dynamics)
![Page 14: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/14.jpg)
Apr 08, 200814
New OSU Multiple Latency Test
� Measure avg. latency as you add active cores running the
latency benchmark in parallel
� Interesting to measure on large core-count nodes, and at
multiple small message sizes …
Average Latency (QLogic IB DDR)
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
# CPU cores active
microseconds
1 byte
128 bytes
1024 bytes
Measured on 2x Xeon E5410 2.33 GHz quad-core CPUs, 8-cores per node; both PCIe x8,
DDR InfiniBand Adapters; benchmark is osu_multilat.c
Average Latency (Other IB DDR)
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
# CPU cores active
microseconds
1 byte
128 bytes
1024 bytes
Lower is better
![Page 15: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/15.jpg)
Apr 08, 200815
MPI Message Rate
� Using osu_mbw_mrto measure Message Rate:
• Measure at several processes per node counts
• See if results scale with additional processes per node
MPI Message Rate (8 cores per node)
0
5
10
15
20
1 2 3 4 5 6 7 8
MPI procceses per node
Million Msgs/sec
QLogic IB DDR Other IB DDR
Measured on 2x Xeon E5410 2.33 GHz quad-core CPUs, 8-cores per node;
both PCIe x8, DDR InfiniBand Adapters
Higher is better
![Page 16: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/16.jpg)
Apr 08, 200816
HPC Challenge Overview
� HPC Challenge component benchmarks are intended to test very different memory access patterns
Source: “HPC Challenge Benchmark,” Piotr Luszczek, University of
Tennessee Knoxville, SC2004, November 6-12, 2004, Pittsburgh, PA
RandomRing Latency
![Page 17: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/17.jpg)
Apr 08, 200817
Relationship of HPC Challenge to Point-
to-Point benchmarks
� Are there benchmarks in HPCC that focus on
latency, bandwidth & message rate but
involve more of the cluster than 2 cores on 2
nodes, or 16 cores on 2 nodes?
• Latency: Random Ring Latency
• Bandwidth: PTRANS and
Random Ring Bandwidth
• Message Rate: MPI Random Access
![Page 18: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/18.jpg)
Apr 08, 200818
Comparison of HPCC RandomRing Latency
0.5
1
1.5
2
2.5
3
5 10 15 20 25 30 35
Number of MPI Processes
microseconds
Other IB DDR
QLogic IB SDR
QLogic IB DDR
Scalable Latency: Number of cores
� QLogic HCAs’ MPI
latency scales
better with larger
clusters
� QLogic InfiniBand
DDR advantage
over Other
InfiniBand DDR is
• 70% at 8 nodes,
32 cores
From a QLogic white paper: “Understanding Low and Scalable Message Passing Interface Latency”
available at http://www.qlogic.com/EducationAndResources/WhitePapersResourcelibraryHpc.aspx
Lower is better
![Page 19: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/19.jpg)
Apr 08, 200819
Scalable Message Rate: GUPS
� QLogic QLE7200
series HCAs Message
Rates scales better
� 8x faster than Other
InfiniBand DDR HCA at
32 cores (and
advantage climbing)
� GUPS = Giga (Billion)
Updates Per Second
are a measure of the
rate at which messages
can be sent to/from
random cluster
nodes/cores.
* QLogic measurements, one QLogic SilverStorm 9024 DDR switch, 8 Systems: 2 x 2.6 GHz Opteron 2218 CPUs, 8 GB DDR2-667 memory; NVIDIA MCP55 PCIe chipset. Base MPI RandomAccess source code used – no source tweaks to reduce # of messages sent.
HPCC's MPI_RandomAccess_GUPS
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
5 10 15 20 25 30 35
cores
Giga Updates Per Sec
QLogic IB DDR
Other IB DDR
Higher is better
![Page 20: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/20.jpg)
Apr 08, 200820
Application Benchmarks
![Page 21: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/21.jpg)
Apr 08, 200821
Application Benchmarks:
SPEC MPI2007
� Suite that measures: CPU, memory, interconnect, MPI, compiler, and file system performance.
� SPEC institutes discipline and fairness in benchmarking:• Rigorous run rules
• All use same source code, or performance-neutral alternate sources
• Disclosure rules: system, adapter, switch, firmware, driver, compiler optimizations, etc.
• Peer review of submissions
• Therefore, more difficult to game
![Page 22: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/22.jpg)
Apr 08, 200822
SPEC MPI2007 Benchmarks 1- 6
A nearly E.P. parallel ray tracing
program with low MPI usageGraphics: Ray
Tracing
C 122.tachyon
The Parallel Ocean Program (POP)
developed at LANLClimate
Modeling
Fortran/C 121.pop2
A CFD model of fire-driven fluid flow,
with an emphasis on smoke and heat
transport from fires
CFD: Fire
dynamics
simulator
Fortran 115.fds4
Solves the Maxwell equations in 3D
using the finite-difference time-
domain (FDTD) method
Computational
ElectromagneticsFortran 113.GemsFDTD
CFD using Large-Eddy Simulations
with linear-eddy mixing model in 3D.
Computational
Fluid Dynamics Fortran 107.leslie3d
A gauge field generating program for
lattice gauge theory programs with
dynamical quarks
Quantum
ChromodynamicsC 104.milc
Brief DescriptionApplication Area Language Benchmark
![Page 23: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/23.jpg)
Apr 08, 200823
SPEC MPI2007 Benchmarks 7- 13
Molecular Dynamics using density-
functional theory (DFT)Molecular
Dynamics
C/Fortran 130.socorro
Solves a regular sparse block Lower-
and Upper-triangular system using
SSOR
Implicit CFDFortran 137.lu
Performs various hydrodynamic
simulations on 1, 2, and 3D gridsComputational
Astrophysics
Fortran 132.zeusmp2
Code uses a 2nd order Gudenov scheme
and a 3rd order remapping3D Eulerian
HydrodynamicsFortran 129.tera_tf
A parallel finite element method (FEM)
code for transient thermal conduction
with gap radiation
Heat Transfer
using FEM
C/Fortran 128.GAPgeofem
Code is based on the Weather Research
and Forecasting (WRF) ModelWeather
Forecasting
C/Fortran 127.wrf2
a classical molecular dynamics
simulation code designed for parallel
computers
Molecular
Dynamics
C++ 126.lammps
Brief DescriptionApplication Area Language Benchmark
![Page 24: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/24.jpg)
Apr 08, 200824
SPEC MPI2007 on the web
� Result score is an average of ratios for each of 13 codes: the ratio of the run time of a code on your system to the runtime on the reference platform (1st listed).
![Page 25: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/25.jpg)
Apr 08, 200825
Scaling with SPEC MPI2007
Scaling by application to 512 Cores
0
5
10
15
20
25
30
104_
milc
107_
leslie3d
113_
Gem
sFDTD
115_
fds4
121_
pop2
122_
tach
yon
126_
lammps
127_
wrf2
128_
GAP
geofem
129_
tera
_tf
130_
soco
rro13
2_ze
usmp2
137_
lu
Relative Scaling
32 64 128 256 512
![Page 26: Benchmarking Clusters with High Core-Count Nodes1 Benchmarking Clusters with High Core-Count Nodes 9th LCI International Conference on High- Performance Clustered Computing 29 April,](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e9097405f81407aa10e3bab/html5/thumbnails/26.jpg)
Apr 08, 200826
In Summary
� The best benchmark is “your application,”
particularly since they are usually run on all cores of
the nodes used on the job.
� There is a range of MPI benchmarks because
they all have their place:
• microbenchmarks are easier, quicker to run and are
evolving to test multi-core nodes
• application benchmarks are a bit more difficult to run,
but are a better predictor of performance across a
range of applications
� Benchmarks are evolving to serve the needs of
ever-expanding multi-core systems