CSC 7600 Lecture 4: Benchmarking, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS...

CSC 7600 Lecture 4: Benchmarking, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

BENCHMARKING

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityJanuary 27, 2011

http://www.csc.lsu.edu/


2

Topics

• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of results• Summary



3

Topics




Basic Performance Metrics

• Time related:– Execution time [seconds]

• wall clock time• system and user time

– Latency– Response time

• Rate related:– Rate of computation

• floating point operations per second [flops]

• integer operations per second [ops]

– Data transfer (I/O) rate [bytes/second]

• Effectiveness:– Efficiency [%]

• Sustained perf/peak perf– Memory consumption

[bytes]– Productivity

[utility/($*second)]• Performance measures:

– Sustained Performance– Peak Performance– Benchmark sustained perf

• HPL Rmax

4



5

What Is a Benchmark?

• The term “benchmark” also commonly applies to specially-designed programs used in benchmarking

• A benchmark should:– be domain specific (the more general the benchmark, the

less useful it is for anything in particular)– be a distillation of the essential attributes of a workload– avoid using single metric to express the overall performance

• Computational benchmark kinds– synthetic: specially-created programs that impose the load

on the specific component in the system– application: derived from a real-world application program

Benchmark: a standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance) [Merriam-Webster]



6

Purpose of Benchmarking

• Provide a tool, enabling quantitative comparisons– Comparison of variations within the same system– Comparison of distinct systems

• Driving progress– enable better engineering by defining measurable and

repeatable objectives

• Establishing of performance agenda– measure release-to-release or version-to-version progress– set goals to meet– be understandable and useful also to the people not having the

expertise in the field (managers, etc.)



7

Properties of a Good Benchmark

• Relevance: meaningful within the target domain• Understandability• Good metric(s): linear, orthogonal, monotonic• Scalability: applicable to a broad spectrum of

hardware/architecture• Coverage: does not over-constrain the typical

environment (does not require any special conditions)• Acceptance: embraced by users and vendors• Has to enable comparative evaluation

Adapted from: Standard Benchmarks for Database Systems by Charles Levine, SIGMOD ‘97



8

Topics




9

Early Benchmarks

• Whetstone– Floating point intensive

• Dhrystone– Integer and character string oriented

• Livermore Fortran Kernels– “Livermore Loops”– Collection of short kernels

• NAS kernel– 7 Fortran test kernels for aerospace computation

The sources of the benchmarks listed above are available from: http://www.netlib.org/benchmark



10

Whetstone

• Originally written in Algol 60 in 1972 at the National Physics Laboratory (UK)

• Named after Whetstone Algol translator-interpreter on the KDF9 computer

• Measures primarily floating point performance in WIPS: Whetstone Instructions Per Second

• Raised also the issue of efficiency of different programming languages

• The original Algol code was translated to C and Fortran (single and double precision support), PL/I, APL, Pascal, Basic, Simula and others



11

Dhrystone

• Synthetic benchmark developed in 1984 by Reinhold Weicker

• The name is a pun on “Whetstone”• Measures integer and string operations performance,

expressed in number of iterations, or Dhrystones, per second

• Alternative unit: D-MIPS, normalized to VAX 11/780 performance

• Latest version released: 2.1, includes implementations in C, Ada and Pascal

• Superseded by SPECint suite

Gordon Bell and VAX 11/780



12

Livermore Fortran Kernels (LFK)

• Developed at Lawrence Livermore National Laboratory in 1970– also known as Livermore Loops

• Consists of 24 separate kernels:– hydrodynamic codes, Cholesky conjugate gradient, linear

algebra, equation of state, integration, predictors, first sum and difference, particle in cell, Monte Carlo, linear recurrence, discrete ordinate transport, Planckian distribution and others

– include careful and careless coding practices• Produces 72 timing results using 3 different DO-loop

lengths for each kernel• Produces Megaflops values for each kernel and range

statistics of the results• Can be used as performance, compiler accuracy

(checksums stored in code) or hardware endurance test



13

NAS Kernel

• Developed at the Numerical Aerodynamic Simulation Projects Office at NASA Ames

• Focuses on vector floating point performance• Consists of 7 test kernels in Fortran (approx. 1000 lines

of code):– matrix multiply– complex 2-D FFT– Cholesky decomposition– block tri-diagonal matrix solver– vortex method setup with Gaussian elimination– vortex creation with boundary conditions– parallel inverse of three matrix pentadiagonals

• Reports performance in Mflops (64-bit precision)



14

Topics




15

Linpack Overview

• Introduced by Jack Dongarra in 1979• Based on LINPACK linear algebra package

developed by J. Dongarra, J. Bunch, C. Moler and P. Stewart (now superseded by the LAPACK library)

• Solves a dense, regular system of linear equations, using matrices initialized with pseudo-random numbers

• Provides an estimate of system’s effective floating-point performance

• Does not reflect the overall performance of the machine!



16

Linpack Benchmark Variants

• Linpack Fortran (single processor)– N=100– N=1000, TPP, best effort

• Linpack’s Highly Parallel Computing benchmark (HPL)• Java Linpack



17

Fortran Linpack (I)

N=100 case• Provides results listed in Table 1 of “Linpack Benchmark

Report”• Absolutely no changes to the code can be made (not

even in comments!)• Matrix generated by the program must be used to run

this case• An external timing function (SECOND) has to be

supplied• Only compiler-induced optimizations allowed• Measures performance of two routines

– DGEFA: LU decomposition with partial pivoting– DGESL: solves system of linear equations using result from

DGEFA• Complexity: O(n2) for DGESL, O(n3) for DGEFA



18

Fortran Linpack (II)

N=1000 case, Toward Peak Performance (TPP), Best Effort• Provides results listed in Table 1 of “Linpack Benchmark

Report”• The user can choose any linear equation to be solved• Allows a complete replacement of the factorization/solver

code by the user• No restriction on the implementation language for the

solver• The solution must conform to prescribed accuracy and

the matrix used must be the same as the matrix used by the netlib driver



19

Linpack Fortran Performance on Different Platforms

Computer N=100 [MFlops]

N=1000, TPP [MFlops]

Theoretical Peak [MFlops]

Intel Pentium Woodcrest (1core, 3 GHz) 3018 6542 12000

NEC SX-8/8 (8 proc., 2 GHz) - 75140 128000

NEC SX-8/8 (1 proc., 2 GHz) 2177 14960 16000

HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon) - 8185 14800

HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon) 1852 4851 7400

IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz) - 34570 60800

IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz) 1776 5872 7600

SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz) 1765 5953 6400

HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz) - 12860 22400

HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz) 1717 4191 5600

Fujitsu VPP5000/1 (1 proc., 3.33ns) 1156 8784 9600

Cray T932 (32 proc., 2.2ns) 1129 (1 proc.) 29360 57600

HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz) - 14260 20800

HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz) 1122 2132 2600

HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz) - 14150 32000

HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz) 843 2905 4000

Data excerpted from the 11-30-2006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps


20

Fortran Linpack Demo> ./linpack Please send the results of this run to:

Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee 37996-1300

Fax: 865-974-8296

Internet: [email protected]

This is version 29.5.04.

norm. resid resid machep x(1) x(n) 1.25501937E+00 1.39332990E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00

times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio b(1) times for array with leading dimension of 201 4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03 9.090E-03 -9.159E-15 4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03 9.017E-03 1.000E+00 4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03 9.018E-03 1.000E+00 4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03 8.981E-03 5.298E+02

times for array with leading dimension of 200 4.210E-04 1.800E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.901E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.699E-05 4.370E-04 1.571E+03 1.273E-03 7.804E-03 1.000E+00 4.288E-04 1.640E-05 4.452E-04 1.542E+03 1.297E-03 7.950E-03 5.298E+02 end of tests -- this version dated 05/29/04

Reference: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html

Time spent in matrix factorization routine (dgefa)

Time spent in solver (dgesl)

Total time (dgefa+dgesl)

Sustained floating point rate

“Timing” unit (obsolete)

Fraction of Cray-1S execution time (obsolete)

First element of right hand side vector

Two different dimensions used to test the effect of array placement in memory



21

Linpack’s Highly Parallel Computing Benchmark (HPL)

• Measures the performance of distributed memory machines

• Used in the “Linpack Benchmark Report” (Table 3) and to determine the order of machines on the Top500 list

• The portable version (written in C)• External dependencies for Linpack installation:

– MPI-1.1 functionality for inter-node communication– BLAS or VSIPL library for simple vector operations such as

scaled vector addition (DAXPY: y = αx+y) and inner dot product (DDOT: a = Σxiyi)

• Ground rules:– allows a complete user replacement of the LU factorization

and solver steps (the accuracy must satisfy given bound)– same matrix as in the driver program– no restrictions on problem size



22

HPL Algorithm

• Data distribution: 2-D block-cyclic• Algorithm elements:

– right-looking variant of LU factorization with row partial pivoting featuring multiple look-ahead depths

– recursive panel factorization with pivot search and column broadcast combined

– various virtual panel broadcast topologies– bandwidth reducing swap-broadcast algorithm– backward substitution with look-ahead depth of one

• Floating point operation count: 2/3·n3+n2



. . .

23

HPL Algorithm Elements

http://www.netlib.org/benchmark/hpl/algorithm.html

Panel Factorization

Panel Broadcast

Look-ahead

Backward Substitution

Update

Solution Check

All columns of A

processed?

Y

N

Six broadcast algorithms available

Matrix distribution scheme over P×Q grid of processors:

Right looking variant of LU factorization is used.In each iteration of the loop a panel of NB columns is factorized and the trailing submatrix is updated:

Execution flow for single parameter set:

Matrix Generation



24

HPL Linpack Metrics

• The HPL implementation of the benchmark is run for different problem sizes N on the entire machine

• For certain problem size Nmax, the cumulative performance in Mflops (reflecting 64-bit addition and multiplication operations) reaches its maximum value denoted as Rmax

• Another metric possible to obtain from the benchmark is N1/2, the problem size for which the half of the maximum performance (Rmax/2) is achieved

• The Rmax value is used to rank supercomputers in Top500 list; listed along with this number are the theoretical peak double precision floating point performance Rpeak of the machine and N1/2



25

Machine Parameters Influencing Linpack Performance

Parameter Linpack Fortran, N=100

Linpack Fortran, N=1000, TPP

HPL

Processor speed Yes Yes Yes

Memory capacity No No (modern system)

Yes (for Rmax)

Network latency/bandwidth

No No Yes

Compiler flags Yes Yes Yes


26

Ten Fastest Supercomputers On Current Top500 List

Source: http://www.top500.org/sublist


27

Java Linpack

• Intended mostly to measure the efficiency of Java implementation rather than hardware floating point performance

• Solves a dense 500x500 system of linear equations with one right-hand side, Ax=b

• Matrix A is generated randomly• Vector b is constructed, so that all component of solution

x are one• Uses Gaussian elimination with partial pivoting• Reports: Mflops, time to solution, Norm Res (solution

accuracy), relative machine precision



============================================================================T/V N NB P Q Time Gflops----------------------------------------------------------------------------WR01L2L2 5000 32 2 2 7.14 1.168e+01----------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0400275 ...... PASSED||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0264242 ...... PASSED||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0051580 ...... PASSED============================================================================T/V N NB P Q Time Gflops----------------------------------------------------------------------------WR01L2L2 5000 32 1 4 7.00 1.192e+01----------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0335428 ...... PASSED||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0221433 ...... PASSED||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0043224 ...... PASSED============================================================================T/V N NB P Q Time Gflops----------------------------------------------------------------------------WR01L2L2 5000 32 4 1 7.00 1.191e+01----------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0426255 ...... PASSED||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0281393 ...... PASSED||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0054928 ...... PASSED============================================================================

Finished 3 tests with the following results: 3 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values.----------------------------------------------------------------------------

End of Tests.============================================================================

HPL Output Example

28

> mpirun -np 4 xhpl============================================================================HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK============================================================================

An explanation of the input/output parameters follows:T/V : Wall time / encoded variant.N : The order of the coefficient matrix A.NB : The partitioning blocking factor.P : The number of process rows.Q : The number of process columns.Time : Time in seconds to solve the linear system.Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 5000 NB : 32 PMAP : Row-major process mappingP : 2 1 4 Q : 2 4 1 PFACT : Left NBMIN : 2 NDIV : 2 RFACT : Left BCAST : 1ringM DEPTH : 0 SWAP : Mix (threshold = 64)L1 : transposed formU : transposed formEQUIL : yesALIGN : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.- The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )- The relative machine precision (eps) is taken to be 1.110223e-16- Computational tests pass if scaled residuals are less than 16.0

For configuration issues, consult: http://www.netlib.org/benchmark/hpl/faqs.html



29

Topics




30

Other Parallel Benchmarks

• High Performance Computing Challenge (HPCC) benchmarks– Devised and sponsored to enrich the benchmarking parameter

set

• NAS Parallel Benchmarks (NPB)– Powerful set of metrics– Reflects computational fluid dynamics

• NPBIO-MPI– Stresses external I/O system



31

HPC Challenge Benchmark

Consists of 7 individual tests:• HPL (Linpack TPP): floating point rate of execution of a solver of

linear system of equations• DGEMM: floating point rate of execution of double precision matrix-

matrix multiplication• STREAM: sustainable memory bandwidth (GB/s) and the

corresponding computation rate for simple vector kernel• PTRANS (parallel matrix transpose): total capacity of the network

using pairwise communicating processes• RandomAccess: the rate of integer random updates of memory (in

GUPS: Giga-Updates Per Second)• FFT: floating point rate of execution of double precision complex 1-D

Discrete Fourier Transform• b_eff (effective bandwidth benchmark): latency and bandwidth of a

number of simultaneous communication patterns



32

Comparison of HPCC Results on Selected Supercomputers

Notes:• all metrics shown are “higher-better”, except for the Random Ring Latency• machine labels include: machine name (optional), manufacturer and system name, affiliation and (in parentheses) processor/network fabric type

0

20

40

60

80

100

Pe

rce

nta

ge

of

the

ma

xim

um

va

lue

G-HPL (max=91Tflops)

G-PTRANS(max=4666GB/s)

G-RandomAccess

(max=7.69GUP/s)

G-FFTE(max=1763

Gflops)

EP-STREAMsystem

(max=62890GB/s)

EP-DGEMMsystem

(max=161885Gflops)

Random RingBandwidth

(max=0.829GB/s)

Random RingLatency

(max=118.6 μs)

"Red Storm" Cray XT3, Sandia (Opteron/Cray custom 3D mesh) IBM p5-575, LLNL (Power5/IBM HPS)IBM Blue Gene/L, NNSA (PowerPC 440/IBM custom 3D torus & tree) Cray X1E, ORNL (X1E/Cray modified 2D torus)HP XC, Government (Itanium2/Quadrics Elan4) "Columbia" SGI, NASA (Itanium2/SGI NUMALINK)NEC SX-8, HLRS (SX-8/IXS crossbar) "Emerald" Rackable Systems, AMD (Opteron/Silverstorm Infiniband)



33

NAS Parallel Benchmarks

• Derived from computational fluid dynamics (CFD) applications• Consist of five kernels and three pseudo-applications• Exist in several flavors:

– NPB 1: original “paper-and-pencil” specification• generally proprietary implementations by hardware vendors

– NPB 2: MPI-based sources distributed by NAS• supplements NPB 1• can be run with little or no tuning

– NPB 3: implementations in OpenMP, HPF and Java• derived from NPB-serial version with improved serial code• a set of multi-zone benchmarks was added• test implementation efficiency of multi-level and hybrid parallelization

methods and tools (e.g. OpenMP with MPI)– GridNPB 3: new suite of benchmarks, designed to rate the

performance of computational grids• includes only four benchmarks, derived from the original NPB• written in Fortran and Java• Globus as grid middleware



34

NPB 2 Overview

• Multiple problem classes (S, W, A, B, C, D)• Tests written mainly in Fortran (IS in C):

– BT (block tri-diagonal solver with 5x5 block size)– CG (conjugate gradient approximation to compute the smallest

eigenvalue of a sparse, symmetric positive definite matrix)– EP (“embarrassingly parallel”; evaluates an integral by means of

pseudorandom trials)– FT (3-D PDE solver using Fast Fourier Transforms)– IS (large integer sort; tests both integer computation speed and network

performance)– LU (a regular-sparse, 5x5 block lower and upper triangular system

solver)– MG (simplified multigrid kernel; tests both short and long distance data

communication)– SP (solves multiple independent system of non-diagonally dominant,

scalar, pentadiagonal equations)• Sources and reports available from:

http://ww.nas.nasa.gov/Resources/Software/npb.html



35

NPBIO-MPI

• Attempts to address lack of I/O tests in NPB, focusing primarily on file output

• Based on BTIO (Block Tridiagonal Input Output) effort, which extended BT (Block-tridiagonal) benchmark with routines writing to storage five double precision numbers for every mesh point

– runs for 200 iterations, writing every five iterations– after all time steps are finished, all data belonging to a single time step must be

stored in the same file, sorted by vector components– timing must include all required data rearrangements to achieve the specified

data layout• Supported access scenarios:

– simple: MPI-IO without collective buffering– full: MPI-IO collective buffering– fortran: Fortran 77 file operations– epio: where each process writes continuously its part of the computational

domain to a separate file• Number of processes must be a square• Problem sizes: class A (643), class B (1023), class C (1623)• Several possible results, depending on the benchmarking goal: effective

flops, effective output bandwidth or output overhead



36

Sample NPB 2 Results

Reference: The NAS Parallel Benchmarks 2.1 Results by W. Saphir, A. Woo, and M. Yarrow http://www.nas.nasa.gov/News/Techreports/1996/PDF/nas-96-010.pdf



37

Topics




38

Benchmarking Organizations

• SPEC (Standard Performance Evaluation Corporation)– Created to satisfy the need for realistic, fair and standardized

performance tests– Motto: “An ounce of honest data is worth more than a pound of

marketing hype”

• TPC (Transaction Processing Performance Council)– Formed primarily due to lack of reliable database benchmarks



39

SPEC Benchmark Suite Overview

• Standard Performance Evaluation Corporation (SPEC) is a non-profit organization (financed by its members: over 60 leading computer and software manufacturers) founded in 1988

• SPEC benchmarks are written in platform-neutral language (typically C or Fortran)

• The code may be compiled using arbitrary compilers, but the sources may not be modified– many manufacturers are known to optimize their compilers

and/or systems to improve the SPEC results• Benchmarks may be obtained by purchasing the license

from SPEC; the results are published on the SPEC website

• Website: http://www.spec.org



40

SPEC Suite Components

• SPEC CPU2006: combined performance of CPU, memory and compiler– CINT2006 (aka. SPECint): integer arithmetic test using compilers, interpreters,

word processors, chess programs, etc.– CFP2006 (aka. SPECfp): floating point test using physical simulations, 3D

graphics, image processing, computational chemistry, etc.• SPECweb2005: PHP/JSP performance• SPECviewperf: OpenGL 3D graphic system performance• SPECapc: several popular 3D-intensive applications• SPEC HPC2002: high-end parallel computing tests using quantum

chemistry application, weather modeling, industrial oil deposits locator• SPEC OMP2001: OpenMP application performance• SPECjvm98: performance of java client on a Java VM• SPECjAppServer2004: multi-tier benchmark measuring the performance of

J2EE application servers• SPECjbb2005: server-side Java performance• SPEC MAIL2001: mail server performance (SMTP and POP)• SPEC SFS97_R1: NFS server throughput and response time• Planned: SPEC MPI2006, SPECimap, SPECpower, Virtualization



41

Sample Results: SPEC CPU2006System CINT2006

SpeedCFP2006 Speed

CINT2006 Rate

CFP2006 Rate

base peak base peak base peak base peak

Dell Precision 380 (Pentium EE965 3.73GHz, 2cores)

11.6 12.4 23.1 21.7

HP ProLiant DL380 G4 (Xeon 3.8GHz, 2 cores)

11.4 11.7 20.9 18.8

HP ProLiant DL585 (Opteron 854 2.8GHz, 2 cores)

11.2 12.7 12.1 13.0 22.3 25.2 24.1 25.9

Sun Blade 2500 (1 UltraSPARC IIIi, 1280MHz) 4.04 4.04

Sun Fire E25K (UltraSPARC IV+ 1500MHz, 144 cores)

759 904

HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 2 cores)

14.5 15.7 17.3 18.1

HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 8 cores)

94.7 102 69.1 71.4

HP Integrity Superdome (Itanium2 1.6GHz/24MB, 128 cores)

1534 1648 1422 1479

Notes:• base metric requires that the same flags are used when compiling all instances of the benchmark (peak is less strict)• speed metric measures how fast a computer executes single task, while rate determines throughput with multiple tasks


42

TPC

• Governed by the Transaction Processing Performance Council (http://www.tpc.org) founded in 1985

– members include leading system and microprocessor manufacturers, and commercial database developers

– the council appoints professional affiliates and auditors outside the member group to help fulfill the TPC’s mission and validate benchmark results

• Current benchmark flavors:– TPC-C for transaction processing (de-facto standard for On-Line Transaction

Processing)– TPC-H for decision support systems– TPC-App for web services

• Obsolete benchmarks:– TPC-A (performance of update-intensive databases)– TPC-B (throughput of a system in transactions per second)– TPC-D (decision support applications with long running queries against complex

data structures)– TPC-R (business reporting, decision support)– TPC-W (transactional web e-Commerce benchmark)



43

Top Ten TPC-C Results



44

Topics




Presentation of the Results

• Tables• Graphs

– Bar graphs (a)– Scatter plots (b)– Line plots (c)– Pie charts (d)– Gantt charts (e)– Kiviat graphs (f)

• Enhancements– Error bars, boxes or

confidence intervals– Broken or offset scales (be

careful!)– Multiple curves per graph

(but avoid overloading)– Data labels, colors, etc.

0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000 12000

G-HPLG-PTRANSG-FFTEG-RanAccG-StreamEPStream

(a) (b)

(c) (d)

(e) (f)

45


Kiviat Graph Example

46

Source: http://www.cse.clrc.ac.uk/disco/DLAB_BENCH_WEB/hpcc/hpcc_kiviat.shtml


Mixed Graph Example

47

WRF OOCORE MILC PARATEC HOMME BSSN_PUGH Whisky_Carpet ADCIRC PETSc_FUN3D

Computation fraction

Communication fraction

Floating point operations

Load/store operations

Other operations

Characterization of NSF/CCT parallel applications on POWER5 architecture(using data collected by IPM)


48

Graph Do’s and Don’ts

• Good graphs:– Require minimum effort from the reader– Maximize information– Maximize information-to-ink ratio– Use commonly accepted practices– Avoid ambiguity

• Poor graphs:– Have too many alternatives on a single chart– Display too many y-variables on a single chart– Use vague symbols in place of text– Show extraneous information– Select scale ranges improperly– Use line chart instead of a bar graph

Reference: Raj Jain, The Art of Computer Systems Performance Analysis, Chapter 10



49

Common Mistakes in Benchmarking

• Only average behavior represented in test workload• Skewness of device demands ignored• Loading level controlled inappropriately• Caching effects ignored• Buffering sizes not appropriate• Inaccuracies due to sampling ignored• Ignoring monitoring overhead• Not validating measurements• Not ensuring same initial conditions• Not measuring transient performance• Using device utilizations for performance comparisons• Collecting too much data but doing very little analysis

From Chapter 9 of The Art of Computer Systems Performance Analysis by Raj Jain:From Chapter 9 of The Art of Computer Systems Performance Analysis by Raj Jain:



50

Misrepresentation of Performance Results on Parallel Computers

1. Quote only 32-bit performance results, not 64-bit results2. Present performance for an inner kernel, representing it as the performance of the entire

application3. Quietly employ assembly code and other low-level constructs4. Scale problem size with the number of processors, but omit any mention of this fact5. Quote performance results projected to the full system6. Compare your results with scalar, unoptimized code run on another platform7. When direct run time comparisons are required, compare with an old code on an obsolete

system

8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation

9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar

10. Mutilate the algorithm used in the parallel implementation to match the architecture

11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment

12. If all else fails, show pretty pictures and animated videos, and don't talk about performance

Reference:David Bailey “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, Aug 1991, pp.54-55, http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf



51

Topics




Material For Test

• Basic performance metrics (slide 4)• Definition of benchmark in own words; purpose of

benchmarking; properties of good benchmark (slides 5, 6, 7)

• Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18)

• HPL: (slides 21 and 24)• Linpack compare and contrast (slide 25)• General knowledge about HPCC,SPEC and NPB suites

(slides 30, 31, 34, 39)• Kiviat Graph (slide 46)• Benchmark result interpretation (slides 49, 50)

52



53

CSC 7600 Lecture 4: Benchmarking, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS...

Documents

Transcript of CSC 7600 Lecture 4: Benchmarking, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS...