Blue Waters Training –Managing HPC Centers Benchmarking

Blue Waters Training – Managing HPC CentersBenchmarkingBill Kramer, Gregory Bauer, Aaron Saxton, Blue Waters [email protected], [email protected], [email protected]

UNCLASSIFIED

Best Practice: Informed Benchmarks

• Full System evaluations• PERCU

• Best Practices for comparing performance across systems• SSP/SPP

• Evaluating applications and preparing Benchmarks• Workload studies

• ML/AI benchmarking

2Unclassified – Blue Waters HPC System Management

Training July 2020

Challenges Evaluating High Performance Computing and Data Analysis (HPC&D) Systems• HPC&D systems typically have multiple application targets

• Communities and applications getting broader• Applications and implementations change over time

• HPC&D Systems are complex, multi-faceted eco-systems• Single measures cannot address current and future complexity for such systems

• Parallel requirements and performance are more tightly coupled than many other systems• But even with single node/core applications, contention may influence performance

• HPC&D systems have multiple stake holders, maybe with different expectations• Point evaluations do not address the life cycle of a living system

• On-going usage• On-going system management• HPC&D Systems are not stagnant

• Software changes• New components - additional capability or repair• Workload changes

• So evaluation methods that can be on-going are needed

Unclassified – Blue Waters HPC System Management Training July 2020 3

The PERCU Method Is Based on What Users Want

• Performance -• How fast will a system process work if everything is working really well• This establishes a system’s potential to do productive work for a workload

• Effectiveness • The likelihood users can get the system to do their work when they need it done – resource management

• Reliability • The likelihood the system is available to do the work when the work is needed

• Consistency• How often will the system process the same or similar work correctly and in the same length of time

• Usability• How easy is it for users to get the system to get their work ready to process

Cost and other “business factors” are also part of a “best value” decision making that can be part of the process


A Key Factor in Selecting a System• The “If I wait, the technology will get better and I will get more for my money” syndrome• The fundamental decision process to the question of “what and when to buy” requires an

evaluator to assess 1. How well will the technology really meet the needs it is intended to serve2. The technology availability3. The cost (in some form of resources) for the technology

• These assessments establish the Potential of each alternative system and the Value of each alternative Anonymized SSP Evaluation

0

500

1000

1500

2000

2500

3000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Month from initial system

SSP

GFl

op/s System 1

System 2System 3System 4 System 5

Unclassified – Blue Waters HPC System Management Training July 2020

Characteristics of PERCU Method• Methodology is flexible so it applies to:

• Different use cases• Workloads and usages• Scales

• System size, cost, scale, etc.• Effort to do evaluations - quick and dirty to highly formalized

• Quality of Service Goals• Duty cycles, Reliability

• Communities• Methodology should efficiently serve four purposes

• Differentiate (select) a system from among its competitors• Validate the system works the way expected once a system is built and/or arrives• Assure that systems perform as expected throughout its lifetime

• e.g. after upgrades, changes, and in regular use• Guide future system designs and implementation

• Method used by many organizations• DOE Labs, DOD Mod, NCSA, …


Training July 2020

7

Benchmark and Test Hierarchy

Und

erst

andi

ng In

crea

ses

Inte

grat

ion

(real

ity) I

ncre

ases

Full Workload

stripped-down app

composite tests

system component tests

kernels

full application

Analyze Application Workload Select

Representative Applications

and Tests DetermineTest Cases(e.g. Input,

Concurrency)Packageand Verify

Tests

SystemUnclassified – Blue Waters HPC System Management

Training July 2020

Sustained System Performance (SSP) Method• Establish a set of performance tests that reflect the intended work the system will do

• Can be any number of tests as long as they have a common measure of the amount of work• A test consists of a code and a problem set• Establish the reference amount work (ops, atoms, years simulated, etc.) the test needs to do for a fixed

concurrency• Time each test takes to execute on each system

• Concurrency and/or optimization can be fixed and/or varied as desired• Determine the amount of work done for a given “schedulable unit” (node, socket, core, task, thread,

interface, etc.)• Work = Total operations/total time/number of scalable units for the test • Work per unit= Total work/number of scalable units used for the test

• Composite the work per scalable unit for all tests• Composite functions based on circumstances and test selection criteria• Can be weighed or not as desired

• Determine the SSP of a system at any time period by multiplying the composite work per scalable unit by the number of scalable units in the system

• Determine the Sustained System Performance for each phase of each system being evaluated

s, kSSP = F W( , s,k,aP )* s,k,aN( )a =1

s, kAå


Value and Price Performance for Alternatives1. Determine the Potency of the system - how well will the system perform the expected work over

some time period• Potency is the sum, over the specified time, of the product of a system’s SSP and the time period of that SSP

over some time period• Different SSPs for different periods• Different SSPs for different types of computation units (heterogeneous)

2. Determine the Cost of systems• Cost can be any resource units ($, Watts, space…) and with any complexity (Initial, TCO,…)

3. Determine the Value of the system • Value is the potency divide by a cost function

4. If needed, compare the value of different system alternatives

sPotency =k=1

sKå s, kSSP *

s, k+1min(t , maxt )-s, kmin(t , maxt )[ ];" s, kt £ maxt

sCost = cs, k, ll=1

s, kLå

k=1

sKå

sValue = sPotency

sCost


Simplified SPP Comparison

τ2,1

SSP performance chart after periods are aligned. For clarity τ́2,k replaces τ2,k

The proposed deployment time and SSP of two systems


Green represents the integrated performance potential for system 2. Think Petaflop*years.

Workload Analysis and NSF Sustained PetascalePerformance (SPP) benchmarks• What are the main/most important aspects of a workload• Substantial effort and expertise needed to craft a good, broad

benchmark suite• Understand the current and if possible future workloads

• https://arxiv.org/abs/1703.00924• That drives benchmarks and metrics

• https://bluewaters.ncsa.illinois.edu/benchmarks• https://bluewaters.ncsa.illinois.edu/spp-methodology


Training July 2020

https://arxiv.org/abs/1703.00924

https://bluewaters.ncsa.illinois.edu/benchmarks

https://bluewaters.ncsa.illinois.edu/spp-methodology

Updating SPP applications

• Reevaluated composition of applications based on workload study and knowledge of what could be coming every several years.• Technology Providers typically make specific improvements for

improving benchmarks – but which may not be generally applicable.• One application removed and 4 applications added.• Full suite with input files requires over 200 GB.

Greg Bauer, Victor Anisimov, Galen Arnold, Brett Bode, Robert Brunner, Tom Cortese, Roland Haas, Andriy Kot, William Kramer, JaeHyuk Kwack, Jing Li, Celso Mendes, Ryan Mokos, Craig Steffen (2017): Updating the SPP Benchmark Suite for Extreme-Scale Systems, presented at CUG 2017, Redmond, Washington, U.S.A.


Training July 2020

https://www.researchgate.net/publication/319514643_Updating_the_SPP_Benchmark_Suite_for_Extreme-Scale_Systems

Real World Example - NGA Application Performance Evaluating with the Roofline Analysis

• Real world use with an NGA Application Earth Gravity Model solver code

• Know your performance limiters;• Determine what the upside potential exists.• 3 implementation variations shown as

performance improvements were made.• Best performing 2D2DmanyRhs version is

limited by L1 cache rate and/or realized DP flop rate.

• Increasing an algorithm's Computational Intensity (CI) to determine the improvement upside possibilities is a good indicator of whether to invest effort to improve an application performance


Training July 2020

Rooflines for Blue Waters XE node

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/

What to Avoid Doing in Performance Comparisons…

• Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers - David H. Bailey, Supercomputing Review, August 1991 , p. 54-55https://www.davidhbailey.com//dhbpapers/twelve-ways.pdf

• Misleading Performance Claims in Parallel Computations - David H. Bailey, July 6, 2009 https://www.davidhbailey.com/dhbpapers/inv3220-bailey.pdf

• Fooling the Masses with Performance Results: Old Classics & Some New Ideas –Georg Hager, https://blogs.fau.de/hager/archives/5260 and https://blogs.fau.de/hager/archives/category/fooling-the-masses


Training July 2020

https://www.davidhbailey.com/dhbpapers/twelve-ways.pdf

https://blogs.fau.de/hager/archives/5260

https://blogs.fau.de/hager/archives/category/fooling-the-masses

Machine Learning Benchmarking• Training vs. Inference

• Inference on single example• Serial Computation• No inter-process communication

• Training with Stochastic Gradient Decent (SGD)• Mini-Batch for parallelization• Iterative process

• Training Benchmarks• Samples Per Sec• Validation Loss - Time To Solution


Training July 2020

Example

Classification

Machine Learning Benchmarking – Quick SGD Overview


Training July 2020

• Loss• 𝐿 = ∑$ 𝑙(𝑎,𝑚)• 𝐵: Batch of Examples

• Sum in loss function is where we exploit parallelism• 𝐿 = ∑$- 𝑙 𝑎,𝑚 + ∑$/ 𝑙 𝑎,𝑚 + …

• 𝐵1, 𝐵2,… is what we call “Mini Batches”

• 3𝜃5 = 𝜃5 + 𝛾 7 ∇9𝐿• 𝛾: Learning Rate• 𝜃5, 3𝜃5: Model Parameters and

Updated Model ParametersM

odel

Par

amet

er j

Model Parameter k

Model Parameter j Model Parameter k

Loss

Machine Learning Benchmarking – Samples Per Sec


Training July 2020

• Mini-Batch• Number of examples that will fit

into memory (GPUs or CPUs) for one iteration of SGD

• Epoch• Iterate Through All Training

Examples

Machine Learning Benchmarking – Samples Per Sec


Training July 2020

• Comparison of Blue Waters against newer Cray XC

• Indeed!• Linear scaling w.r.t.

example/sec• New system is faster, but

still scales the same

Machine Learning Benchmarking – Time to Solution


Training July 2020

• Example of hyperparameter tuning

• Training on MNIST (handwriting) dataset

• Model is a NN with 2 FC layers• Orange: Batchsize 64

Blue: Batchsize 256Purple: Batchsize 1024

EpochsEpochs

EpochsEpochs

Blue Waters Training –Managing HPC Centers Benchmarking

Documents

Transcript of Blue Waters Training –Managing HPC Centers Benchmarking