The Operational Impact of GPUs on ORNL's Cray XK7 Titan · 2014. 4. 9. · CC = 0.51 CC = 0.30 CC =...

Office of Science

The Opera)onal Impact of GPUs on ORNL's Cray XK7 Titan

Jim Rogers

Director of Opera2ons Na2onal Center for Computa2onal Sciences

Oak Ridge Na2onal Laboratory

2

Session Descrip)on, Session ID S4670

The Opera2onal Impact of GPUs on ORNL's Cray XK7 Titan

With a peak computa)onal capacity of more than 27PF, Oak Ridge Na)onal Lab's Cray XK7, Titan, is currently the largest compu)ng resource available to the US Department of Energy. Titan contains 18,688 individual compute nodes, where each node pairs one commodity x86 processor with a single NVIDIA Kepler GPU. When compared to a typical mul)core solu)on, the ability to offload substan)ve amounts of work to the GPUs provides benefits with significant opera)onal impacts. Case studies show )me-‐to-‐solu)on and energy-‐to-‐solu)on that are frequently more than 5 )mes more efficient than the non-‐GPU-‐enabled case. The need to understand how effec)vely the Kepler GPUs are being used by these applica)ons is augmented by changes to the Kepler device driver and the Cray Resource U)liza)on so_ware, which now provide a mechanism for repor)ng valuable GPU usage metrics for scheduled work and memory use, on a per job basis.

3

Presenter Overview

Jim Rogers is the Director of Opera)ons for the Na)onal Center for Computa)onal Sciences at Oak Ridge Na)onal Laboratory. The NCCS provides full facility and opera)ons support for three petaFLOP-‐scale systems including Titan, a 27PF Cray XK7. Jim has a BS in Computer Engineering, and has worked in high performance compu)ng systems acquisi)on, integra)on, and opera)on for more than 25 years.

4

Content •  The OLCF’s Cray XK7 Titan

–  Hardware Descrip)on –  Mission Need –  INCITE Alloca)on Program

•  Compe))ve Alloca)ons •  Computa)onally Ready Science

•  The Opera)onal Need to Understand Usage –  ALTD (the early years) –  NVIDIA’s Role

•  Δ to the Kepler Driver, API, and NVML –  Cray’s Resource U)liza)on (RUR) –  Assessing the Opera)onal Impact to

Delivered Science •  Time-‐ and Energy-‐ to Solu)on

•  Examples of NVML_COMPUTEMODE_ EXCLUSIVE_PROCESS Measurement –  Lajce QCD –  LAMMPS –  NAMD

•  Opera)onal Impact of GPUs on Titan –  Case Study: WL-‐LSMS –  Examples among Domains

•  Edge Case: HPL •  Takeaways…

5

ORNL’s Cray XK7 Titan -‐ A Hybrid System with 1:1 AMD Opteron CPU and NVIDIA Kepler GPU

SYSTEM SPECIFICATIONS: • Peak performance of 27 PF • Sustained performance of 17.59 PF •  18,688 Compute Nodes each with:

•  16-Core AMD Opteron CPU • NVIDIA K20x (Kepler) GPU •  32 + 6 GB memory

•  512 Service and I/O nodes •  200 Cabinets •  710 TB total system memory • Cray Gemini 3D Torus Interconnect •  8.9 MW peak energy measurement

4,352 ft2 404 m2

5

Electrical Distribu2on •  (4) Transformers •  (200) 480V/100A circuits •  (48) 480V/20A circuits

6

Cray XK7 Compute Node

Y

X

Z

HT3 HT3

PCIe Gen2

XK7 Compute Node Characteris2cs

AMD Opteron 6274 16 core processor -‐ 141 GF

Tesla K20x -‐ 1311 GF

Host Memory 32GB

1600 MHz DDR3

Tesla K20x Memory 6GB GDDR5

Gemini High Speed Interconnect

Slide courtesy of Cray, Inc.

7

Science challenges for the OLCF in next decade

Combustion Science Increase efficiency by

25%-50% and lower emissions from internal

combustion engines using advanced fuels and new, low-

temperature combustion concepts.

Biomass to Biofuels Enhance the understanding

and production of biofuels for transportation and other bio-

products from biomass.

Fusion Energy/ITER Develop predictive understanding of plasma properties, dynamics, and interactions with surrounding materials.

Climate Change Science Understand the dynamic ecological and chemical evolution of the climate system with uncertainty quantification of impacts on regional and decadal scales.

Solar Energy Improve photovoltaic efficiency and lower cost for organic and inorganic materials.

Globally Optimized Accelerator Designs Optimize designs as the next

generations of accelerators are planned, detailed models will be

needed to provide a proof of principle and efficient designs of new light

sources.

ASCR Mission: “…discover, develop, and deploy computa;onal and networking capabili;es to analyze, model, simulate, and predict complex phenomena important to DOE.”

8

Innova)ve and Novel Computa)onal Impact on Theory and Experiment

Call for Proposals

Contact information Julia C. White, INCITE Manager

[email protected]

The INCITE program seeks proposals for high-impact

science and technology research challenges that require the power of the

leadership-class systems. Allocations will be for calendar year 2015.

April 16 – June 27, 2014

INCITE is an annual, peer-review allocation program that provides unprecedented computational and data science resources

•  5.8 billion core-hours awarded for 2014 on the 27-petaflop Cray XK7 “Titan” and the 10-petaflop IBM BG/Q “Mira”

•  Average award: 78 million core-hours on Titan and 88 million core-hours on Mira in 2014

•  INCITE is open to any science domain •  INCITE seeks computationally

intensive, large-scale research campaigns

9

Diversity of INCITE Science

Determining protein structures and design proteins that block influenza virus infection.

- David Baker, University of Washington

Simulating a flow of healthy (red) and diseased (blue) blood cells

- George Karniadakis, Brown University

• Glimpse into dark matter • Supernovae ignition • Protein structure • Creation of biofuels • Replicating enzyme functions

• Global climate • Accelerator design • Carbon sequestration • Turbulent flow • Propulsor systems

Other INCITE research topics

Calculating an improved probabilistic seismic hazard forecast for California.

-  Thomas Jordan, University of Southern California

• Membrane channels • Protein folding • Chemical catalyst design • Plasma physics • Algorithm development

• Nano-devices • Batteries • Solar cells • Reactor design • Nuclear structure

Providing new insights into the dynamics of turbulent combustion processes in internal-combustion engines.

- Jacqueline Chen and Joseph Oefelein, Sandia National Laboratories

Modeling of ubiquitous weak intermolecular bonds using Quantum Monte Carlo to develop benchmark energies.

-  Dario Alfé, University College London, UK

High-fidelity simulation of complex suspension flow for practical rheometry.

- William George, National Institute of Standards and Technology

to orange bar); for one of these the improvements were sufficient forautobuilding to effectively solve the structure.

Over the combined set of 18 blind cases and the 59 benchmark cases,Rosetta refinement yielded a model with density correlation as good orbetter than any of the control methods for all but six structures. Thedependence of success on sequence identity over the combined set isillustrated in Fig. 2b. The improvement in performance is particularlystriking below 22% sequence identity, where the quality of the startinghomology models becomes too low for the control methods in almostall cases. With the new method the success rate in the 15–28%sequence identity range, generally considered very challenging formolecular replacement, is over 50%.

Figure 2c illustrates the dependence of model-building on the qualityof initial electron density. Conventional chain rebuilding requires amap in which the connectivity is largely correct (leftmost panel),whereas the new method can tolerate breaks in the chain more thanother methods (panels 2–4), as long as there is sufficient information inthe electron density map, combined with the Rosetta energy function,to guide structure optimization. The map on the far right contains toolittle information to guide energy-based structure optimization andhence the new approach fails. In the five blind cases that have not yetbeen solved the comparative models may have been too low in quality,or there may have been complications in the X-ray diffraction data setsthemselves.

Key to the success of the approach described here is the integration ofstructure prediction and crystallographic chain tracing and refinementmethods. Simulated annealing guided by molecular force fields and dif-fraction data has had an important role in crystallographic refinement14,21.Structure prediction methods such as Rosetta can be even more powerfulwhen combined with crystallographic data because the force fieldsincorporate additional contributions such as solvation energy and hydro-gen bonding, and the sampling algorithms can build non-modelled por-tions of the molecule de novo and cover a larger region of conformationalspace than simulated annealing. The difference between Rosetta samplingand simulated annealing sampling, both using crystallographic data, isillustrated in Fig. 3. Beginning with the homology model placed bymolecular replacement in the unit cell for blind case 6, we generated 100models by simulated annealing at two starting temperatures, and 100models with Rosetta energy- and density-guided optimization followedby refinement. The 2mFo 2 DFc (ref. 22) electron density maps generatedusing phases from over 50% of the Rosetta models had correlations 0.36or better to the final refined map, whereas fewer than 5% of models fromsimulated annealing had correlations this high. Our approach probablyoutperforms even extreme simulated annealing because the physicalchemistry and protein structural information which guide samplingeliminate the vast majority of non-physical conformations.

Approaches to molecular replacement combining the power of crys-tallographic map interpretation and structure prediction methodology

c

0

1

2

3

4

5

6

7

8

0–29% 30–34% 35–39% 40–44% 45–49% 50+%

Num

ber o

f tar

gets

Rfree after autobuilding

Rosetta + Autobuild AutoBuild Arp/Warp SA + AutoBuild Torsion-space SA + Autobuild Extreme SA + Autobuild DEN + Autobuild

a

0%

20%

40%

60%

80%

100%

0–15% 15–19% 19–22% 22–28% 28–31% 31–100%

Frac

tion

of s

truc

ture

s so

lved

to

Rfre

e <

0.40

Template sequence identity

Rosetta + Autobuild Autobuild SA + Autobuild Torsion-space SA + Autobuild DEN + Autobuild

b

CC = 0.13CC = 0.23CC = 0.51 CC = 0.30 CC = 0.27

Figure 2 | Method comparison. a, Histogram of Rfree values after autobuildingfor the eight difficult blind cases solved using the new approach (Table 1). Formost existing approaches, none of the cases yielded Rfree values under 50%;DEN was able to reduce Rfree to 45–49% for three of the structures. For all eightcases, Rosetta energy and density guided structure optimization led to Rfreevalues under 40%. b, Dependence of success on sequence identity. The fractionof cases solved (Rfree after autobuilding ,40%) is shown as a function oftemplate sequence identity over the 18 blind cases and 59 benchmark cases. Thenew method is a clear improvement below 28% sequence identity.

c, Dependence of structure determination success on initial map quality.Sigma-A-weighted 2mFo 2 DFc density maps (contoured at 1.5s) computedfrom benchmark set templates with divergence from the native structureincreasing from left to right are shown in grey; the solved crystal structure isshown in yellow. The correlation with the native density is shown above eachpanel. The solid green bar indicates structures the new approach was able tosolve (Rfree , 0.4); the red bar those that torsion-space refinement or DENrefinement is able to solve, and the purple bar those that can be solved directlyusing the template.

LETTER RESEARCH

0 0 M O N T H 2 0 1 1 | V O L 0 0 0 | N A T U R E | 3

Macmillan Publishers Limited. All rights reserved©2011

10

Alloca)on Programs for the OLCF’s Cray XK7 Titan

11










12

Monitoring GPU Usage on Titan-‐ The Early Years •  Requirement: Detect, on a per-job basis, if/

when jobs use accelerator-equipped nodes. •  Initial Solution

–  Leverage ORNL’s Automatic Library Tracking Database (ALTD) •  At link time, a list of libraries linked against is stored in

a database •  When the resulting program is executed via aprun, a

new ALTD record is written that contains the specific executable, to be run, the batch job id, and other info

–  Batch jobs are compared against ALTD to see if they were linked against an accelerator-specific library •  libacc*, libOpenCL*, libmagma*, libhmpp*, libcuda*,

libcupti*, libcula*, libcublas* •  Jobs whose executables are linked against one of the

above are deemed to have used the accelerator

•  Outliers –  Job run outside of the batch system

•  ALTD knows about them, but we can’t tie them to usage because there’s no job record

–  ALTD is enabled by default, but if it’s disabled we won’t capture link/run info

Making sense of an example link statement % lsms /usr/lib/../lib64/crt1.o /usr/lib/../lib64/cr).o /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/crtbegin.o libLSMS.aSystemParameters.o libLSMS.aread_input.o libLSMS.aPoten)alIO.o libLSMS.abuildLIZandCommLists.o libLSMS.aenergyContourIntegra)on.o libLSMS.asolveSingleScaterers.o libLSMS.acalculateDensi)es.o libLSMS.acalculateChemPot.o /lustre/widow0/scratch/larkin/lsms3-‐trunk/lua/lib/liblua.a … -‐lcublas /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcublas.so -‐lcup) /opt/nvidia/cudatoolkit/5.0.28.101/extras/CUPTI/lib64/libcup).so -‐lcudart /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcudart.so -‐lcuda /opt/cray/nvidia/default/lib64/libcuda.so /opt/cray/atp/1.4.4/lib//libAtpSigHCommData.a -‐lAtpSigHandler /opt/cray/atp/1.4.4/lib//libAtpSigHandler.so -‐lgfortran /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/../../../../lib64/libgfortran.so -‐lhdf5_hl_cpp_gnu ... /opt/cray/pmi/3.0.1-‐1.0000.9101.2.26.gem/lib64/libpmi.so -‐lalpslli /usr/lib/alps/libalpslli.so -‐lalpsu)l /usr/lib/alps/libalpsu)l.so /lib64/libpthread.so.0 -‐lstdc++ /lib64/ld-‐linux-‐x86-‐64.so.2 -‐lgcc_s /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/../../../../lib64/libgcc_s.so /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/crtend.o /usr/lib/../lib64/crtn.o

13

!"!!!!

!2!!

!4!!

!6!!

!8!!

!10!!

!12!!

!14!!

!16!!

!18!!

5/31/13!

6/30/13!

7/30/13!

8/29/13!

9/28/13!

10/28/13!

11/27/13!

12/27/13!

1/26/14!

2/25/14!

Titan&Co

re&Hou

rs&Delivered

&(Daily)& Million

s&

Core!Hours!(Unknown)! Core"Hours!(CPU)! Core"Hours!(CPU+GPU)!

Assessing GPU Usage with ALTD

Unknowns are 14% of total delivered hours since May 31, 2013.

Great apparent use of the GPU by the workflow, but no way to quantify it.

Rocky start using ALTD… lots of edge cases escaped.

14

NVIDIA’s Role - Δ to the Kepler Driver, API, and NVML The previous NVML is cool. You can spot check…

–  Driver version –  pstate –  Memory use –  Compute mode –  GPU u)liza)on –  Temperature –  Power –  Clock

But we needed… •  GPU u)lity (not point in )me u)liza)on) for the life of a process

•  Persistent state of that GPU and memory data.

•  Ability to retrieve that data, by apid, using a predefined API

And we conceded… •  if there is work on any of the 14 SMs, we are accumula)ng GPU u)lity.

•  NVIDIA products containing these new features •  Kepler (GK110) or beter; •  Kepler Driver 319.82 or later; •  NVML API 5.319.43 or later; •  The CUDA 5.5 release cadence

15

nvidia-‐smi Output (truncated) from a Single Titan Kepler GPU ==============NVSMI LOG==============

Timestamp : Mon Mar 18 16:51:15 2013 Driver Version : 304.47.13 Attached GPUs : 1 GPU 0000:02:00.0 Product Name : Tesla K20X Display Mode : Disabled Persistence Mode : Enabled Performance State : P8 Clocks Throttle Reasons Idle : Active User Defined Clocks : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Unknown : Not Active Memory Usage Total : 5759 MB Used : 37 MB Free : 5722 MB Compute Mode : Exclusive_Process Gpu : 0 % Memory : 0 % Ecc Mode Current : Enabled Pending : Enable

ECC Errors Temperature Gpu : 25 C Power Readings Power Management : Supported Power Draw : 18.08 W Power Limit : 225.00 W Default Power Limit : 225.00 W Min Power Limit : 55.00 W Max Power Limit : 300.00 W Clocks Graphics : 324 MHz SM : 324 MHz Memory : 324 MHz Applications Clocks Graphics : 732 MHz Memory : 2600 MHz Max Clocks Graphics : 784 MHz SM : 784 MHz Memory : 2600 MHz Compute Processes : None

Driver version for XK is no less than 304.47.13

Kepler – the K20X

6GB GDDR5

Instantaneous GPU u)liza)on. This is a point-‐in-‐)me sample, and has no temporal quality.

Kepler has either a p-‐state of 0 (busy) or 8 (idle)

NVML is a C-‐based API for monitoring and managing various states of the NVIDIA GPU devices. nvidia-‐smi is an exis)ng applica)on that uses the nvml API.

16

Cau)on: Default Mode versus Exclusive Process

•  The default GPU compute mode on Titan is EXCLUSIVE_PROCESS. However, we do not preclude users from using DEFAULT compute mode, and some applica)ons demonstrate slightly beter performance in DEFAULT compute mode.

•  In EXCLUSIVE_PROCESS compute mode, the current release of the Kepler device driver acts exactly like you would expect.

•  However, in Default Mode, the aggrega;on of GPU seconds across mul;ple contexts can be misinterpreted by third party soMware using the new API.

– Look for updates to the way that GPU seconds are accumulated across mul)ple contexts in Default mode as the CUDA 6.5 cadence nears.

•  Kepler Compute Modes: –  NVML_COMPUTEMODE_DEFAULT Default compute mode – mul)ple contexts per

device. –  NVML_COMPUTEMODE_EXCLUSIVE_THREAD Compute-‐exclusive-‐thread mode – only

one context per device, usable from one thread at a )me. –  NVML_COMPUTEMODE_PROHIBITED Compute-‐prohibited mode – no contexts per

device. –  NVML_COMPUTEMODE_EXCLUSIVE_PROCESS Compute-‐exclusive-‐process mode – only

one context per device, usable from mul)ple threads at a )me.

17

Cray RUR, and the NVIDIA API

•  At the conclusion of every job, Cray uses the revised NVIDIA API to query every compute node associated with a job, extrac)ng the accumulated GPU usage and memory usage sta)s)cs on each individual node.

•  By aggrega)ng that informa)on with data from the job scheduler, sta)s)cs can then be generated that describe the GPU usage, on a per-‐job basis.

Aggregate GPU Seconds (u)lity x run)me) across an en)re aprun. Reference

scheduler data for run )me and other data

Provide specific job informa)on that allows determina)on of GPU

usage, per job

GPU-‐Seconds across all apids

GPU-‐seconds across all apids

Compute Node

Compute Node

Compute Node

Aggregate GPU seconds, run )me, scheduler

data

Collect GPU u)lity, per node, for a specific apid

18

OLCF Acquisi)on and Opera)onal Costs Where Cycles are Free through a Compe;;ve Process

•  2014 Alloca)on Model: 125M Cray XK7 Node Hours –  INCITE: 75M Node Hours

among 40 Projects –  ALCC: 37.5 Node Hours

among 10-‐20 Projects –  DD: 12.5 Node Hours

•  Titan Acquisi)on and Opera)onal Costs –  (5-‐year life) –  Facili)es –  Power –  Cooling –  Asset (Purchase, Taxes,

Maintenance, Lease) –  Staff

•  Total asset cost: –  $1/node hour.

•  The cost of the computer )me dominates everything else.

•  How can we effec)vely use the XK7 architecture to minimize )me to solu)on ?

19










20

0.0%$

10.0%$

20.0%$

30.0%$

40.0%$

50.0%$

60.0%$

70.0%$

80.0%$

90.0%$

100.0%$

0$ 100$ 200$ 300$ 400$ 500$ 600$ 700$

Aggregate'GP

U'Usage'via'RUR'

Q1CY14'Sample'Number'(each'using'800'Cray'XK7'compute'nodes)'

GPU Usage by Lajce QCD on OLCF’s Cray XK7 Titan NVML_COMPUTEMODE_EXCLUSIVE_PROCESS

•  Lajce QCD calcula)ons aim to understand the physical phenomena encompassed by quantum chromodynamics (QCD), the fundamental theory of the strong forces of subatomic physics.

Lajce QCD GPU Usage Average: 52.50% StdDev: 0.0406

21

0.0%$

10.0%$

20.0%$

30.0%$

40.0%$

50.0%$

60.0%$

70.0%$

80.0%$

90.0%$

100.0%$

0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$

Aggregate'GP

U'Usage'via'RUR'


GPU Usage by LAMMPS on OLCF’s Cray XK7 Titan Mixed Mode (OpenMP + MPI), NVML_COMPUTEMODE_EXCLUSIVE_PROCESS •  LAMMPS -‐ Classical Molecular Dynamics So_ware used in simula)ons for biology, materials science, granular, mesoscale, etc

Coarse-grained MD simulation of phase-separation of a 1:1 weight ratio P3HT/PCBM mixture into donor (white) and acceptor (blue) domains.

P3HT (electron donor)

PCBM (electron acceptor)

This Series: A sample of all Mixed Mode (OpenMP + MPI) LAMMPS runs in Q1CY14. Average GPU Usage: 49.28%

22

0.0%$

10.0%$

20.0%$

30.0%$

40.0%$

50.0%$

60.0%$

70.0%$

80.0%$

90.0%$

100.0%$

0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$

Aggregate'GP

U'Usage'via'RUR'


GPU Usage by NAMD on OLCF’s Cray XK7 Titan NVML_COMPUTEMODE_EXCLUSIVE_PROCESS

•  NAMD is a parallel molecular dynamics code designed for high-‐performance simula)on of large bio-‐molecular systems.

•  The availability of systems like Titan have rapidly expanded the domain of bio-‐molecular simula)on from isolated proteins in solvent to complex aggregates, o_en in a lipid environment. Such systems rou)nely comprise 100,000 atoms, and several published NAMD simula)ons have exceeded 10,000,000 atoms.

NAMD GPU Usage Average: 26.89% StdDev: 0.062

23








•  Opera)onal Impact of GPUs on Titan –  Case Study: WL-‐LSMS –  Examples among Other Domains


24

3,000$

3,500$

4,000$

4,500$

5,000$

5,500$

6,000$

6,500$

7,000$

7,500$

8,000$

0:00:00$

0:08:38$

0:17:17$

0:25:55$

0:34:34$

0:43:12$

0:51:50$

1:00:29$

1:09:07$

1:17:46$

1:26:24$

1:35:02$

1:43:41$

1:52:19$

2:00:58$

2:09:36$

2:18:14$

2:26:53$

2:35:31$

2:44:10$

2:52:48$

3:01:26$

3:10:05$

3:18:43$

3:27:22$

3:36:00$

3:44:38$

3:53:17$

4:01:55$

4:10:34$

kW#instantaneous,

Elapsed,Time,h:mm:ss,

WL#LSMS,v.3.0,CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon,

Cray,XK7,(Titan),18,561,compute,nodes,

Application Power Efficiency on the Cray XK7 The Behavior of Magnetic Systems with WL-LSMS

CPU-‐only power (energy) consump)on trace for a WL-‐LSMS run that simulates 1024 Fe atoms as they reach their Curie temperature. Run size: 18,561 Titan nodes (99% of Titan). Run signature: Ini)aliza)on, followed by 20 Monte Carlo steps. Computa)on dominated by double complex matrix matrix mul)plica)on.

Application: WL-LSMS Runtime (hh:mm:ss): 04:11:44 Avg. Inst. Power: 6,160 kW Energy Consumed: 25,700 kW-hr Mech. (1.23 PUE): 5,900 kW-hr

Total Energy: 31,600 kW-hr Energy/Cooling Cost: $3,500 Single Run Opportunity Cost: (Runtime*Asset Cost) $77,800

For each step, update density of states (DOS)

ZGEMM

25

3,000$

3,500$

4,000$

4,500$

5,000$

5,500$

6,000$

6,500$

7,000$

7,500$

8,000$

0:00:00$

0:08:38$

0:17:17$

0:25:55$

0:34:34$

0:43:12$

0:51:50$

1:00:29$

1:09:07$

1:17:46$

1:26:24$

1:35:02$

1:43:41$

1:52:19$

2:00:58$

2:09:36$

2:18:14$

2:26:53$

2:35:31$

2:44:10$

2:52:48$

3:01:26$

3:10:05$

3:18:43$

3:27:22$

3:36:00$

3:44:38$

3:53:17$

4:01:55$

4:10:34$

kW#instantaneous,

Elapsed,Time,h:mm:ss,

WL#LSMS,v.3.0,CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon,

Cray,XK7,(Titan),18,561,compute,nodes,

(CPU$Only)$kW9instantaneous$ (GPU$Enabled)$kW9instantaneous$

Application Power Efficiency on the Cray XK7 Comparing CPU-Only and GPU-Enabled WL-LSMS

The iden)cal WL-‐LSMS run (1024 Fe atoms on 18,561 Titan nodes), comparing the run)me and power consump)on of the GPU-‐enabled version versus the CPU-‐only version. •  Run)me Is 9X faster for the accelerated code-‐> 9X less

opportunity cost. Same science output. •  Total energy consumed is 7.3X less

App: GPU-enabled WL-LSMS Runtime (hh:mm:ss) 00:27:43 Avg. Inst. Power: 7,070 kW Energy Consumed: 3,500 kW-hr Mech. (1.23 PUE): 800 kW-hr

Total Energy: 4,300 kW-hr Energy/Cooling Cost: $475 Single Run Opportunity Cost: (Runtime*Asset Cost) $8,575

26

Opera)onal Impact

•  From Titan’s introduc)on to produc)on (May 31, 2013) through the end of the year –  Titan delivered 2.6B core-‐hours to science –  Maintained 90% u)liza)on –  Was stable-‐ 98.7% scheduled availability and 468-‐

hour MTTF

•  Titan provides significant )me to solu)on benefits for applica)ons that can take advantage of the 1:1 architecture.

•  Many applica)ons have been restructured to expose addi)onal parallelism. This posi)vely impacts not only the XK7 but other large parallel compute systems as well.

•  Considerable performance gains on the XK7 con)nue to outpace its tradi)onal compe)tors.

•  Faster solu)on produces addi)onal opportunity.

Applica)on

Cray XK7 vs. Cray XE6

Performance Ra)o*

LAMMPS Molecular dynamics 7.4

S3D Turbulent combus)on 2.2

Denovo 3D neutron transport for nuclear reactors

3.8

WL-‐LSMS Sta)s)cal mechanics of magne)c materials

3.8

27








•  Opera)onal Impact of GPUs on Titan –  Case Study: WL-‐LSMS –  Examples among Other Domains


28

April 15, 2012 Top 500 Submission: Cray XK6 (Jaguar)

Cray XK6 HPL Run Sta2s2cs

System Configura)on: Cray XK Blade Upgrade Complete (No Fermi or Kepler accelerators) 18,688 AMD Opteron 6274 processors (Interlagos) Cray Gemini Interconnect

1.941PF sustained (73.6% of 2.634PF peak)

Run Dura)on: 24.6 hours

Power (Consump)on Only) System Idle: 2,935 kW Max kW: 5,275 kW Mean kW 5,142 kW Total 126,281 kW-‐h

28

29

RmaxPower = 8,296.53kW

7545.56kW-‐hr

-‐

1,000.00

2,000.00

3,000.00

4,000.00

5,000.00

6,000.00

7,000.00

8,000.00

-‐

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00 0:00:53

0:01:53

0:02:53

0:03:53

0:04:53

0:05:53

0:06:53

0:07:53

0:08:53

0:09:53

0:10:53

0:11:53

0:12:53

0:13:53

0:14:53

0:15:53

0:16:53

0:17:53

0:18:53

0:19:53

0:20:53

0:21:53

0:22:53

0:23:53

0:24:53

0:25:53

0:26:53

0:27:53

0:28:53

0:29:53

0:30:53

0:31:53

0:32:53

0:33:53

0:34:53

0:35:53

0:36:53

0:37:53

0:38:53

0:39:53

0:40:53

0:41:53

0:42:53

0:43:53

0:44:53

0:45:53

0:46:53

0:47:53

0:48:53

0:49:53

0:50:53

0:51:53

0:52:53

0:53:53

0:54:53

kW-‐hours (Cumula2ve) MW, Instantaneous

Thou

sand

s

Run Time Dura2on (hh:mm:ss)

ORNL's Cray XK7 Titan | HPL Consump)on

kW, Instantaneous

RmaxPower

kW-‐hours, cumula)ve

Custom NVIDIA HPL Binary-‐ NVML Driver Reported 99%

GPU Usage

Instantaneous Measurements •  8.93 MW •  21.42 PF •  2,397.14 MF/Wat

Cray XK7 HPL Run Sta2s2cs

System Configura)on: (November 2012) Titan Upgrade Complete. 18,688 AMD Opteron 6274 processors, 1:1 with NVIDIA GK110 (Kepler) GPUs in SXM form factor

17.59 PF sustained vs 1.941PF on Opteron-‐only

Run Dura)on: 0:54.53 vs more than 24.6 hours

Power (Consump)on Only) Max kW: 8,930 kW Mean kW 8.296 kW Total 7,545kW-‐h vs 126,281 kW-‐h

30

Takeaways

•  Revisions by NVIDIA to the Kepler device driver now expose an ability for system accoun)ng methods to capture, per node, and per job, important usage and high water sta)s)cs concerning the K20X SM and memory subsystems.

•  GPU enabled applica)ons on the Cray XK7 can frequently realize )me-‐to-‐solu)on savings of 5X or more versus tradi)onal CPU-‐only applica)ons.

•  Invita)on: the OLCF welcomes R&D applica)ons focused on maximizing scien)fic applica)on efficiency –  Applica)on performance benchmarking, analysis, modeling, and scaling

studies. End-‐to-‐end workflow, visualiza)on, data analy)cs. –  htps://www.olcf.ornl.gov/support/gejng-‐started/olcf-‐director-‐discre)on-‐

project-‐applica)on/

#GTC14

31

Ques)ons? [email protected]

31

The activities described herein were performed using resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0500OR22725.

The Operational Impact of GPUs on ORNL's Cray XK7 Titan · 2014. 4. 9. · CC = 0.51 CC = 0.30 CC =...

Documents

Transcript of The Operational Impact of GPUs on ORNL's Cray XK7 Titan · 2014. 4. 9. · CC = 0.51 CC = 0.30 CC =...