The Operational Impact of GPUs on ORNL's Cray XK7 Titan · 2014. 4. 9. · CC = 0.51 CC = 0.30 CC =...

31
Office of Science The Opera)onal Impact of GPUs on ORNL's Cray XK7 Titan Jim Rogers Director of Opera2ons Na2onal Center for Computa2onal Sciences Oak Ridge Na2onal Laboratory

Transcript of The Operational Impact of GPUs on ORNL's Cray XK7 Titan · 2014. 4. 9. · CC = 0.51 CC = 0.30 CC =...

  • Office of Science

    The  Opera)onal  Impact  of  GPUs  on  ORNL's  Cray  XK7  Titan  

    Jim  Rogers  

    Director  of  Opera2ons  Na2onal  Center  for  Computa2onal  Sciences  

    Oak  Ridge  Na2onal  Laboratory  

     

  • 2

    Session  Descrip)on,  Session  ID  S4670    

    The  Opera2onal  Impact  of  GPUs  on  ORNL's  Cray  XK7  Titan  

    With  a  peak  computa)onal  capacity  of  more  than  27PF,  Oak  Ridge  Na)onal  Lab's  Cray  XK7,  Titan,  is  currently  the  largest  compu)ng  resource  available  to  the  US  Department  of  Energy.  Titan  contains  18,688  individual  compute  nodes,  where  each  node  pairs  one  commodity  x86  processor  with  a  single  NVIDIA  Kepler  GPU.  When  compared  to  a  typical  mul)core  solu)on,  the  ability  to  offload  substan)ve  amounts  of  work  to  the  GPUs  provides  benefits  with  significant  opera)onal  impacts.  Case  studies  show  )me-‐to-‐solu)on  and  energy-‐to-‐solu)on  that  are  frequently  more  than  5  )mes  more  efficient  than  the  non-‐GPU-‐enabled  case.  The  need  to  understand  how  effec)vely  the  Kepler  GPUs  are  being  used  by  these  applica)ons  is  augmented  by  changes  to  the  Kepler  device  driver  and  the  Cray  Resource  U)liza)on  so_ware,  which  now  provide  a  mechanism  for  repor)ng  valuable  GPU  usage  metrics  for  scheduled  work  and  memory  use,  on  a  per  job  basis.    

  • 3

    Presenter  Overview  

    Jim  Rogers  is  the  Director  of  Opera)ons  for  the  Na)onal  Center  for  Computa)onal  Sciences  at  Oak  Ridge  Na)onal  Laboratory.  The  NCCS  provides  full  facility  and  opera)ons  support  for  three  petaFLOP-‐scale  systems  including  Titan,  a  27PF  Cray  XK7.  Jim  has  a  BS  in  Computer  Engineering,  and  has  worked  in  high  performance  compu)ng  systems  acquisi)on,  integra)on,  and  opera)on  for  more  than  25  years.  

  • 4

    Content  •  The  OLCF’s  Cray  XK7  Titan  

    –  Hardware  Descrip)on  –  Mission  Need  –  INCITE  Alloca)on  Program  

    •  Compe))ve  Alloca)ons  •  Computa)onally  Ready  Science  

    •  The  Opera)onal  Need  to  Understand  Usage  –  ALTD  (the  early  years)  –  NVIDIA’s  Role  

    •   Δ  to  the  Kepler  Driver,  API,  and  NVML    –  Cray’s  Resource  U)liza)on  (RUR)  –  Assessing  the  Opera)onal  Impact  to  

    Delivered  Science  •  Time-‐  and  Energy-‐  to  Solu)on  

    •  Examples  of  NVML_COMPUTEMODE_  EXCLUSIVE_PROCESS  Measurement  –  Lajce  QCD  –  LAMMPS  –  NAMD  

    •  Opera)onal  Impact  of  GPUs  on  Titan  –  Case  Study:  WL-‐LSMS  –  Examples  among  Domains  

    •  Edge  Case:  HPL  •  Takeaways…  

  • 5

    ORNL’s  Cray  XK7  Titan      -‐  A  Hybrid  System  with  1:1  AMD  Opteron  CPU  and  NVIDIA  Kepler  GPU  

    SYSTEM SPECIFICATIONS: • Peak performance of 27 PF • Sustained performance of 17.59 PF •  18,688 Compute Nodes each with:

    •  16-Core AMD Opteron CPU • NVIDIA K20x (Kepler) GPU •  32 + 6 GB memory

    •  512 Service and I/O nodes •  200 Cabinets •  710 TB total system memory • Cray Gemini 3D Torus Interconnect •  8.9 MW peak energy measurement

    4,352 ft2 404 m2

    5

    Electrical  Distribu2on  •  (4)  Transformers  •  (200)  480V/100A  circuits  •  (48)  480V/20A  circuits  

  • 6

    Cray  XK7  Compute  Node  

    Y  

    X  

    Z  

    HT3 HT3

    PCIe Gen2

    XK7  Compute  Node  Characteris2cs  

    AMD  Opteron  6274  16  core  processor  -‐  141  GF  

    Tesla  K20x  -‐  1311  GF  

    Host  Memory  32GB  

    1600  MHz  DDR3  

    Tesla  K20x  Memory  6GB  GDDR5  

    Gemini  High  Speed  Interconnect  

    Slide courtesy of Cray, Inc.

  • 7

    Science  challenges  for  the  OLCF  in  next  decade  

    Combustion Science Increase efficiency by

    25%-50% and lower emissions from internal

    combustion engines using advanced fuels and new, low-

    temperature combustion concepts.

    Biomass to Biofuels Enhance the understanding

    and production of biofuels for transportation and other bio-

    products from biomass.

    Fusion Energy/ITER Develop predictive understanding of plasma properties, dynamics, and interactions with surrounding materials.

    Climate Change Science Understand the dynamic ecological and chemical evolution of the climate system with uncertainty quantification of impacts on regional and decadal scales.

    Solar Energy Improve photovoltaic efficiency and lower cost for organic and inorganic materials.

    Globally Optimized Accelerator Designs Optimize designs as the next

    generations of accelerators are planned, detailed models will be

    needed to provide a proof of principle and efficient designs of new light

    sources.

    ASCR  Mission:  “…discover,  develop,  and  deploy  computa;onal  and  networking  capabili;es  to  analyze,  model,  simulate,  and  predict  complex  phenomena  important  to  DOE.”  

  • 8

    Innova)ve  and  Novel  Computa)onal  Impact  on  Theory  and  Experiment  

    Call  for  Proposals  

    Contact information Julia C. White, INCITE Manager

    [email protected]

    The INCITE program seeks proposals for high-impact

    science and technology research challenges that require the power of the

    leadership-class systems. Allocations will be for calendar year 2015.

    April  16  –  June  27,  2014  

    INCITE is an annual, peer-review allocation program that provides unprecedented computational and data science resources

    •  5.8 billion core-hours awarded for 2014 on the 27-petaflop Cray XK7 “Titan” and the 10-petaflop IBM BG/Q “Mira”

    •  Average award: 78 million core-hours on Titan and 88 million core-hours on Mira in 2014

    •  INCITE is open to any science domain •  INCITE seeks computationally

    intensive, large-scale research campaigns

  • 9

    Diversity  of  INCITE  Science  

    Determining protein structures and design proteins that block influenza virus infection.

    - David Baker, University of Washington

    Simulating a flow of healthy (red) and diseased (blue) blood cells

    - George Karniadakis, Brown University

    • Glimpse into dark matter • Supernovae ignition • Protein structure • Creation of biofuels • Replicating enzyme functions

    • Global climate • Accelerator design • Carbon sequestration • Turbulent flow • Propulsor systems

    Other INCITE research topics

    Calculating an improved probabilistic seismic hazard forecast for California.

    -  Thomas Jordan, University of Southern California

    • Membrane channels • Protein folding • Chemical catalyst design • Plasma physics • Algorithm development

    • Nano-devices • Batteries • Solar cells • Reactor design • Nuclear structure

    Providing new insights into the dynamics of turbulent combustion processes in internal-combustion engines.

    - Jacqueline Chen and Joseph Oefelein, Sandia National Laboratories

    Modeling of ubiquitous weak intermolecular bonds using Quantum Monte Carlo to develop benchmark energies.

    -  Dario Alfé, University College London, UK

    High-fidelity simulation of complex suspension flow for practical rheometry.

    - William George, National Institute of Standards and Technology

    to orange bar); for one of these the improvements were sufficient forautobuilding to effectively solve the structure.

    Over the combined set of 18 blind cases and the 59 benchmark cases,Rosetta refinement yielded a model with density correlation as good orbetter than any of the control methods for all but six structures. Thedependence of success on sequence identity over the combined set isillustrated in Fig. 2b. The improvement in performance is particularlystriking below 22% sequence identity, where the quality of the startinghomology models becomes too low for the control methods in almostall cases. With the new method the success rate in the 15–28%sequence identity range, generally considered very challenging formolecular replacement, is over 50%.

    Figure 2c illustrates the dependence of model-building on the qualityof initial electron density. Conventional chain rebuilding requires amap in which the connectivity is largely correct (leftmost panel),whereas the new method can tolerate breaks in the chain more thanother methods (panels 2–4), as long as there is sufficient information inthe electron density map, combined with the Rosetta energy function,to guide structure optimization. The map on the far right contains toolittle information to guide energy-based structure optimization andhence the new approach fails. In the five blind cases that have not yetbeen solved the comparative models may have been too low in quality,or there may have been complications in the X-ray diffraction data setsthemselves.

    Key to the success of the approach described here is the integration ofstructure prediction and crystallographic chain tracing and refinementmethods. Simulated annealing guided by molecular force fields and dif-fraction data has had an important role in crystallographic refinement14,21.Structure prediction methods such as Rosetta can be even more powerfulwhen combined with crystallographic data because the force fieldsincorporate additional contributions such as solvation energy and hydro-gen bonding, and the sampling algorithms can build non-modelled por-tions of the molecule de novo and cover a larger region of conformationalspace than simulated annealing. The difference between Rosetta samplingand simulated annealing sampling, both using crystallographic data, isillustrated in Fig. 3. Beginning with the homology model placed bymolecular replacement in the unit cell for blind case 6, we generated 100models by simulated annealing at two starting temperatures, and 100models with Rosetta energy- and density-guided optimization followedby refinement. The 2mFo 2 DFc (ref. 22) electron density maps generatedusing phases from over 50% of the Rosetta models had correlations 0.36or better to the final refined map, whereas fewer than 5% of models fromsimulated annealing had correlations this high. Our approach probablyoutperforms even extreme simulated annealing because the physicalchemistry and protein structural information which guide samplingeliminate the vast majority of non-physical conformations.

    Approaches to molecular replacement combining the power of crys-tallographic map interpretation and structure prediction methodology

    c

    0

    1

    2

    3

    4

    5

    6

    7

    8

    0–29% 30–34% 35–39% 40–44% 45–49% 50+%

    Num

    ber o

    f tar

    gets

    Rfree after autobuilding

    Rosetta + Autobuild AutoBuild Arp/Warp SA + AutoBuild Torsion-space SA + Autobuild Extreme SA + Autobuild DEN + Autobuild

    a

    0%

    20%

    40%

    60%

    80%

    100%

    0–15% 15–19% 19–22% 22–28% 28–31% 31–100%

    Frac

    tion

    of s

    truc

    ture

    s so

    lved

    to

    Rfre

    e <

    0.40

    Template sequence identity

    Rosetta + Autobuild Autobuild SA + Autobuild Torsion-space SA + Autobuild DEN + Autobuild

    b

    CC = 0.13CC = 0.23CC = 0.51 CC = 0.30 CC = 0.27

    Figure 2 | Method comparison. a, Histogram of Rfree values after autobuildingfor the eight difficult blind cases solved using the new approach (Table 1). Formost existing approaches, none of the cases yielded Rfree values under 50%;DEN was able to reduce Rfree to 45–49% for three of the structures. For all eightcases, Rosetta energy and density guided structure optimization led to Rfreevalues under 40%. b, Dependence of success on sequence identity. The fractionof cases solved (Rfree after autobuilding ,40%) is shown as a function oftemplate sequence identity over the 18 blind cases and 59 benchmark cases. Thenew method is a clear improvement below 28% sequence identity.

    c, Dependence of structure determination success on initial map quality.Sigma-A-weighted 2mFo 2 DFc density maps (contoured at 1.5s) computedfrom benchmark set templates with divergence from the native structureincreasing from left to right are shown in grey; the solved crystal structure isshown in yellow. The correlation with the native density is shown above eachpanel. The solid green bar indicates structures the new approach was able tosolve (Rfree , 0.4); the red bar those that torsion-space refinement or DENrefinement is able to solve, and the purple bar those that can be solved directlyusing the template.

    LETTER RESEARCH

    0 0 M O N T H 2 0 1 1 | V O L 0 0 0 | N A T U R E | 3

    Macmillan Publishers Limited. All rights reserved©2011

  • 10

    Alloca)on  Programs  for  the  OLCF’s  Cray  XK7  Titan  

  • 11

    Content  •  The  OLCF’s  Cray  XK7  Titan  

    –  Hardware  Descrip)on  –  Mission  Need  –  INCITE  Alloca)on  Program  

    •  Compe))ve  Alloca)ons  •  Computa)onally  Ready  Science  

    •  The  Opera)onal  Need  to  Understand  Usage  –  ALTD  (the  early  years)  –  NVIDIA’s  Role  

    •   Δ  to  the  Kepler  Driver,  API,  and  NVML    –  Cray’s  Resource  U)liza)on  (RUR)  –  Assessing  the  Opera)onal  Impact  to  

    Delivered  Science  •  Time-‐  and  Energy-‐  to  Solu)on  

    •  Examples  of  NVML_COMPUTEMODE_  EXCLUSIVE_PROCESS  Measurement  –  Lajce  QCD  –  LAMMPS  –  NAMD  

    •  Opera)onal  Impact  of  GPUs  on  Titan  –  Case  Study:  WL-‐LSMS  –  Examples  among  Domains  

    •  Edge  Case:  HPL  •  Takeaways…  

  • 12

    Monitoring  GPU  Usage  on  Titan-‐  The  Early  Years •  Requirement: Detect, on a per-job basis, if/

    when jobs use accelerator-equipped nodes. •  Initial Solution

    –  Leverage ORNL’s Automatic Library Tracking Database (ALTD) •  At link time, a list of libraries linked against is stored in

    a database •  When the resulting program is executed via aprun, a

    new ALTD record is written that contains the specific executable, to be run, the batch job id, and other info

    –  Batch jobs are compared against ALTD to see if they were linked against an accelerator-specific library •  libacc*, libOpenCL*, libmagma*, libhmpp*, libcuda*,

    libcupti*, libcula*, libcublas* •  Jobs whose executables are linked against one of the

    above are deemed to have used the accelerator

    •  Outliers –  Job run outside of the batch system

    •  ALTD knows about them, but we can’t tie them to usage because there’s no job record

    –  ALTD is enabled by default, but if it’s disabled we won’t capture link/run info

    Making  sense  of  an  example  link  statement    %  lsms  /usr/lib/../lib64/crt1.o  /usr/lib/../lib64/cr).o  /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/crtbegin.o  libLSMS.aSystemParameters.o  libLSMS.aread_input.o  libLSMS.aPoten)alIO.o  libLSMS.abuildLIZandCommLists.o  libLSMS.aenergyContourIntegra)on.o  libLSMS.asolveSingleScaterers.o  libLSMS.acalculateDensi)es.o  libLSMS.acalculateChemPot.o  /lustre/widow0/scratch/larkin/lsms3-‐trunk/lua/lib/liblua.a  …  -‐lcublas  /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcublas.so  -‐lcup)  /opt/nvidia/cudatoolkit/5.0.28.101/extras/CUPTI/lib64/libcup).so  -‐lcudart  /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcudart.so  -‐lcuda  /opt/cray/nvidia/default/lib64/libcuda.so  /opt/cray/atp/1.4.4/lib//libAtpSigHCommData.a  -‐lAtpSigHandler  /opt/cray/atp/1.4.4/lib//libAtpSigHandler.so  -‐lgfortran  /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/../../../../lib64/libgfortran.so  -‐lhdf5_hl_cpp_gnu  ...  /opt/cray/pmi/3.0.1-‐1.0000.9101.2.26.gem/lib64/libpmi.so  -‐lalpslli  /usr/lib/alps/libalpslli.so  -‐lalpsu)l  /usr/lib/alps/libalpsu)l.so  /lib64/libpthread.so.0  -‐lstdc++  /lib64/ld-‐linux-‐x86-‐64.so.2  -‐lgcc_s  /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/../../../../lib64/libgcc_s.so  /opt/gcc/4.7.0/snos/lib/gcc/x86_64-‐suse-‐linux/4.7.0/crtend.o  /usr/lib/../lib64/crtn.o  

  • 13

    !"!!!!

    !2!!

    !4!!

    !6!!

    !8!!

    !10!!

    !12!!

    !14!!

    !16!!

    !18!!

    5/31/13!

    6/30/13!

    7/30/13!

    8/29/13!

    9/28/13!

    10/28/13!

    11/27/13!

    12/27/13!

    1/26/14!

    2/25/14!

    Titan&Co

    re&Hou

    rs&Delivered

    &(Daily)& Million

    s&

    Core!Hours!(Unknown)! Core"Hours!(CPU)! Core"Hours!(CPU+GPU)!

    Assessing  GPU  Usage  with  ALTD  

    Unknowns are 14% of total delivered hours since May 31, 2013.

    Great apparent use of the GPU by the workflow, but no way to quantify it.

    Rocky start using ALTD… lots of edge cases escaped.

  • 14

    NVIDIA’s Role - Δ to the Kepler Driver, API, and NVML The  previous  NVML  is  cool.  You  can  spot  check…  

    –  Driver  version  –  pstate  –  Memory  use  –  Compute  mode  –  GPU  u)liza)on  –  Temperature  –  Power  –  Clock  

    But  we  needed…  •  GPU  u)lity  (not  point  in  )me  u)liza)on)  for  the  life  of  a  process  

    •  Persistent  state  of  that  GPU  and  memory  data.  

    •  Ability  to  retrieve  that  data,  by  apid,  using  a  predefined  API  

    And  we  conceded…  •  if  there  is  work  on  any  of  the  14  SMs,  we  are  accumula)ng  GPU  u)lity.  

    •  NVIDIA  products  containing  these  new  features  •  Kepler  (GK110)  or  beter;  •  Kepler  Driver  319.82  or  later;  •  NVML  API  5.319.43  or  later;  •  The  CUDA  5.5  release  cadence  

  • 15

    nvidia-‐smi  Output  (truncated)  from  a  Single  Titan  Kepler  GPU  ==============NVSMI LOG==============

    Timestamp                       : Mon Mar 18 16:51:15 2013 Driver Version                  : 304.47.13 Attached GPUs                   : 1 GPU 0000:02:00.0     Product Name                : Tesla K20X     Display Mode                : Disabled     Persistence Mode            : Enabled     Performance State           : P8     Clocks Throttle Reasons         Idle                    : Active         User Defined Clocks     : Not Active         SW Power Cap            : Not Active         HW Slowdown             : Not Active         Unknown                 : Not Active     Memory Usage         Total                   : 5759 MB         Used                    : 37 MB         Free                    : 5722 MB     Compute Mode                : Exclusive_Process         Gpu                     : 0 %         Memory                  : 0 %     Ecc Mode         Current                 : Enabled         Pending                 : Enable

        ECC Errors             Temperature         Gpu                     : 25 C     Power Readings         Power Management        : Supported         Power Draw              : 18.08 W         Power Limit             : 225.00 W         Default Power Limit     : 225.00 W         Min Power Limit         : 55.00 W         Max Power Limit         : 300.00 W     Clocks         Graphics                : 324 MHz         SM                      : 324 MHz         Memory                  : 324 MHz     Applications Clocks         Graphics                : 732 MHz         Memory                  : 2600 MHz     Max Clocks         Graphics                : 784 MHz         SM                      : 784 MHz         Memory                  : 2600 MHz     Compute Processes           : None

    Driver  version  for  XK  is  no  less  than  304.47.13    

    Kepler  –  the  K20X  

    6GB  GDDR5  

    Instantaneous  GPU  u)liza)on.  This  is  a  point-‐in-‐)me  sample,  and  has  no  temporal  quality.  

    Kepler  has  either  a  p-‐state  of  0  (busy)  or  8  (idle)  

    NVML  is  a  C-‐based  API  for  monitoring  and  managing  various  states  of  the  NVIDIA  GPU  devices.  nvidia-‐smi  is  an  exis)ng  applica)on  that  uses  the  nvml  API.  

  • 16

    Cau)on:  Default  Mode  versus  Exclusive  Process  

    •  The  default  GPU  compute  mode  on  Titan  is  EXCLUSIVE_PROCESS.  However,  we  do  not  preclude  users  from  using  DEFAULT  compute  mode,  and  some  applica)ons  demonstrate  slightly  beter  performance  in  DEFAULT  compute  mode.  

    •  In  EXCLUSIVE_PROCESS  compute  mode,  the  current  release  of  the  Kepler  device  driver  acts  exactly  like  you  would  expect.    

    •  However,  in  Default  Mode,  the  aggrega;on  of  GPU  seconds  across  mul;ple  contexts  can  be  misinterpreted  by  third  party  soMware  using  the  new  API.  

    – Look  for  updates  to  the  way  that  GPU  seconds  are  accumulated  across  mul)ple  contexts  in  Default  mode  as  the  CUDA  6.5  cadence  nears.    

    •  Kepler  Compute  Modes:  –  NVML_COMPUTEMODE_DEFAULT  Default  compute  mode  –  mul)ple  contexts  per  

    device.  –  NVML_COMPUTEMODE_EXCLUSIVE_THREAD  Compute-‐exclusive-‐thread  mode  –  only  

    one  context  per  device,  usable  from  one  thread  at  a  )me.  –  NVML_COMPUTEMODE_PROHIBITED  Compute-‐prohibited  mode  –  no  contexts  per  

    device.  –  NVML_COMPUTEMODE_EXCLUSIVE_PROCESS  Compute-‐exclusive-‐process  mode  –  only  

    one  context  per  device,  usable  from  mul)ple  threads  at  a  )me.  

  • 17

    Cray  RUR,  and  the  NVIDIA  API  

    •  At  the  conclusion  of  every  job,  Cray  uses  the  revised  NVIDIA  API  to  query  every  compute  node  associated  with  a  job,  extrac)ng  the  accumulated  GPU  usage  and  memory  usage  sta)s)cs  on  each  individual  node.  

    •  By  aggrega)ng  that  informa)on  with  data  from  the  job  scheduler,  sta)s)cs  can  then  be  generated  that  describe  the  GPU  usage,  on  a  per-‐job  basis.  

    Aggregate  GPU  Seconds  (u)lity  x  run)me)  across  an  en)re  aprun.  Reference  

    scheduler  data  for  run  )me  and  other  data  

    Provide  specific  job  informa)on  that  allows  determina)on  of  GPU  

    usage,  per  job  

    GPU-‐Seconds  across  all  apids  

    GPU-‐seconds  across  all  apids  

    Compute  Node  

    Compute  Node  

    Compute  Node  

    Aggregate  GPU  seconds,  run  )me,  scheduler  

    data  

    Collect  GPU  u)lity,  per  node,  for  a  specific  apid  

  • 18

    OLCF  Acquisi)on  and  Opera)onal  Costs    Where  Cycles  are  Free  through  a  Compe;;ve  Process  

    •  2014  Alloca)on  Model:  125M  Cray  XK7  Node  Hours  –  INCITE:  75M  Node  Hours  

    among  40  Projects  –  ALCC:  37.5  Node  Hours  

    among  10-‐20  Projects  –  DD:  12.5  Node  Hours  

    •  Titan  Acquisi)on  and  Opera)onal  Costs  –  (5-‐year  life)  –  Facili)es  –  Power  –  Cooling  –  Asset  (Purchase,  Taxes,  

    Maintenance,  Lease)  –  Staff  

    •  Total  asset  cost:    –  $1/node  hour.  

    •  The  cost  of  the  computer  )me  dominates  everything  else.    

    •  How  can  we  effec)vely  use  the  XK7  architecture  to  minimize  )me  to  solu)on  ?  

  • 19

    Content  •  The  OLCF’s  Cray  XK7  Titan  

    –  Hardware  Descrip)on  –  Mission  Need  –  INCITE  Alloca)on  Program  

    •  Compe))ve  Alloca)ons  •  Computa)onally  Ready  Science  

    •  The  Opera)onal  Need  to  Understand  Usage  –  ALTD  (the  early  years)  –  NVIDIA’s  Role  

    •   Δ  to  the  Kepler  Driver,  API,  and  NVML    –  Cray’s  Resource  U)liza)on  (RUR)  –  Assessing  the  Opera)onal  Impact  to  

    Delivered  Science  •  Time-‐  and  Energy-‐  to  Solu)on  

    •  Examples  of  NVML_COMPUTEMODE_  EXCLUSIVE_PROCESS  Measurement  –  Lajce  QCD  –  LAMMPS  –  NAMD  

    •  Opera)onal  Impact  of  GPUs  on  Titan  –  Case  Study:  WL-‐LSMS  –  Examples  among  Domains  

    •  Edge  Case:  HPL  •  Takeaways…  

  • 20

    0.0%$

    10.0%$

    20.0%$

    30.0%$

    40.0%$

    50.0%$

    60.0%$

    70.0%$

    80.0%$

    90.0%$

    100.0%$

    0$ 100$ 200$ 300$ 400$ 500$ 600$ 700$

    Aggregate'GP

    U'Usage'via'RUR'

    Q1CY14'Sample'Number'(each'using'800'Cray'XK7'compute'nodes)'

    GPU  Usage  by  Lajce  QCD  on  OLCF’s  Cray  XK7  Titan    NVML_COMPUTEMODE_EXCLUSIVE_PROCESS    

    •  Lajce  QCD  calcula)ons  aim  to  understand  the  physical  phenomena  encompassed  by  quantum  chromodynamics  (QCD),  the  fundamental  theory  of  the  strong  forces  of  subatomic  physics.    

    Lajce  QCD  GPU  Usage          Average:  52.50%          StdDev:      0.0406  

  • 21

    0.0%$

    10.0%$

    20.0%$

    30.0%$

    40.0%$

    50.0%$

    60.0%$

    70.0%$

    80.0%$

    90.0%$

    100.0%$

    0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$

    Aggregate'GP

    U'Usage'via'RUR'

    Q1CY14'Sample'Number'(each'using'64'Cray'XK7'compute'nodes)'

    GPU  Usage  by  LAMMPS  on  OLCF’s  Cray  XK7  Titan                  Mixed  Mode  (OpenMP  +  MPI),  NVML_COMPUTEMODE_EXCLUSIVE_PROCESS  •  LAMMPS  -‐  Classical  Molecular  Dynamics  So_ware  used  in  simula)ons  for  biology,  materials  science,  granular,  mesoscale,  etc  

    Coarse-grained MD simulation of phase-separation of a 1:1 weight ratio P3HT/PCBM mixture into donor (white) and acceptor (blue) domains.

    P3HT (electron donor)

    PCBM (electron acceptor)

    This  Series:  A  sample  of  all  Mixed  Mode  (OpenMP  +  MPI)  LAMMPS  runs  in  Q1CY14.    Average  GPU  Usage:  49.28%  

  • 22

    0.0%$

    10.0%$

    20.0%$

    30.0%$

    40.0%$

    50.0%$

    60.0%$

    70.0%$

    80.0%$

    90.0%$

    100.0%$

    0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$

    Aggregate'GP

    U'Usage'via'RUR'

    Q1CY14'Sample'Number'(each'using'768'Cray'XK7'compute'nodes)'

    GPU  Usage  by  NAMD  on  OLCF’s  Cray  XK7  Titan    NVML_COMPUTEMODE_EXCLUSIVE_PROCESS    

    •  NAMD  is  a  parallel  molecular  dynamics  code  designed  for  high-‐performance  simula)on  of  large  bio-‐molecular  systems.  

    •  The  availability  of  systems  like  Titan  have  rapidly  expanded  the  domain  of  bio-‐molecular  simula)on  from  isolated  proteins  in  solvent  to  complex  aggregates,  o_en  in  a  lipid  environment.  Such  systems  rou)nely  comprise  100,000  atoms,  and  several  published  NAMD  simula)ons  have  exceeded  10,000,000  atoms.  

    NAMD  GPU  Usage          Average:  26.89%          StdDev:      0.062  

  • 23

    Content  •  The  OLCF’s  Cray  XK7  Titan  

    –  Hardware  Descrip)on  –  Mission  Need  –  INCITE  Alloca)on  Program  

    •  Compe))ve  Alloca)ons  •  Computa)onally  Ready  Science  

    •  The  Opera)onal  Need  to  Understand  Usage  –  ALTD  (the  early  years)  –  NVIDIA’s  Role  

    •   Δ  to  the  Kepler  Driver,  API,  and  NVML    –  Cray’s  Resource  U)liza)on  (RUR)  –  Assessing  the  Opera)onal  Impact  to  

    Delivered  Science  •  Time-‐  and  Energy-‐  to  Solu)on  

    •  Examples  of  NVML_COMPUTEMODE_  EXCLUSIVE_PROCESS  Measurement  –  Lajce  QCD  –  LAMMPS  –  NAMD  

    •  Opera)onal  Impact  of  GPUs  on  Titan  –  Case  Study:  WL-‐LSMS  –  Examples  among  Other  Domains  

    •  Edge  Case:  HPL  •  Takeaways…  

  • 24

    3,000$

    3,500$

    4,000$

    4,500$

    5,000$

    5,500$

    6,000$

    6,500$

    7,000$

    7,500$

    8,000$

    0:00:00$

    0:08:38$

    0:17:17$

    0:25:55$

    0:34:34$

    0:43:12$

    0:51:50$

    1:00:29$

    1:09:07$

    1:17:46$

    1:26:24$

    1:35:02$

    1:43:41$

    1:52:19$

    2:00:58$

    2:09:36$

    2:18:14$

    2:26:53$

    2:35:31$

    2:44:10$

    2:52:48$

    3:01:26$

    3:10:05$

    3:18:43$

    3:27:22$

    3:36:00$

    3:44:38$

    3:53:17$

    4:01:55$

    4:10:34$

    kW#instantaneous,

    Elapsed,Time,h:mm:ss,

    WL#LSMS,v.3.0,CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon,

    Cray,XK7,(Titan),18,561,compute,nodes,

    Application Power Efficiency on the Cray XK7 The Behavior of Magnetic Systems with WL-LSMS

    CPU-‐only  power  (energy)  consump)on  trace  for  a  WL-‐LSMS  run  that  simulates  1024  Fe  atoms  as  they  reach  their  Curie  temperature.  Run  size:  18,561  Titan  nodes  (99%  of  Titan).  Run  signature:  Ini)aliza)on,  followed  by  20  Monte  Carlo  steps.  Computa)on  dominated  by  double  complex  matrix  matrix  mul)plica)on.  

    Application: WL-LSMS Runtime (hh:mm:ss): 04:11:44 Avg. Inst. Power: 6,160 kW Energy Consumed: 25,700 kW-hr Mech. (1.23 PUE): 5,900 kW-hr

    Total Energy: 31,600 kW-hr Energy/Cooling Cost: $3,500 Single Run Opportunity Cost: (Runtime*Asset Cost) $77,800

    For each step, update density of states (DOS)

    ZGEMM

  • 25

    3,000$

    3,500$

    4,000$

    4,500$

    5,000$

    5,500$

    6,000$

    6,500$

    7,000$

    7,500$

    8,000$

    0:00:00$

    0:08:38$

    0:17:17$

    0:25:55$

    0:34:34$

    0:43:12$

    0:51:50$

    1:00:29$

    1:09:07$

    1:17:46$

    1:26:24$

    1:35:02$

    1:43:41$

    1:52:19$

    2:00:58$

    2:09:36$

    2:18:14$

    2:26:53$

    2:35:31$

    2:44:10$

    2:52:48$

    3:01:26$

    3:10:05$

    3:18:43$

    3:27:22$

    3:36:00$

    3:44:38$

    3:53:17$

    4:01:55$

    4:10:34$

    kW#instantaneous,

    Elapsed,Time,h:mm:ss,

    WL#LSMS,v.3.0,CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon,

    Cray,XK7,(Titan),18,561,compute,nodes,

    (CPU$Only)$kW9instantaneous$ (GPU$Enabled)$kW9instantaneous$

    Application Power Efficiency on the Cray XK7 Comparing CPU-Only and GPU-Enabled WL-LSMS

    The  iden)cal  WL-‐LSMS  run  (1024  Fe  atoms  on  18,561  Titan  nodes),  comparing  the  run)me  and  power  consump)on  of  the  GPU-‐enabled  version  versus  the  CPU-‐only  version.  •  Run)me  Is  9X  faster  for  the  accelerated  code-‐>  9X  less  

    opportunity  cost.  Same  science  output.  •  Total  energy  consumed  is  7.3X  less  

    App: GPU-enabled WL-LSMS Runtime (hh:mm:ss) 00:27:43 Avg. Inst. Power: 7,070 kW Energy Consumed: 3,500 kW-hr Mech. (1.23 PUE): 800 kW-hr

    Total Energy: 4,300 kW-hr Energy/Cooling Cost: $475 Single Run Opportunity Cost: (Runtime*Asset Cost) $8,575

  • 26

    Opera)onal  Impact  

    •  From  Titan’s  introduc)on  to  produc)on  (May  31,  2013)  through  the  end  of  the  year  –  Titan  delivered  2.6B  core-‐hours  to  science  –  Maintained  90%  u)liza)on  –  Was  stable-‐  98.7%  scheduled  availability  and  468-‐

    hour  MTTF  

    •  Titan  provides  significant  )me  to  solu)on  benefits  for  applica)ons  that  can  take  advantage  of  the  1:1  architecture.    

    •  Many  applica)ons  have  been  restructured  to  expose  addi)onal  parallelism.  This  posi)vely  impacts  not  only  the  XK7  but  other  large  parallel  compute  systems  as  well.  

    •  Considerable  performance  gains  on  the  XK7  con)nue  to  outpace  its  tradi)onal  compe)tors.  

    •  Faster  solu)on  produces  addi)onal  opportunity.  

     Applica)on  

    Cray  XK7  vs.  Cray  XE6    

    Performance  Ra)o*  

    LAMMPS  Molecular  dynamics   7.4  

    S3D  Turbulent  combus)on     2.2  

    Denovo  3D  neutron  transport  for  nuclear  reactors  

    3.8  

    WL-‐LSMS  Sta)s)cal  mechanics  of  magne)c  materials  

    3.8  

  • 27

    Content  •  The  OLCF’s  Cray  XK7  Titan  

    –  Hardware  Descrip)on  –  Mission  Need  –  INCITE  Alloca)on  Program  

    •  Compe))ve  Alloca)ons  •  Computa)onally  Ready  Science  

    •  The  Opera)onal  Need  to  Understand  Usage  –  ALTD  (the  early  years)  –  NVIDIA’s  Role  

    •   Δ  to  the  Kepler  Driver,  API,  and  NVML    –  Cray’s  Resource  U)liza)on  (RUR)  –  Assessing  the  Opera)onal  Impact  to  

    Delivered  Science  •  Time-‐  and  Energy-‐  to  Solu)on  

    •  Examples  of  NVML_COMPUTEMODE_  EXCLUSIVE_PROCESS  Measurement  –  Lajce  QCD  –  LAMMPS  –  NAMD  

    •  Opera)onal  Impact  of  GPUs  on  Titan  –  Case  Study:  WL-‐LSMS  –  Examples  among  Other  Domains  

    •  Edge  Case:  HPL  •  Takeaways…  

  • 28

    April  15,  2012  Top  500  Submission:    Cray  XK6  (Jaguar)  

    Cray  XK6  HPL  Run  Sta2s2cs  

    System  Configura)on:    Cray  XK  Blade  Upgrade  Complete  (No  Fermi  or  Kepler  accelerators)  18,688  AMD  Opteron  6274  processors  (Interlagos)  Cray  Gemini  Interconnect  

    1.941PF  sustained  (73.6%  of  2.634PF  peak)  

    Run  Dura)on:                24.6  hours    

    Power  (Consump)on  Only)  System  Idle:  2,935  kW  Max  kW:  5,275  kW  Mean  kW  5,142  kW  Total                  126,281  kW-‐h  

    28

  • 29

     RmaxPower  =  8,296.53kW    

     7545.56kW-‐hr  

     -‐        

     1,000.00    

     2,000.00    

     3,000.00    

     4,000.00    

     5,000.00    

     6,000.00    

     7,000.00    

     8,000.00    

     -‐        

     1.00    

     2.00    

     3.00    

     4.00    

     5.00    

     6.00    

     7.00    

     8.00    

     9.00    

     10.00    0:00:53  

    0:01:53  

    0:02:53  

    0:03:53  

    0:04:53  

    0:05:53  

    0:06:53  

    0:07:53  

    0:08:53  

    0:09:53  

    0:10:53  

    0:11:53  

    0:12:53  

    0:13:53  

    0:14:53  

    0:15:53  

    0:16:53  

    0:17:53  

    0:18:53  

    0:19:53  

    0:20:53  

    0:21:53  

    0:22:53  

    0:23:53  

    0:24:53  

    0:25:53  

    0:26:53  

    0:27:53  

    0:28:53  

    0:29:53  

    0:30:53  

    0:31:53  

    0:32:53  

    0:33:53  

    0:34:53  

    0:35:53  

    0:36:53  

    0:37:53  

    0:38:53  

    0:39:53  

    0:40:53  

    0:41:53  

    0:42:53  

    0:43:53  

    0:44:53  

    0:45:53  

    0:46:53  

    0:47:53  

    0:48:53  

    0:49:53  

    0:50:53  

    0:51:53  

    0:52:53  

    0:53:53  

    0:54:53  

    kW-‐hours  (Cumula2ve)  MW,  Instantaneous  

    Thou

    sand

    s  

    Run  Time  Dura2on  (hh:mm:ss)  

    ORNL's  Cray  XK7  Titan  |  HPL  Consump)on  

    kW,  Instantaneous  

    RmaxPower  

    kW-‐hours,  cumula)ve  

    Custom  NVIDIA  HPL  Binary-‐    NVML  Driver  Reported  99%  

    GPU  Usage  

    Instantaneous  Measurements  •     8.93  MW  •     21.42  PF  •     2,397.14  MF/Wat  

    Cray  XK7  HPL  Run  Sta2s2cs  

    System  Configura)on:    (November  2012)  Titan  Upgrade  Complete.  18,688  AMD  Opteron  6274  processors,  1:1  with  NVIDIA  GK110  (Kepler)  GPUs  in  SXM  form  factor  

     

    17.59  PF  sustained            vs  1.941PF  on  Opteron-‐only  

    Run  Dura)on:                0:54.53                  vs  more  than  24.6  hours    

    Power  (Consump)on  Only)  Max  kW:  8,930  kW  Mean  kW  8.296  kW  Total                            7,545kW-‐h                            vs  126,281  kW-‐h  

  • 30

    Takeaways  

    •  Revisions  by  NVIDIA  to  the  Kepler  device  driver  now  expose  an  ability  for  system  accoun)ng  methods  to  capture,  per  node,  and  per  job,  important  usage  and  high  water  sta)s)cs  concerning  the  K20X  SM  and  memory  subsystems.  

    •  GPU  enabled  applica)ons  on  the  Cray  XK7  can  frequently  realize  )me-‐to-‐solu)on  savings  of  5X  or  more  versus  tradi)onal  CPU-‐only  applica)ons.  

    •  Invita)on:  the  OLCF  welcomes  R&D  applica)ons  focused  on  maximizing  scien)fic  applica)on  efficiency  –  Applica)on  performance  benchmarking,  analysis,  modeling,  and  scaling  

    studies.  End-‐to-‐end  workflow,  visualiza)on,  data  analy)cs.  –  htps://www.olcf.ornl.gov/support/gejng-‐started/olcf-‐director-‐discre)on-‐

    project-‐applica)on/  

    #GTC14

  • 31

    Ques)ons?    [email protected]  

    31

    The activities described herein were performed using resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0500OR22725.