Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 •...

15
Performance Results from Multi-core Platforms Nicholas J. Wright Advanced Technology Group, NERSC/LBNL [email protected] Programming weather, climate, and earthsystem models on heterogeneous mulIcore plaJorms 1920 September 2013, NCAR 1

Transcript of Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 •...

Page 1: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Performance Results from Multi-core Platforms

Nicholas  J.  Wright  Advanced  Technology  Group,  NERSC/LBNL  

[email protected]    

Programming  weather,  climate,  and  earth-­‐system  models  on  heterogeneous  mulI-­‐core  plaJorms  

19-­‐20  September  2013,  NCAR      -­‐  1  -­‐  

Page 2: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Rich @ SC12 – “Show some Data!”

-­‐  2  -­‐  

Page 3: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

NERSC users I want

•  Robust  code  changes  –  I  don’t  want  to  add  things  in  only  to  take  them  out  again  two  years  later  

•  Performance  portability  –  Changes  made  today  for  one  pla;orm  should  help  on  all  

•  Given  hardware  trends  what  should  I  do?  –  Understand  what  is  limi=ng  my  applica=ons  performance  

•  Roofline  Model  –  Iden=fy  and  exploit  parallelism  

•  OpenMP  •  Vectors  •  Tasks  

-­‐  3  -­‐  

Page 4: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Understanding Performance on Today’s Machines – Per socket comparison

•  Edison  Cray  XC30  –  Intel  Ivybridge  –  2.4  GHz  –  12  cores,  212  gflops,  50  GB/s*  per  socket  

•  Hopper  Cray  XE6  –  AMD  Magny  Cours  –  2.1  GHz  –  12  cores,  95  gflops,  35  GB/s*  per  socket  

•  Mira  BG/Q  –  IBM  PowerPC  –  1.6  GHz  –  16  cores,  172  gflops,  29  GB/s*  per  socket  

•  Intel  Xeon  Phi  (KNC)  –  1.238  GHz  –  61  cores,  1.06TF,  174  GB/s*  per  socket  

•  NVIDIA  Kepler  G20X  –  XXX  cores,  1220TF,  171  GB/s  per  socket  

-­‐  4  -­‐   *DGEMM  &  STREAM  triad  

Page 5: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Performance Per Node

-­‐  5  -­‐  

102  

50  

29  

174   175  

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

200  

XC30   XE6   BG/Q   Xeon  Phi   NVIDIA  K20X  

STREAM  (GB/s)  

423  

189   172  

1,060  

1,220  

0  

200  

400  

600  

800  

1000  

1200  

1400  

XC30   XE6   BG/Q   Xeon  Phi   NVIDIA  K20X  

DGEMM  (gflops)  

Page 6: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Roofline for Test Systems

-­‐  6  -­‐  

   1/64      1/32      1/16      1/8        1/4        1/2    1              2              4              8              16              

   1/16      1/8        1/4        1/2     1               2               4               8               16              

Gflo

ps/s/core  

OperaIonal  Intensity  (Flops/Byte)  

Hopper  

Edison  

Mira  

GTC  

MILC  

miniDFT  

miniGhost  

miniFE  

SNAP  

AMG  

Page 7: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Flash Code on Edison, Hopper, Mira

•  ~4x  more  parallelism  needed  for  equivalent  performance  on  BG/Q  compared  to  Cray  XC30  

•  Energy  –  XC30  280W/node  –  BG/W  80W/node  –  Factor  of  3.5x  

-­‐  7  -­‐  

10^1

10^2

10^3

512 1024 2048 4096 8192 16384 32768

Runtim

e (

Seco

nds)

Nodes

0.00

0.20

0.40

0.60

0.80

1.00

1.20

512 1024 2048 4096 8192 16384 32768

Para

llel E

ffic

iency

Nodes

BG/Q (16xMPI, 4xOpenMP)BG/Q (1xMPI, 64xOpenMP)

Hopper (24xMPI)Hopper (4xMPI, 6xOpenMP)

Edison (24xMPI)Edison (2xMPI, 12xOpenMP) Same  performance  on  BG/Q  and  

XC30  achievable  Need  to  work  harder  on  BG/Q  

Page 8: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Performance Tuning of NWChem Texas Integrals

•  Two-­‐  electron  repulsion  integrals  and  construc=on  of  Fock  matrix  are  key  NWChem  components  (PMBS  13  submifed)  

•  Node-­‐level  performance  normalized  to  Hopper  reference  •  SMT  and  vectoriza=on  are  key  for  MIC  &  BG/Q  •  Code  does  not  lend  itself  well  to  vectoriza=on,  likely  a  new  algorithmic  approach  is  

required  

•  Op=miza=ons  include:  -­‐  Dynamic  load  balancing,  -­‐  Improved  data  locality  -­‐  Loop  transforma=ons  -­‐  Mul=-­‐threading  -­‐  Compiler-­‐directed  

vectoriza=on  

-­‐  Overall  performance  gain  up  to  2.5x  

   

Page 9: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Optimization of Geometric Multigrid

-­‐  9  -­‐  

See:  S.  Williams,  D.  Kalamkar,  A.  Singh,  A.  Deshpande,  B.  Van  Straalen,  M.  Smelyanskiy,  A.  Almgren,  P.  Dubey,  J.  Shalf,  L.  Oliker.    "Op=miza=on  of  Geometric  Mul=grid  for  Emerging  Mul=-­‐  and  Manycore  Processors",  Supercompu=ng  (SC),  November  2012,      

Page 10: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

GTC on Homogenous and Heterogenous Platforms"

See:  Bei  Wang,  Stephane  Ethier,  William  Tang,  Timothy  Williams,  Khaled  Z.  Ibrahim,  Kamesh  Madduri,  Samuel  Williams,  Leonid  Oliker.    "Kine=c  Turbulence  Simula=ons  at  Extreme  Scale  on  Leadership-­‐Class  Systems",  Supercompu=ng  (SC),  November  2013,      

Page 11: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Vectorization (Hopper vs Edison)

-­‐120%  

-­‐100%  

-­‐80%  

-­‐60%  

-­‐40%  

-­‐20%  

0%  

20%  

40%  

60%  

Hopper  

Edison  

-­‐  11  -­‐  

Vectoriza=on  doesn’t  help  most  NERSC  benchmark  codes  as  wrifen  today.  

Run=me  change  

Page 12: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

NWCHEM Vectorization on Intel MIC and BG/Q

•  Top ten subroutines accounted for 73% of total running time •  Erintsp and ssssm benefit from vectorized function (inverse square root) •  Obassi, wt2wt2, trac12, amshf benefit from vectorized data access •  Assem, xwpq, pre4n suffer from indirect data access •  Destbul can not be automatically vectorized by compiler due to

serilization •  Both platforms show similar effect

Intel MIC BG/Q

Page 13: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

The BSP execution model wastes resources packing buffers

Shiner  rou=ne  ~30%  faster  !  GTC  overall  ~5%  faster  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

old   new  

RelaIv

e  Im

e  

serial  

openmp  

mpi  

Before   Aner  

Cost  of  repacking  data  significant  frac=on  of  the  execu=on  =me    Waste  of  resources  as  well  as  detrimental  to  programmer  produc=vity    Example:  By  using  OpenMP  tasking  we  can  use  spare  resources  to  repack  buffers  while  messages  are  being  sent    

Page 14: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Summary •  DisrupIve  technology  changes  are  coming  

–  Understand  how  they  will  effect  you  !  •  Modify  your  code  with  a  mind  to  the  future  

– Make  sure  you  understand  what  limi=ng  factors  are  –  OpenMP  –  Vectoriza=on  –  Tasking  

•  Early  results  seem  to  indicate  that  this  approach  will  be  beneficial  on  today’s  machines  and  tomorrows  !  

Page 15: Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Acknowledgements

•  US  Department  of  Energy  Contract  No.  DE-­‐AC02-­‐05CH11231  

•  Malhew  Cordery,  Chris  Daley,  Brian  AusIn  –  NERSC  ATG  Group  

•  Lenny  Oliker,  Sam  Williams,  Khaled  Ibrahim-­‐  LBNL  FTG  Group  

15