Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 •...

Performance Results from Multi-core Platforms

Nicholas J. Wright Advanced Technology Group, NERSC/LBNL

[email protected]

Programming weather, climate, and earth-‐system models on heterogeneous mulI-‐core plaJorms

19-‐20 September 2013, NCAR -‐ 1 -‐

Rich @ SC12 – “Show some Data!”

-‐ 2 -‐

NERSC users I want

•  Robust code changes –  I don’t want to add things in only to take them out again two years later

•  Performance portability –  Changes made today for one pla;orm should help on all

•  Given hardware trends what should I do? –  Understand what is limi=ng my applica=ons performance

•  Roofline Model –  Iden=fy and exploit parallelism

•  OpenMP •  Vectors •  Tasks

-‐ 3 -‐

Understanding Performance on Today’s Machines – Per socket comparison

•  Edison Cray XC30 – Intel Ivybridge – 2.4 GHz –  12 cores, 212 gflops, 50 GB/s* per socket

•  Hopper Cray XE6 – AMD Magny Cours – 2.1 GHz –  12 cores, 95 gflops, 35 GB/s* per socket

•  Mira BG/Q – IBM PowerPC – 1.6 GHz –  16 cores, 172 gflops, 29 GB/s* per socket

•  Intel Xeon Phi (KNC) – 1.238 GHz –  61 cores, 1.06TF, 174 GB/s* per socket

•  NVIDIA Kepler G20X –  XXX cores, 1220TF, 171 GB/s per socket

-‐ 4 -‐ *DGEMM & STREAM triad

Performance Per Node

-‐ 5 -‐

102

50

29

174 175

0

20

40

60

80

100

120

140

160

180

200

XC30 XE6 BG/Q Xeon Phi NVIDIA K20X

STREAM (GB/s)

423

189 172

1,060

1,220

0

200

400

600

800

1000

1200

1400

XC30 XE6 BG/Q Xeon Phi NVIDIA K20X

DGEMM (gflops)

Roofline for Test Systems

-‐ 6 -‐

1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8 16

1/16 1/8 1/4 1/2 1 2 4 8 16

Gflo

ps/s/core

OperaIonal Intensity (Flops/Byte)

Hopper

Edison

Mira

GTC

MILC

miniDFT

miniGhost

miniFE

SNAP

AMG

Flash Code on Edison, Hopper, Mira

•  ~4x more parallelism needed for equivalent performance on BG/Q compared to Cray XC30

•  Energy –  XC30 280W/node –  BG/W 80W/node –  Factor of 3.5x

-‐ 7 -‐

10^1

10^2

10^3

512 1024 2048 4096 8192 16384 32768

Runtim

e (

Seco

nds)

Nodes

0.00

0.20

0.40

0.60

0.80

1.00

1.20

512 1024 2048 4096 8192 16384 32768

Para

llel E

ffic

iency

Nodes

BG/Q (16xMPI, 4xOpenMP)BG/Q (1xMPI, 64xOpenMP)

Hopper (24xMPI)Hopper (4xMPI, 6xOpenMP)

Edison (24xMPI)Edison (2xMPI, 12xOpenMP) Same performance on BG/Q and

XC30 achievable Need to work harder on BG/Q

Performance Tuning of NWChem Texas Integrals

•  Two-‐ electron repulsion integrals and construc=on of Fock matrix are key NWChem components (PMBS 13 submifed)

•  Node-‐level performance normalized to Hopper reference •  SMT and vectoriza=on are key for MIC & BG/Q •  Code does not lend itself well to vectoriza=on, likely a new algorithmic approach is

required

•  Op=miza=ons include: -‐  Dynamic load balancing, -‐  Improved data locality -‐  Loop transforma=ons -‐  Mul=-‐threading -‐  Compiler-‐directed

vectoriza=on

-‐  Overall performance gain up to 2.5x

Optimization of Geometric Multigrid

-‐ 9 -‐

See: S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker. "Op=miza=on of Geometric Mul=grid for Emerging Mul=-‐ and Manycore Processors", Supercompu=ng (SC), November 2012,

GTC on Homogenous and Heterogenous Platforms"

See: Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker. "Kine=c Turbulence Simula=ons at Extreme Scale on Leadership-‐Class Systems", Supercompu=ng (SC), November 2013,

Vectorization (Hopper vs Edison)

-‐120%

-‐100%

-‐80%

-‐60%

-‐40%

-‐20%

0%

20%

40%

60%

Hopper

Edison

-‐ 11 -‐

Vectoriza=on doesn’t help most NERSC benchmark codes as wrifen today.

Run=me change

NWCHEM Vectorization on Intel MIC and BG/Q

•  Top ten subroutines accounted for 73% of total running time •  Erintsp and ssssm benefit from vectorized function (inverse square root) •  Obassi, wt2wt2, trac12, amshf benefit from vectorized data access •  Assem, xwpq, pre4n suffer from indirect data access •  Destbul can not be automatically vectorized by compiler due to

serilization •  Both platforms show similar effect

Intel MIC BG/Q

The BSP execution model wastes resources packing buffers

Shiner rou=ne ~30% faster ! GTC overall ~5% faster

0

0.2

0.4

0.6

0.8

1

1.2

old new

RelaIv

e Im

e

serial

openmp

mpi

Before Aner

Cost of repacking data significant frac=on of the execu=on =me Waste of resources as well as detrimental to programmer produc=vity Example: By using OpenMP tasking we can use spare resources to repack buffers while messages are being sent

Summary •  DisrupIve technology changes are coming

–  Understand how they will effect you ! •  Modify your code with a mind to the future

– Make sure you understand what limi=ng factors are –  OpenMP –  Vectoriza=on –  Tasking

•  Early results seem to indicate that this approach will be beneficial on today’s machines and tomorrows !

Acknowledgements

•  US Department of Energy Contract No. DE-‐AC02-‐05CH11231

•  Malhew Cordery, Chris Daley, Brian AusIn – NERSC ATG Group

•  Lenny Oliker, Sam Williams, Khaled Ibrahim-‐ LBNL FTG Group

15

Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 •...

Documents

Transcript of Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 •...