Improving the Performance of the MILC Code on … › documents ›...

Post on 07-Jul-2020

4 views 0 download

Transcript of Improving the Performance of the MILC Code on … › documents ›...

ImprovingthePerformanceoftheMILCCodeonIntelKnightsLanding,

AnOverviewIXPUG2017FallMeeEng

September26th–28th,2017TexasAdvancedCompuEngCenter(TACC)

AusEn,TX

MILConKNLWorkingGroup1.  IndianaUniversity:S.GoUlieb,R.Li2. UniversityofUtah:C.DeTar3. UniversityofArizona:D.Toussaint4.  Intel:K.Raman,A.Jha,D.Kalamkar,M.Tolubaeva,T.Phung,R.Malladi

5. TataConsultancyServices(TCS):G.Bhaskar,P.Gaurav,J.Bhat

6.  JeffersonLab:B.Joó7.  LBL/NERSC:D.Doerfler

•  ThiseffortissupportedbyIntelParallelCompuEngCenteratIndianaUniversity

•  AndisaNERSC/NESAPTier1codeintheCoriPhase2(KNL)Project

2

ImpactofMILCQCDSimulaEons•  Measuringthefundamentalparametersof

theStandardModelofparEclephysics•  AndlookingfordeviaEonswhichsuggest

physicsNOTaccountedfor,i.e.NewPhysics!

•  MethodistouseMonteCarloevaluaEonofthequantummechanicalpathintegral

•  Plotshowsresultsachievedoverthelastseveralyears,usingresourcesfrommulEplefaciliEes

•  Asthephysicalgrid(lagce)spacingdecreases,thecomputaEonalcomplexityincreases

•  CoriishelpingwiththethecalculaEonsassociatedwitha=0.043femtometers

RaEoofdecayconstantsoftheKmesontothePion

3

BenchmarkingPlaiorms•  Endeavor-Intel–  Intel’sinternaldevelopmentcluster–  KnightsLanding(KNL)andSkyLake(SKX)nodes–  IntelOPAhigh-speedinterconnect

•  Cori-NERSC–  CrayXC-40Architecture–  IntelHaswell&KNLprocessors

•  Edison-NERSC–  CrayXC-30Architecture–  IntelIvyBridgeprocessors

•  Theta–ArgonneNaEonalLab–  CrayXC-40withKNLnodes

•  Stampede–TACC–  Dell,IntelandSeagate–  IntelKNLwithOPAinterconnect

4

MILCcomputaEonalphases:OpEmizaEons&Performance

5

WheredoesMILCspenditsEme?•  RepresentaEverunEmebreakdown(singlenode)•  su3_rhmd_hisqapplicaEon

0 100 200 300 400 500 600

Haswell:2MPI+16OMP

Haswell:MPI-only32ranks

Haswell:OMP-only32threads

KNL:MPI-only64ranks

KNL:OMP-only64threads

StaggeredCGFFGFFLothers

16x16x16x16lagce

seconds 6

ProfilingToolsandMethods•  IntelVTuneandAdvisor•  Goodol’Linuxprofandgprof•  Rooflineanalysis

–  e.g.usedtolookhowclosetotheKNLMCDRAMBWroofline•  IntegratedPerformanceMonitoring(IPM)tool•  And,MILChasextensiveEmingsupport,inparEcularfuncEonsknownto

beEmeconsuming(Emeinsec.andGF/sreports)…–  However,wefoundsomeEmingsdidn’taddup

•  NeedingaddiEonalEmingforcodenotrepresented•  andthisledtorouEnesgegngOpenMPthatwereassumedtobetrivial

–  E.g.Update_u()callsfromthe“trajectoryupdate”•  FuncEonspeedupauerthreadingwas42.5x•  resulEnginanoverall12%improvementtothetrajectoryupdate

7

OpenMPEnhancementsinBaselineMILCDirectory #offilestotal #fileswith

candidateloops

#ofloops #ofloopsmodified

%ofloopsrewri7en

generic 102 43 229 71 31%

generic_ks 185 42 529 17 3.2%

ks_imp_rhmc 20 8 36 4 11%

•  MILCabstractsloopswithFORALLSITESandFORALLFIELDSITES.–  ConverttoFORALLxxx_OMPmacros–  CandidateOpenMPloopsneedtobeexaminedforOMPprivatevariablesandreducEons

•  ExaminaEonofallloopswouldbetedious,soweusedthetoolsmenEonedintheslideabovetohelpusidenEfyloopswithpotenEallythemostimpact

8

IntegraEngQPhiXSolverforStaggeredFermions(akaTheBigWin)

•  QPhiX,StaggeredFermionversion,developedtoimprovevectorizaEon,andthreadingperformance

•  Staggeredfermionsvs.Wilson/Clover–  Lookingatadifferent“acEon”thanWilson/Clover–  MulE-massCGsolverimplementaEon–  Usesasinglerighthandside

àlimitsreuseanddecreasesarithmeEcintensity–  3complexvaluespergridpointvs.12–  Primarilyusedwithdouble-precisionvariables

•  X-YblockingforSIMDvectorizaEon–  Datastoredasarraysofstructuresofarrays–  SoAintheXdimension–  SIMD_width/SOA_lengthintheYdimension–  EnablesefficientcacheblockingofX-Z

SIMDwidth

SoAlengthx

y

9hUps://github.com/JeffersonLab/qphix

0

20

40

60

80

100

120

1 2 4 8 16

MILCCGPerformanceKNL

L=16

L=32

MulE-massCGSolver(SingleNode)

•  TheQPhiXsolverprovidesa1.5xforL=32lagce•  Forsmalllagcesizes,performanceislimitedbydata

remappingEme10

GFLO

Ps/sec

IncreasingMPIranksDecreasingthreads

0

20

40

60

80

100

120

1 2 4 8 16

MILCCG(w.QPhiX)PerformanceKNL

L=16

L=32

GaugeForce(SingleNode)

•  GaugeForceperformanceimprovementsareupto4xforL=32lagcesize

11

GFLO

Ps/sec

0

50

100

150

200

250

300

1 2 4 8 16

MILCGFPerformanceKNL

L=16

L=32

0

50

100

150

200

250

300

1 2 4 8 16

MILCGF(w.QPhiX)PerformanceKNL

L=16

L=32

IncreasingMPIranksDecreasingthreads

0

20

40

60

80

100

120

1 2 4 8 16

MILCCG(w.QPhiX)L=32

SKXbase

SKX

KNL

SkyLakevs.KNL

•  ImprovementstranslatetolatestXeonprocessorwithAVX512aswell

•  Architecturesbehavedifferentlywrtrank/threadtradeoffs12

GFLO

Ps/sec

IncreasingMPIranksDecreasingthreads

0

50

100

150

200

250

300

1 2 4 8 16

MILCGF(w.QPhiX)L=32

SKXbase

SKX

KNL

CGPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)•  CGSolverPerformancelimitedbymemorybandwidthonXeons.Hence,noimprovementwith

QPhiX•  ParallelEfficiency:

–  ~99%@64Nodesfor32^(4)LagceVolumeperNode

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

MulTmassCG(w/QPhiX)LaUceVolumepernode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

MulTmassCG(BaselineMILC)LaUceVolumepernode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

GFPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  ~3.5xNodeLevelPerformanceimprovementwithQPhiX•  MulEpleRanks/NodegivesbeUerperformanceforGaugeForce•  ParallelEfficiency:

–  ~85%@64Nodesfor32^(4)LagceVolumeperNode

0

50

100

150

200

250

300

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(w/QPhiX)LaUceVolumeperNode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(BaselineMILC)LaUceVolume:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

MPIMessagingCharacterisEcs,16KNLNodes

•  Messagesizesvarywithnumberofrankspernode(RPN)•  AsRPNdecreases

–  Lagcesize(volume)perrankincreases->messagesizesincrease–  Surface-to-volumedecreases->totalamountofdatadecreases(proporEonalto1/N4)–  Sameanalogywithincreasinglagcesize(volume)pernode

0102030405060708090

64 32 16 8 4 2 1

,me(secon

ds)

rankspernode

MILCMul,massCG:TotalTime16Nodes,L=16

Compute

Irecv

Isend

Allreduce

Wait1

10

1006K

8K

12K

16K

32K

64K

128K

256K

Num

bero

fMessages(1e6)

Messagesize(bytes)

MILCMulAmassCG,16Nodes,L=16

64rpn

32rpn

16rpn

8rpn

4rpn

2rpn

1rpn

15

MPICharacterisEcs,512Nodes

•  Caveat1:atthisscale,the“full”IPMinstrumentaEonimpactsabsoluteEme,butrelaEveEmes“should”berepresentaEve(Ineedmoreconfidencethatthisistrue)

•  Caveat2:Resultsarefromasingletrial,couldbeupto10%variability16

64 32 16 8 4 2 1rankspernode

QPhiXMul9massCG:TotalTime512Nodes,L=16x32x32x32

Compute

Irecv

Isend

Allreduce

Wait

64 32 16 8 4 2 1rankspernode

MILCMul7massCG:TotalTime512Nodes,L=16x32x32x32

Compute

Irecv

Isend

Allreduce

Wait

0

2

4

6

8

10

12

8 16

32

64

128

256

512 1k

2k

4k

8k

16k

32k

64k

128k

256k

512k

1M

GB/s

messagesizeinbytes

SMBMsgrateBandwidth:Cori/KNL

1/2BW

r=1

r=2

r=4

r=8

r=16

r=32

r=640

2

4

6

8

10

12

8 16

32

64

128

256

512 1k

2k

4k

8k

16k

32k

64k

128k

256k

512k

1M

GB/s

messagesizeinbytes

SMBMsgrateBandwidth:Cori/KNL

1/2BW

r=8

r=8w/huge2M

r=8eager=64K0

20

40

60

80

100

120

64 32 16 8 4 2 1

1 2 4 8 16 32 64

second

s

RanksperNode,TheadsperRank

MILCCGSolveTime:432Nodes72x72x72x144laEce

eager=8KB

eager=64KB

w/huge2M

Cori(CrayAries)HugePagesOpEmizaEon

•  MPImessageratemicro-benchmarkidenEfiedanissuewhereBWdropssignificantlywhentransiEoningtotheRendezvousprotocol

17

•  TwosoluEonstried:•  MoveRendezvoustransiEonto

64KB•  PerCrayadvice,usehugepages,

2Mpagesinthiscase

•  ThishasasignificantimpactonperformancewhenusingalargenumberofMPIrankspernode

•  RecommendaEonàUsehugepagesincommunicaEonintensivecodeswithmoderatemessagesize

Eagerprotocol Rendezvousprotocol

RooflineAnalysis•  HelpstofocusonareasofcodetotargetforopEmizaEon•  MILCisknowntomemoryBWbound,andQPhiXversionisnearroofline

model,butbaselineversionhashigherAIandlowerperformance?

RooflineModelforKNL MILCPerformancevs.Roofline 18

1

10

100

1000

0.1 1 10

GFLOP/s

Arithme3cIntensity(FLOP/byte)

MILCStaggeredCGSolveRooflineforKNL28x28x28x28laLce

Emp.MCDRAMRoofline

Emp.DDRRoofline

Baselinew/MCDRAM

QPhiXw/MCDRAM

QPhiXw/DDR

ImpactonProducEonCompuEng

19

TheRubberHigngtheRoad•  CurrentCori(KNL)producEonruns

–  96x96x96x192lagceon128nodes–  ks_spectrum_hisq:CalculaEonofmesonandbaryonspectrafromawidevarietyofsources

andsinks–  su3_clov_hisq:generatescloverpropagatorsandcontracEngmesonandbaryontwopoint

funcEons•  QPhiXversionofthemulE-massstaggeredfermionsolverisbeingused

20

StaggeredCG CloverBiCG

StandardMILC1 40GF/s 21-692GF/s

QPhiX3 52GF/s 69GF/s

QPhiXImprovement 1.3x 1xto3.3x1.  64rankspernode,1threadperrank,includesOpenMPimprovements2.  256nodesrunningadifferentproblem3.  32rankspernode,8threadsperrank

AdvantageofusingCori’sBurstBufferforI/O

•  I/OoverheadisbeingsignificantlyreducedbyusingCori’sBurstBuffer(BB)subsystem

•  Typicalwallclockfor128noderunis~5hours•  UsingnominalLustre/scratch,53minutesspentwriEng61

temporaryfiles•  UsingBB,I/Owasreducedto26minutes

–  2ximprovement

21

CurrentandFutureEfforts•  FermionForcecalculaEonopEmizaEons•  FatLinkscalcuaEonopEmizaEons•  High-speedinterconnectexploraEons

–  MulEpleendpointsimplementaEonofOPAMPI

•  ConEnueimprovementstoOpenMP–  RooflineanalysisshowsthatbaselineMILCperhapsnotBWbound

•  InvesEgaEngGridsolver•  ConEnueworkingonSciDAC/QOPversionofthecode

22

Discussion&QuesEons

23

Backup

24

CGPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)•  SmallerVolume[16^(4)perNode]becomescommunicaEonsensiEveatlargernodecounts•  ParallelEfficiency:

–  ~80%@64Nodesfor16^(4)LagceVolumeperNode

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLO

PS/sec/nod

e

log(nodes)/log(2)

MulEmassCG(w/QPhiX)LagceVolumepernode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLO

PS/sec/nod

e

log(nodes)/log(2)

MulEmassCG(BaselineMILC)LagceVolumepernode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

GFPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  ~3.5xNodeLevelPerformanceimprovementwithQPhiX

–  NotseeingLLCeffectswithQPhiXasseenwithBaselineMILC(needtoinvesTgate)•  MulEpleRanks/NodegivesbeUerperformanceforGaugeForce•  ParallelEfficiency:

–  ~92%@64Nodesfor16^(4)LagceVolumeperNode

0.00

50.00

100.00

150.00

200.00

250.00

300.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(w/QPhiX)LaUceVolumeperNode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(BaselineMILC)LaUceVolumeperNode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

CommunicaEonProfile

•  ProfilecollectedusingIntel®MPIPerformanceSnapshotTool(partofITAC)•  SmallerLagceVolume(16^(4))spendshigh%ofEmeinMPIasexpected

94.59 91.93 91.81 90.45 89.80 87.49

5.41 8.07 8.19 9.55 10.20 12.51

0.00

50.00

100.00

150.00

0 2 3 4 5 6

%Tim

e

log(nodes)/log(2)

Computevs.MPI(CGSolver)LagceVolumeperNode:32^(4)

8MPIRanks/Node

%ComputaEon

81.47 76.69 76.52 75.13 70.71

18.53 23.31 23.48 24.87 29.29

0.00

50.00

100.00

150.00

0 2 3 4 5

%Tim

e

log(nodes)/log(2)

Computevs.MPI(CGSolver)LagceVolume:16^(4)8MPIRanks/Node

%ComputaEon

MPIFuncEonSummary

•  MostofMPIEmespentinMPIWait(i.e.P2PSend/RecvcompleEon)•  CollecEveOpsTime(i.e.%MPI_Allreduce)increaseswithnodecounts

–  MPI_Allreduce~40%ofMPITimeat64Nodes–  PotenEalboUleneckatverylargenodes

80.98 73.03 61.39 50.53 44.18

13.55 20.0527.5

29.97 40.05

3.68 5.79 9.82 18.03 14.49

0

50

100

150

0 2 3 4 5%M

PITim

e(Normalize

d)

log(nodes)/log2

MPIFuncEonSummary(CGSolver)LagceVolume:16^(4)

%MPI_Wait %MPI_Allreduce

StaggeredMulE-massCG:LagceScalingStudy

MILCbaseline:•  64MPIrankspernode

StaggeredQPhiX:•  1MPIrankon1node•  upto16ranksonmulEplenodes 29

QPhiXMILCbaseline

VariousMPIrank/OMPthreadcombinaEons,L=24

MulEmassCG:WeakScalingonCori

30

Benchmarks:SymanzikGaugeForce

MILCbaseline:•  64MPIrankspernode

QOPQDP:•  64MPIrankspernode

QPhiX:•  1MPIrankwith1node•  16ranks/nodeotherwise

GFperformanceinGFLOPS/node,variouslagcesizespernodeL 1thread/core

31

Benchmarks:HISQFermionForce

MILCbaseline:•  64MPIrankspernode

QOPQDP:•  64MPIrankspernode

1.00

3.00

5.00

7.00

9.00

1 16 64

L=16

L=24

L=32

SpeedupofQOPQDP

Modifiedalgorithmvs.MILCFF•  Speedupduetoareduced

numberofcalculaEons(FLOPs)to17%ofbaseline

32

CoriandStampede•  Bothusethesame68-coreKNLSKUs•  CoriusesArieshigh-speedinterconnect,StampedeusesIntel

OmniPath(OPA)–  CrayandIntelMPIrespecEvely

•  CoriusesCrayCNL,Stampedeis???Linux•  Standard(non-QPhiX)versionofMILC

–  Primaryanalysisishigh-speedinterconnectperformance,usingrelaEvelysmalllagcesizepernode

•  IntelCforboth,although17.0.2.xvs.17.0.0.x•  Limitedtomaximumjobsizeof80nodesonStampede->limited

scalingstudyto64nodes

33

02468

10121416

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

Ping-PongBandwidth

Stampede Cori

0246810121416

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

Uni-direcBonalBandwidth

Stampede Cori

0246810121416

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

Bi-direcAonalBandwidth

Stampede Cori

OSUSinglerankPt-2-Pt

•  Singlecore,point-to-pointbetween2nodes•  Latencyiscomparableat~3.1μS(butabout2xthatofHaswell)•  Ping-pongisexactlythat,singlemessageping-pongedbetween2nodes•  Uni-direcEonalisa“streaming”exchangewithawindowofsize64•  Bi-direcEonalisalso“streaming”•  CorishowsbeUersmallmessageBW(messagerate)andStampedehigherpeakBW

Increasingmessagerate

34

OSUMulE-rankPt-2-Pt

0

2

4

6

8

10

12

14

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

StampedeMulC-RankBandwidth

1/2BW

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

•  Asrankspernode(RPN)increaseinthisuni-direcEonaltest,Stampede’sperformanceimproves.•  CorihasaBW“ceiling”below32RPNthatlimitslargemessageBWthatisnotobservedwithStampede

–  CrayaUributesthistoaPCIlatencyissuebetweenKNLandAries,canbemiEgatedbymovingBTE“put”protocoltransiEontoasmallermessage(defaultis4MB)

•  ItlooksasthoughStampedecouldusesometuninginitstransiEontoalargemessageprotocoltotakeadvantageofitshigherpeakBW

0

2

4

6

8

10

12

14

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

CoriMulC-RankBandwidth

1/2BW

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

Increasingmessagerate

35

SMBMulE-node,MulE-rankStencil

•  16nodes,6neighborsperrank(emulates3DstencilcommunicaEonpaUern)•  Measuresbi-direcEonalmessagerate(convertedtoBW)•  HereweseeStampedeisabletoachieveclosetoitspeakbi-direcEonBW(25GB/s)

–  However,somethinghappensat1MBmessagesize,tobeinvesEgated•  StampedecouldalsousesometuninginitsprotocoltransiEontolargemessages(>64KB)

0

5

10

15

20

25

30

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

Band

width(G

B/s)

Messagesize(bytes)

StampedeSMB

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn0

5

10

15

20

25

30

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

Band

width(G

B/s)

Messagesize(bytes)

CoriSMB

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

Increasingmessagerate

36

WeakScaling:NumberofNodes

•  WeakScaled:lagcesizeis16x16x16x16pernode•  MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes

–  numberofcoresfixedat64–  MPIranks/node(rpn)variedfrom1to64–  Allresultsuse1threadpercore

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6

GFLO

P/s

numberofnodes(2^^N)

Cori:MulCmassCGL=16

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn 0

10

20

30

40

50

60

70

0 1 2 3 4 5 6

GFLO

P/s

numberofnodes(2^^N)

Stampede:MulEmassCGL=16

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

37

WeakScaling:RanksperNode

•  WeakScaled:lagcesizeis16x16x16x16pernode•  MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes

–  numberofcoresfixedat64–  MPIranks/node(rpn)variedfrom1to64

0

10

20

30

40

50

60

70

1 10 100

GFLO

P/s

MPIrankspernode

Cori:Mul@massCGL=16

1node

8nodes

16nodes

32nodes

64nodes0

10

20

30

40

50

60

70

1 10 100

GFLO

P/s

MPIrankspernode

Stampede:MulAmassCGL=16

1node

8nodes

16nodes

32nodes

64nodes

38

WeakScaling:SelectMPI/OMP

•  CGEmeandCGscalingforselectedMPI/OpenMPcombinaEons–  Idealscalingwouldbeaflathorizontalline

•  StampedescalingisbeUerwithhigherRPN•  ThereisascalingimprovementonCorigoingfrom32to64RPN,butIwouldn’treadtomuchintoitat

suchassmallscale

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

1 8 16 32 64

Time(sec)

numberofnodes

CoriCGPerformanceWeakScaling,L=16pernode

Cori64/1

Cori8/8

Cori2/32

Stampede64/1

Stampede8/8

Stampede2/320.80

0.85

0.90

0.95

1.00

16 32 64

scalingeffi

cien

cy

numberofnodes

Cori:CGWeakScaling,L=16

Cori64/1

Cori8/8

Cori2/32

Stampede64/1

Stampede8/8

Stampede2/32

39