Improving the Performance of the MILC Code on … › documents ›...

ImprovingthePerformanceoftheMILCCodeonIntelKnightsLanding,

AnOverviewIXPUG2017FallMeeEng

September26th–28th,2017TexasAdvancedCompuEngCenter(TACC)

AusEn,TX

MILConKNLWorkingGroup1.  IndianaUniversity:S.GoUlieb,R.Li2. UniversityofUtah:C.DeTar3. UniversityofArizona:D.Toussaint4.  Intel:K.Raman,A.Jha,D.Kalamkar,M.Tolubaeva,T.Phung,R.Malladi

5. TataConsultancyServices(TCS):G.Bhaskar,P.Gaurav,J.Bhat

6.  JeffersonLab:B.Joó7.  LBL/NERSC:D.Doerfler

•  ThiseffortissupportedbyIntelParallelCompuEngCenteratIndianaUniversity

•  AndisaNERSC/NESAPTier1codeintheCoriPhase2(KNL)Project

ImpactofMILCQCDSimulaEons•  Measuringthefundamentalparametersof

theStandardModelofparEclephysics•  AndlookingfordeviaEonswhichsuggest

physicsNOTaccountedfor,i.e.NewPhysics!

•  MethodistouseMonteCarloevaluaEonofthequantummechanicalpathintegral

•  Plotshowsresultsachievedoverthelastseveralyears,usingresourcesfrommulEplefaciliEes

•  Asthephysicalgrid(lagce)spacingdecreases,thecomputaEonalcomplexityincreases

•  CoriishelpingwiththethecalculaEonsassociatedwitha=0.043femtometers

RaEoofdecayconstantsoftheKmesontothePion

BenchmarkingPlaiorms•  Endeavor-Intel–  Intel’sinternaldevelopmentcluster–  KnightsLanding(KNL)andSkyLake(SKX)nodes–  IntelOPAhigh-speedinterconnect

•  Cori-NERSC–  CrayXC-40Architecture–  IntelHaswell&KNLprocessors

•  Edison-NERSC–  CrayXC-30Architecture–  IntelIvyBridgeprocessors

•  Theta–ArgonneNaEonalLab–  CrayXC-40withKNLnodes

•  Stampede–TACC–  Dell,IntelandSeagate–  IntelKNLwithOPAinterconnect

MILCcomputaEonalphases:OpEmizaEons&Performance

WheredoesMILCspenditsEme?•  RepresentaEverunEmebreakdown(singlenode)•  su3_rhmd_hisqapplicaEon

0 100 200 300 400 500 600

Haswell:2MPI+16OMP

Haswell:MPI-only32ranks

Haswell:OMP-only32threads

KNL:MPI-only64ranks

KNL:OMP-only64threads

StaggeredCGFFGFFLothers

16x16x16x16lagce

seconds 6

ProfilingToolsandMethods•  IntelVTuneandAdvisor•  Goodol’Linuxprofandgprof•  Rooflineanalysis

–  e.g.usedtolookhowclosetotheKNLMCDRAMBWroofline•  IntegratedPerformanceMonitoring(IPM)tool•  And,MILChasextensiveEmingsupport,inparEcularfuncEonsknownto

beEmeconsuming(Emeinsec.andGF/sreports)…–  However,wefoundsomeEmingsdidn’taddup

•  NeedingaddiEonalEmingforcodenotrepresented•  andthisledtorouEnesgegngOpenMPthatwereassumedtobetrivial

–  E.g.Update_u()callsfromthe“trajectoryupdate”•  FuncEonspeedupauerthreadingwas42.5x•  resulEnginanoverall12%improvementtothetrajectoryupdate

OpenMPEnhancementsinBaselineMILCDirectory #offilestotal #fileswith

candidateloops

#ofloops #ofloopsmodified

%ofloopsrewri7en

generic 102 43 229 71 31%

generic_ks 185 42 529 17 3.2%

ks_imp_rhmc 20 8 36 4 11%

•  MILCabstractsloopswithFORALLSITESandFORALLFIELDSITES.–  ConverttoFORALLxxx_OMPmacros–  CandidateOpenMPloopsneedtobeexaminedforOMPprivatevariablesandreducEons

•  ExaminaEonofallloopswouldbetedious,soweusedthetoolsmenEonedintheslideabovetohelpusidenEfyloopswithpotenEallythemostimpact

IntegraEngQPhiXSolverforStaggeredFermions(akaTheBigWin)

•  QPhiX,StaggeredFermionversion,developedtoimprovevectorizaEon,andthreadingperformance

•  Staggeredfermionsvs.Wilson/Clover–  Lookingatadifferent“acEon”thanWilson/Clover–  MulE-massCGsolverimplementaEon–  Usesasinglerighthandside

àlimitsreuseanddecreasesarithmeEcintensity–  3complexvaluespergridpointvs.12–  Primarilyusedwithdouble-precisionvariables

•  X-YblockingforSIMDvectorizaEon–  Datastoredasarraysofstructuresofarrays–  SoAintheXdimension–  SIMD_width/SOA_lengthintheYdimension–  EnablesefficientcacheblockingofX-Z

SIMDwidth

SoAlengthx

9hUps://github.com/JeffersonLab/qphix

1 2 4 8 16

MILCCGPerformanceKNL

MulE-massCGSolver(SingleNode)

•  TheQPhiXsolverprovidesa1.5xforL=32lagce•  Forsmalllagcesizes,performanceislimitedbydata

remappingEme10

Ps/sec

IncreasingMPIranksDecreasingthreads

1 2 4 8 16

MILCCG(w.QPhiX)PerformanceKNL

GaugeForce(SingleNode)

•  GaugeForceperformanceimprovementsareupto4xforL=32lagcesize

Ps/sec

1 2 4 8 16

MILCGFPerformanceKNL

1 2 4 8 16

MILCGF(w.QPhiX)PerformanceKNL

1 2 4 8 16

MILCCG(w.QPhiX)L=32

SKXbase

SkyLakevs.KNL

•  ImprovementstranslatetolatestXeonprocessorwithAVX512aswell

•  Architecturesbehavedifferentlywrtrank/threadtradeoffs12

Ps/sec

1 2 4 8 16

MILCGF(w.QPhiX)L=32

SKXbase

CGPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)•  CGSolverPerformancelimitedbymemorybandwidthonXeons.Hence,noimprovementwith

QPhiX•  ParallelEfficiency:

–  ~99%@64Nodesfor32^(4)LagceVolumeperNode

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

log(nodes)/log(2)

MulTmassCG(w/QPhiX)LaUceVolumepernode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

log(nodes)/log(2)

MulTmassCG(BaselineMILC)LaUceVolumepernode:32^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

GFPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  ~3.5xNodeLevelPerformanceimprovementwithQPhiX•  MulEpleRanks/NodegivesbeUerperformanceforGaugeForce•  ParallelEfficiency:

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

log(nodes)/log(2)

GaugeForce(w/QPhiX)LaUceVolumeperNode:32^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

log(nodes)/log(2)

GaugeForce(BaselineMILC)LaUceVolume:32^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

MPIMessagingCharacterisEcs,16KNLNodes

•  Messagesizesvarywithnumberofrankspernode(RPN)•  AsRPNdecreases

–  Lagcesize(volume)perrankincreases->messagesizesincrease–  Surface-to-volumedecreases->totalamountofdatadecreases(proporEonalto1/N4)–  Sameanalogywithincreasinglagcesize(volume)pernode

0102030405060708090

64 32 16 8 4 2 1

,me(secon

rankspernode

MILCMul,massCG:TotalTime16Nodes,L=16

Compute

Allreduce

fMessages(1e6)

Messagesize(bytes)

MILCMulAmassCG,16Nodes,L=16

MPICharacterisEcs,512Nodes

•  Caveat1:atthisscale,the“full”IPMinstrumentaEonimpactsabsoluteEme,butrelaEveEmes“should”berepresentaEve(Ineedmoreconfidencethatthisistrue)

•  Caveat2:Resultsarefromasingletrial,couldbeupto10%variability16

64 32 16 8 4 2 1rankspernode

QPhiXMul9massCG:TotalTime512Nodes,L=16x32x32x32

Compute

Allreduce

64 32 16 8 4 2 1rankspernode

MILCMul7massCG:TotalTime512Nodes,L=16x32x32x32

Compute

Allreduce

512 1k

messagesizeinbytes

SMBMsgrateBandwidth:Cori/KNL

512 1k

messagesizeinbytes

SMBMsgrateBandwidth:Cori/KNL

r=8w/huge2M

r=8eager=64K0

64 32 16 8 4 2 1

1 2 4 8 16 32 64

second

RanksperNode,TheadsperRank

MILCCGSolveTime:432Nodes72x72x72x144laEce

eager=8KB

eager=64KB

w/huge2M

Cori(CrayAries)HugePagesOpEmizaEon

•  MPImessageratemicro-benchmarkidenEfiedanissuewhereBWdropssignificantlywhentransiEoningtotheRendezvousprotocol

•  TwosoluEonstried:•  MoveRendezvoustransiEonto

64KB•  PerCrayadvice,usehugepages,

2Mpagesinthiscase

•  ThishasasignificantimpactonperformancewhenusingalargenumberofMPIrankspernode

•  RecommendaEonàUsehugepagesincommunicaEonintensivecodeswithmoderatemessagesize

Eagerprotocol Rendezvousprotocol

RooflineAnalysis•  HelpstofocusonareasofcodetotargetforopEmizaEon•  MILCisknowntomemoryBWbound,andQPhiXversionisnearroofline

model,butbaselineversionhashigherAIandlowerperformance?

RooflineModelforKNL MILCPerformancevs.Roofline 18

0.1 1 10

GFLOP/s

Arithme3cIntensity(FLOP/byte)

MILCStaggeredCGSolveRooflineforKNL28x28x28x28laLce

Emp.MCDRAMRoofline

Emp.DDRRoofline

Baselinew/MCDRAM

QPhiXw/MCDRAM

QPhiXw/DDR

ImpactonProducEonCompuEng

TheRubberHigngtheRoad•  CurrentCori(KNL)producEonruns

–  96x96x96x192lagceon128nodes–  ks_spectrum_hisq:CalculaEonofmesonandbaryonspectrafromawidevarietyofsources

andsinks–  su3_clov_hisq:generatescloverpropagatorsandcontracEngmesonandbaryontwopoint

funcEons•  QPhiXversionofthemulE-massstaggeredfermionsolverisbeingused

StaggeredCG CloverBiCG

StandardMILC1 40GF/s 21-692GF/s

QPhiX3 52GF/s 69GF/s

QPhiXImprovement 1.3x 1xto3.3x1.  64rankspernode,1threadperrank,includesOpenMPimprovements2.  256nodesrunningadifferentproblem3.  32rankspernode,8threadsperrank

AdvantageofusingCori’sBurstBufferforI/O

•  I/OoverheadisbeingsignificantlyreducedbyusingCori’sBurstBuffer(BB)subsystem

•  Typicalwallclockfor128noderunis~5hours•  UsingnominalLustre/scratch,53minutesspentwriEng61

temporaryfiles•  UsingBB,I/Owasreducedto26minutes

–  2ximprovement

CurrentandFutureEfforts•  FermionForcecalculaEonopEmizaEons•  FatLinkscalcuaEonopEmizaEons•  High-speedinterconnectexploraEons

–  MulEpleendpointsimplementaEonofOPAMPI

•  ConEnueimprovementstoOpenMP–  RooflineanalysisshowsthatbaselineMILCperhapsnotBWbound

•  InvesEgaEngGridsolver•  ConEnueworkingonSciDAC/QOPversionofthecode

Discussion&QuesEons

Backup

CGPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)•  SmallerVolume[16^(4)perNode]becomescommunicaEonsensiEveatlargernodecounts•  ParallelEfficiency:

0 1 2 3 4 5 6 7

PS/sec/nod

log(nodes)/log(2)

MulEmassCG(w/QPhiX)LagceVolumepernode:16^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0 1 2 3 4 5 6 7

PS/sec/nod

log(nodes)/log(2)

MulEmassCG(BaselineMILC)LagceVolumepernode:16^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

GFPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  ~3.5xNodeLevelPerformanceimprovementwithQPhiX

–  NotseeingLLCeffectswithQPhiXasseenwithBaselineMILC(needtoinvesTgate)•  MulEpleRanks/NodegivesbeUerperformanceforGaugeForce•  ParallelEfficiency:

100.00

150.00

200.00

250.00

300.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

log(nodes)/log(2)

GaugeForce(w/QPhiX)LaUceVolumeperNode:16^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

log(nodes)/log(2)

GaugeForce(BaselineMILC)LaUceVolumeperNode:16^(4)

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

CommunicaEonProfile

•  ProfilecollectedusingIntel®MPIPerformanceSnapshotTool(partofITAC)•  SmallerLagceVolume(16^(4))spendshigh%ofEmeinMPIasexpected

94.59 91.93 91.81 90.45 89.80 87.49

5.41 8.07 8.19 9.55 10.20 12.51

100.00

150.00

0 2 3 4 5 6

log(nodes)/log(2)

Computevs.MPI(CGSolver)LagceVolumeperNode:32^(4)

8MPIRanks/Node

%ComputaEon

81.47 76.69 76.52 75.13 70.71

18.53 23.31 23.48 24.87 29.29

100.00

150.00

0 2 3 4 5

log(nodes)/log(2)

Computevs.MPI(CGSolver)LagceVolume:16^(4)8MPIRanks/Node

%ComputaEon

MPIFuncEonSummary

•  MostofMPIEmespentinMPIWait(i.e.P2PSend/RecvcompleEon)•  CollecEveOpsTime(i.e.%MPI_Allreduce)increaseswithnodecounts

–  MPI_Allreduce~40%ofMPITimeat64Nodes–  PotenEalboUleneckatverylargenodes

80.98 73.03 61.39 50.53 44.18

13.55 20.0527.5

29.97 40.05

3.68 5.79 9.82 18.03 14.49

0 2 3 4 5%M

e(Normalize

log(nodes)/log2

MPIFuncEonSummary(CGSolver)LagceVolume:16^(4)

%MPI_Wait %MPI_Allreduce

StaggeredMulE-massCG:LagceScalingStudy

MILCbaseline:•  64MPIrankspernode

StaggeredQPhiX:•  1MPIrankon1node•  upto16ranksonmulEplenodes 29

QPhiXMILCbaseline

VariousMPIrank/OMPthreadcombinaEons,L=24

MulEmassCG:WeakScalingonCori

Benchmarks:SymanzikGaugeForce

QOPQDP:•  64MPIrankspernode

QPhiX:•  1MPIrankwith1node•  16ranks/nodeotherwise

GFperformanceinGFLOPS/node,variouslagcesizespernodeL 1thread/core

Benchmarks:HISQFermionForce

QOPQDP:•  64MPIrankspernode

1 16 64

SpeedupofQOPQDP

Modifiedalgorithmvs.MILCFF•  Speedupduetoareduced

numberofcalculaEons(FLOPs)to17%ofbaseline

CoriandStampede•  Bothusethesame68-coreKNLSKUs•  CoriusesArieshigh-speedinterconnect,StampedeusesIntel

OmniPath(OPA)–  CrayandIntelMPIrespecEvely

•  CoriusesCrayCNL,Stampedeis???Linux•  Standard(non-QPhiX)versionofMILC

–  Primaryanalysisishigh-speedinterconnectperformance,usingrelaEvelysmalllagcesizepernode

•  IntelCforboth,although17.0.2.xvs.17.0.0.x•  Limitedtomaximumjobsizeof80nodesonStampede->limited

scalingstudyto64nodes

10121416

width(G

Messagesize(bytes)

Ping-PongBandwidth

Stampede Cori

0246810121416

width(G

Messagesize(bytes)

Uni-direcBonalBandwidth

Stampede Cori

0246810121416

width(G

Messagesize(bytes)

Bi-direcAonalBandwidth

Stampede Cori

OSUSinglerankPt-2-Pt

•  Singlecore,point-to-pointbetween2nodes•  Latencyiscomparableat~3.1μS(butabout2xthatofHaswell)•  Ping-pongisexactlythat,singlemessageping-pongedbetween2nodes•  Uni-direcEonalisa“streaming”exchangewithawindowofsize64•  Bi-direcEonalisalso“streaming”•  CorishowsbeUersmallmessageBW(messagerate)andStampedehigherpeakBW

Increasingmessagerate

OSUMulE-rankPt-2-Pt

512 1K

width(G

Messagesize(bytes)

StampedeMulC-RankBandwidth

•  Asrankspernode(RPN)increaseinthisuni-direcEonaltest,Stampede’sperformanceimproves.•  CorihasaBW“ceiling”below32RPNthatlimitslargemessageBWthatisnotobservedwithStampede

–  CrayaUributesthistoaPCIlatencyissuebetweenKNLandAries,canbemiEgatedbymovingBTE“put”protocoltransiEontoasmallermessage(defaultis4MB)

•  ItlooksasthoughStampedecouldusesometuninginitstransiEontoalargemessageprotocoltotakeadvantageofitshigherpeakBW

512 1K

width(G

Messagesize(bytes)

CoriMulC-RankBandwidth

SMBMulE-node,MulE-rankStencil

•  16nodes,6neighborsperrank(emulates3DstencilcommunicaEonpaUern)•  Measuresbi-direcEonalmessagerate(convertedtoBW)•  HereweseeStampedeisabletoachieveclosetoitspeakbi-direcEonBW(25GB/s)

–  However,somethinghappensat1MBmessagesize,tobeinvesEgated•  StampedecouldalsousesometuninginitsprotocoltransiEontolargemessages(>64KB)

512 1K

width(G

Messagesize(bytes)

StampedeSMB

64rpn0

512 1K

width(G

Messagesize(bytes)

CoriSMB

WeakScaling:NumberofNodes

•  WeakScaled:lagcesizeis16x16x16x16pernode•  MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes

–  numberofcoresfixedat64–  MPIranks/node(rpn)variedfrom1to64–  Allresultsuse1threadpercore

0 1 2 3 4 5 6

numberofnodes(2^^N)

Cori:MulCmassCGL=16

64rpn 0

0 1 2 3 4 5 6

numberofnodes(2^^N)

Stampede:MulEmassCGL=16

WeakScaling:RanksperNode

•  WeakScaled:lagcesizeis16x16x16x16pernode•  MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes

–  numberofcoresfixedat64–  MPIranks/node(rpn)variedfrom1to64

1 10 100

MPIrankspernode

Cori:Mul@massCGL=16

8nodes

16nodes

32nodes

64nodes0

1 10 100

MPIrankspernode

Stampede:MulAmassCGL=16

8nodes

16nodes

32nodes

64nodes

WeakScaling:SelectMPI/OMP

•  CGEmeandCGscalingforselectedMPI/OpenMPcombinaEons–  Idealscalingwouldbeaflathorizontalline

•  StampedescalingisbeUerwithhigherRPN•  ThereisascalingimprovementonCorigoingfrom32to64RPN,butIwouldn’treadtomuchintoitat

suchassmallscale

1 8 16 32 64

Time(sec)

numberofnodes

CoriCGPerformanceWeakScaling,L=16pernode

Cori64/1

Cori8/8

Cori2/32

Stampede64/1

Stampede8/8

Stampede2/320.80

16 32 64

scalingeffi

numberofnodes

Cori:CGWeakScaling,L=16

Cori64/1

Cori8/8

Cori2/32

Stampede64/1

Stampede8/8

Stampede2/32

Improving the Performance of the MILC Code on … › documents ›...

Documents

Transcript of Improving the Performance of the MILC Code on … › documents ›...

Rivista mensile multitematica a cura di Fondazione Milc ...

Improving Water, Improving the World Improving Water ...

Improving Place, Improving Lives

OpenMP Affinity - IXPUG · 2019-12-13 · Setting Mask 9 User App Running App OS Kernel Hardware N MPI Tasks M OMP Threads Hardware/OS Setup Environment Variables OpenMP Aﬃnity

MILC 9 Program - Monica Macaulaymonicamacaulay.com/MILC/MILC9-program.pdf3 4:44-23:02 STEVE KRAUSE and Adam Ussishkin Der ewige Studentenbude: Graduate student grammaticalization last

Improving Lives. Improving Texas.

Improving Assessment, Improving Learning

Strategy of the code - IXPUG · OFS1 Xiamen Marine Environment Forecasting Station, OFS1 UNESCO/IOC Ocean Forecasting demonstration System in the Southeast Asian Sea, OFS1. Outline

HPC Advisory Council · environments and potential optimizations for ... • OpenFOAM • MILC • OpenMX ... –“Runtime Support for Sparse Graph Applications ...

HackLatt 20081 MILC with SciDAC C Carleton DeTar HackLatt 2008.

K80 Boost Executive Brief-CN - images.nvidia.comimages.nvidia.com/content/pdf/tesla/cn/K80_Three_Reasons_Flyer_CN.… · GROMAC GTC OMCPAC Chroma CloverLe LAMMPS NAMC HOOMD MILC AmberM

MILC Performance Benchmark and Profiling€¦ · • Workload profiling and benchmarking – Characterization for HPC and compute intense environments – Optimization for scale,

2+1 flavor topological susceptibility from the asqtad ... · Conclusions First MILC χtopo Result 2003 Rooted Staggered Chiral Perturbation Theory Billeter, DeTar, and Osborn: χtopo

Application Readiness at NERSC · Application Readiness at NERSC IXPUG 2016 Argonne National Lab Sep. 20, 2016 Richard Gerber ... Cori Phase 2 – Being installed now! 4 Cray XC40

HackLatt 20081 MILC Code Basics Carleton DeTar First presented at Edinburgh EPCC HackLatt 2008 Updated 2013.

MILC BlackBerry - University of the Aegean...MILC BlackBerry Η Διπλ &ματική Εργασία παρουσιάστηκε εν +πιον του Διδακτικο * Προσ

I INCH-POUND SUPERSEDING MILC-21723A(OS) MILITARY ...

Improving Communication—Improving Care

MILC DECEMBER

Milc katalog