Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......
Transcript of Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......
Steve Leak, and Zhengji Zhao!NESAP Hack-a-thon!November 29, 2016, Berkeley CA
Using Cori
How to Compile & MCDRAM
-2-
SteveLeak
Building for Cori KNL nodes
• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons
• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory
Building for Cori KNL nodes
Don’tPanic(much)
KNL can run Haswell executables
But ...
HaswellExecutablescan’tfullyuseKNLhardware AVX2
(haswell)Opera6onon4DPwords
AVX-512(knl)Hardwarecancompute8DPwordsperinstruc6on
And ...
KNLreliesmoreonvectorizaKonforperformance
GFLOPS
ArithmeKcintensity
And ...
KNLmemoryhierarchyismorecomplicated
How to compile
Best:UsecompileropKonstobuildforKNLmodule swap craype-haswell craype-mic-knl
• Theloadedcraype-*modulesetsthetargetthatthecompilerwrappers(cc,CC,Wn)buildfor– Eg-mknl(GNUcompiler),
-hmic-knl(Craycompiler)• craype-haswellisdefaulton
loginnodes• craype-mic-knlisforKNLnodes
How to compile
Best:CompilerseXngstotargetKNLAlternate:CC -axMIC-AVX512,CORE-AVX2 <more-options> mycode.c++
• OnlyvalidwhenusingIntelcompilers(cc,CCorWn)• -ax<arch>addsan“alternateexecuKonpaths”opKmized
fordifferentarchitectures– Makes2(ormore)versionsofcodeinsameobjectfile
• NOTASGOODasthecraype-mic-knlmodule– (modulecausesversionsoflibrariesbuiltforthatarchitecturetobeused-egMKL)
How to compile
RecommendaKons:• Forbestperformance,usethecraype-mic-knlmodulemodule swap craype-haswell craype-mic-knl CC -O3 -c myfile.c++
• IfthesameexecutablemustrunonKNLandHaswellnodes,usecraype-haswellbutaddKNL-opKmizedexecuKonpathCC -axMIC-AVX512,CORE-AVX2 -O3 -c myfile.c++
What to link
UKlitylibraries• Notperformance-criKcal(bydefiniKon)– KNLcanrunXeonbinaries..canuseHaswell-targetedversions
• I/Olibraries(HDF5,NetCDF,etc)shouldfitinthiscategorytoo– (forCray-providedlibraries,compilerwrapperwillusecraype-*toselectbestbuildanyway)
What to link
Performance-criKcallibraries• MKL:hasKNL-targetedopKmizaKons– Note:needtolinkwithwith-lmemkind(moresoon)
• PETsc,SLEPc,Caffe,MeKs,etc:– (soon)hasKNL-targetedbuilds
• Modulefileswillusecraype-{haswell,mic-knl}tofindappropriatelibrary• Keypoints:– SomeoneelsehasalreadypreparedlibrariesforKNL– Noneedtodo-it-yourself– Loadtherightcraype-module
What to link
• NERSCconvenKon:/usr/common/software/<name>/<version>/<arch>/[<PrgEnv>]
• Eg:/usr/common/software/petsc/3.7.2/hsw/intel /usr/common/software/petsc/3.7.2/knl/intel
• KNLsubfoldermaybeasymlinktohsw– Librariescompiledwith-axMIC-AVX512,CORE-AVX2
• ModulefilesshoulddotherightthingTM
– UsingCRAY_CPU_TARGET,setbycraype-{haswell,mic-knl}
Where to build
• Mostly:ontheloginnodes– KNLisdesignedforscalable,vectorizedworkloads– Compilingisneither!
• WillprobablybemuchsloweronKNLnodethanXeonnode
• Cross-compiling– YouarecompilingforaXeonPhi(KNL)target,onaXeonhost• Toolslikeautoconf(./configure)maytrytobuild-and-runsmallexecutablestotestavailabilityoflibraries,etc..whichmightnotwork
– CompileonKNLcomputenode?• Slow(andcurrentlynotworking)
– craype-haswell+CFLAGS=-axMIC-AVX512,CORE-AVX2
Don’t Panic!
InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)
• Here’sthenewbit:• module swap craype-haswell craype-mic-knl
– ForKNL-specificexecutables,or• CC -axMIC-AVX512,CORE-AVX2 ...
– ForHaswell/KNLportability
What about MCDRAM?
• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons
• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory
MCDRAM in a nutshell
• 16GBon-chipmemory– cf96GBoff-chipDDR(Cori)
• Not(exactly)acache– LatencysimilartoDDR
• Butveryhighbandwidth– ~5xDDR
• 2waystouseit:– “Cache”mode:invisibletoOS,memorypagesarecachedinMCDRAM(cache-linegranularity)– “Flat”mode:appearstoOSasseparateNUMAnode,withnolocalCPUs.Accessiblevianumactl,libnuma(pagegranularity)
MCDRAM in a nutshell - cache mode
MCDRAM in a nutshell - flat mode
How to use MCDRAM
• OpKon1:Letthesystemfigureitout– Cachemode,nochangestocode,buildprocedureorrunprocedure– Mostofthebenefit,free,mostofthe6me
How to use MCDRAM
• OpKon2:Run-KmeseXngsonly– Flatmode,nochangestocodeorbuildprocedure– Doeswholejobfitwithin16GB/node?
• srun<op6ons>numactl-m1./myexec.exe– Toobig?
• srun<op6ons>numactl-p1./myexec.exe
How to use MCDRAM
• OpKon3:MakeyourapplicaKonNUMA-aware– Flatmode– UselibmemkindtoexplicitlyallocateselectedarraysinMCDRAM
memkind
jemalloc libnuma
APIforNUMAalloca6onpolicyinLinuxkernel
Mallocimplementa6onemphasizingfragmenta6onavoidanceandconcurrency
NUMA-awareextensibleheapmanager
#include <hbwmalloc.h> malloc(size) -> hbw_malloc(size)
Using libmemkind in code
• C/C++hbw_malloc()replacesmalloc()#include <hbwmalloc.h> // malloc(size) -> hbw_malloc(size)
• Fortran!DIR$ MEMORY(bandwidth) a,b,c ! cray real, allocatable :: a(:,:), b(:,:), c(:) !DIR$ ATTRIBUTES FASTMEM :: a,b,c ! intel
• Caveat:onlyfordynamically-allocatedarrays– Notlocal(stack)variables– OrFortranpointers
Using libmemkind in code
• WhicharraystoputinMCDRAM?– Vtunememory-accessmeasurements:– amplxe-cl-collectmemory-access…
Building with libmemkind
• module load memkind • (ormodule load cray-memkind)
• Compilerwrapperswilladd
-lmemkind -ljemalloc -lnuma • Fortrannote:NotallcompilerssupportFASTMEMdirecKve– CurrentlyIntelandmaybeCray
AutoHBW: Automatic memkind
• UsesarraysizetodeterminewhetheranarrayshouldbeallocatedtoMCDRAM• Nocodechangesnecessary!• module load autohbw • Linkwith-lautohbw
RunKmeenvironmentvariables:export AUTO_HBW_SIZE=4K # any allocation # >4KB will be placed in MCDRAM export AUTO_HBW_SIZE=4K:8K # allocations # between 4KB and 8KB will # be placed in MCDRAM
Don’t Panic!
InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)• Here’sthenewbit:• module swap craype-haswell craype-mic-knl
– ForKNL-specificexecutables,or• CC -axMIC-AVX512,CORE-AVX2 ...
– ForHaswell/KNLportabilityAnd:• ThinkaboutMCDRAM– numactl,memkind,autohbm
A few final notes
• Edisonexecutables(probably)won’tworkwithoutrecompile– ISA-compa6ble,but…– CorihasnewerOSversion,updatedlibraries– So:recompileforCori
• KNL-opKmizedMKLuseslibmemkind– Willneedtolinkwith-lmemkind -ljemalloc – Shouldbeinvisiblyintegratedinfutureversion
Running jobs on Cori KNL nodes
-30-
ZhengjiZhao
Agenda
• What’snewonKNLnodes• Process/thread/memoryaffinity• Samplejobscripts• Summary
-31-
KNL overview and legend Knights Landing Overview
Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown) Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+
TILE
4
2 VPU
Core
2 VPU
Core
1MB L2
CHA
Package
Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. KNL data are preliminary based on current expectations and are subject to change without notice. 1Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX). 2Bandwidth numbers are based on STREAM-like memory access pattern when MCDRAM used as flat memory. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Omni-path not shown
EDC EDC PCIe Gen 3
EDC EDC
Tile
DDR MC DDR MC
EDC EDC misc EDC EDC
36 Tiles connected by
2D Mesh Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
3
DDR4
CHANNELS
3
DDR4
CHANNELS
MCDRAM MCDRAM MCDRAM MCDRAM
DMI
2 x161 x4
X4 DMI
-32-
CPU CPU
CPU CPU
CPU CPU
CPU CPU
Tile
Core Core
1Socket/Node68Cores(272CPUs)/Node36Tiles/Node(34ac6ve)2Cores/Tile;4CPUs/Core
1.4GB/CoreDDRmemory235MB/CoreMCDRAM
• ACoriKNLnodehas68cores/272CPUsrunningat1.4GHz,96GBDDRmemory,16GBhigh(~5xDDR)bandwidthonpackagememory(MCDRAM)
• Threeclustermodes,all-to-all,quadrant,sub-NUMAclustering,areavailableatbootKmetoconfiguretheKNLmeshinterconnect.
UseSlurm’sterminologyofcores,CPUs(hardwarethreads).
KNL overview – MCDRAM modes
10
Memory Modes
Hybrid Mode
DDR 4 or 8 GB MCDRAM
8 or 12GB MCDRAM
16GB MCDRAM
DDR
Flat Mode
Ph
ysic
al A
dd
ress
DDR 16GB
MCDRAM
Cache Mode
• SW-Transparent, Mem-side cache • Direct mapped. 64B lines. • Tags part of line • Covers whole DDR range
Three Modes. Selected at boot
Ph
ysic
al A
dd
ress
• MCDRAM as regular memory • SW-Managed • Same address space
• Part cache, Part memory • 25% or 50% cache • Benefits of both
-33-
96GB 96GB96GB
• Nosourcecodechangesneeded
• Missesareexpensive
• Codechangesrequired• ExposedasaNUMAnode• Accessviamemkindlibrary,
joblaunchers,and/ornumactl
• Combina6onofthecacheandflatmodes
MCDRAMcanbeconfiguredinthreedifferentmodesatbootKme-cache,flat,andhybridmodes
What’s new on KNL nodes (in comparison with Cori Haswell nodes from the perspective of running jobs)
1. Alotmore(slower)coresonthenode2. Muchreducedpercorememory3. DynamicallyconfigurableNUMAandMCDRAM
modes…
-34-
A proper process/thread/memory affinity is the basis for optimal performance
• Processaffinity(orCPUpinning):binda(MPI)processtoaCPUorarangeofCPUsonthenode,sothattheprocessexecuteswithinthedesignatedCPUsinsteadofdriWingaroundtootherCPUsonthenode.
• Threadaffinity:finepineachthreadofaprocesstoaCPUorCPUswithintheCPUsthataredesignatedtotheprocess.– Threadsliveintheprocessthatownsthem,sotheprocessandthreadaffinityarenotseparable.
• Memoryaffinity:restrictprocessestoallocatememoriesfromthedesignatedNUMAnodesonlyratherthananyNUMAnodes.
-35-
The minimum goal of process/thread/memory affinity is to achieve best resource utilization and to avoid NUMA performance penalty
• SpreadMPItasksandthreadsontothecoresandCPUsonthenodesasevenlyaspossiblesothatnocoresandCPUsareoversubscribedwhileothersstayidle.Thiscanensuretheresourcesavailableonthenode,suchascores,CPUs,NUMAnodes,memoryandnetworkbandwidths,etc.,canbebestuKlized.
• AvoidaccessingremoteNUMAnodesasmuchaspossiblesotoavoidperformancepenalty.
• IncontextofKNL,enableandcontroltheMCDRAMaccess.
-36-
Using srun’s --cpu_bind option and OpenMP environment variables to achieve desired process/thread affinity
• Usesrun--cpu_bindtobindtaskstoCPUs– Ozenneedstoworkwiththe–copKonofsruntoevenlyspreadMPI
tasksontheCPUsonthenodes– Thesrun–c<n>(or--cpus-per-task=n)allocates(reserves)nnumber
ofCPUspertask(process)– --cpu_bind=[{verbose,quiet},]type,type:cores,threads,map_cpu:<list
ofCPUs>,mask_cpu:<listofmasks>,none,…• UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACESto
finepineachthreadtoasubsetofCPUsallocatedtothehosttask– Differentcompilersmayhavedifferentdefaultvaluesforthem.The
followingarerecommended,whichyieldamorecompa6blethreadaffinityamongIntel,GNUandCraycompilers:OMP_PROC_BIND=true#SpecifyingthreadsmaynotbemovedbetweenCPUsOMP_PLACES=threads#SpecifyingathreadshouldbeplacedinasingleCPU
– UseOMP_DISPLAY_ENV=truetodisplaytheOpenMPenvironmentvariablesset(usefulwhencheckingthedefaultcompilerbehavior)
-37-
Using srun’s --mem_bind option and/or numactl to achieve desired memory affinity
• Usesrun–mem_bindformemoryaffinity– --mem_bind=[{verbose,quiet},]type:local,map_mem:<NUMAidlist>,mask_mem:<NUMAmasklist>,none,…
– E.g.,--mem_bind=<MCDRAMNUMAid>whenalloca6onsfitintoMDCRAMinflatmode
• UseNumactl–p<NUMAid>– Srundoesnothavethisfunc6onalitycurrently(16.05.6),willbesupportedinSlurm17.02.
– E.g.,numactl–p<MCDRAMNUMAid>./a.outsothatalloca6onsthatdon’tfitintoMCDRAMspillovertoDDRmemory
-38-
Default Slurm behavior with respect to process/thread/memory binding
• BySlurmdefault,adecentCPUbindingissetonlywhentheMPItaskspernodexCPUspertask=thetotalnumberofCPUsallocatedpernode,e.g.,68x4=272
• Otherwise,SlurmdoesnotdoanythingwithCPUbinding.Thesrun’s--cpu_bindand–copKonsmustbeusedexplicitlytoachieveopKmalprocess/threadaffinity.
• NodefaultmemorybindingissetbySlurm.ProcessescanallocatememoryfromallNUMAnodes.The--mem_bind(ornumactl)shouldbeusedexplicitlytosetmemorybindings.
-39-
Default Slurm behavior with respect to process/thread/memory binding (continued)
• ThedefaultdistribuKon,the–mopKonofsrun,isblock:cycliconCori.– Thecyclicdistribu6onmethoddistributesallocatedCPUsforbindingtoa
giventaskconsecu6velyfromthesamesocket,andfromthenextconsecu6vesocketforthenexttask,inaround-robinfashionacrosssockets.
• The–mblock:blockalsoworks.Youareencouragedtoexperimentwith–mblock:blockassomeapplicaKonsperformbeyerwiththeblockdistribuKon.– Theblockdistribu6onmethoddistributesallocatedCPUsconsecu6vely
fromthesamesocketforbindingtotasks,beforeusingthenextconsecu6vesocket.
• The–mopKonisrelevanttotheKNLnodeswhentheyareconfiguredinthesub-NUMAclustermodes,e.g.,SNC2,SNC4,etc.Slurmtreats“NUMAnodeswithCPUs”as“sockets”,althoughKNLisasinglesocketnode.
-40-
Available partitions and NUMA/MCDRAM modes on Cori KNL nodes (not finalized view yet)
• SameparKKonsasHaswell– #SBATCH–pregular– #SBATCH–pdebug– Typesinfo–sformoreinfoaboutpar66onsandnodes
• Usingthe–Cknl,<NUMA>,<MCDRAM>opKonsofsbatchtorequestKNLnodeswithdesiredfeatures– #SBATCH–Cknl,quad,flat
• SupportscombinaKonofthefollowingNUMA/MCDRAMmodes:– AllowNUMA=a2a,snc2,snc4,hemi,quad– AllowMCDRAM=cache,split,equal,flat– Quad,flatisthedefaultfornow(notfinalized)
• NodescanberebootedautomaKcally– Frequentrebootsarenotencouraged,astheycurrentlytakealong6me– Wearetes6ngvariousmemorymodessotosetaproperdefaultmode
-41-
Example of running interactive batch job with KNL nodes in the quad,cache mode
zz217@gert01:~>salloc-N1–pdebug–t30:00-Cknl,quad,cachesalloc:Grantedjoballoca6on5545salloc:Wai6ngforresourceconfigura6onsalloc:Nodesnid00044arereadyforjobzz217@nid00044:~>numactl-Havailable:1nodes(0)node0cpus:0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172…...238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node0size:96757MBnode0free:94207MBnodedistances:node00:10
-42-
• Runthenumactl–HcommandtocheckiftheactualNUMAconfigura6onmatchestherequestedNUMA,MCDRAMmode
• Thequad,cachemodehasonly1NUMAnodewithallCPUsontheNUMAnode0(DDRmemory)
• TheMCDRAMishiddenfromthenumactl–Hcommand(itisacache).
Sample job script to run under the quad,cache mode
SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache
exportOMP_NUM_THREADS=1#op6onal*srun-n64-c4--cpu_bind=cores./a.out
-43-
Processaffinityoutcome0 68
136 204
1 69
137 205
3 70
138 206
4 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches64MPItasksonthenode,alloca6ng4CPUspertask,andbindsprocessestocores.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63willbepinnedtoCore63.EachMPItaskmaymovewithinthe4CPUsinthecores.
Each2x2boxaboveisacorewith4CPUs(hardwarethreads).ThenumbersshownineachCPUboxistheCPUids.Thelast4coresarenotusedinthisexample.Thecores4-59werenotbeshown.
*)Theuseof“exportOMP_NUM_THREADS=1”isop6onalbutrecommendedevenforpureMPIcodes.Thisistoavoidunexpectedthreadforking(compilerwrappersmaylinkyourcodetothemul6-threadedsystemprovidedlibrariesbydefault).
64 132
200 268
65 133
201 269
66 134
202 270
67 135
203 271
…
…
Rank0 Rank2 Rank3
Rank60 Rank61 Rank62 Rank63
Rank1
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Core64 Core65 Core66 Core67
Sample job script to run under the quad,cache mode
SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache
exportOMP_NUM_THREADS=1#op6onal*srun–n16–c16--cpu_bind=cores./a.out
-44-
Processaffinityoutcome
0 68
136 204
1 69
137 205
3 70
138 206
4 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches16MPItasksonthenode,alloca6ng16CPUspertask,andbindseachprocessto4cores/16CPUs.Theresul6ngtaskplacementisshownintherightfigure.TheRank0ispinnedtoCore0-3,andRank1toCore4-7,…,Rank15toCore60-63.TheMPItaskmaymovewithinthe16CPUsinthe4cores.
64 132
200 268
65 133
201 269
66 134
202 270
67 135
203 271
…
…
Rank0
Rank15
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Core64 Core65 Core66 Core67
Sample job script to run under the quad,cache mode
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache
exportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out
-45-
Processaffinityoutcome
0 68
136 204
1 69
137 205
3 70
138 206
4 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
Thisjobscriptrequests1KNLnodeinthequad,cachemodetorun64MPItasksonthenode,alloca6ng4CPUspertask,andbindseachtasktothe4CPUsallocatedwithinthecores.EachMPItaskruns4OpenMPthreads.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63toCore63.The4threadsofeachtaskarepinnedwithinthecore.Dependingonthecompilersusedtocompilethecode,the4threadsineachcoremayormaynotmovebetweenthe4CPUs.
Thread0
Thread1
Thread2
Thread3
64 132
200 268
65 133
201 269
66 134
202 270
67 135
203 271
…
…
Rank0 Rank2 Rank3
Rank60 Rank61 Rank62 Rank63
Rank1
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Core64 Core65 Core66 Core67
Sample job script to run under the quad,cache mode
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache
exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out
-46-
Process/threadaffinityoutcome0 68
136 204
1 69
137 205
2 70
138 206
3 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUwithineachcore.Theresul6ngthreadaffinity(andtaskaffinity)isshownintherightfigure.E.g.,forRank0,Thread0ispinnedtoCPU0,Thread1toCPU68,Thread2toCPU136,andThread3ispinnedtoCPU204.
Thread0
Thread1
Thread2
Thread3
64 132
200 268
65 133
201 269
66 134
202 270
67 135
203 271
…
…
Rank0 Rank2 Rank3
Rank60 Rank61 Rank62 Rank63
Rank1
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Core64 Core65 Core66 Core67
Sample job script to run under the quad,cache mode
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache
exportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out
-47-
Processaffinityoutcome
0 68
136 204
1 69
137 205
2 70
138 206
3 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
Thread0
Thread1
Thread2
Thread3
Thread4
Thread5
Thread6
Thread7
Dependingonthecompilerimplementa6ons,the8threadsineachtaskmayormaynotmovebetween4cores/16CPUsallocatedtothehosttask.
64 132
200 268
65 133
201 269
66 134
202 270
67 135
203 271
…
…
Rank0
Rank15
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Core64 Core65 Core66 Core67
Sample job script to run under the quad,cache mode
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache
exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out
-48-
Process/threadaffinityoutcome0 68
136 204
1 69
137 205
2 70
138 206
3 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
Core61
Thread0
Thread1
Thread2
Thread3
Thread4
Thread5
Thread6
Thread7
WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUonthecoresallocatedtothetask.Theresul6ngprocess/threadisshownintherightfigure.E.g.,forRank0,Thread0ispinnedtotheCPU0(onCore0),Thread1totheCPU1(onCore1),Threads2toCPU2(onCore2),andsoon.
64 132
200 268
65 133
201 269
66 134
202 270
67 135
203 271
…
…
Rank0
Rank15
Core0
Core60
Core1 Core2 Core3
Core62 Core63
Core64 Core65 Core66 Core67
Example of running under the quad,flat mode interactively
zz217@gert01:~>salloc-pdebug-t30:00-Cknl,quad,flatzz217@nid00037:~>numactl-Havailable:2nodes(0-1)node0cpus:01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374….262263264265266267268269270271node0size:96759MBnode0free:94208MBnode1cpus:node1size:16157MBnode1free:16091MBnodedistances:node010:10311:3110 -49-
zz217@cori10:~>scontrolshownodenid10388NodeName=nid10388Arch=x86_64CoresPerSocket=68CPUAlloc=0CPUErr=0CPUTot=272CPULoad=0.01AvailableFeatures=knl,flat,split,equal,cache,a2a,snc2,snc4,hemi,quadAcKveFeatures=knl,cache,quad…State=IDLEThreadsPerCore=4…BootTime=2016-10-31T13:43:12…
Sample job script to run under the quad,flat mode
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out
-50-
Processaffinityoutcome
0 68
136 204
1 69
137 205
2 70
138 206
3 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
0 68
136 204
1 69
137 205
2 70
138 206
3 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
…
…
Rank0 Rank2 Rank3
Rank60 Rank61 Rank62 Rank63
Rank1
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
…
…
Rank0
Rank15
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Sample job script to run under the quad,flat mode
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out
-51-
Process/threadaffinityoutcome0 68
136 204
1 69
137 205
2 70
138 206
3 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
0 68
136 204
1 69
137 205
3 70
138 206
4 71
139 207
60 128
196 264
61 129
197 265
62 130
198 266
63 131
199 267
…
…
Rank0
Rank15
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Thread0
Thread1
Thread2
Thread3
Thread4
Thread5
Thread6
Thread7
…
…
Rank0 Rank2 Rank3
Rank60 Rank61 Rank62 Rank63
Rank1
Core0
Core60
Core1 Core2 Core3
Core61 Core62 Core63
Sample job script to run under the quad,flat mode using MCDRAM
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat
#Whenthememoryfootprintfitsin16GBofMCDRAM(NUMAnode1),runsoutofMCDRAMexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads
srun-n64-c4--cpu_bind=cores--mem_bind=map_mem:1./a.out#orusingnumactl-msrun-n64-c4--cpu_bind=coresnumactl–m1./a.out
-52-
SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat
#PrefersrunningonMCDRAM(NUMAnode1)ifmemoryfootprintdoesnotfitonMCDRAM,spillstoDDRexportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads
srun–n16–c16--cpu_bind=coresnumactl–p1./a.out
How to check the process/thread affinity
• Usesrunflag:--cpu_bind=verbose– Needtoreadthecpumasksinhexidecimalformat
• UseaCrayprovidedcodexthi.c(seebackupslides).• Use--mem_bind=verbose,<type>tocheckmemoryaffinity
• Usethenumastat–p<PID>commandtoconfirmwhileajobisrunning
• Useenvironmentalvariables(Slurm,compilerspecific)– SLURM_CPU_BIND_VERBOSE--cpu_bindverbosity(quiet,verbose).
-53-
A few useful commands • sinfo–format=“%F%b”foravailablefeaturesofnodes,
orsinfo–format=“%C%b”– A/I/O/T(allocated/idle/other/total)
• scontrolshownode<nid>• scontrolshowjob<jobid>-ddd• sinfo–stoseeavailableparKKonsandnodes• sbatch,srun,squeue,sinfoandotherSlurmcommand
manpages– needtodis6nguishthejoballoca6on6me(#SBATCH)andjobstepcrea6on6me(srunwithinajobscript)
– Someop6onsareonlyavailableatJoballoca6on6me,suchas–ntasks-per-core,someonlyworkwhencertainpluginsareenabled
-54-
Summary
• Use–Cknl,<NUMA>,<MCDRAM>torequestKNLnodeswiththesameparKKonsasHaswellnodes(debug,orregular)
• Alwaysexplicitlyusesrun’s--cpu_bindand-copKontospreadtheMPItasksevenlyoverthecores/CPUsonthenodes
• UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACEStofinepinthreadstotheCPUsallocatedtothetasks
• Usesrun’s--mem_bindandnumactl–ptocontrolmemoryaffinityandaccessMCDRAM– Usingmemkind/autoHBWlibrariescanbeusedtoallocateonlyselectedarrays/memoryalloca6onstoMCDRAM(Steve’stalk)
-55-
Summary (2)
• Considerusing64coresoutof68inmostcases• Moresamplejobscriptscanbefoundinourwebsite
– h�p://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/running-jobs-on-cori-knl-nodes/
• WehaveprovidedajobscriptgeneratortohelpyoutogeneratebatchjobscriptsforKNL(andHaswell,Edison)
• SlurmKNLfeaturesareinconKnuousdevelopmentandsomeinstrucKonsaresubjecttochange
------URLforthejobscriptgenerator:
h�ps://my.nersc.gov/script_generator.php
-56-
Backupslides
-57-
Cray provided a code to check process/thread affinity xthi.c (http://portal.nersc.gov/project/training/KNLUserTraing20161103/UsingCori/xthi.c/)
• Tocompile,cc–qopenmp–oxthi.intelxthi.c#Intelcompilerscc–fopenmp–oxthi.gnuxthi.c#GNUcompilerscc–oxthi.gnuxthi.c#Craycompilers
• Torun,salloc–N1–pdebug–Cknl,quad,flat#startaninterac6ve1nodejob…exportOMP_DISPLAY_ENV=true#todisplayenvsused/setbyopenmprun6meexportOMP_NUM_THREADS=4srun–n64–c4--cpu_bind=verbose,coresxthi.intel#run64taskswith4threadseach
Srun–n16–c16--cpu_bind=verbose,coresxthi.intel#run16tasks4threadseach
sinfo --format="%F %b” to show the number of nodes with active features
zz217@cori01:~>dateMonNov2815:10:36PST2016zz217@cori01:~>sinfo--format="%F%b"NODES(A/I/O/T)ACTIVE_FEATURES1805/157/42/2004haswell0/107/1/108knl,flat,a2a0/0/1/1knl0/225/6/231knl,flat,quad2500/1942/60/4502knl,cache,quad2676/995/6/3677quad,cache,knl768/0/0/768snc4,flat,knl0/8/0/8quad,flat,knl0/8/1/9snc2,flat,knl
-59-
Thiscommandshowsthattherethereare8179KNLnodesinquad,cachemode;239nodesinquad,flatmode;768nodesinsnc4,flatmode;9insnc2,flatmode
Theorderoffeaturesdoesnotmakedifference.
The --cpu_bind option of srun enables CPU bindings
-60-
srun–n4./a.out#noCPUbidings.Taskscanmovearoundwithin68cores/272CPUs
salloc–N1–pdebug–Cknl,quad,flat…
srun–n4–cpu_bind=cores./a.out
srun–n4--cpu_bind=threads./a.out
…...
Rank0
Core0
Core1
Core3
Core2
Core67 CPUCPUCPUCPU
…
Core65
Core66
…...
Rank0
Rank1
Rank2
Core0
Core1
Core3
Core2
Core67 CPUCPUCPUCPU
…
Core65
Core66
Rank3
…...
Core0
Core1
Core3
Core2
Core67 CPUCPUCPUCPU…
Core65
Core66
srun–n4--cpu_bind=map_cpu:0,204,67,271./a.out
Rank1
Rank2
Rank3
Rank0
Rank1
Rank3
Rank2
The --cpu_bind option: the -c option spreads tasks (evenly) on the CPUs on the node
-61-
salloc–N1–pdebug–Cknl,quad,flat…
srun–n16–c16–cpu_bind=cores./a.outsrun–n4–c8–cpu_bind=cores./a.out
…...
Rank0Core0
Core1
Core3
Core2
CPUCPUCPUCPU
…
Rank1
Rank3
Rank2Core4
Core5
Core7
Core6
First8cores/32CPUSareused;rest60cores/240CPUsstayidle
…...
Rank0
Core0
Core1
Core3
Core2
Core63 CPUCPUCPUCPU
…Core61
Core62
Rank1
Core60
Rank15
Last4cores/16CPUsareidle
The --cpu_bind option (continued): the –c option spread tasks (evenly) on the CPUs on the node
-62-
salloc–N1–pdebug–Cknl,quad,flat…
srun–n16–c16–cpu_bind=threads./a.outsrun–n4–c8–cpu_bind=threads./a.out
…...
Rank0Core0
Core1
Core3
Core2
CPUCPUCPUCPU
…
Rank1
Rank3
Rank2Core4
Core5
Core7
Core6
First8cores/32CPUSareused;rest60cores/240CPUsstayidle
…...
Rank0
Core0
Core1
Core3
Core2
Core63 CPUCPUCPUCPU
…Core61
Core62
Rank1
Core60
Rank15
Last4cores/16CPUsareidle
The -c option: --cpu_bind=cores vs –cpu_bind=threads
-63-
salloc–N1–pdebug–Cknl,quad,flat…
srun–n4–c6–cpu_bind=threads./a.outsrun–n4–c6–cpu_bind=cores./a.out
…...
Rank0Core0
Core1
Core3
Core2
CPUCPUCPUCPU
…
Rank1
Rank3
Rank2Core4
Core5
Core7
Core6
First8cores/32CPUSareused;rest60cores/240CPUsstayidle
…...
Core0
Core1
Core3
Core2
CPUCPUCPUCPU
…
Core4
Core5
Core7
Core6
First6cores/24CPUSareused;rest62cores/248CPUsstayidle
Rank0
Rank1
Rank2
Rank3
Snc2,flat (salloc -N 1 -p regular -C knl,snc2,flat) zz217@nid11512:~>numactl-Havailable:4nodes(0-3)node0cpus:01234567891011121314151617181920212223242526272829303132336869707172737475767778798081828384858687888990919293949596979899100101136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237node0size:48293MBnode0free:46047MBnode1cpus:34353637383940414243444546474849505152535455565758596061626364656667102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node1size:48466MBnode1free:44949MB
-64-
node2cpus:node2size:8079MBnode2free:7983MBnode3cpus:node3size:8077MBnode3free:7980MB
nodedistances:node01230:102131411:211041312:314110413:41314110
NUMAnode2
NUMAnode0
NUMAnode3
NUMAnode1
Core34-67
Core0-33
The default distribution on Cori is -m block:cyclic
-65-
salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c4–cpu_bind=cores./a.out#orsrun–n11–c4–cpu_bind=cores–mblock:cyclic./a.out
KNLNode1NUMAnode0
…...
NUMAnode1
…...
Rank0 Rank1Rank2 Rank3Rank4 Rank5
Core0
Core1
Core3
Core2
Core33 Core67
Core34
Core35
Core37
Core36
CPUCPUCPUCPU
… …
CPUCPUCPUCPU
KNLNode2NUMAnode0
…...
NUMAnode1
…...
Rank7Rank6Rank9Rank8
Core33
Core0
Core1
Core3
Core2
Core67
Core34
Core35
Core37
Core36
……
Rank10
CPUCPUCPUCPU CPUCPUCPUCPU
The default distribution on Cori is -m block:cyclic (continued)
-66-
salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c8–cpu_bind=cores./a.out#orsrun–n11–c8–cpu_bind=cores–mblock:cyclic./a.out
KNLNode1NUMAnode0
…...
NUMAnode1
…...
Core0
Core1
Core3
Core2
Core33 Core67
Core34
Core35
Core37
Core36
… …Rank2
Rank4 Rank5
Rank3
Rank1Rank0
KNLNode2NUMAnode0
…...
NUMAnode1
…...
Core33
Core0
Core1
Core3
Core2
Core67
Core34
Core35
Core37
Core36
……Rank10
Rank8 Rank9
Rank7Rank6
CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU
Block distribution: -m block:block
-67-
Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c4–cpu_bind=cores–mblock:block./a.out
KNLNode1NUMAnode0
…...
NUMAnode1
…...
Rank0 Rank3Rank1 Rank4Rank2 Rank5
Core0
Core1
Core3
Core2
Core33 Core67
Core34
Core35
Core37
Core36
… …
KNLNode2NUMAnode0
…...
NUMAnode1
…...
Rank9Rank6Rank10Rank7
Core33
Core0
Core1
Core3
Core2
Core67
Core34
Core35
Core37
Core36
……
Rank8
CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU
Block distribution: -m block:block (continued)
-68-
Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c8–cpu_bind=cores–mblock:block./a.out
KNLNode1NUMAnode0
…...
NUMAnode1
…...
Rank0Core0
Core1
Core3
Core2
Core33 Core67
Core34
Core35
Core37
Core36
… …
KNLNode2NUMAnode0
…...
NUMAnode1
…...
Core33
Core0
Core1
Core3
Core2
Core67
Core34
Core35
Core37
Core36
……
Rank1
Rank2 Rank5
Rank4
Rank3
Rank7
Rank8
Rank6
Rank10
Rank9
CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU
Sample job script to run MPI+OpenMP under the snc2,flat mode
-69-
#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,snc2,flat
exportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads
#using2CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n16–c16--cpu_bind=cores–mem_bind=local./a.out
#using2CPUs(hardwarethreads)percoreusingMCDRAMsrun–n16–c16--cpu_bind=cores–mem_bind=map_mem:2,3./a.out#orsrun–n16–c16--cpu_bind=coresnumactl–m2,3./a.out
#using2CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n16–c16--cpu_bind=coresnumactl–p2,3./a.out
#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n32–c8--cpu_bind=coresnumactl–p2,3./a.out
Sample job script to run large jobs (> 1500 MPI tasks) (quad,flat)
Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N100#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun--bcast=/tmp/a.out–n6400–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsbcast./a.out/tmp/a.out#copya.outtothe/tmpofeachcomputenodeallocatedfirstsrun–n6400–c4--cpu_bind=coresnumactl–p1/tmp/a.out
-70-
Sample job script to use core specialization (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flat#SBATCH–S1
exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)Srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredSrun–n64–c4--cpu_bind=coresnumactl–p1./a.out
-71-
Sample job script to run with Intel MPI (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat
exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsmoduleloadimpiexportI_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so#exportI_MPI_FABRICS=shm:tcp
#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out
#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n64–c4--cpu_bind=coresnumactl–p1./a.out
-72-
Thankyou!
-73-