Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

73
Steve Leak, and Zhengji Zhao NESAP Hack-a-thon November 29, 2016, Berkeley CA Using Cori

Transcript of Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

Page 1: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Steve Leak, and Zhengji Zhao!NESAP Hack-a-thon!November 29, 2016, Berkeley CA

Using Cori

Page 2: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to Compile & MCDRAM

-2-

SteveLeak

Page 3: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Building for Cori KNL nodes

• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons

• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory

Page 4: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Building for Cori KNL nodes

Don’tPanic(much)

Page 5: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

KNL can run Haswell executables

Page 6: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

But ...

HaswellExecutablescan’tfullyuseKNLhardware AVX2

(haswell)Opera6onon4DPwords

AVX-512(knl)Hardwarecancompute8DPwordsperinstruc6on

Page 7: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

And ...

KNLreliesmoreonvectorizaKonforperformance

GFLOPS

ArithmeKcintensity

Page 8: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

And ...

KNLmemoryhierarchyismorecomplicated

Page 9: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to compile

Best:UsecompileropKonstobuildforKNLmodule swap craype-haswell craype-mic-knl

•  Theloadedcraype-*modulesetsthetargetthatthecompilerwrappers(cc,CC,Wn)buildfor–  Eg-mknl(GNUcompiler),

-hmic-knl(Craycompiler)•  craype-haswellisdefaulton

loginnodes•  craype-mic-knlisforKNLnodes

Page 10: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to compile

Best:CompilerseXngstotargetKNLAlternate:CC -axMIC-AVX512,CORE-AVX2 <more-options> mycode.c++

•  OnlyvalidwhenusingIntelcompilers(cc,CCorWn)•  -ax<arch>addsan“alternateexecuKonpaths”opKmized

fordifferentarchitectures– Makes2(ormore)versionsofcodeinsameobjectfile

• NOTASGOODasthecraype-mic-knlmodule– (modulecausesversionsoflibrariesbuiltforthatarchitecturetobeused-egMKL)

Page 11: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to compile

RecommendaKons:•  Forbestperformance,usethecraype-mic-knlmodulemodule swap craype-haswell craype-mic-knl CC -O3 -c myfile.c++

•  IfthesameexecutablemustrunonKNLandHaswellnodes,usecraype-haswellbutaddKNL-opKmizedexecuKonpathCC -axMIC-AVX512,CORE-AVX2 -O3 -c myfile.c++

Page 12: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

What to link

UKlitylibraries• Notperformance-criKcal(bydefiniKon)– KNLcanrunXeonbinaries..canuseHaswell-targetedversions

•  I/Olibraries(HDF5,NetCDF,etc)shouldfitinthiscategorytoo– (forCray-providedlibraries,compilerwrapperwillusecraype-*toselectbestbuildanyway)

Page 13: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

What to link

Performance-criKcallibraries• MKL:hasKNL-targetedopKmizaKons– Note:needtolinkwithwith-lmemkind(moresoon)

• PETsc,SLEPc,Caffe,MeKs,etc:– (soon)hasKNL-targetedbuilds

• Modulefileswillusecraype-{haswell,mic-knl}tofindappropriatelibrary• Keypoints:– SomeoneelsehasalreadypreparedlibrariesforKNL– Noneedtodo-it-yourself– Loadtherightcraype-module

Page 14: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

What to link

• NERSCconvenKon:/usr/common/software/<name>/<version>/<arch>/[<PrgEnv>]

•  Eg:/usr/common/software/petsc/3.7.2/hsw/intel /usr/common/software/petsc/3.7.2/knl/intel

• KNLsubfoldermaybeasymlinktohsw– Librariescompiledwith-axMIC-AVX512,CORE-AVX2

•  ModulefilesshoulddotherightthingTM

– UsingCRAY_CPU_TARGET,setbycraype-{haswell,mic-knl}

Page 15: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Where to build

• Mostly:ontheloginnodes– KNLisdesignedforscalable,vectorizedworkloads– Compilingisneither!

•  WillprobablybemuchsloweronKNLnodethanXeonnode

• Cross-compiling– YouarecompilingforaXeonPhi(KNL)target,onaXeonhost•  Toolslikeautoconf(./configure)maytrytobuild-and-runsmallexecutablestotestavailabilityoflibraries,etc..whichmightnotwork

– CompileonKNLcomputenode?•  Slow(andcurrentlynotworking)

– craype-haswell+CFLAGS=-axMIC-AVX512,CORE-AVX2

Page 16: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Don’t Panic!

InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)

• Here’sthenewbit:•  module swap craype-haswell craype-mic-knl

– ForKNL-specificexecutables,or•  CC -axMIC-AVX512,CORE-AVX2 ...

– ForHaswell/KNLportability

Page 17: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

What about MCDRAM?

• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons

• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory

Page 18: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

MCDRAM in a nutshell

•  16GBon-chipmemory– cf96GBoff-chipDDR(Cori)

• Not(exactly)acache– LatencysimilartoDDR

• Butveryhighbandwidth– ~5xDDR

•  2waystouseit:– “Cache”mode:invisibletoOS,memorypagesarecachedinMCDRAM(cache-linegranularity)– “Flat”mode:appearstoOSasseparateNUMAnode,withnolocalCPUs.Accessiblevianumactl,libnuma(pagegranularity)

Page 19: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

MCDRAM in a nutshell - cache mode

Page 20: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

MCDRAM in a nutshell - flat mode

Page 21: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to use MCDRAM

• OpKon1:Letthesystemfigureitout– Cachemode,nochangestocode,buildprocedureorrunprocedure– Mostofthebenefit,free,mostofthe6me

Page 22: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to use MCDRAM

• OpKon2:Run-KmeseXngsonly– Flatmode,nochangestocodeorbuildprocedure– Doeswholejobfitwithin16GB/node?

•  srun<op6ons>numactl-m1./myexec.exe– Toobig?

•  srun<op6ons>numactl-p1./myexec.exe

Page 23: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to use MCDRAM

• OpKon3:MakeyourapplicaKonNUMA-aware– Flatmode– UselibmemkindtoexplicitlyallocateselectedarraysinMCDRAM

memkind

jemalloc libnuma

APIforNUMAalloca6onpolicyinLinuxkernel

Mallocimplementa6onemphasizingfragmenta6onavoidanceandconcurrency

NUMA-awareextensibleheapmanager

#include <hbwmalloc.h> malloc(size) -> hbw_malloc(size)

Page 24: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Using libmemkind in code

• C/C++hbw_malloc()replacesmalloc()#include <hbwmalloc.h> // malloc(size) -> hbw_malloc(size)

•  Fortran!DIR$ MEMORY(bandwidth) a,b,c ! cray real, allocatable :: a(:,:), b(:,:), c(:) !DIR$ ATTRIBUTES FASTMEM :: a,b,c ! intel

• Caveat:onlyfordynamically-allocatedarrays– Notlocal(stack)variables– OrFortranpointers

Page 25: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Using libmemkind in code

• WhicharraystoputinMCDRAM?– Vtunememory-accessmeasurements:– amplxe-cl-collectmemory-access…

Page 26: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Building with libmemkind

•  module load memkind •  (ormodule load cray-memkind)

• Compilerwrapperswilladd

-lmemkind -ljemalloc -lnuma •  Fortrannote:NotallcompilerssupportFASTMEMdirecKve– CurrentlyIntelandmaybeCray

Page 27: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

AutoHBW: Automatic memkind

• UsesarraysizetodeterminewhetheranarrayshouldbeallocatedtoMCDRAM• Nocodechangesnecessary!•  module load autohbw •  Linkwith-lautohbw

RunKmeenvironmentvariables:export AUTO_HBW_SIZE=4K # any allocation # >4KB will be placed in MCDRAM export AUTO_HBW_SIZE=4K:8K # allocations # between 4KB and 8KB will # be placed in MCDRAM

Page 28: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Don’t Panic!

InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)• Here’sthenewbit:•  module swap craype-haswell craype-mic-knl

– ForKNL-specificexecutables,or•  CC -axMIC-AVX512,CORE-AVX2 ...

– ForHaswell/KNLportabilityAnd:•  ThinkaboutMCDRAM– numactl,memkind,autohbm

Page 29: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

A few final notes

•  Edisonexecutables(probably)won’tworkwithoutrecompile– ISA-compa6ble,but…– CorihasnewerOSversion,updatedlibraries– So:recompileforCori

• KNL-opKmizedMKLuseslibmemkind– Willneedtolinkwith-lmemkind -ljemalloc – Shouldbeinvisiblyintegratedinfutureversion

Page 30: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Running jobs on Cori KNL nodes

-30-

ZhengjiZhao

Page 31: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Agenda

•  What’snewonKNLnodes•  Process/thread/memoryaffinity•  Samplejobscripts•  Summary

-31-

Page 32: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

KNL overview and legend Knights Landing Overview

Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown) Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+

TILE

4

2 VPU

Core

2 VPU

Core

1MB L2

CHA

Package

Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. KNL data are preliminary based on current expectations and are subject to change without notice. 1Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX). 2Bandwidth numbers are based on STREAM-like memory access pattern when MCDRAM used as flat memory. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Omni-path not shown

EDC EDC PCIe Gen 3

EDC EDC

Tile

DDR MC DDR MC

EDC EDC misc EDC EDC

36 Tiles connected by

2D Mesh Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

3

DDR4

CHANNELS

3

DDR4

CHANNELS

MCDRAM MCDRAM MCDRAM MCDRAM

DMI

2 x161 x4

X4 DMI

-32-

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Tile

Core Core

1Socket/Node68Cores(272CPUs)/Node36Tiles/Node(34ac6ve)2Cores/Tile;4CPUs/Core

1.4GB/CoreDDRmemory235MB/CoreMCDRAM

•  ACoriKNLnodehas68cores/272CPUsrunningat1.4GHz,96GBDDRmemory,16GBhigh(~5xDDR)bandwidthonpackagememory(MCDRAM)

•  Threeclustermodes,all-to-all,quadrant,sub-NUMAclustering,areavailableatbootKmetoconfiguretheKNLmeshinterconnect.

UseSlurm’sterminologyofcores,CPUs(hardwarethreads).

Page 33: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

KNL overview – MCDRAM modes

10

Memory Modes

Hybrid Mode

DDR 4 or 8 GB MCDRAM

8 or 12GB MCDRAM

16GB MCDRAM

DDR

Flat Mode

Ph

ysic

al A

dd

ress

DDR 16GB

MCDRAM

Cache Mode

• SW-Transparent, Mem-side cache • Direct mapped. 64B lines. • Tags part of line • Covers whole DDR range

Three Modes. Selected at boot

Ph

ysic

al A

dd

ress

• MCDRAM as regular memory • SW-Managed • Same address space

• Part cache, Part memory • 25% or 50% cache • Benefits of both

-33-

96GB 96GB96GB

•  Nosourcecodechangesneeded

•  Missesareexpensive

•  Codechangesrequired•  ExposedasaNUMAnode•  Accessviamemkindlibrary,

joblaunchers,and/ornumactl

•  Combina6onofthecacheandflatmodes

MCDRAMcanbeconfiguredinthreedifferentmodesatbootKme-cache,flat,andhybridmodes

Page 34: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

What’s new on KNL nodes (in comparison with Cori Haswell nodes from the perspective of running jobs)

1.   Alotmore(slower)coresonthenode2.   Muchreducedpercorememory3.   DynamicallyconfigurableNUMAandMCDRAM

modes…

-34-

Page 35: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

A proper process/thread/memory affinity is the basis for optimal performance

•  Processaffinity(orCPUpinning):binda(MPI)processtoaCPUorarangeofCPUsonthenode,sothattheprocessexecuteswithinthedesignatedCPUsinsteadofdriWingaroundtootherCPUsonthenode.

•  Threadaffinity:finepineachthreadofaprocesstoaCPUorCPUswithintheCPUsthataredesignatedtotheprocess.–  Threadsliveintheprocessthatownsthem,sotheprocessandthreadaffinityarenotseparable.

•  Memoryaffinity:restrictprocessestoallocatememoriesfromthedesignatedNUMAnodesonlyratherthananyNUMAnodes.

-35-

Page 36: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The minimum goal of process/thread/memory affinity is to achieve best resource utilization and to avoid NUMA performance penalty

•  SpreadMPItasksandthreadsontothecoresandCPUsonthenodesasevenlyaspossiblesothatnocoresandCPUsareoversubscribedwhileothersstayidle.Thiscanensuretheresourcesavailableonthenode,suchascores,CPUs,NUMAnodes,memoryandnetworkbandwidths,etc.,canbebestuKlized.

•  AvoidaccessingremoteNUMAnodesasmuchaspossiblesotoavoidperformancepenalty.

•  IncontextofKNL,enableandcontroltheMCDRAMaccess.

-36-

Page 37: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Using srun’s --cpu_bind option and OpenMP environment variables to achieve desired process/thread affinity

•  Usesrun--cpu_bindtobindtaskstoCPUs–  Ozenneedstoworkwiththe–copKonofsruntoevenlyspreadMPI

tasksontheCPUsonthenodes–  Thesrun–c<n>(or--cpus-per-task=n)allocates(reserves)nnumber

ofCPUspertask(process)–  --cpu_bind=[{verbose,quiet},]type,type:cores,threads,map_cpu:<list

ofCPUs>,mask_cpu:<listofmasks>,none,…•  UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACESto

finepineachthreadtoasubsetofCPUsallocatedtothehosttask–  Differentcompilersmayhavedifferentdefaultvaluesforthem.The

followingarerecommended,whichyieldamorecompa6blethreadaffinityamongIntel,GNUandCraycompilers:OMP_PROC_BIND=true#SpecifyingthreadsmaynotbemovedbetweenCPUsOMP_PLACES=threads#SpecifyingathreadshouldbeplacedinasingleCPU

–  UseOMP_DISPLAY_ENV=truetodisplaytheOpenMPenvironmentvariablesset(usefulwhencheckingthedefaultcompilerbehavior)

-37-

Page 38: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Using srun’s --mem_bind option and/or numactl to achieve desired memory affinity

•  Usesrun–mem_bindformemoryaffinity–  --mem_bind=[{verbose,quiet},]type:local,map_mem:<NUMAidlist>,mask_mem:<NUMAmasklist>,none,…

–  E.g.,--mem_bind=<MCDRAMNUMAid>whenalloca6onsfitintoMDCRAMinflatmode

•  UseNumactl–p<NUMAid>–  Srundoesnothavethisfunc6onalitycurrently(16.05.6),willbesupportedinSlurm17.02.

–  E.g.,numactl–p<MCDRAMNUMAid>./a.outsothatalloca6onsthatdon’tfitintoMCDRAMspillovertoDDRmemory

-38-

Page 39: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Default Slurm behavior with respect to process/thread/memory binding

•  BySlurmdefault,adecentCPUbindingissetonlywhentheMPItaskspernodexCPUspertask=thetotalnumberofCPUsallocatedpernode,e.g.,68x4=272

•  Otherwise,SlurmdoesnotdoanythingwithCPUbinding.Thesrun’s--cpu_bindand–copKonsmustbeusedexplicitlytoachieveopKmalprocess/threadaffinity.

•  NodefaultmemorybindingissetbySlurm.ProcessescanallocatememoryfromallNUMAnodes.The--mem_bind(ornumactl)shouldbeusedexplicitlytosetmemorybindings.

-39-

Page 40: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Default Slurm behavior with respect to process/thread/memory binding (continued)

•  ThedefaultdistribuKon,the–mopKonofsrun,isblock:cycliconCori.–  Thecyclicdistribu6onmethoddistributesallocatedCPUsforbindingtoa

giventaskconsecu6velyfromthesamesocket,andfromthenextconsecu6vesocketforthenexttask,inaround-robinfashionacrosssockets.

•  The–mblock:blockalsoworks.Youareencouragedtoexperimentwith–mblock:blockassomeapplicaKonsperformbeyerwiththeblockdistribuKon.–  Theblockdistribu6onmethoddistributesallocatedCPUsconsecu6vely

fromthesamesocketforbindingtotasks,beforeusingthenextconsecu6vesocket.

•  The–mopKonisrelevanttotheKNLnodeswhentheyareconfiguredinthesub-NUMAclustermodes,e.g.,SNC2,SNC4,etc.Slurmtreats“NUMAnodeswithCPUs”as“sockets”,althoughKNLisasinglesocketnode.

-40-

Page 41: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Available partitions and NUMA/MCDRAM modes on Cori KNL nodes (not finalized view yet)

•  SameparKKonsasHaswell–  #SBATCH–pregular–  #SBATCH–pdebug–  Typesinfo–sformoreinfoaboutpar66onsandnodes

•  Usingthe–Cknl,<NUMA>,<MCDRAM>opKonsofsbatchtorequestKNLnodeswithdesiredfeatures–  #SBATCH–Cknl,quad,flat

•  SupportscombinaKonofthefollowingNUMA/MCDRAMmodes:–  AllowNUMA=a2a,snc2,snc4,hemi,quad–  AllowMCDRAM=cache,split,equal,flat–  Quad,flatisthedefaultfornow(notfinalized)

•  NodescanberebootedautomaKcally–  Frequentrebootsarenotencouraged,astheycurrentlytakealong6me–  Wearetes6ngvariousmemorymodessotosetaproperdefaultmode

-41-

Page 42: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Example of running interactive batch job with KNL nodes in the quad,cache mode

zz217@gert01:~>salloc-N1–pdebug–t30:00-Cknl,quad,cachesalloc:Grantedjoballoca6on5545salloc:Wai6ngforresourceconfigura6onsalloc:Nodesnid00044arereadyforjobzz217@nid00044:~>numactl-Havailable:1nodes(0)node0cpus:0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172…...238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node0size:96757MBnode0free:94207MBnodedistances:node00:10

-42-

•  Runthenumactl–HcommandtocheckiftheactualNUMAconfigura6onmatchestherequestedNUMA,MCDRAMmode

•  Thequad,cachemodehasonly1NUMAnodewithallCPUsontheNUMAnode0(DDRmemory)

•  TheMCDRAMishiddenfromthenumactl–Hcommand(itisacache).

Page 43: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,cache mode

SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=1#op6onal*srun-n64-c4--cpu_bind=cores./a.out

-43-

Processaffinityoutcome0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches64MPItasksonthenode,alloca6ng4CPUspertask,andbindsprocessestocores.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63willbepinnedtoCore63.EachMPItaskmaymovewithinthe4CPUsinthecores.

Each2x2boxaboveisacorewith4CPUs(hardwarethreads).ThenumbersshownineachCPUboxistheCPUids.Thelast4coresarenotusedinthisexample.Thecores4-59werenotbeshown.

*)Theuseof“exportOMP_NUM_THREADS=1”isop6onalbutrecommendedevenforpureMPIcodes.Thisistoavoidunexpectedthreadforking(compilerwrappersmaylinkyourcodetothemul6-threadedsystemprovidedlibrariesbydefault).

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Page 44: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,cache mode

SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=1#op6onal*srun–n16–c16--cpu_bind=cores./a.out

-44-

Processaffinityoutcome

0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches16MPItasksonthenode,alloca6ng16CPUspertask,andbindseachprocessto4cores/16CPUs.Theresul6ngtaskplacementisshownintherightfigure.TheRank0ispinnedtoCore0-3,andRank1toCore4-7,…,Rank15toCore60-63.TheMPItaskmaymovewithinthe16CPUsinthe4cores.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Page 45: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out

-45-

Processaffinityoutcome

0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemodetorun64MPItasksonthenode,alloca6ng4CPUspertask,andbindseachtasktothe4CPUsallocatedwithinthecores.EachMPItaskruns4OpenMPthreads.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63toCore63.The4threadsofeachtaskarepinnedwithinthecore.Dependingonthecompilersusedtocompilethecode,the4threadsineachcoremayormaynotmovebetweenthe4CPUs.

Thread0

Thread1

Thread2

Thread3

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Page 46: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out

-46-

Process/threadaffinityoutcome0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUwithineachcore.Theresul6ngthreadaffinity(andtaskaffinity)isshownintherightfigure.E.g.,forRank0,Thread0ispinnedtoCPU0,Thread1toCPU68,Thread2toCPU136,andThread3ispinnedtoCPU204.

Thread0

Thread1

Thread2

Thread3

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Page 47: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-47-

Processaffinityoutcome

0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

Dependingonthecompilerimplementa6ons,the8threadsineachtaskmayormaynotmovebetween4cores/16CPUsallocatedtothehosttask.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Page 48: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-48-

Process/threadaffinityoutcome0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Core61

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUonthecoresallocatedtothetask.Theresul6ngprocess/threadisshownintherightfigure.E.g.,forRank0,Thread0ispinnedtotheCPU0(onCore0),Thread1totheCPU1(onCore1),Threads2toCPU2(onCore2),andsoon.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core62 Core63

Core64 Core65 Core66 Core67

Page 49: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Example of running under the quad,flat mode interactively

zz217@gert01:~>salloc-pdebug-t30:00-Cknl,quad,flatzz217@nid00037:~>numactl-Havailable:2nodes(0-1)node0cpus:01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374….262263264265266267268269270271node0size:96759MBnode0free:94208MBnode1cpus:node1size:16157MBnode1free:16091MBnodedistances:node010:10311:3110 -49-

zz217@cori10:~>scontrolshownodenid10388NodeName=nid10388Arch=x86_64CoresPerSocket=68CPUAlloc=0CPUErr=0CPUTot=272CPULoad=0.01AvailableFeatures=knl,flat,split,equal,cache,a2a,snc2,snc4,hemi,quadAcKveFeatures=knl,cache,quad…State=IDLEThreadsPerCore=4…BootTime=2016-10-31T13:43:12…

Page 50: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,flat mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-50-

Processaffinityoutcome

0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Page 51: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,flat mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-51-

Process/threadaffinityoutcome0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Page 52: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run under the quad,flat mode using MCDRAM

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

#Whenthememoryfootprintfitsin16GBofMCDRAM(NUMAnode1),runsoutofMCDRAMexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

srun-n64-c4--cpu_bind=cores--mem_bind=map_mem:1./a.out#orusingnumactl-msrun-n64-c4--cpu_bind=coresnumactl–m1./a.out

-52-

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

#PrefersrunningonMCDRAM(NUMAnode1)ifmemoryfootprintdoesnotfitonMCDRAM,spillstoDDRexportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

srun–n16–c16--cpu_bind=coresnumactl–p1./a.out

Page 53: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

How to check the process/thread affinity

•  Usesrunflag:--cpu_bind=verbose–  Needtoreadthecpumasksinhexidecimalformat

•  UseaCrayprovidedcodexthi.c(seebackupslides).•  Use--mem_bind=verbose,<type>tocheckmemoryaffinity

•  Usethenumastat–p<PID>commandtoconfirmwhileajobisrunning

•  Useenvironmentalvariables(Slurm,compilerspecific)–  SLURM_CPU_BIND_VERBOSE--cpu_bindverbosity(quiet,verbose).

-53-

Page 54: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

A few useful commands •  sinfo–format=“%F%b”foravailablefeaturesofnodes,

orsinfo–format=“%C%b”–  A/I/O/T(allocated/idle/other/total)

•  scontrolshownode<nid>•  scontrolshowjob<jobid>-ddd•  sinfo–stoseeavailableparKKonsandnodes•  sbatch,srun,squeue,sinfoandotherSlurmcommand

manpages–  needtodis6nguishthejoballoca6on6me(#SBATCH)andjobstepcrea6on6me(srunwithinajobscript)

–  Someop6onsareonlyavailableatJoballoca6on6me,suchas–ntasks-per-core,someonlyworkwhencertainpluginsareenabled

-54-

Page 55: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Summary

•  Use–Cknl,<NUMA>,<MCDRAM>torequestKNLnodeswiththesameparKKonsasHaswellnodes(debug,orregular)

•  Alwaysexplicitlyusesrun’s--cpu_bindand-copKontospreadtheMPItasksevenlyoverthecores/CPUsonthenodes

•  UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACEStofinepinthreadstotheCPUsallocatedtothetasks

•  Usesrun’s--mem_bindandnumactl–ptocontrolmemoryaffinityandaccessMCDRAM–  Usingmemkind/autoHBWlibrariescanbeusedtoallocateonlyselectedarrays/memoryalloca6onstoMCDRAM(Steve’stalk)

-55-

Page 56: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Summary (2)

•  Considerusing64coresoutof68inmostcases•  Moresamplejobscriptscanbefoundinourwebsite

–  h�p://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/running-jobs-on-cori-knl-nodes/

•  WehaveprovidedajobscriptgeneratortohelpyoutogeneratebatchjobscriptsforKNL(andHaswell,Edison)

•  SlurmKNLfeaturesareinconKnuousdevelopmentandsomeinstrucKonsaresubjecttochange

------URLforthejobscriptgenerator:

h�ps://my.nersc.gov/script_generator.php

-56-

Page 57: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Backupslides

-57-

Page 58: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Cray provided a code to check process/thread affinity xthi.c (http://portal.nersc.gov/project/training/KNLUserTraing20161103/UsingCori/xthi.c/)

•  Tocompile,cc–qopenmp–oxthi.intelxthi.c#Intelcompilerscc–fopenmp–oxthi.gnuxthi.c#GNUcompilerscc–oxthi.gnuxthi.c#Craycompilers

•  Torun,salloc–N1–pdebug–Cknl,quad,flat#startaninterac6ve1nodejob…exportOMP_DISPLAY_ENV=true#todisplayenvsused/setbyopenmprun6meexportOMP_NUM_THREADS=4srun–n64–c4--cpu_bind=verbose,coresxthi.intel#run64taskswith4threadseach

Srun–n16–c16--cpu_bind=verbose,coresxthi.intel#run16tasks4threadseach

Page 59: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

sinfo --format="%F %b” to show the number of nodes with active features

zz217@cori01:~>dateMonNov2815:10:36PST2016zz217@cori01:~>sinfo--format="%F%b"NODES(A/I/O/T)ACTIVE_FEATURES1805/157/42/2004haswell0/107/1/108knl,flat,a2a0/0/1/1knl0/225/6/231knl,flat,quad2500/1942/60/4502knl,cache,quad2676/995/6/3677quad,cache,knl768/0/0/768snc4,flat,knl0/8/0/8quad,flat,knl0/8/1/9snc2,flat,knl

-59-

Thiscommandshowsthattherethereare8179KNLnodesinquad,cachemode;239nodesinquad,flatmode;768nodesinsnc4,flatmode;9insnc2,flatmode

Theorderoffeaturesdoesnotmakedifference.

Page 60: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The --cpu_bind option of srun enables CPU bindings

-60-

srun–n4./a.out#noCPUbidings.Taskscanmovearoundwithin68cores/272CPUs

salloc–N1–pdebug–Cknl,quad,flat…

srun–n4–cpu_bind=cores./a.out

srun–n4--cpu_bind=threads./a.out

…...

Rank0

Core0

Core1

Core3

Core2

Core67 CPUCPUCPUCPU

Core65

Core66

…...

Rank0

Rank1

Rank2

Core0

Core1

Core3

Core2

Core67 CPUCPUCPUCPU

Core65

Core66

Rank3

…...

Core0

Core1

Core3

Core2

Core67 CPUCPUCPUCPU…

Core65

Core66

srun–n4--cpu_bind=map_cpu:0,204,67,271./a.out

Rank1

Rank2

Rank3

Rank0

Rank1

Rank3

Rank2

Page 61: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The --cpu_bind option: the -c option spreads tasks (evenly) on the CPUs on the node

-61-

salloc–N1–pdebug–Cknl,quad,flat…

srun–n16–c16–cpu_bind=cores./a.outsrun–n4–c8–cpu_bind=cores./a.out

…...

Rank0Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Rank1

Rank3

Rank2Core4

Core5

Core7

Core6

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Rank0

Core0

Core1

Core3

Core2

Core63 CPUCPUCPUCPU

…Core61

Core62

Rank1

Core60

Rank15

Last4cores/16CPUsareidle

Page 62: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The --cpu_bind option (continued): the –c option spread tasks (evenly) on the CPUs on the node

-62-

salloc–N1–pdebug–Cknl,quad,flat…

srun–n16–c16–cpu_bind=threads./a.outsrun–n4–c8–cpu_bind=threads./a.out

…...

Rank0Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Rank1

Rank3

Rank2Core4

Core5

Core7

Core6

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Rank0

Core0

Core1

Core3

Core2

Core63 CPUCPUCPUCPU

…Core61

Core62

Rank1

Core60

Rank15

Last4cores/16CPUsareidle

Page 63: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The -c option: --cpu_bind=cores vs –cpu_bind=threads

-63-

salloc–N1–pdebug–Cknl,quad,flat…

srun–n4–c6–cpu_bind=threads./a.outsrun–n4–c6–cpu_bind=cores./a.out

…...

Rank0Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Rank1

Rank3

Rank2Core4

Core5

Core7

Core6

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Core4

Core5

Core7

Core6

First6cores/24CPUSareused;rest62cores/248CPUsstayidle

Rank0

Rank1

Rank2

Rank3

Page 64: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Snc2,flat (salloc -N 1 -p regular -C knl,snc2,flat) zz217@nid11512:~>numactl-Havailable:4nodes(0-3)node0cpus:01234567891011121314151617181920212223242526272829303132336869707172737475767778798081828384858687888990919293949596979899100101136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237node0size:48293MBnode0free:46047MBnode1cpus:34353637383940414243444546474849505152535455565758596061626364656667102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node1size:48466MBnode1free:44949MB

-64-

node2cpus:node2size:8079MBnode2free:7983MBnode3cpus:node3size:8077MBnode3free:7980MB

nodedistances:node01230:102131411:211041312:314110413:41314110

NUMAnode2

NUMAnode0

NUMAnode3

NUMAnode1

Core34-67

Core0-33

Page 65: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The default distribution on Cori is -m block:cyclic

-65-

salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c4–cpu_bind=cores./a.out#orsrun–n11–c4–cpu_bind=cores–mblock:cyclic./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0 Rank1Rank2 Rank3Rank4 Rank5

Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

CPUCPUCPUCPU

… …

CPUCPUCPUCPU

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Rank7Rank6Rank9Rank8

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……

Rank10

CPUCPUCPUCPU CPUCPUCPUCPU

Page 66: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

The default distribution on Cori is -m block:cyclic (continued)

-66-

salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c8–cpu_bind=cores./a.out#orsrun–n11–c8–cpu_bind=cores–mblock:cyclic./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

… …Rank2

Rank4 Rank5

Rank3

Rank1Rank0

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……Rank10

Rank8 Rank9

Rank7Rank6

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Page 67: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Block distribution: -m block:block

-67-

Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c4–cpu_bind=cores–mblock:block./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0 Rank3Rank1 Rank4Rank2 Rank5

Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

… …

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Rank9Rank6Rank10Rank7

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……

Rank8

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Page 68: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Block distribution: -m block:block (continued)

-68-

Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c8–cpu_bind=cores–mblock:block./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

… …

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……

Rank1

Rank2 Rank5

Rank4

Rank3

Rank7

Rank8

Rank6

Rank10

Rank9

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Page 69: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run MPI+OpenMP under the snc2,flat mode

-69-

#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,snc2,flat

exportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

#using2CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n16–c16--cpu_bind=cores–mem_bind=local./a.out

#using2CPUs(hardwarethreads)percoreusingMCDRAMsrun–n16–c16--cpu_bind=cores–mem_bind=map_mem:2,3./a.out#orsrun–n16–c16--cpu_bind=coresnumactl–m2,3./a.out

#using2CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n16–c16--cpu_bind=coresnumactl–p2,3./a.out

#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n32–c8--cpu_bind=coresnumactl–p2,3./a.out

Page 70: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run large jobs (> 1500 MPI tasks) (quad,flat)

Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N100#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun--bcast=/tmp/a.out–n6400–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsbcast./a.out/tmp/a.out#copya.outtothe/tmpofeachcomputenodeallocatedfirstsrun–n6400–c4--cpu_bind=coresnumactl–p1/tmp/a.out

-70-

Page 71: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to use core specialization (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flat#SBATCH–S1

exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)Srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredSrun–n64–c4--cpu_bind=coresnumactl–p1./a.out

-71-

Page 72: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Sample job script to run with Intel MPI (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsmoduleloadimpiexportI_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so#exportI_MPI_FABRICS=shm:tcp

#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out

#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n64–c4--cpu_bind=coresnumactl–p1./a.out

-72-

Page 73: Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ... jemalloc libnuma API for NUMA allocaon policy in Linux kernel Malloc implementaon emphasizing

Thankyou!

-73-