Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

Steve Leak, and Zhengji Zhao!NESAP Hack-a-thon!November 29, 2016, Berkeley CA

Using Cori

How to Compile & MCDRAM

SteveLeak

Building for Cori KNL nodes

• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons

• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory

Building for Cori KNL nodes

Don’tPanic(much)

KNL can run Haswell executables

But ...

HaswellExecutablescan’tfullyuseKNLhardware AVX2

(haswell)Opera6onon4DPwords

AVX-512(knl)Hardwarecancompute8DPwordsperinstruc6on

And ...

KNLreliesmoreonvectorizaKonforperformance

GFLOPS

ArithmeKcintensity

And ...

KNLmemoryhierarchyismorecomplicated

How to compile

Best:UsecompileropKonstobuildforKNLmodule swap craype-haswell craype-mic-knl

•  Theloadedcraype-*modulesetsthetargetthatthecompilerwrappers(cc,CC,Wn)buildfor–  Eg-mknl(GNUcompiler),

-hmic-knl(Craycompiler)•  craype-haswellisdefaulton

loginnodes•  craype-mic-knlisforKNLnodes

How to compile

Best:CompilerseXngstotargetKNLAlternate:CC -axMIC-AVX512,CORE-AVX2 <more-options> mycode.c++

•  OnlyvalidwhenusingIntelcompilers(cc,CCorWn)•  -ax<arch>addsan“alternateexecuKonpaths”opKmized

fordifferentarchitectures– Makes2(ormore)versionsofcodeinsameobjectfile

• NOTASGOODasthecraype-mic-knlmodule– (modulecausesversionsoflibrariesbuiltforthatarchitecturetobeused-egMKL)

How to compile

RecommendaKons:•  Forbestperformance,usethecraype-mic-knlmodulemodule swap craype-haswell craype-mic-knl CC -O3 -c myfile.c++

•  IfthesameexecutablemustrunonKNLandHaswellnodes,usecraype-haswellbutaddKNL-opKmizedexecuKonpathCC -axMIC-AVX512,CORE-AVX2 -O3 -c myfile.c++

What to link

UKlitylibraries• Notperformance-criKcal(bydefiniKon)– KNLcanrunXeonbinaries..canuseHaswell-targetedversions

•  I/Olibraries(HDF5,NetCDF,etc)shouldfitinthiscategorytoo– (forCray-providedlibraries,compilerwrapperwillusecraype-*toselectbestbuildanyway)

What to link

Performance-criKcallibraries• MKL:hasKNL-targetedopKmizaKons– Note:needtolinkwithwith-lmemkind(moresoon)

• PETsc,SLEPc,Caffe,MeKs,etc:– (soon)hasKNL-targetedbuilds

• Modulefileswillusecraype-{haswell,mic-knl}tofindappropriatelibrary• Keypoints:– SomeoneelsehasalreadypreparedlibrariesforKNL– Noneedtodo-it-yourself– Loadtherightcraype-module

What to link

• NERSCconvenKon:/usr/common/software/<name>/<version>/<arch>/[<PrgEnv>]

•  Eg:/usr/common/software/petsc/3.7.2/hsw/intel /usr/common/software/petsc/3.7.2/knl/intel

• KNLsubfoldermaybeasymlinktohsw– Librariescompiledwith-axMIC-AVX512,CORE-AVX2

•  ModulefilesshoulddotherightthingTM

– UsingCRAY_CPU_TARGET,setbycraype-{haswell,mic-knl}

Where to build

• Mostly:ontheloginnodes– KNLisdesignedforscalable,vectorizedworkloads– Compilingisneither!

•  WillprobablybemuchsloweronKNLnodethanXeonnode

• Cross-compiling– YouarecompilingforaXeonPhi(KNL)target,onaXeonhost•  Toolslikeautoconf(./configure)maytrytobuild-and-runsmallexecutablestotestavailabilityoflibraries,etc..whichmightnotwork

– CompileonKNLcomputenode?•  Slow(andcurrentlynotworking)

– craype-haswell+CFLAGS=-axMIC-AVX512,CORE-AVX2

Don’t Panic!

InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)

• Here’sthenewbit:•  module swap craype-haswell craype-mic-knl

– ForKNL-specificexecutables,or•  CC -axMIC-AVX512,CORE-AVX2 ...

– ForHaswell/KNLportability

What about MCDRAM?

• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons

• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory

MCDRAM in a nutshell

•  16GBon-chipmemory– cf96GBoff-chipDDR(Cori)

• Not(exactly)acache– LatencysimilartoDDR

• Butveryhighbandwidth– ~5xDDR

•  2waystouseit:– “Cache”mode:invisibletoOS,memorypagesarecachedinMCDRAM(cache-linegranularity)– “Flat”mode:appearstoOSasseparateNUMAnode,withnolocalCPUs.Accessiblevianumactl,libnuma(pagegranularity)

MCDRAM in a nutshell - cache mode

MCDRAM in a nutshell - flat mode

How to use MCDRAM

• OpKon1:Letthesystemfigureitout– Cachemode,nochangestocode,buildprocedureorrunprocedure– Mostofthebenefit,free,mostofthe6me

How to use MCDRAM

• OpKon2:Run-KmeseXngsonly– Flatmode,nochangestocodeorbuildprocedure– Doeswholejobfitwithin16GB/node?

•  srun<op6ons>numactl-m1./myexec.exe– Toobig?

•  srun<op6ons>numactl-p1./myexec.exe

How to use MCDRAM

• OpKon3:MakeyourapplicaKonNUMA-aware– Flatmode– UselibmemkindtoexplicitlyallocateselectedarraysinMCDRAM

memkind

jemalloc libnuma

APIforNUMAalloca6onpolicyinLinuxkernel

Mallocimplementa6onemphasizingfragmenta6onavoidanceandconcurrency

NUMA-awareextensibleheapmanager

#include <hbwmalloc.h> malloc(size) -> hbw_malloc(size)

Using libmemkind in code

• C/C++hbw_malloc()replacesmalloc()#include <hbwmalloc.h> // malloc(size) -> hbw_malloc(size)

•  Fortran!DIR$ MEMORY(bandwidth) a,b,c ! cray real, allocatable :: a(:,:), b(:,:), c(:) !DIR$ ATTRIBUTES FASTMEM :: a,b,c ! intel

• Caveat:onlyfordynamically-allocatedarrays– Notlocal(stack)variables– OrFortranpointers

Using libmemkind in code

• WhicharraystoputinMCDRAM?– Vtunememory-accessmeasurements:– amplxe-cl-collectmemory-access…

Building with libmemkind

•  module load memkind •  (ormodule load cray-memkind)

• Compilerwrapperswilladd

-lmemkind -ljemalloc -lnuma •  Fortrannote:NotallcompilerssupportFASTMEMdirecKve– CurrentlyIntelandmaybeCray

AutoHBW: Automatic memkind

• UsesarraysizetodeterminewhetheranarrayshouldbeallocatedtoMCDRAM• Nocodechangesnecessary!•  module load autohbw •  Linkwith-lautohbw

RunKmeenvironmentvariables:export AUTO_HBW_SIZE=4K # any allocation # >4KB will be placed in MCDRAM export AUTO_HBW_SIZE=4K:8K # allocations # between 4KB and 8KB will # be placed in MCDRAM

Don’t Panic!

InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)• Here’sthenewbit:•  module swap craype-haswell craype-mic-knl

– ForKNL-specificexecutables,or•  CC -axMIC-AVX512,CORE-AVX2 ...

– ForHaswell/KNLportabilityAnd:•  ThinkaboutMCDRAM– numactl,memkind,autohbm

A few final notes

•  Edisonexecutables(probably)won’tworkwithoutrecompile– ISA-compa6ble,but…– CorihasnewerOSversion,updatedlibraries– So:recompileforCori

• KNL-opKmizedMKLuseslibmemkind– Willneedtolinkwith-lmemkind -ljemalloc – Shouldbeinvisiblyintegratedinfutureversion

Running jobs on Cori KNL nodes

ZhengjiZhao

Agenda

•  What’snewonKNLnodes•  Process/thread/memoryaffinity•  Samplejobscripts•  Summary

KNL overview and legend Knights Landing Overview

Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown) Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+

1MB L2

Package

Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. KNL data are preliminary based on current expectations and are subject to change without notice. 1Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX). 2Bandwidth numbers are based on STREAM-like memory access pattern when MCDRAM used as flat memory. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Omni-path not shown

EDC EDC PCIe Gen 3

EDC EDC

DDR MC DDR MC

EDC EDC misc EDC EDC

36 Tiles connected by

2D Mesh Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

CHANNELS

MCDRAM MCDRAM MCDRAM MCDRAM

2 x161 x4

X4 DMI

CPU CPU

Core Core

1Socket/Node68Cores(272CPUs)/Node36Tiles/Node(34ac6ve)2Cores/Tile;4CPUs/Core

1.4GB/CoreDDRmemory235MB/CoreMCDRAM

•  ACoriKNLnodehas68cores/272CPUsrunningat1.4GHz,96GBDDRmemory,16GBhigh(~5xDDR)bandwidthonpackagememory(MCDRAM)

•  Threeclustermodes,all-to-all,quadrant,sub-NUMAclustering,areavailableatbootKmetoconfiguretheKNLmeshinterconnect.

UseSlurm’sterminologyofcores,CPUs(hardwarethreads).

KNL overview – MCDRAM modes

Memory Modes

Hybrid Mode

DDR 4 or 8 GB MCDRAM

8 or 12GB MCDRAM

16GB MCDRAM

Flat Mode

DDR 16GB

MCDRAM

Cache Mode

• SW-Transparent, Mem-side cache • Direct mapped. 64B lines. • Tags part of line • Covers whole DDR range

Three Modes. Selected at boot

• MCDRAM as regular memory • SW-Managed • Same address space

• Part cache, Part memory • 25% or 50% cache • Benefits of both

96GB 96GB96GB

•  Nosourcecodechangesneeded

•  Missesareexpensive

•  Codechangesrequired•  ExposedasaNUMAnode•  Accessviamemkindlibrary,

joblaunchers,and/ornumactl

•  Combina6onofthecacheandflatmodes

MCDRAMcanbeconfiguredinthreedifferentmodesatbootKme-cache,flat,andhybridmodes

What’s new on KNL nodes (in comparison with Cori Haswell nodes from the perspective of running jobs)

1.   Alotmore(slower)coresonthenode2.   Muchreducedpercorememory3.   DynamicallyconfigurableNUMAandMCDRAM

modes…

A proper process/thread/memory affinity is the basis for optimal performance

•  Processaffinity(orCPUpinning):binda(MPI)processtoaCPUorarangeofCPUsonthenode,sothattheprocessexecuteswithinthedesignatedCPUsinsteadofdriWingaroundtootherCPUsonthenode.

•  Threadaffinity:finepineachthreadofaprocesstoaCPUorCPUswithintheCPUsthataredesignatedtotheprocess.–  Threadsliveintheprocessthatownsthem,sotheprocessandthreadaffinityarenotseparable.

•  Memoryaffinity:restrictprocessestoallocatememoriesfromthedesignatedNUMAnodesonlyratherthananyNUMAnodes.

The minimum goal of process/thread/memory affinity is to achieve best resource utilization and to avoid NUMA performance penalty

•  SpreadMPItasksandthreadsontothecoresandCPUsonthenodesasevenlyaspossiblesothatnocoresandCPUsareoversubscribedwhileothersstayidle.Thiscanensuretheresourcesavailableonthenode,suchascores,CPUs,NUMAnodes,memoryandnetworkbandwidths,etc.,canbebestuKlized.

•  AvoidaccessingremoteNUMAnodesasmuchaspossiblesotoavoidperformancepenalty.

•  IncontextofKNL,enableandcontroltheMCDRAMaccess.

Using srun’s --cpu_bind option and OpenMP environment variables to achieve desired process/thread affinity

•  Usesrun--cpu_bindtobindtaskstoCPUs–  Ozenneedstoworkwiththe–copKonofsruntoevenlyspreadMPI

tasksontheCPUsonthenodes–  Thesrun–c<n>(or--cpus-per-task=n)allocates(reserves)nnumber

ofCPUspertask(process)–  --cpu_bind=[{verbose,quiet},]type,type:cores,threads,map_cpu:<list

ofCPUs>,mask_cpu:<listofmasks>,none,…•  UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACESto

finepineachthreadtoasubsetofCPUsallocatedtothehosttask–  Differentcompilersmayhavedifferentdefaultvaluesforthem.The

followingarerecommended,whichyieldamorecompa6blethreadaffinityamongIntel,GNUandCraycompilers:OMP_PROC_BIND=true#SpecifyingthreadsmaynotbemovedbetweenCPUsOMP_PLACES=threads#SpecifyingathreadshouldbeplacedinasingleCPU

–  UseOMP_DISPLAY_ENV=truetodisplaytheOpenMPenvironmentvariablesset(usefulwhencheckingthedefaultcompilerbehavior)

Using srun’s --mem_bind option and/or numactl to achieve desired memory affinity

•  Usesrun–mem_bindformemoryaffinity–  --mem_bind=[{verbose,quiet},]type:local,map_mem:<NUMAidlist>,mask_mem:<NUMAmasklist>,none,…

–  E.g.,--mem_bind=<MCDRAMNUMAid>whenalloca6onsfitintoMDCRAMinflatmode

•  UseNumactl–p<NUMAid>–  Srundoesnothavethisfunc6onalitycurrently(16.05.6),willbesupportedinSlurm17.02.

–  E.g.,numactl–p<MCDRAMNUMAid>./a.outsothatalloca6onsthatdon’tfitintoMCDRAMspillovertoDDRmemory

Default Slurm behavior with respect to process/thread/memory binding

•  BySlurmdefault,adecentCPUbindingissetonlywhentheMPItaskspernodexCPUspertask=thetotalnumberofCPUsallocatedpernode,e.g.,68x4=272

•  Otherwise,SlurmdoesnotdoanythingwithCPUbinding.Thesrun’s--cpu_bindand–copKonsmustbeusedexplicitlytoachieveopKmalprocess/threadaffinity.

•  NodefaultmemorybindingissetbySlurm.ProcessescanallocatememoryfromallNUMAnodes.The--mem_bind(ornumactl)shouldbeusedexplicitlytosetmemorybindings.

Default Slurm behavior with respect to process/thread/memory binding (continued)

•  ThedefaultdistribuKon,the–mopKonofsrun,isblock:cycliconCori.–  Thecyclicdistribu6onmethoddistributesallocatedCPUsforbindingtoa

giventaskconsecu6velyfromthesamesocket,andfromthenextconsecu6vesocketforthenexttask,inaround-robinfashionacrosssockets.

•  The–mblock:blockalsoworks.Youareencouragedtoexperimentwith–mblock:blockassomeapplicaKonsperformbeyerwiththeblockdistribuKon.–  Theblockdistribu6onmethoddistributesallocatedCPUsconsecu6vely

fromthesamesocketforbindingtotasks,beforeusingthenextconsecu6vesocket.

•  The–mopKonisrelevanttotheKNLnodeswhentheyareconfiguredinthesub-NUMAclustermodes,e.g.,SNC2,SNC4,etc.Slurmtreats“NUMAnodeswithCPUs”as“sockets”,althoughKNLisasinglesocketnode.

Available partitions and NUMA/MCDRAM modes on Cori KNL nodes (not finalized view yet)

•  SameparKKonsasHaswell–  #SBATCH–pregular–  #SBATCH–pdebug–  Typesinfo–sformoreinfoaboutpar66onsandnodes

•  Usingthe–Cknl,<NUMA>,<MCDRAM>opKonsofsbatchtorequestKNLnodeswithdesiredfeatures–  #SBATCH–Cknl,quad,flat

•  SupportscombinaKonofthefollowingNUMA/MCDRAMmodes:–  AllowNUMA=a2a,snc2,snc4,hemi,quad–  AllowMCDRAM=cache,split,equal,flat–  Quad,flatisthedefaultfornow(notfinalized)

•  NodescanberebootedautomaKcally–  Frequentrebootsarenotencouraged,astheycurrentlytakealong6me–  Wearetes6ngvariousmemorymodessotosetaproperdefaultmode

Example of running interactive batch job with KNL nodes in the quad,cache mode

zz217@gert01:~>salloc-N1–pdebug–t30:00-Cknl,quad,cachesalloc:Grantedjoballoca6on5545salloc:Wai6ngforresourceconfigura6onsalloc:Nodesnid00044arereadyforjobzz217@nid00044:~>numactl-Havailable:1nodes(0)node0cpus:0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172…...238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node0size:96757MBnode0free:94207MBnodedistances:node00:10

•  Runthenumactl–HcommandtocheckiftheactualNUMAconfigura6onmatchestherequestedNUMA,MCDRAMmode

•  Thequad,cachemodehasonly1NUMAnodewithallCPUsontheNUMAnode0(DDRmemory)

•  TheMCDRAMishiddenfromthenumactl–Hcommand(itisacache).

Sample job script to run under the quad,cache mode

SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=1#op6onal*srun-n64-c4--cpu_bind=cores./a.out

Processaffinityoutcome0 68

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches64MPItasksonthenode,alloca6ng4CPUspertask,andbindsprocessestocores.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63willbepinnedtoCore63.EachMPItaskmaymovewithinthe4CPUsinthecores.

Each2x2boxaboveisacorewith4CPUs(hardwarethreads).ThenumbersshownineachCPUboxistheCPUids.Thelast4coresarenotusedinthisexample.Thecores4-59werenotbeshown.

*)Theuseof“exportOMP_NUM_THREADS=1”isop6onalbutrecommendedevenforpureMPIcodes.Thisistoavoidunexpectedthreadforking(compilerwrappersmaylinkyourcodetothemul6-threadedsystemprovidedlibrariesbydefault).

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=1#op6onal*srun–n16–c16--cpu_bind=cores./a.out

Processaffinityoutcome

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches16MPItasksonthenode,alloca6ng16CPUspertask,andbindseachprocessto4cores/16CPUs.Theresul6ngtaskplacementisshownintherightfigure.TheRank0ispinnedtoCore0-3,andRank1toCore4-7,…,Rank15toCore60-63.TheMPItaskmaymovewithinthe16CPUsinthe4cores.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank15

Core60

Core1 Core2 Core3

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemodetorun64MPItasksonthenode,alloca6ng4CPUspertask,andbindseachtasktothe4CPUsallocatedwithinthecores.EachMPItaskruns4OpenMPthreads.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63toCore63.The4threadsofeachtaskarepinnedwithinthecore.Dependingonthecompilersusedtocompilethecode,the4threadsineachcoremayormaynotmovebetweenthe4CPUs.

Thread0

Thread1

Thread2

Thread3

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Core60

Core1 Core2 Core3

exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out

Process/threadaffinityoutcome0 68

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUwithineachcore.Theresul6ngthreadaffinity(andtaskaffinity)isshownintherightfigure.E.g.,forRank0,Thread0ispinnedtoCPU0,Thread1toCPU68,Thread2toCPU136,andThread3ispinnedtoCPU204.

Thread0

Thread1

Thread2

Thread3

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Core60

Core1 Core2 Core3

exportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

Dependingonthecompilerimplementa6ons,the8threadsineachtaskmayormaynotmovebetween4cores/16CPUsallocatedtothehosttask.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank15

Core60

Core1 Core2 Core3

exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Core61

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUonthecoresallocatedtothetask.Theresul6ngprocess/threadisshownintherightfigure.E.g.,forRank0,Thread0ispinnedtotheCPU0(onCore0),Thread1totheCPU1(onCore1),Threads2toCPU2(onCore2),andsoon.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank15

Core60

Core1 Core2 Core3

Core62 Core63

Example of running under the quad,flat mode interactively

zz217@gert01:~>salloc-pdebug-t30:00-Cknl,quad,flatzz217@nid00037:~>numactl-Havailable:2nodes(0-1)node0cpus:01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374….262263264265266267268269270271node0size:96759MBnode0free:94208MBnode1cpus:node1size:16157MBnode1free:16091MBnodedistances:node010:10311:3110 -49-

zz217@cori10:~>scontrolshownodenid10388NodeName=nid10388Arch=x86_64CoresPerSocket=68CPUAlloc=0CPUErr=0CPUTot=272CPULoad=0.01AvailableFeatures=knl,flat,split,equal,cache,a2a,snc2,snc4,hemi,quadAcKveFeatures=knl,cache,quad…State=IDLEThreadsPerCore=4…BootTime=2016-10-31T13:43:12…

Sample job script to run under the quad,flat mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Rank0 Rank2 Rank3

Core60

Core1 Core2 Core3

Rank15

Core60

Core1 Core2 Core3

Sample job script to run under the quad,flat mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

136 204

137 205

138 206

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Rank15

Core60

Core1 Core2 Core3

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

Rank0 Rank2 Rank3

Core60

Core1 Core2 Core3

Sample job script to run under the quad,flat mode using MCDRAM

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

#Whenthememoryfootprintfitsin16GBofMCDRAM(NUMAnode1),runsoutofMCDRAMexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

srun-n64-c4--cpu_bind=cores--mem_bind=map_mem:1./a.out#orusingnumactl-msrun-n64-c4--cpu_bind=coresnumactl–m1./a.out

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

#PrefersrunningonMCDRAM(NUMAnode1)ifmemoryfootprintdoesnotfitonMCDRAM,spillstoDDRexportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

srun–n16–c16--cpu_bind=coresnumactl–p1./a.out

How to check the process/thread affinity

•  Usesrunflag:--cpu_bind=verbose–  Needtoreadthecpumasksinhexidecimalformat

•  UseaCrayprovidedcodexthi.c(seebackupslides).•  Use--mem_bind=verbose,<type>tocheckmemoryaffinity

•  Usethenumastat–p<PID>commandtoconfirmwhileajobisrunning

•  Useenvironmentalvariables(Slurm,compilerspecific)–  SLURM_CPU_BIND_VERBOSE--cpu_bindverbosity(quiet,verbose).

A few useful commands •  sinfo–format=“%F%b”foravailablefeaturesofnodes,

orsinfo–format=“%C%b”–  A/I/O/T(allocated/idle/other/total)

•  scontrolshownode<nid>•  scontrolshowjob<jobid>-ddd•  sinfo–stoseeavailableparKKonsandnodes•  sbatch,srun,squeue,sinfoandotherSlurmcommand

manpages–  needtodis6nguishthejoballoca6on6me(#SBATCH)andjobstepcrea6on6me(srunwithinajobscript)

–  Someop6onsareonlyavailableatJoballoca6on6me,suchas–ntasks-per-core,someonlyworkwhencertainpluginsareenabled

Summary

•  Use–Cknl,<NUMA>,<MCDRAM>torequestKNLnodeswiththesameparKKonsasHaswellnodes(debug,orregular)

•  Alwaysexplicitlyusesrun’s--cpu_bindand-copKontospreadtheMPItasksevenlyoverthecores/CPUsonthenodes

•  UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACEStofinepinthreadstotheCPUsallocatedtothetasks

•  Usesrun’s--mem_bindandnumactl–ptocontrolmemoryaffinityandaccessMCDRAM–  Usingmemkind/autoHBWlibrariescanbeusedtoallocateonlyselectedarrays/memoryalloca6onstoMCDRAM(Steve’stalk)

Summary (2)

•  Considerusing64coresoutof68inmostcases•  Moresamplejobscriptscanbefoundinourwebsite

–  h�p://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/running-jobs-on-cori-knl-nodes/

•  WehaveprovidedajobscriptgeneratortohelpyoutogeneratebatchjobscriptsforKNL(andHaswell,Edison)

•  SlurmKNLfeaturesareinconKnuousdevelopmentandsomeinstrucKonsaresubjecttochange

------URLforthejobscriptgenerator:

h�ps://my.nersc.gov/script_generator.php

Backupslides

Cray provided a code to check process/thread affinity xthi.c (http://portal.nersc.gov/project/training/KNLUserTraing20161103/UsingCori/xthi.c/)

•  Tocompile,cc–qopenmp–oxthi.intelxthi.c#Intelcompilerscc–fopenmp–oxthi.gnuxthi.c#GNUcompilerscc–oxthi.gnuxthi.c#Craycompilers

•  Torun,salloc–N1–pdebug–Cknl,quad,flat#startaninterac6ve1nodejob…exportOMP_DISPLAY_ENV=true#todisplayenvsused/setbyopenmprun6meexportOMP_NUM_THREADS=4srun–n64–c4--cpu_bind=verbose,coresxthi.intel#run64taskswith4threadseach

Srun–n16–c16--cpu_bind=verbose,coresxthi.intel#run16tasks4threadseach

sinfo --format="%F %b” to show the number of nodes with active features

zz217@cori01:~>dateMonNov2815:10:36PST2016zz217@cori01:~>sinfo--format="%F%b"NODES(A/I/O/T)ACTIVE_FEATURES1805/157/42/2004haswell0/107/1/108knl,flat,a2a0/0/1/1knl0/225/6/231knl,flat,quad2500/1942/60/4502knl,cache,quad2676/995/6/3677quad,cache,knl768/0/0/768snc4,flat,knl0/8/0/8quad,flat,knl0/8/1/9snc2,flat,knl

Thiscommandshowsthattherethereare8179KNLnodesinquad,cachemode;239nodesinquad,flatmode;768nodesinsnc4,flatmode;9insnc2,flatmode

Theorderoffeaturesdoesnotmakedifference.

The --cpu_bind option of srun enables CPU bindings

srun–n4./a.out#noCPUbidings.Taskscanmovearoundwithin68cores/272CPUs

salloc–N1–pdebug–Cknl,quad,flat…

srun–n4–cpu_bind=cores./a.out

srun–n4--cpu_bind=threads./a.out

…...

Core67 CPUCPUCPUCPU

Core65

Core66

…...

Core67 CPUCPUCPUCPU

Core65

Core66

…...

Core67 CPUCPUCPUCPU…

Core65

Core66

srun–n4--cpu_bind=map_cpu:0,204,67,271./a.out

The --cpu_bind option: the -c option spreads tasks (evenly) on the CPUs on the node

srun–n16–c16–cpu_bind=cores./a.outsrun–n4–c8–cpu_bind=cores./a.out

…...

Rank0Core0

CPUCPUCPUCPU

Rank2Core4

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Core63 CPUCPUCPUCPU

…Core61

Core62

Core60

Rank15

Last4cores/16CPUsareidle

The --cpu_bind option (continued): the –c option spread tasks (evenly) on the CPUs on the node

srun–n16–c16–cpu_bind=threads./a.outsrun–n4–c8–cpu_bind=threads./a.out

…...

Rank0Core0

CPUCPUCPUCPU

Rank2Core4

…...

Core63 CPUCPUCPUCPU

…Core61

Core62

Core60

Rank15

Last4cores/16CPUsareidle

The -c option: --cpu_bind=cores vs –cpu_bind=threads

srun–n4–c6–cpu_bind=threads./a.outsrun–n4–c6–cpu_bind=cores./a.out

…...

Rank0Core0

CPUCPUCPUCPU

Rank2Core4

…...

CPUCPUCPUCPU

Snc2,flat (salloc -N 1 -p regular -C knl,snc2,flat) zz217@nid11512:~>numactl-Havailable:4nodes(0-3)node0cpus:01234567891011121314151617181920212223242526272829303132336869707172737475767778798081828384858687888990919293949596979899100101136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237node0size:48293MBnode0free:46047MBnode1cpus:34353637383940414243444546474849505152535455565758596061626364656667102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node1size:48466MBnode1free:44949MB

node2cpus:node2size:8079MBnode2free:7983MBnode3cpus:node3size:8077MBnode3free:7980MB

nodedistances:node01230:102131411:211041312:314110413:41314110

NUMAnode2

NUMAnode0

NUMAnode3

NUMAnode1

Core34-67

Core0-33

The default distribution on Cori is -m block:cyclic

salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c4–cpu_bind=cores./a.out#orsrun–n11–c4–cpu_bind=cores–mblock:cyclic./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0 Rank1Rank2 Rank3Rank4 Rank5

Core33 Core67

Core34

Core35

Core37

Core36

CPUCPUCPUCPU

… …

CPUCPUCPUCPU

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Rank7Rank6Rank9Rank8

Core33

Core67

Core34

Core35

Core37

Core36

……

Rank10

CPUCPUCPUCPU CPUCPUCPUCPU

The default distribution on Cori is -m block:cyclic (continued)

salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c8–cpu_bind=cores./a.out#orsrun–n11–c8–cpu_bind=cores–mblock:cyclic./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Core33 Core67

Core34

Core35

Core37

Core36

… …Rank2

Rank4 Rank5

Rank1Rank0

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Core33

Core67

Core34

Core35

Core37

Core36

……Rank10

Rank8 Rank9

Rank7Rank6

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Block distribution: -m block:block

Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c4–cpu_bind=cores–mblock:block./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0 Rank3Rank1 Rank4Rank2 Rank5

Core33 Core67

Core34

Core35

Core37

Core36

… …

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Rank9Rank6Rank10Rank7

Core33

Core67

Core34

Core35

Core37

Core36

……

Block distribution: -m block:block (continued)

Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c8–cpu_bind=cores–mblock:block./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0Core0

Core33 Core67

Core34

Core35

Core37

Core36

… …

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Core33

Core67

Core34

Core35

Core37

Core36

……

Rank2 Rank5

Rank10

Sample job script to run MPI+OpenMP under the snc2,flat mode

#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,snc2,flat

exportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

#using2CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n16–c16--cpu_bind=cores–mem_bind=local./a.out

#using2CPUs(hardwarethreads)percoreusingMCDRAMsrun–n16–c16--cpu_bind=cores–mem_bind=map_mem:2,3./a.out#orsrun–n16–c16--cpu_bind=coresnumactl–m2,3./a.out

#using2CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n16–c16--cpu_bind=coresnumactl–p2,3./a.out

#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n32–c8--cpu_bind=coresnumactl–p2,3./a.out

Sample job script to run large jobs (> 1500 MPI tasks) (quad,flat)

Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N100#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun--bcast=/tmp/a.out–n6400–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsbcast./a.out/tmp/a.out#copya.outtothe/tmpofeachcomputenodeallocatedfirstsrun–n6400–c4--cpu_bind=coresnumactl–p1/tmp/a.out

Sample job script to use core specialization (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flat#SBATCH–S1

exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)Srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredSrun–n64–c4--cpu_bind=coresnumactl–p1./a.out

Sample job script to run with Intel MPI (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsmoduleloadimpiexportI_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so#exportI_MPI_FABRICS=shm:tcp

#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out

#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n64–c4--cpu_bind=coresnumactl–p1./a.out

Thankyou!

Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

Documents

Transcript of Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

Convegno Cori 2010

Estimating the Performance Impact of the MCDRAM on KNL ...

Cori a palazzo 2012

Donna Fashion Florianópolis Cori

Exp. Ciclo de Cori

Cori senior project pictures

Gerty Cori Israel

Mini CORI-FLOW

mini CORI-FLOWTM

Mike cori octavia

Cori Brígida

CLIPAGEM CORI | MARÇO

CORI + FOR ESL LEARNERS

Gas Liquid - Bronkhorst Cori

Cori Messtechnik

Sisteme Dieta Cori Gramescu

Gerty Theresa Radnitz Cori

PROYECTO MATEMATICA CORI

Exploiting the jemalloc Memory Allocator: Owning Firefox’s ... · Exploiting the jemalloc Memory Allocator: Owning Firefox’s Heap Patroklos Argyroudis, Chariton Karamitas {argp,

Using MPI Effectively on Theta › sites › default › files › 2019... · Cray MPI support for MCDRAM on KNL (use cases) §When the entire data set fits within MCDRAM, on KNL