Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

Post on 13-Aug-2018

225 views 0 download

Transcript of Using Cori - National Energy Research Scientific … · Using Cori. How to Compile & MCDRAM ......

Steve Leak, and Zhengji Zhao!NESAP Hack-a-thon!November 29, 2016, Berkeley CA

Using Cori

How to Compile & MCDRAM

-2-

SteveLeak

Building for Cori KNL nodes

• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons

• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory

Building for Cori KNL nodes

Don’tPanic(much)

KNL can run Haswell executables

But ...

HaswellExecutablescan’tfullyuseKNLhardware AVX2

(haswell)Opera6onon4DPwords

AVX-512(knl)Hardwarecancompute8DPwordsperinstruc6on

And ...

KNLreliesmoreonvectorizaKonforperformance

GFLOPS

ArithmeKcintensity

And ...

KNLmemoryhierarchyismorecomplicated

How to compile

Best:UsecompileropKonstobuildforKNLmodule swap craype-haswell craype-mic-knl

•  Theloadedcraype-*modulesetsthetargetthatthecompilerwrappers(cc,CC,Wn)buildfor–  Eg-mknl(GNUcompiler),

-hmic-knl(Craycompiler)•  craype-haswellisdefaulton

loginnodes•  craype-mic-knlisforKNLnodes

How to compile

Best:CompilerseXngstotargetKNLAlternate:CC -axMIC-AVX512,CORE-AVX2 <more-options> mycode.c++

•  OnlyvalidwhenusingIntelcompilers(cc,CCorWn)•  -ax<arch>addsan“alternateexecuKonpaths”opKmized

fordifferentarchitectures– Makes2(ormore)versionsofcodeinsameobjectfile

• NOTASGOODasthecraype-mic-knlmodule– (modulecausesversionsoflibrariesbuiltforthatarchitecturetobeused-egMKL)

How to compile

RecommendaKons:•  Forbestperformance,usethecraype-mic-knlmodulemodule swap craype-haswell craype-mic-knl CC -O3 -c myfile.c++

•  IfthesameexecutablemustrunonKNLandHaswellnodes,usecraype-haswellbutaddKNL-opKmizedexecuKonpathCC -axMIC-AVX512,CORE-AVX2 -O3 -c myfile.c++

What to link

UKlitylibraries• Notperformance-criKcal(bydefiniKon)– KNLcanrunXeonbinaries..canuseHaswell-targetedversions

•  I/Olibraries(HDF5,NetCDF,etc)shouldfitinthiscategorytoo– (forCray-providedlibraries,compilerwrapperwillusecraype-*toselectbestbuildanyway)

What to link

Performance-criKcallibraries• MKL:hasKNL-targetedopKmizaKons– Note:needtolinkwithwith-lmemkind(moresoon)

• PETsc,SLEPc,Caffe,MeKs,etc:– (soon)hasKNL-targetedbuilds

• Modulefileswillusecraype-{haswell,mic-knl}tofindappropriatelibrary• Keypoints:– SomeoneelsehasalreadypreparedlibrariesforKNL– Noneedtodo-it-yourself– Loadtherightcraype-module

What to link

• NERSCconvenKon:/usr/common/software/<name>/<version>/<arch>/[<PrgEnv>]

•  Eg:/usr/common/software/petsc/3.7.2/hsw/intel /usr/common/software/petsc/3.7.2/knl/intel

• KNLsubfoldermaybeasymlinktohsw– Librariescompiledwith-axMIC-AVX512,CORE-AVX2

•  ModulefilesshoulddotherightthingTM

– UsingCRAY_CPU_TARGET,setbycraype-{haswell,mic-knl}

Where to build

• Mostly:ontheloginnodes– KNLisdesignedforscalable,vectorizedworkloads– Compilingisneither!

•  WillprobablybemuchsloweronKNLnodethanXeonnode

• Cross-compiling– YouarecompilingforaXeonPhi(KNL)target,onaXeonhost•  Toolslikeautoconf(./configure)maytrytobuild-and-runsmallexecutablestotestavailabilityoflibraries,etc..whichmightnotwork

– CompileonKNLcomputenode?•  Slow(andcurrentlynotworking)

– craype-haswell+CFLAGS=-axMIC-AVX512,CORE-AVX2

Don’t Panic!

InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)

• Here’sthenewbit:•  module swap craype-haswell craype-mic-knl

– ForKNL-specificexecutables,or•  CC -axMIC-AVX512,CORE-AVX2 ...

– ForHaswell/KNLportability

What about MCDRAM?

• What’sdifferent?• Howtocompile– ..tousethenewwidevectorinstruc6ons

• Whattolink• MakinguseofMCDRAM– HighBandwidthMemory

MCDRAM in a nutshell

•  16GBon-chipmemory– cf96GBoff-chipDDR(Cori)

• Not(exactly)acache– LatencysimilartoDDR

• Butveryhighbandwidth– ~5xDDR

•  2waystouseit:– “Cache”mode:invisibletoOS,memorypagesarecachedinMCDRAM(cache-linegranularity)– “Flat”mode:appearstoOSasseparateNUMAnode,withnolocalCPUs.Accessiblevianumactl,libnuma(pagegranularity)

MCDRAM in a nutshell - cache mode

MCDRAM in a nutshell - flat mode

How to use MCDRAM

• OpKon1:Letthesystemfigureitout– Cachemode,nochangestocode,buildprocedureorrunprocedure– Mostofthebenefit,free,mostofthe6me

How to use MCDRAM

• OpKon2:Run-KmeseXngsonly– Flatmode,nochangestocodeorbuildprocedure– Doeswholejobfitwithin16GB/node?

•  srun<op6ons>numactl-m1./myexec.exe– Toobig?

•  srun<op6ons>numactl-p1./myexec.exe

How to use MCDRAM

• OpKon3:MakeyourapplicaKonNUMA-aware– Flatmode– UselibmemkindtoexplicitlyallocateselectedarraysinMCDRAM

memkind

jemalloc libnuma

APIforNUMAalloca6onpolicyinLinuxkernel

Mallocimplementa6onemphasizingfragmenta6onavoidanceandconcurrency

NUMA-awareextensibleheapmanager

#include <hbwmalloc.h> malloc(size) -> hbw_malloc(size)

Using libmemkind in code

• C/C++hbw_malloc()replacesmalloc()#include <hbwmalloc.h> // malloc(size) -> hbw_malloc(size)

•  Fortran!DIR$ MEMORY(bandwidth) a,b,c ! cray real, allocatable :: a(:,:), b(:,:), c(:) !DIR$ ATTRIBUTES FASTMEM :: a,b,c ! intel

• Caveat:onlyfordynamically-allocatedarrays– Notlocal(stack)variables– OrFortranpointers

Using libmemkind in code

• WhicharraystoputinMCDRAM?– Vtunememory-accessmeasurements:– amplxe-cl-collectmemory-access…

Building with libmemkind

•  module load memkind •  (ormodule load cray-memkind)

• Compilerwrapperswilladd

-lmemkind -ljemalloc -lnuma •  Fortrannote:NotallcompilerssupportFASTMEMdirecKve– CurrentlyIntelandmaybeCray

AutoHBW: Automatic memkind

• UsesarraysizetodeterminewhetheranarrayshouldbeallocatedtoMCDRAM• Nocodechangesnecessary!•  module load autohbw •  Linkwith-lautohbw

RunKmeenvironmentvariables:export AUTO_HBW_SIZE=4K # any allocation # >4KB will be placed in MCDRAM export AUTO_HBW_SIZE=4K:8K # allocations # between 4KB and 8KB will # be placed in MCDRAM

Don’t Panic!

InSummary:• Buildonloginnodes(likeyoudonow)• Useprovidedlibraries(likeyouprobablydonow)• Here’sthenewbit:•  module swap craype-haswell craype-mic-knl

– ForKNL-specificexecutables,or•  CC -axMIC-AVX512,CORE-AVX2 ...

– ForHaswell/KNLportabilityAnd:•  ThinkaboutMCDRAM– numactl,memkind,autohbm

A few final notes

•  Edisonexecutables(probably)won’tworkwithoutrecompile– ISA-compa6ble,but…– CorihasnewerOSversion,updatedlibraries– So:recompileforCori

• KNL-opKmizedMKLuseslibmemkind– Willneedtolinkwith-lmemkind -ljemalloc – Shouldbeinvisiblyintegratedinfutureversion

Running jobs on Cori KNL nodes

-30-

ZhengjiZhao

Agenda

•  What’snewonKNLnodes•  Process/thread/memoryaffinity•  Samplejobscripts•  Summary

-31-

KNL overview and legend Knights Landing Overview

Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown) Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+

TILE

4

2 VPU

Core

2 VPU

Core

1MB L2

CHA

Package

Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. KNL data are preliminary based on current expectations and are subject to change without notice. 1Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX). 2Bandwidth numbers are based on STREAM-like memory access pattern when MCDRAM used as flat memory. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Omni-path not shown

EDC EDC PCIe Gen 3

EDC EDC

Tile

DDR MC DDR MC

EDC EDC misc EDC EDC

36 Tiles connected by

2D Mesh Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

3

DDR4

CHANNELS

3

DDR4

CHANNELS

MCDRAM MCDRAM MCDRAM MCDRAM

DMI

2 x161 x4

X4 DMI

-32-

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Tile

Core Core

1Socket/Node68Cores(272CPUs)/Node36Tiles/Node(34ac6ve)2Cores/Tile;4CPUs/Core

1.4GB/CoreDDRmemory235MB/CoreMCDRAM

•  ACoriKNLnodehas68cores/272CPUsrunningat1.4GHz,96GBDDRmemory,16GBhigh(~5xDDR)bandwidthonpackagememory(MCDRAM)

•  Threeclustermodes,all-to-all,quadrant,sub-NUMAclustering,areavailableatbootKmetoconfiguretheKNLmeshinterconnect.

UseSlurm’sterminologyofcores,CPUs(hardwarethreads).

KNL overview – MCDRAM modes

10

Memory Modes

Hybrid Mode

DDR 4 or 8 GB MCDRAM

8 or 12GB MCDRAM

16GB MCDRAM

DDR

Flat Mode

Ph

ysic

al A

dd

ress

DDR 16GB

MCDRAM

Cache Mode

• SW-Transparent, Mem-side cache • Direct mapped. 64B lines. • Tags part of line • Covers whole DDR range

Three Modes. Selected at boot

Ph

ysic

al A

dd

ress

• MCDRAM as regular memory • SW-Managed • Same address space

• Part cache, Part memory • 25% or 50% cache • Benefits of both

-33-

96GB 96GB96GB

•  Nosourcecodechangesneeded

•  Missesareexpensive

•  Codechangesrequired•  ExposedasaNUMAnode•  Accessviamemkindlibrary,

joblaunchers,and/ornumactl

•  Combina6onofthecacheandflatmodes

MCDRAMcanbeconfiguredinthreedifferentmodesatbootKme-cache,flat,andhybridmodes

What’s new on KNL nodes (in comparison with Cori Haswell nodes from the perspective of running jobs)

1.   Alotmore(slower)coresonthenode2.   Muchreducedpercorememory3.   DynamicallyconfigurableNUMAandMCDRAM

modes…

-34-

A proper process/thread/memory affinity is the basis for optimal performance

•  Processaffinity(orCPUpinning):binda(MPI)processtoaCPUorarangeofCPUsonthenode,sothattheprocessexecuteswithinthedesignatedCPUsinsteadofdriWingaroundtootherCPUsonthenode.

•  Threadaffinity:finepineachthreadofaprocesstoaCPUorCPUswithintheCPUsthataredesignatedtotheprocess.–  Threadsliveintheprocessthatownsthem,sotheprocessandthreadaffinityarenotseparable.

•  Memoryaffinity:restrictprocessestoallocatememoriesfromthedesignatedNUMAnodesonlyratherthananyNUMAnodes.

-35-

The minimum goal of process/thread/memory affinity is to achieve best resource utilization and to avoid NUMA performance penalty

•  SpreadMPItasksandthreadsontothecoresandCPUsonthenodesasevenlyaspossiblesothatnocoresandCPUsareoversubscribedwhileothersstayidle.Thiscanensuretheresourcesavailableonthenode,suchascores,CPUs,NUMAnodes,memoryandnetworkbandwidths,etc.,canbebestuKlized.

•  AvoidaccessingremoteNUMAnodesasmuchaspossiblesotoavoidperformancepenalty.

•  IncontextofKNL,enableandcontroltheMCDRAMaccess.

-36-

Using srun’s --cpu_bind option and OpenMP environment variables to achieve desired process/thread affinity

•  Usesrun--cpu_bindtobindtaskstoCPUs–  Ozenneedstoworkwiththe–copKonofsruntoevenlyspreadMPI

tasksontheCPUsonthenodes–  Thesrun–c<n>(or--cpus-per-task=n)allocates(reserves)nnumber

ofCPUspertask(process)–  --cpu_bind=[{verbose,quiet},]type,type:cores,threads,map_cpu:<list

ofCPUs>,mask_cpu:<listofmasks>,none,…•  UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACESto

finepineachthreadtoasubsetofCPUsallocatedtothehosttask–  Differentcompilersmayhavedifferentdefaultvaluesforthem.The

followingarerecommended,whichyieldamorecompa6blethreadaffinityamongIntel,GNUandCraycompilers:OMP_PROC_BIND=true#SpecifyingthreadsmaynotbemovedbetweenCPUsOMP_PLACES=threads#SpecifyingathreadshouldbeplacedinasingleCPU

–  UseOMP_DISPLAY_ENV=truetodisplaytheOpenMPenvironmentvariablesset(usefulwhencheckingthedefaultcompilerbehavior)

-37-

Using srun’s --mem_bind option and/or numactl to achieve desired memory affinity

•  Usesrun–mem_bindformemoryaffinity–  --mem_bind=[{verbose,quiet},]type:local,map_mem:<NUMAidlist>,mask_mem:<NUMAmasklist>,none,…

–  E.g.,--mem_bind=<MCDRAMNUMAid>whenalloca6onsfitintoMDCRAMinflatmode

•  UseNumactl–p<NUMAid>–  Srundoesnothavethisfunc6onalitycurrently(16.05.6),willbesupportedinSlurm17.02.

–  E.g.,numactl–p<MCDRAMNUMAid>./a.outsothatalloca6onsthatdon’tfitintoMCDRAMspillovertoDDRmemory

-38-

Default Slurm behavior with respect to process/thread/memory binding

•  BySlurmdefault,adecentCPUbindingissetonlywhentheMPItaskspernodexCPUspertask=thetotalnumberofCPUsallocatedpernode,e.g.,68x4=272

•  Otherwise,SlurmdoesnotdoanythingwithCPUbinding.Thesrun’s--cpu_bindand–copKonsmustbeusedexplicitlytoachieveopKmalprocess/threadaffinity.

•  NodefaultmemorybindingissetbySlurm.ProcessescanallocatememoryfromallNUMAnodes.The--mem_bind(ornumactl)shouldbeusedexplicitlytosetmemorybindings.

-39-

Default Slurm behavior with respect to process/thread/memory binding (continued)

•  ThedefaultdistribuKon,the–mopKonofsrun,isblock:cycliconCori.–  Thecyclicdistribu6onmethoddistributesallocatedCPUsforbindingtoa

giventaskconsecu6velyfromthesamesocket,andfromthenextconsecu6vesocketforthenexttask,inaround-robinfashionacrosssockets.

•  The–mblock:blockalsoworks.Youareencouragedtoexperimentwith–mblock:blockassomeapplicaKonsperformbeyerwiththeblockdistribuKon.–  Theblockdistribu6onmethoddistributesallocatedCPUsconsecu6vely

fromthesamesocketforbindingtotasks,beforeusingthenextconsecu6vesocket.

•  The–mopKonisrelevanttotheKNLnodeswhentheyareconfiguredinthesub-NUMAclustermodes,e.g.,SNC2,SNC4,etc.Slurmtreats“NUMAnodeswithCPUs”as“sockets”,althoughKNLisasinglesocketnode.

-40-

Available partitions and NUMA/MCDRAM modes on Cori KNL nodes (not finalized view yet)

•  SameparKKonsasHaswell–  #SBATCH–pregular–  #SBATCH–pdebug–  Typesinfo–sformoreinfoaboutpar66onsandnodes

•  Usingthe–Cknl,<NUMA>,<MCDRAM>opKonsofsbatchtorequestKNLnodeswithdesiredfeatures–  #SBATCH–Cknl,quad,flat

•  SupportscombinaKonofthefollowingNUMA/MCDRAMmodes:–  AllowNUMA=a2a,snc2,snc4,hemi,quad–  AllowMCDRAM=cache,split,equal,flat–  Quad,flatisthedefaultfornow(notfinalized)

•  NodescanberebootedautomaKcally–  Frequentrebootsarenotencouraged,astheycurrentlytakealong6me–  Wearetes6ngvariousmemorymodessotosetaproperdefaultmode

-41-

Example of running interactive batch job with KNL nodes in the quad,cache mode

zz217@gert01:~>salloc-N1–pdebug–t30:00-Cknl,quad,cachesalloc:Grantedjoballoca6on5545salloc:Wai6ngforresourceconfigura6onsalloc:Nodesnid00044arereadyforjobzz217@nid00044:~>numactl-Havailable:1nodes(0)node0cpus:0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172…...238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node0size:96757MBnode0free:94207MBnodedistances:node00:10

-42-

•  Runthenumactl–HcommandtocheckiftheactualNUMAconfigura6onmatchestherequestedNUMA,MCDRAMmode

•  Thequad,cachemodehasonly1NUMAnodewithallCPUsontheNUMAnode0(DDRmemory)

•  TheMCDRAMishiddenfromthenumactl–Hcommand(itisacache).

Sample job script to run under the quad,cache mode

SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=1#op6onal*srun-n64-c4--cpu_bind=cores./a.out

-43-

Processaffinityoutcome0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches64MPItasksonthenode,alloca6ng4CPUspertask,andbindsprocessestocores.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63willbepinnedtoCore63.EachMPItaskmaymovewithinthe4CPUsinthecores.

Each2x2boxaboveisacorewith4CPUs(hardwarethreads).ThenumbersshownineachCPUboxistheCPUids.Thelast4coresarenotusedinthisexample.Thecores4-59werenotbeshown.

*)Theuseof“exportOMP_NUM_THREADS=1”isop6onalbutrecommendedevenforpureMPIcodes.Thisistoavoidunexpectedthreadforking(compilerwrappersmaylinkyourcodetothemul6-threadedsystemprovidedlibrariesbydefault).

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Sample job script to run under the quad,cache mode

SampleJobscript(PureMPI)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=1#op6onal*srun–n16–c16--cpu_bind=cores./a.out

-44-

Processaffinityoutcome

0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemode.Thesruncommandlaunches16MPItasksonthenode,alloca6ng16CPUspertask,andbindseachprocessto4cores/16CPUs.Theresul6ngtaskplacementisshownintherightfigure.TheRank0ispinnedtoCore0-3,andRank1toCore4-7,…,Rank15toCore60-63.TheMPItaskmaymovewithinthe16CPUsinthe4cores.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out

-45-

Processaffinityoutcome

0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thisjobscriptrequests1KNLnodeinthequad,cachemodetorun64MPItasksonthenode,alloca6ng4CPUspertask,andbindseachtasktothe4CPUsallocatedwithinthecores.EachMPItaskruns4OpenMPthreads.Theresul6ngtaskplacementisshownintherightfigure.TheRank0willbepinnedtoCore0,Rank1toCore1,…,Rank63toCore63.The4threadsofeachtaskarepinnedwithinthecore.Dependingonthecompilersusedtocompilethecode,the4threadsineachcoremayormaynotmovebetweenthe4CPUs.

Thread0

Thread1

Thread2

Thread3

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.out

-46-

Process/threadaffinityoutcome0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUwithineachcore.Theresul6ngthreadaffinity(andtaskaffinity)isshownintherightfigure.E.g.,forRank0,Thread0ispinnedtoCPU0,Thread1toCPU68,Thread2toCPU136,andThread3ispinnedtoCPU204.

Thread0

Thread1

Thread2

Thread3

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-47-

Processaffinityoutcome

0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

Dependingonthecompilerimplementa6ons,the8threadsineachtaskmayormaynotmovebetween4cores/16CPUsallocatedtothehosttask.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Core64 Core65 Core66 Core67

Sample job script to run under the quad,cache mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,cache

exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-48-

Process/threadaffinityoutcome0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Core61

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

WiththeabovetwoOpenMPenvs,eachthreadispinnedtoasingleCPUonthecoresallocatedtothetask.Theresul6ngprocess/threadisshownintherightfigure.E.g.,forRank0,Thread0ispinnedtotheCPU0(onCore0),Thread1totheCPU1(onCore1),Threads2toCPU2(onCore2),andsoon.

64 132

200 268

65 133

201 269

66 134

202 270

67 135

203 271

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core62 Core63

Core64 Core65 Core66 Core67

Example of running under the quad,flat mode interactively

zz217@gert01:~>salloc-pdebug-t30:00-Cknl,quad,flatzz217@nid00037:~>numactl-Havailable:2nodes(0-1)node0cpus:01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374….262263264265266267268269270271node0size:96759MBnode0free:94208MBnode1cpus:node1size:16157MBnode1free:16091MBnodedistances:node010:10311:3110 -49-

zz217@cori10:~>scontrolshownodenid10388NodeName=nid10388Arch=x86_64CoresPerSocket=68CPUAlloc=0CPUErr=0CPUTot=272CPULoad=0.01AvailableFeatures=knl,flat,split,equal,cache,a2a,snc2,snc4,hemi,quadAcKveFeatures=knl,cache,quad…State=IDLEThreadsPerCore=4…BootTime=2016-10-31T13:43:12…

Sample job script to run under the quad,flat mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-50-

Processaffinityoutcome

0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Sample job script to run under the quad,flat mode

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flatexportOMP_PROC_BIND=trueexportOMP_PLACES=threadsexportOMP_NUM_THREADS=4srun-n64-c4--cpu_bind=cores./a.outexportOMP_NUM_THREADS=8srun–n16–c16--cpu_bind=cores./a.out

-51-

Process/threadaffinityoutcome0 68

136 204

1 69

137 205

2 70

138 206

3 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

0 68

136 204

1 69

137 205

3 70

138 206

4 71

139 207

60 128

196 264

61 129

197 265

62 130

198 266

63 131

199 267

Rank0

Rank15

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Thread0

Thread1

Thread2

Thread3

Thread4

Thread5

Thread6

Thread7

Rank0 Rank2 Rank3

Rank60 Rank61 Rank62 Rank63

Rank1

Core0

Core60

Core1 Core2 Core3

Core61 Core62 Core63

Sample job script to run under the quad,flat mode using MCDRAM

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

#Whenthememoryfootprintfitsin16GBofMCDRAM(NUMAnode1),runsoutofMCDRAMexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

srun-n64-c4--cpu_bind=cores--mem_bind=map_mem:1./a.out#orusingnumactl-msrun-n64-c4--cpu_bind=coresnumactl–m1./a.out

-52-

SampleJobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

#PrefersrunningonMCDRAM(NUMAnode1)ifmemoryfootprintdoesnotfitonMCDRAM,spillstoDDRexportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

srun–n16–c16--cpu_bind=coresnumactl–p1./a.out

How to check the process/thread affinity

•  Usesrunflag:--cpu_bind=verbose–  Needtoreadthecpumasksinhexidecimalformat

•  UseaCrayprovidedcodexthi.c(seebackupslides).•  Use--mem_bind=verbose,<type>tocheckmemoryaffinity

•  Usethenumastat–p<PID>commandtoconfirmwhileajobisrunning

•  Useenvironmentalvariables(Slurm,compilerspecific)–  SLURM_CPU_BIND_VERBOSE--cpu_bindverbosity(quiet,verbose).

-53-

A few useful commands •  sinfo–format=“%F%b”foravailablefeaturesofnodes,

orsinfo–format=“%C%b”–  A/I/O/T(allocated/idle/other/total)

•  scontrolshownode<nid>•  scontrolshowjob<jobid>-ddd•  sinfo–stoseeavailableparKKonsandnodes•  sbatch,srun,squeue,sinfoandotherSlurmcommand

manpages–  needtodis6nguishthejoballoca6on6me(#SBATCH)andjobstepcrea6on6me(srunwithinajobscript)

–  Someop6onsareonlyavailableatJoballoca6on6me,suchas–ntasks-per-core,someonlyworkwhencertainpluginsareenabled

-54-

Summary

•  Use–Cknl,<NUMA>,<MCDRAM>torequestKNLnodeswiththesameparKKonsasHaswellnodes(debug,orregular)

•  Alwaysexplicitlyusesrun’s--cpu_bindand-copKontospreadtheMPItasksevenlyoverthecores/CPUsonthenodes

•  UseOpenMPenvs,OMP_PROC_BINDandOMP_PLACEStofinepinthreadstotheCPUsallocatedtothetasks

•  Usesrun’s--mem_bindandnumactl–ptocontrolmemoryaffinityandaccessMCDRAM–  Usingmemkind/autoHBWlibrariescanbeusedtoallocateonlyselectedarrays/memoryalloca6onstoMCDRAM(Steve’stalk)

-55-

Summary (2)

•  Considerusing64coresoutof68inmostcases•  Moresamplejobscriptscanbefoundinourwebsite

–  h�p://www.nersc.gov/users/computa6onal-systems/cori/running-jobs/running-jobs-on-cori-knl-nodes/

•  WehaveprovidedajobscriptgeneratortohelpyoutogeneratebatchjobscriptsforKNL(andHaswell,Edison)

•  SlurmKNLfeaturesareinconKnuousdevelopmentandsomeinstrucKonsaresubjecttochange

------URLforthejobscriptgenerator:

h�ps://my.nersc.gov/script_generator.php

-56-

Backupslides

-57-

Cray provided a code to check process/thread affinity xthi.c (http://portal.nersc.gov/project/training/KNLUserTraing20161103/UsingCori/xthi.c/)

•  Tocompile,cc–qopenmp–oxthi.intelxthi.c#Intelcompilerscc–fopenmp–oxthi.gnuxthi.c#GNUcompilerscc–oxthi.gnuxthi.c#Craycompilers

•  Torun,salloc–N1–pdebug–Cknl,quad,flat#startaninterac6ve1nodejob…exportOMP_DISPLAY_ENV=true#todisplayenvsused/setbyopenmprun6meexportOMP_NUM_THREADS=4srun–n64–c4--cpu_bind=verbose,coresxthi.intel#run64taskswith4threadseach

Srun–n16–c16--cpu_bind=verbose,coresxthi.intel#run16tasks4threadseach

sinfo --format="%F %b” to show the number of nodes with active features

zz217@cori01:~>dateMonNov2815:10:36PST2016zz217@cori01:~>sinfo--format="%F%b"NODES(A/I/O/T)ACTIVE_FEATURES1805/157/42/2004haswell0/107/1/108knl,flat,a2a0/0/1/1knl0/225/6/231knl,flat,quad2500/1942/60/4502knl,cache,quad2676/995/6/3677quad,cache,knl768/0/0/768snc4,flat,knl0/8/0/8quad,flat,knl0/8/1/9snc2,flat,knl

-59-

Thiscommandshowsthattherethereare8179KNLnodesinquad,cachemode;239nodesinquad,flatmode;768nodesinsnc4,flatmode;9insnc2,flatmode

Theorderoffeaturesdoesnotmakedifference.

The --cpu_bind option of srun enables CPU bindings

-60-

srun–n4./a.out#noCPUbidings.Taskscanmovearoundwithin68cores/272CPUs

salloc–N1–pdebug–Cknl,quad,flat…

srun–n4–cpu_bind=cores./a.out

srun–n4--cpu_bind=threads./a.out

…...

Rank0

Core0

Core1

Core3

Core2

Core67 CPUCPUCPUCPU

Core65

Core66

…...

Rank0

Rank1

Rank2

Core0

Core1

Core3

Core2

Core67 CPUCPUCPUCPU

Core65

Core66

Rank3

…...

Core0

Core1

Core3

Core2

Core67 CPUCPUCPUCPU…

Core65

Core66

srun–n4--cpu_bind=map_cpu:0,204,67,271./a.out

Rank1

Rank2

Rank3

Rank0

Rank1

Rank3

Rank2

The --cpu_bind option: the -c option spreads tasks (evenly) on the CPUs on the node

-61-

salloc–N1–pdebug–Cknl,quad,flat…

srun–n16–c16–cpu_bind=cores./a.outsrun–n4–c8–cpu_bind=cores./a.out

…...

Rank0Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Rank1

Rank3

Rank2Core4

Core5

Core7

Core6

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Rank0

Core0

Core1

Core3

Core2

Core63 CPUCPUCPUCPU

…Core61

Core62

Rank1

Core60

Rank15

Last4cores/16CPUsareidle

The --cpu_bind option (continued): the –c option spread tasks (evenly) on the CPUs on the node

-62-

salloc–N1–pdebug–Cknl,quad,flat…

srun–n16–c16–cpu_bind=threads./a.outsrun–n4–c8–cpu_bind=threads./a.out

…...

Rank0Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Rank1

Rank3

Rank2Core4

Core5

Core7

Core6

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Rank0

Core0

Core1

Core3

Core2

Core63 CPUCPUCPUCPU

…Core61

Core62

Rank1

Core60

Rank15

Last4cores/16CPUsareidle

The -c option: --cpu_bind=cores vs –cpu_bind=threads

-63-

salloc–N1–pdebug–Cknl,quad,flat…

srun–n4–c6–cpu_bind=threads./a.outsrun–n4–c6–cpu_bind=cores./a.out

…...

Rank0Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Rank1

Rank3

Rank2Core4

Core5

Core7

Core6

First8cores/32CPUSareused;rest60cores/240CPUsstayidle

…...

Core0

Core1

Core3

Core2

CPUCPUCPUCPU

Core4

Core5

Core7

Core6

First6cores/24CPUSareused;rest62cores/248CPUsstayidle

Rank0

Rank1

Rank2

Rank3

Snc2,flat (salloc -N 1 -p regular -C knl,snc2,flat) zz217@nid11512:~>numactl-Havailable:4nodes(0-3)node0cpus:01234567891011121314151617181920212223242526272829303132336869707172737475767778798081828384858687888990919293949596979899100101136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237node0size:48293MBnode0free:46047MBnode1cpus:34353637383940414243444546474849505152535455565758596061626364656667102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271node1size:48466MBnode1free:44949MB

-64-

node2cpus:node2size:8079MBnode2free:7983MBnode3cpus:node3size:8077MBnode3free:7980MB

nodedistances:node01230:102131411:211041312:314110413:41314110

NUMAnode2

NUMAnode0

NUMAnode3

NUMAnode1

Core34-67

Core0-33

The default distribution on Cori is -m block:cyclic

-65-

salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c4–cpu_bind=cores./a.out#orsrun–n11–c4–cpu_bind=cores–mblock:cyclic./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0 Rank1Rank2 Rank3Rank4 Rank5

Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

CPUCPUCPUCPU

… …

CPUCPUCPUCPU

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Rank7Rank6Rank9Rank8

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……

Rank10

CPUCPUCPUCPU CPUCPUCPUCPU

The default distribution on Cori is -m block:cyclic (continued)

-66-

salloc–N2–pdebug–Cknl,snc2,flat…srun–n11–c8–cpu_bind=cores./a.out#orsrun–n11–c8–cpu_bind=cores–mblock:cyclic./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

… …Rank2

Rank4 Rank5

Rank3

Rank1Rank0

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……Rank10

Rank8 Rank9

Rank7Rank6

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Block distribution: -m block:block

-67-

Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c4–cpu_bind=cores–mblock:block./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0 Rank3Rank1 Rank4Rank2 Rank5

Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

… …

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Rank9Rank6Rank10Rank7

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……

Rank8

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Block distribution: -m block:block (continued)

-68-

Salloc–N2–pdebug–Cknl,snc2,flat…Srun–n11–c8–cpu_bind=cores–mblock:block./a.out

KNLNode1NUMAnode0

…...

NUMAnode1

…...

Rank0Core0

Core1

Core3

Core2

Core33 Core67

Core34

Core35

Core37

Core36

… …

KNLNode2NUMAnode0

…...

NUMAnode1

…...

Core33

Core0

Core1

Core3

Core2

Core67

Core34

Core35

Core37

Core36

……

Rank1

Rank2 Rank5

Rank4

Rank3

Rank7

Rank8

Rank6

Rank10

Rank9

CPUCPUCPUCPU CPUCPUCPUCPUCPUCPUCPUCPU CPUCPUCPUCPU

Sample job script to run MPI+OpenMP under the snc2,flat mode

-69-

#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,snc2,flat

exportOMP_NUM_THREADS=8exportOMP_PROC_BIND=trueexportOMP_PLACES=threads

#using2CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n16–c16--cpu_bind=cores–mem_bind=local./a.out

#using2CPUs(hardwarethreads)percoreusingMCDRAMsrun–n16–c16--cpu_bind=cores–mem_bind=map_mem:2,3./a.out#orsrun–n16–c16--cpu_bind=coresnumactl–m2,3./a.out

#using2CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n16–c16--cpu_bind=coresnumactl–p2,3./a.out

#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n32–c8--cpu_bind=coresnumactl–p2,3./a.out

Sample job script to run large jobs (> 1500 MPI tasks) (quad,flat)

Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N100#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flatexportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun--bcast=/tmp/a.out–n6400–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsbcast./a.out/tmp/a.out#copya.outtothe/tmpofeachcomputenodeallocatedfirstsrun–n6400–c4--cpu_bind=coresnumactl–p1/tmp/a.out

-70-

Sample job script to use core specialization (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH–LSCRATCH#SBATCH-Cknl,quad,flat#SBATCH–S1

exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threads#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)Srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredSrun–n64–c4--cpu_bind=coresnumactl–p1./a.out

-71-

Sample job script to run with Intel MPI (quad,flat) Samplejobscript(MPI+OpenMP)#!/bin/bash-l#SBATCH–N1#SBATCH–pregular#SBATCH–t1:00:00#SBATCH-Cknl,quad,flat

exportOMP_NUM_THREADS=4exportOMP_PROC_BIND=trueexportOMP_PLACES=threadsmoduleloadimpiexportI_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so#exportI_MPI_FABRICS=shm:tcp

#using4CPUs(hardwarethreads)percoreusingDDRmemory(orusingMCDRAMvialibmemkind)srun–n64–c4--cpu_bind=cores–mem_bind=local./a.out

#or#using4CPUs(hardwarethreads)percorewithMCDRAMpreferredsrun–n64–c4--cpu_bind=coresnumactl–p1./a.out

-72-

Thankyou!

-73-