Blue Gene/Q and Knights Landing Many Core Architectures...

Blue Gene/Q and Knights Landing Many Core ArchitecturesArgonne Training Program on Extreme Scale Computing

ScottParkerArgonneLeadershipComputingFacility8/01/2016

§ 2004:– BlueGene/Lintroduced– LLNL90-600TFsystem#1onTop500for3.5years

§ 2005:– Argonneaccepts1rack(1024nodes)ofBlueGene/L(5.6TF)

§ 2006:– ArgonneLeadershipComputingFacility(ALCF)created– ANLworkingwithIBMonnextgenerationBlueGene

§ 2008:– ALCFaccepts40racks(160kcores)ofBlueGene/P(557TF)

§ 2009:– ALCFapprovedfor10petaflop systemtobedeliveredin2012– ANLworkingwithIBMonnextgenerationBlueGene

§ 2012:– 48racksofMiraBlueGene/Q(10PF)inproductionatALCF

§ 2014:– ALCFCORALcontractawardedtoIntel/Cray– DevelopmentpartnershipforThetaandAurorabegins

§ 2016:– ALCFacceptsTheta(8.5PF)CrayXC40withXeonPhi(KNL)

§ 2018:– Aurora(180+PF)Cray/IntelXeonPhi(KNH)tobedelivered

2

Argonne HPC Timeline

§ Mira – BG/Qsystem– 49,152nodes/786,432cores– 768TBofmemory– Peakfloprate:10PF– Linpack floprate:8.1PF(#6Top500)

§ Cetus &Vesta (T&D)- BG/Qsystems– 4K&2knodes/64k&32kcores– 64TB&32TBofmemory– 820TF&410TFpeakfloprate

§ Cooley– x86&Nvidia system– 126nodes/1512x86cores/126nVidia TeslaK80GPUs– 47TBx86memory/3TBGPUmemory– Peakfloprate:220TF

§ Storage– Over30PBcapacity,240GB/sbw (GPFS)

3

Current ALCF Systems

§ Leadership computing power– Leading architecture since introduction, #1 half Top500 lists over last 10 years– On average over the last 12 years 3 of the top 10 machine on Top 500 have been Blue Genes

§ Low speed, low power– Embedded PowerPC core with custom SIMD floating point extensions– Low frequency (L – 700 MHz, P – 850 MHz, Q – 1.6 GHz)

§ Massive parallelism:– Multi/Many core (L - 2, P – 4, Q – 16)– Many aggregate cores (L – 208k, P – 288k, Q – 1.5M)

§ Fast communication network(s)– Low latency, high bandwidth, torus network (L & P – 3D, Q – 5D)

§ Balance:– Processor, network, and memory speeds are well balanced

§ Minimal system overhead– Simple lightweight OS (CNK) minimizes noise

§ Standard Programming Models– Fortran, C, C++, & Python languages supported– Provides MPI, OpenMP, and Pthreads parallel programming models

§ System on a Chip (SoC) & Custom designed Application Specific Integrated Circuit (ASIC)– All node components on one chip, except for memory– Reduces system complexity and power, improves price / performance

§ High Reliability:– Sophisticated RAS (reliability, availability, and serviceability)

§ Dense packaging– 1024 nodes per rack

4

Blue Gene DNA And The Evolution of Many Core

BlueGene/Q Compute ChipIt’sbig§ 360mm²Cu-45technology(SOI)§ 1.5Btransistors§ 205GFpernode

18Cores§ 16computecores§ 17th coreforsystemfunctions(OS,RAS)§ plus1redundantprocessor§ L1I/Dcache=16kB/16kB

Crossbarswitch§ EachcoreconnectedtosharedL2§ Aggregatereadrateof409.6GB/s

CentralsharedL2cache§ 32MBeDRAM§ 16slices

Dualmemorycontroller§ 16GBexternalDDR3memory§ 42.6GB/s bandwidth

OnChipNetworking§ RouterlogicintegratedintoBQCchip§DMA, collective operations§ 11 networkports

5

6

BG/Q Chip, Another View

Crossbar

MemoryBank

L1Cache

L1Prefetcher

L2CacheSlices

Mem Controller

NICandRouter

MemoryBank

Mem Controller

18Cores

BG/Q Core§ FullPowerPCcompliant64-bitCPU,PowerISA v.206

§ PlusQPXfloatingpointvectorinstructions§ Runsat1.6GHz§ In-orderexecution§ 4-waySimultaneousMulti-Threading§ Registers:3264-bitinteger,32256-bitfloatingpoint

7

FunctionalUnits:§ IU– instructionsfetchanddecode§ XU– Branch,Integer,Load/Storeinstructions§ AXU– Floatingpointinstructions

§ StandardPowerPCinstructions§ QPX4wideSIMD

§ MMU– memorymanagement(TLB)

InstructionIssue:§ 2-wayconcurrentissue1XU+1AXU§ Agiventhreadmayonlyissue1instructionpercycle§ Twothreadsmaysimultaneouslyissue1instructioneachcycle

QPX Overview

§ Unique4widedoubleprecisionSIMDinstructionsextendingstandardPowerISA with:– Fullsetofarithmeticfunctions– Load/storeinstructions– Permuteinstructionstoreorganizedata

§ 4wideFMAinstructionsallow8flops/inst§ FPUoperateson:

– StandardscalePowerPCFPinstructions– 4wideSIMDinstructions– 2widecomplexarithmeticSIMDarithmetic

§ Standard64bitfloatingpointregistersareextendedto256bits

§ AttachedtoAXUportofA2core§ A2issuesoneinstruction/cycletoAXU§ 6stagepipeline§ CompilercangenerateQPXinstructions§ IntrinsicfunctionsmappingtoQPXinstructionsalloweasyQPXprogramming

8

BG/Q Memory Hierarchy

Core0 L1 L1PF L2slice0

X-bar

DRAMController0

DRAMController1

DMA

NetworkInterface

Crossbarswitchconnects:• L1P’s,L2slices,Network,PCIe interfaceAggregatebandwidthacrossslices:• Read:409.6GB/s,Write:204.8GB/s

9

Core1 L1 L1PF

Core2 L1 L1PF

Core3 L1 L1PF

Core16 L1 L1PF

L2slice7

L2slice8

L2slice15

L2Cache:• Sharedbyallcores• Servesapointofcoherency,generatesL1invalidations• Dividedinto16slicesconnectedviacrossbarswitchtoeachcore• 32MBtotal,2MBperslice• 16waysetassoc.,write-back,LRUreplacement,82cyclelatency• Supportsmemoryspeculationandatomicmemoryoperations

Memory:• Twoonchipmemorycontrollers• Eachconnectsto8L2slicesvia2ringbuses• Eachcontrollerdrivesa16+2byteDDR-3channelat1.33Gb/s• Peakbandwidthis42.67BG/s(excludingECC)• Latency>350cycles

L1Cache:• Data:16KB,8wayassoc.,64byteline,6cyclelatency• Instruction:16KB,4wayassoc.,3cyclelatencyL1Prefetcher (L1P):• 32entryprefetch buffer,entriesare128bytes• 24cyclelatency• OperatesinListorStreamprefetch modes• Operatesaswrite-backbuffer

The Blue Gene/Q Network

10

• 5Dtorusnetwork:• Achieveshighnearestneighborbandwidthwhileincreasing

bisectionalbandwidthandreducinghopsvs3Dtorus• Allowsmachinetobepartitionedintoindependentsubmachines

• Noimpactfromconcurrentlyrunningcodes.• Hardwareassistsforcollective&barrierfunctionsover

COMM_WORLDandrectangularsubcommunicators• Halfrack(midplane)is4x4x4x4x2torus(lastdimalways2)

• Nodeshave10linkswith2GB/srawbandwidtheach• Bi-directional:send+receivegives4GB/s• 90%ofbandwidth(1.8GB/s)availabletouser

• Hardwarelatency• ~40nsperhopthroughnetworklogic• Nearest:80ns• Farthest:3us(96-rack20PFsystem,31hops)

• NetworkPerformance• Nearest-neighbor:98%ofpeak• Bisection:>93%ofpeak• All-to-all:97%ofpeak• Collective:FPreductionsat94.6%ofpeak• Allreduce hardwarelatencyon96knodes~6.5us• Barrierhardwarelatencyon96knodes~6.3us

§ Facilitateextremescalability– Lownoiseoncomputenodes– FileI/OoffloadedtoI/OnodesrunningfullLinux– GLIBCenvironmentwithafewrestrictionsforscaling

§ FamiliarprogrammingmodessuchasMPIandOpenMP– ScalableMPICH2providingMPI2.2withextrememessagerate– Efficientintermediate(PAMI)andlow-level(SPI)messagelibraries– PAMIlayerallowseasyportingofruntimeslikeGA/ARMCI,BerkeleyUPC,etc

§ Standards-basedwhenpossible– Linuxdevelopmentenvironment:familiarGNUtoolchainwithglibc,pthreads– XLCompilersC,C++,FortranwithOpenMP 3.1– Debuggers:Totalview– Tools:HPCToolkit,TAU,PAPI,Valgrind

§ Opensourcewherepossible§ Facilitatehighperformanceforuniquehardware:

– QuadFPU,DMAunit,List-basedprefetcher– TM(TransactionalMemory),SE(SpeculativeExecution)– Wakeup-Unit,ScalableAtomicOperations

§ Flexibleandfastjobcontrol– withhighavailability– Noise-freepartitionednetworks– IntegratedHPC,HTC,MPMD,andsub-blockjobs

§ Facilitatenewprogrammingmodels

Blue Gene/Q Software High-Level Goals & Philosophy

11

§ Theta– Arrivingin2016– IntelXeonPhi,2nd Generation(KnightsLanding)– 3,240computenodes:207,360cores– 660TBofmemory– 8.5PetaFlops peakperformance– CrayAriesinterconnect– Dragonflynetworktopology– 10PBLustre filesystem

§ Aurora– Arrivingin2018– IntelXeonPhi,3rd Generation(KnightsHill)– Over50,000computenodes– Greaterthan7PBofpersistentmemory– 180+PetaFlops peakperformance– IntelOmni-Pathinterconnect– Dragonflynetworktopology– Over150PBLustre filesystem

12

Future ALCF Systems

13

Knights Landing ProcessorIt’sbigger§ 683mm²§ 14nmprocess§ 8Billiontransistors

Upto72Cores§ 36tiles§ 2corespertile§ 2.4TFpernode

2DMeshInterconnect§ Tilesconnectedby2Dmesh

OnPackageMemory§16GBMCDRAM§8Stacks§~450GB/sbandwidth

6DDR4memorychannels§2controllers§upto384GBexternalDDR4§ 90GB/sbandwidth

OnSocketNetworking§Omni-Path NIC on package§Connected by PCIe

14

KNL Tile and Core

Tile• TwoCPUs• 2VPUspercore• Shared1MBL2cache(notglobal)• Caching/Homeagent

• Distributeddirectory,Coherence

Core• BasedonSilvermont (Atom)• Functionalunits:

• 2IntegerALUs• 2Memoryunits• 2VPU’swithAVX-512

• InstructionIssue&Exec:• 2widedecode• 6wideexecute• Outoforder

• 4Hardwarethreadspercore

Programming with IPM and DRR

15

CPUIPM D

DR

CPUIPM D

DR

480GB/s

480GB/sIPM

• Twomemorytypes• InPackageMemory(IPM)

• 16GBMCDRAM• ~480GB/sbandwidth

• OffPackageMemory(DDR)• Upto384GB• ~90GB/sbandwidth

• Oneaddressspace• PossiblymultipleNUMAdomains

• Memoryconfigurations• Cached:DDRfullycachedbyIPM• Directmapped:usermanaged• Hybrid:¼,½IPMusedascache

• Managingmemory:• jemalloc &numa libraries• Pragmasforstaticmemory

allocations

FullyCached

Hybrid

CPUIPM D

DR

DirectMapped

90GB/s

480GB/s90GB/s

90GB/s

16

Clustering Modes

1. TilehasCacheMiss2. CHAselected3. MemoryController4. MemoryReceived

QuadrantMode SNC-4Mode

17

Aries Dragonfly Network

AriesRouter:• 4NIC’sconnectedviaPCIe• 40Networktiles/links• 4.7-5.25GB/s/dir perlink

Dragonflytopology• 4nodesconnectedtoanAries• 2Localall-to-alldimensions

• 16all-to-allhorizontal• 6all-to-allvertical

• 384nodesinlocalgroup• All-to-allconnectionsbetweengroups

§ Morelocalparallelism– 64-72cores(KNL)vs16(BG/Q)– 4hardwarethreadsonboth

§ Noincreaseinthenumberofnodes§ Increasedvectorlength

– 8widevectors(KNL)vs4widevectors(BG/Q)

§ Increasednodeperformance– 2.4TF(KNL)vs0.2TF(BG/Q)

§ Instructionissue– Out-of-order(KNL)vsin-order(BG/Q)– 2wideinstructionissueonboth– 2floatingpointinstructionspercycle(KNL)vs1percycle(BG/Q)

§ MemoryHierarchy– MCDRAM&DDR(KNL)vsuniform16GBDDR(BG/Q)

§ Differentnetworktopology– 5DtorusvsDragonfly

§ NICconnectivity– PCIe (Aries,Omni-Path)vsdirectcrossbarconnection(BG/Q)

18

Considerations In Moving From BG/Q to Xeon Phi (KNL)

19

Questions?

Blue Gene/Q and Knights Landing Many Core Architectures...

Documents

Transcript of Blue Gene/Q and Knights Landing Many Core Architectures...