Blue Gene/Q and Knights Landing Many Core Architectures...
Transcript of Blue Gene/Q and Knights Landing Many Core Architectures...
Blue Gene/Q and Knights Landing Many Core ArchitecturesArgonne Training Program on Extreme Scale Computing
ScottParkerArgonneLeadershipComputingFacility8/01/2016
§ 2004:– BlueGene/Lintroduced– LLNL90-600TFsystem#1onTop500for3.5years
§ 2005:– Argonneaccepts1rack(1024nodes)ofBlueGene/L(5.6TF)
§ 2006:– ArgonneLeadershipComputingFacility(ALCF)created– ANLworkingwithIBMonnextgenerationBlueGene
§ 2008:– ALCFaccepts40racks(160kcores)ofBlueGene/P(557TF)
§ 2009:– ALCFapprovedfor10petaflop systemtobedeliveredin2012– ANLworkingwithIBMonnextgenerationBlueGene
§ 2012:– 48racksofMiraBlueGene/Q(10PF)inproductionatALCF
§ 2014:– ALCFCORALcontractawardedtoIntel/Cray– DevelopmentpartnershipforThetaandAurorabegins
§ 2016:– ALCFacceptsTheta(8.5PF)CrayXC40withXeonPhi(KNL)
§ 2018:– Aurora(180+PF)Cray/IntelXeonPhi(KNH)tobedelivered
2
Argonne HPC Timeline
§ Mira – BG/Qsystem– 49,152nodes/786,432cores– 768TBofmemory– Peakfloprate:10PF– Linpack floprate:8.1PF(#6Top500)
§ Cetus &Vesta (T&D)- BG/Qsystems– 4K&2knodes/64k&32kcores– 64TB&32TBofmemory– 820TF&410TFpeakfloprate
§ Cooley– x86&Nvidia system– 126nodes/1512x86cores/126nVidia TeslaK80GPUs– 47TBx86memory/3TBGPUmemory– Peakfloprate:220TF
§ Storage– Over30PBcapacity,240GB/sbw (GPFS)
3
Current ALCF Systems
§ Leadership computing power– Leading architecture since introduction, #1 half Top500 lists over last 10 years– On average over the last 12 years 3 of the top 10 machine on Top 500 have been Blue Genes
§ Low speed, low power– Embedded PowerPC core with custom SIMD floating point extensions– Low frequency (L – 700 MHz, P – 850 MHz, Q – 1.6 GHz)
§ Massive parallelism:– Multi/Many core (L - 2, P – 4, Q – 16)– Many aggregate cores (L – 208k, P – 288k, Q – 1.5M)
§ Fast communication network(s)– Low latency, high bandwidth, torus network (L & P – 3D, Q – 5D)
§ Balance:– Processor, network, and memory speeds are well balanced
§ Minimal system overhead– Simple lightweight OS (CNK) minimizes noise
§ Standard Programming Models– Fortran, C, C++, & Python languages supported– Provides MPI, OpenMP, and Pthreads parallel programming models
§ System on a Chip (SoC) & Custom designed Application Specific Integrated Circuit (ASIC)– All node components on one chip, except for memory– Reduces system complexity and power, improves price / performance
§ High Reliability:– Sophisticated RAS (reliability, availability, and serviceability)
§ Dense packaging– 1024 nodes per rack
4
Blue Gene DNA And The Evolution of Many Core
BlueGene/Q Compute ChipIt’sbig§ 360mm²Cu-45technology(SOI)§ 1.5Btransistors§ 205GFpernode
18Cores§ 16computecores§ 17th coreforsystemfunctions(OS,RAS)§ plus1redundantprocessor§ L1I/Dcache=16kB/16kB
Crossbarswitch§ EachcoreconnectedtosharedL2§ Aggregatereadrateof409.6GB/s
CentralsharedL2cache§ 32MBeDRAM§ 16slices
Dualmemorycontroller§ 16GBexternalDDR3memory§ 42.6GB/s bandwidth
OnChipNetworking§ RouterlogicintegratedintoBQCchip§DMA, collective operations§ 11 networkports
5
6
BG/Q Chip, Another View
Crossbar
MemoryBank
L1Cache
L1Prefetcher
L2CacheSlices
Mem Controller
NICandRouter
MemoryBank
Mem Controller
18Cores
BG/Q Core§ FullPowerPCcompliant64-bitCPU,PowerISA v.206
§ PlusQPXfloatingpointvectorinstructions§ Runsat1.6GHz§ In-orderexecution§ 4-waySimultaneousMulti-Threading§ Registers:3264-bitinteger,32256-bitfloatingpoint
7
FunctionalUnits:§ IU– instructionsfetchanddecode§ XU– Branch,Integer,Load/Storeinstructions§ AXU– Floatingpointinstructions
§ StandardPowerPCinstructions§ QPX4wideSIMD
§ MMU– memorymanagement(TLB)
InstructionIssue:§ 2-wayconcurrentissue1XU+1AXU§ Agiventhreadmayonlyissue1instructionpercycle§ Twothreadsmaysimultaneouslyissue1instructioneachcycle
QPX Overview
§ Unique4widedoubleprecisionSIMDinstructionsextendingstandardPowerISA with:– Fullsetofarithmeticfunctions– Load/storeinstructions– Permuteinstructionstoreorganizedata
§ 4wideFMAinstructionsallow8flops/inst§ FPUoperateson:
– StandardscalePowerPCFPinstructions– 4wideSIMDinstructions– 2widecomplexarithmeticSIMDarithmetic
§ Standard64bitfloatingpointregistersareextendedto256bits
§ AttachedtoAXUportofA2core§ A2issuesoneinstruction/cycletoAXU§ 6stagepipeline§ CompilercangenerateQPXinstructions§ IntrinsicfunctionsmappingtoQPXinstructionsalloweasyQPXprogramming
8
BG/Q Memory Hierarchy
Core0 L1 L1PF L2slice0
X-bar
DRAMController0
DRAMController1
DMA
NetworkInterface
Crossbarswitchconnects:• L1P’s,L2slices,Network,PCIe interfaceAggregatebandwidthacrossslices:• Read:409.6GB/s,Write:204.8GB/s
9
Core1 L1 L1PF
Core2 L1 L1PF
Core3 L1 L1PF
Core16 L1 L1PF
L2slice7
L2slice8
L2slice15
L2Cache:• Sharedbyallcores• Servesapointofcoherency,generatesL1invalidations• Dividedinto16slicesconnectedviacrossbarswitchtoeachcore• 32MBtotal,2MBperslice• 16waysetassoc.,write-back,LRUreplacement,82cyclelatency• Supportsmemoryspeculationandatomicmemoryoperations
Memory:• Twoonchipmemorycontrollers• Eachconnectsto8L2slicesvia2ringbuses• Eachcontrollerdrivesa16+2byteDDR-3channelat1.33Gb/s• Peakbandwidthis42.67BG/s(excludingECC)• Latency>350cycles
L1Cache:• Data:16KB,8wayassoc.,64byteline,6cyclelatency• Instruction:16KB,4wayassoc.,3cyclelatencyL1Prefetcher (L1P):• 32entryprefetch buffer,entriesare128bytes• 24cyclelatency• OperatesinListorStreamprefetch modes• Operatesaswrite-backbuffer
The Blue Gene/Q Network
10
• 5Dtorusnetwork:• Achieveshighnearestneighborbandwidthwhileincreasing
bisectionalbandwidthandreducinghopsvs3Dtorus• Allowsmachinetobepartitionedintoindependentsubmachines
• Noimpactfromconcurrentlyrunningcodes.• Hardwareassistsforcollective&barrierfunctionsover
COMM_WORLDandrectangularsubcommunicators• Halfrack(midplane)is4x4x4x4x2torus(lastdimalways2)
• Nodeshave10linkswith2GB/srawbandwidtheach• Bi-directional:send+receivegives4GB/s• 90%ofbandwidth(1.8GB/s)availabletouser
• Hardwarelatency• ~40nsperhopthroughnetworklogic• Nearest:80ns• Farthest:3us(96-rack20PFsystem,31hops)
• NetworkPerformance• Nearest-neighbor:98%ofpeak• Bisection:>93%ofpeak• All-to-all:97%ofpeak• Collective:FPreductionsat94.6%ofpeak• Allreduce hardwarelatencyon96knodes~6.5us• Barrierhardwarelatencyon96knodes~6.3us
§ Facilitateextremescalability– Lownoiseoncomputenodes– FileI/OoffloadedtoI/OnodesrunningfullLinux– GLIBCenvironmentwithafewrestrictionsforscaling
§ FamiliarprogrammingmodessuchasMPIandOpenMP– ScalableMPICH2providingMPI2.2withextrememessagerate– Efficientintermediate(PAMI)andlow-level(SPI)messagelibraries– PAMIlayerallowseasyportingofruntimeslikeGA/ARMCI,BerkeleyUPC,etc
§ Standards-basedwhenpossible– Linuxdevelopmentenvironment:familiarGNUtoolchainwithglibc,pthreads– XLCompilersC,C++,FortranwithOpenMP 3.1– Debuggers:Totalview– Tools:HPCToolkit,TAU,PAPI,Valgrind
§ Opensourcewherepossible§ Facilitatehighperformanceforuniquehardware:
– QuadFPU,DMAunit,List-basedprefetcher– TM(TransactionalMemory),SE(SpeculativeExecution)– Wakeup-Unit,ScalableAtomicOperations
§ Flexibleandfastjobcontrol– withhighavailability– Noise-freepartitionednetworks– IntegratedHPC,HTC,MPMD,andsub-blockjobs
§ Facilitatenewprogrammingmodels
Blue Gene/Q Software High-Level Goals & Philosophy
11
§ Theta– Arrivingin2016– IntelXeonPhi,2nd Generation(KnightsLanding)– 3,240computenodes:207,360cores– 660TBofmemory– 8.5PetaFlops peakperformance– CrayAriesinterconnect– Dragonflynetworktopology– 10PBLustre filesystem
§ Aurora– Arrivingin2018– IntelXeonPhi,3rd Generation(KnightsHill)– Over50,000computenodes– Greaterthan7PBofpersistentmemory– 180+PetaFlops peakperformance– IntelOmni-Pathinterconnect– Dragonflynetworktopology– Over150PBLustre filesystem
12
Future ALCF Systems
13
Knights Landing ProcessorIt’sbigger§ 683mm²§ 14nmprocess§ 8Billiontransistors
Upto72Cores§ 36tiles§ 2corespertile§ 2.4TFpernode
2DMeshInterconnect§ Tilesconnectedby2Dmesh
OnPackageMemory§16GBMCDRAM§8Stacks§~450GB/sbandwidth
6DDR4memorychannels§2controllers§upto384GBexternalDDR4§ 90GB/sbandwidth
OnSocketNetworking§Omni-Path NIC on package§Connected by PCIe
14
KNL Tile and Core
Tile• TwoCPUs• 2VPUspercore• Shared1MBL2cache(notglobal)• Caching/Homeagent
• Distributeddirectory,Coherence
Core• BasedonSilvermont (Atom)• Functionalunits:
• 2IntegerALUs• 2Memoryunits• 2VPU’swithAVX-512
• InstructionIssue&Exec:• 2widedecode• 6wideexecute• Outoforder
• 4Hardwarethreadspercore
Programming with IPM and DRR
15
CPUIPM D
DR
CPUIPM D
DR
480GB/s
480GB/sIPM
• Twomemorytypes• InPackageMemory(IPM)
• 16GBMCDRAM• ~480GB/sbandwidth
• OffPackageMemory(DDR)• Upto384GB• ~90GB/sbandwidth
• Oneaddressspace• PossiblymultipleNUMAdomains
• Memoryconfigurations• Cached:DDRfullycachedbyIPM• Directmapped:usermanaged• Hybrid:¼,½IPMusedascache
• Managingmemory:• jemalloc &numa libraries• Pragmasforstaticmemory
allocations
FullyCached
Hybrid
CPUIPM D
DR
DirectMapped
90GB/s
480GB/s90GB/s
90GB/s
16
Clustering Modes
1. TilehasCacheMiss2. CHAselected3. MemoryController4. MemoryReceived
QuadrantMode SNC-4Mode
17
Aries Dragonfly Network
AriesRouter:• 4NIC’sconnectedviaPCIe• 40Networktiles/links• 4.7-5.25GB/s/dir perlink
Dragonflytopology• 4nodesconnectedtoanAries• 2Localall-to-alldimensions
• 16all-to-allhorizontal• 6all-to-allvertical
• 384nodesinlocalgroup• All-to-allconnectionsbetweengroups
§ Morelocalparallelism– 64-72cores(KNL)vs16(BG/Q)– 4hardwarethreadsonboth
§ Noincreaseinthenumberofnodes§ Increasedvectorlength
– 8widevectors(KNL)vs4widevectors(BG/Q)
§ Increasednodeperformance– 2.4TF(KNL)vs0.2TF(BG/Q)
§ Instructionissue– Out-of-order(KNL)vsin-order(BG/Q)– 2wideinstructionissueonboth– 2floatingpointinstructionspercycle(KNL)vs1percycle(BG/Q)
§ MemoryHierarchy– MCDRAM&DDR(KNL)vsuniform16GBDDR(BG/Q)
§ Differentnetworktopology– 5DtorusvsDragonfly
§ NICconnectivity– PCIe (Aries,Omni-Path)vsdirectcrossbarconnection(BG/Q)
18
Considerations In Moving From BG/Q to Xeon Phi (KNL)
19
Questions?