Introduction to the Oakforest-PACS Supercomputer in Japan

Taisuke$Boku$Center$for$Computa?onal$Sciences,$University$of$Tsukuba$/$

Joint$Center$for$Advanced$HPC$(JCAHPC)$

Introduc)on*to**Oakforest1PACS*(OFP)�

2016/11/14� ��DDN$UG@SC16,$Salt$Lake$City�

new$Japan’s$Fastest$Supercomputer�

Fiscal Year� 2014� 2015� 2016� 2017� 2018� 2019� 2020� 2021�2022�2023�2024�2025�

Hokkaido�

Tohoku�

Tsukuba�

Tokyo�

Tokyo Tech.�

Nagoya�

Kyoto�

Osaka�

Kyushu�

Deployment*plan*of*9*supercompu)ng*center**(Oct.*2015*+*latest,�

HITACHI SR16000/M1+172TF, 22TB,�Cloud System BS2000 +44TF, 14TB,

Data Science Cloud / Storage HA8000 / WOS7000 +10TF, 1.96PB,

50+ PF (FAC) 3.5MW�200+ PF

(FAC) 6.5MW�

Fujitsu FX10 (1PFlops, 150TiB, 408 TB/s), Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s)�

NEC SX-9� (60TF)

~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW�

SX-ACE(707TF,160TB, 655TB/s) LX406e(31TF), Storage(4PB), 3D Vis, 2MW�

100+ PF (FAC/UCC+CFL-M)�

up to 4MW�

50+ PF (FAC/UCC + CFL-M)�FX10(90TF)� Fujitsu FX100 (2.9PF, 81 TiB)�CX400(470TF)� Fujitsu CX400 (774TF, 71TiB)�

SGI UV2000 (24TF, 20TiB)� 2MW in total� up to 4MW�

Power consumption indicates maximum of power supply (includes cooling facility)

Cray: XE6 + GB8K + XC30 (983TF) 7-8 PF(FAC/TPF + UCC)

1.5 MW�Cray XC30 (584TF)

50-100+ PF (FAC/TPF + UCC) 1.8-2.4

MW�

3.2 PF (UCC + CFL/M) 0.96MW �

30 PF +UCC + CFL-M, 2MW�

0.3 PF (Cloud) 0.36MW

15-20 PF (UCC/TPF)�HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB)� 100-150 PF

(FAC/TPF + UCC/TPF)�FX10

(90.8TFLOPS)�

3MW FX10 (272.4TF, 36 TB) CX400 (966.2 TF, 183TB)�

2.0MW 2.6MW

HA-PACS (1166 TF)

100+ PF 4.5MW (UCC + TPF)�

TSUBAME 3.0 (20 PF, 4~6PB/s) 2.0MW (3.5, 40PF at 2018 if upgradable)�

TSUBAME 4.0 (100+ PF, >10PB/s, ~2.0MW)�

TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW�

TSUBAME 2.5 (3.4 PF, extended)�

25.6 PB/s, 50-100Pflop/s,1.5-2.0MW�

3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M) �NEC SX-ACE NEC Express5800 (423TF) (22.4TF)

0.7-1PF (UCC) �

COMA (PACS-IX) (1001 TF)

Reedbush 1.8�1.9 PF 0.7 MW�

Post Open Supercomputer 25 PF (UCC + TPF) 4.2 MW�

PACS-X 10PF (TPF) 2MW�

2016/11/14� DDN$UG@SC16,$Salt$Lake$City��

JCAHPC�

•  Joint$Center$for$Advanced$High$Performance$Compu?ng$$$$(hyp://jcahpc.jp)$$

•  Very$?ght$collabora?on$for$“post]T2K”$with$two$universi?es,$Univ.$of$Tsukuba$and$the$Univ.$of$Tokyo$

•  For$main$supercomputer$resources,$uniform(specifica:on$to$single&shared&system(

•  Each$university$is$financially$responsible$to$introduce$the$machine$and$its$opera?on$]>$unified$procurement$toward$single$system$with$largest&scale&in&Japan&

•  To$manage$everything$smoothly,$a$joint$organiza?on$was$established$]>(JCAHPC(by(Center(for(Computa?onal(Sciences,(U.(Tsukuba(and(Informa?on(Technology(Center,(U.(Tokyo�

2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ��

History*of*JCAHPC�

•  March$2013:$U.$Tsukuba$and$U.$Tokyo$exchanged$agreement$for$”JCAHPC$establishment$and$opera?on”$

•  Center$for$Computa?onal$Sciences,$University$of$Tsukuba$and$Informa?on$Technology$Center,$University$of$Tokyo$

•  April$2013:$JCAHPC$started$•  1st$period$director:$Mitsuhisa$Sato$(Tsukuba),$vice$director:$Yutaka$Ishikawa$(Tokyo)$

•  2nd$period$(2016~)$director:$Hiroshi$Nakamura$(Tokyo),$vice$director:$Masayuki$Umemura$(Tsukuba)$

•  July$2013:$RFI$for$procurement$•  at$this$?me,$the$joint$procurement$style$was$not$fixed$]>$then$a$single$system$procurement$was$decided$

•  to$give$enough$?me$for$very$advanced$technology$for$processor,$network,$memory,$etc.,$more$than$1$year$of$period$was$taken$to$fix$the$specifica?on$

•  It(is(the(first(challenge(to(introduce(a(shared(single(supercomputer(system(by(mul?ple(na?onal(universi?es(in(Japan(!(


Procurement*Policies*of*JCAHPC*System�

•  based$on$the$spirit$of$T2K,$introducing$open$advanced$technology$•  massively$parallel$PC$cluster$•  advanced$processor$for$HPC$•  easy$to$use$and$efficient$interconnec?on$•  large$scale$shared$file$system$flatly$shared$by$all$nodes$

•  joint$procurement$by$two$universi?es$•  the$largest$class$of$budget$as$na?onal$universi?es’$supercomputer$in$Japan$•  the$largest$system$scale$as$PC$cluster$in$Japan$•  no$accelerator$to$support$wide$variety$of$users$and$applica?on$fields$

•  goodness$of$single$system$•  scale]merit$by$merging$budget$]>$largest$in$Japan$•  ultra$large$scale$single$job$execu?on$at$special$occasion$such$as$“Gordon$Bell$Prize$Challenge”$


'(Oakforest1PACS((OFP)�


Oakforest1PACS:*Japan’s$Fastest$Supercomputer$today�

Photo*of*computa)on*node*&*chassis�


Computa?on$node$(Fujitsu$next$genera?on$PRIMERGY)$with$single$chip$Intel$Xeon$Phi$(Knights$Landing,$3+TFLOPS)$and$Intel$Omni]Path$Architecture$card$(100Gbps)�

Chassis$with$8$nodes,$2U$size�

Specifica)on*of*Oakforest1PACS*system�


Total$peak$performance$Linpack$performance�

25$PFLOPS$13.55$PFLOPS$(with$8,178$nodes,$556,104$cores)�

Total$number$of$compute$nodes� 8,208�

Compute$node�

Product� Fujitsu$PRIMERGY$CX1640$M1�

Processor� Intel®$Xeon$Phi&$7250$+Code$name:$Knights$Landing,,$68$cores,$3$TFLOPS�

Memory� High$BW� 16$GB,$$>$400$GB/sec$(MCDRAM,$effec?ve$rate)$�

Low$BW� 96$GB,$115.2$GB/sec$(DDR4]2400$x$6ch,$peak$rate)�

Inter]connect�

Product� Intel®$Omni]Path$Architecture�

Link$speed� 100$Gbps�

Topology� Fat]tree$with$(completely)$full]bisec?on$bandwidth$(102.6TB/s)$�

Login$node�Product� Fujitsu$PRIMERGY$RX2530$M2$server$�

#$of$servers� 20�

Processor� Intel$Xeon$E5]2690v4$(2.6$GHz$14$core$x$2$socket)�

Memory� 256$GB,$153$GB/sec$(DDR4]2400$x$4ch$x$2$socket)�

Specifica)on*of*Oakforest1PACS*system*(I/O)�


Parallel$File$System�

Type� Lustre$File$System�

Total$Capacity� 26.2$PB�

Meta$data�

Product� DataDirect$Networks$MDS$server$+$SFA7700X�

#$of$MDS� 4$servers$x$3$set�

MDT� 7.7$TB$(SAS$SSD)$x$3$set�

Object$storage$�

Product� DataDirect$Networks$SFA14KE�

#$of$OSS$(Nodes)� 10$(20)�

Aggregate$BW� 500$GB/sec�

Fast$File$Cache$System�

Type� Burst$Buffer,$Infinite$Memory$Engine$(by$DDN)�

Total$capacity� 940$TB$(NVMe$SSD,$including$parity$data$by$erasure$coding)�

Product� DataDirect$Networks$IME14K�

#$of$servers$(Nodes)� 25$(50)�

Aggregate$BW� 1,560$GB/sec�

Full*bisec)on*bandwidth*Fat1tree*by**Intel®*Omni1Path*Architecture*�


12$of$768$port$Director$Switch$(Source$by$Intel)�

362$of$48$port$Edge$Switch�

2�2�

24�1� 48�25� 72�49�

Uplink:$24�

Downlink:$24�

.$.$.� .$.$.� .$.$.� Compute$Nodes� 8208�

Login$Nodes� 20�

Parallel$FS� 64�

IME� 200�

Mgmt,$etc.� 8�

Total� 8500�

Firstly,$to$reduce$switches&cables,$we$considered$:$•  All$the$nodes$into$subgroups$are$connected$with$FBB$Fat]tree$•  Subgroups$are$connected$with$each$other$with$>20%$of$FBB$But,$HW$quan?ty$is$not$so$different$from$globally$FBB,$and$globally$FBB$is$preferred$for$flexible$job$management.�

Schema)cs*of*User’s*View�


$Parallel$File$System$

26.2(PB($

Omni]Path$Architecture$(100$Gbps),$$Full]bisec?on$BW$Fat]tree$�

Lustre$Filesystem$DDN$SFA14KE$x10�

File$Cache$System$940TB(

DDN$IME14K$x25�

1560(GB/s�

500(GB/s�

Compute$Nodes:$25(PFlops(�

CPU:$Intel$Xeon$Phi$7250$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(KNL$68$core,$1.4$GHz)$Mem:$16$GB$(MCDRAM,$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$490$GB/sec,$effec?ve)$$$$$$$$$$+$96$GB$(DDR4]2400,$115.2$GB/sec)�

×8,208�

Fujitsu$PRIMERGY$CX1640$M1$$x$8$node$inside$CX600$M1$(2U)$

$$$$

�

Login$node$$

Login$Node$x20�

Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$

Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$

U.$Tsukuba$users�

U.$Tokyo$users�

$Parallel$File$System$

26.2(PB($

Omni]Path$Architecture$(100$Gbps),$$Full]bisec?on$BW$Fat]tree$�

Lustre$Filesystem$DDN$SFA14KE$x10�

File$Cache$System$940TB(

DDN$IME14K$x25�

1560(GB/s�

500(GB/s�

Compute$Nodes:$25(PFlops(�

CPU:$Intel$Xeon$Phi$7250$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(KNL$68$core,$1.4$GHz)$Mem:$16$GB$(MCDRAM,$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$490$GB/sec,$effec?ve)$$$$$$$$$$+$96$GB$(DDR4]2400,$115.2$GB/sec)�

×8,208�

Fujitsu$PRIMERGY$CX1640$M1$$x$8$node$inside$CX600$M1$(2U)$

$$$$

�

Login$node$$

Login$Node$x20�

Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$

Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$

U.$Tsukuba$users�

U.$Tokyo$users�

Facility*of*Oakforest1PACS*system�


Power$consump?on� 4.2$MW$(including$cooling)�

#$of$racks� 102�

Cooling$system�

Compute$Node�

Type� Warm]water$cooling$$$$$Direct$cooling$(CPU)$$$$$Rear$door$cooling$$(except$CPU)$$

Facility�

Cooling$tower$&$Chiller$

Others� Type� Air$cooling$

Facility�

PAC$

SoTware*of*Oakforest1PACS*system�


Compute(node� Login(node�

OS� CentOS$7,$McKernel� Red$Hat$Enterprise$Linux$7�

Compiler� gcc,$$Intel$compiler$(C,$C++,$Fortran)�MPI� Intel$MPI,$MVAPICH2�

Library� Intel$MKL$LAPACK,$FFTW,$SuperLU,$PETSc,$METIS,$Scotch,$ScaLAPACK,$GNU$Scien?fic$Library,$NetCDF,$Parallel$netCDF,$Xabclib,$ppOpen]HPC,$ppOpen]AT,$MassiveThreads�

Applica?on� mpijava,$XcalableMP,$OpenFOAM,$ABINIT]MP,$PHASE$system,$FrontFlow/blue,$FrontISTR,$REVOCAP,$OpenMX,$xTAPP,$AkaiKKR,$MODYLAS,$ALPS,$feram,$GROMACS,$BLAST,$R$packages,$Bioconductor,$BioPerl,$BioRuby�

Distributed$FS� $� Globus$Toolkit,$Gfarm�

Job$Scheduler� Fujitsu$Technical$Compu?ng$Suite�

Debugger� Allinea$DDT�

Profiler� Intel$VTune$Amplifier,$Trace$Analyzer$&$Collector�

File*I/O*on*Oakforest1PACS�

•  Basic$shared$file$system$by$Lustre$•  500GB/s$of$high$bandwidth$shared$file$system$supported$by$fat]tree$network$

•  Freedom$of$data$alloca?on$and$flexibility$on$node$scheduling$

•  File$staging$on$File$Cache$(IME)$•  1.5TB/s$of$high$bandwidth$burst$buffer$with$NVMe$to$support$highly$I/O$intensive$jobs$

•  Suppor?ng$astrophysics,$climate,$etc.$

•  Job$scheduler$rela?on$•  Pre]staging$is$under$control$of$job$scheduler$for$smooth$job$launch$with$background$IME]Lustre$data$transfer�


Ultra*Compact*File*System�


System( IO(Node� Rack(unit� 45U(Rack� Max(Write((GB/s)�

Max(Read((GB/s)�

IO(Test�

K$Computer� 2592� 10368� ~231� 965� 1486� FPP�

ORNL$Spider$2� 2288� 2088� ~47� 1100� 1100� FPP�

Blue$Waters� 432� 1080� ~25� 1296� 1296� FPP�

Oakforest]PACS�

50� 100� ~2.5� (peak$1560)� (peak$1560)� SSF$&$FPP�

On$Oakforest]PACS,$Single$Shared$File$access$speed$is$quite$fast$as$well$as$File$per$Process�

Machine*loca)on:*Kashiwa*Campus*of*U.*Tokyo�


Hongo$Campus$of$U.$Tokyo�

U.$Tsukuba$

Kashiwa$Campus$of$U.$Tokyo�

System*opera)on*outline�

•  Regular$opera?on$•  both$universi?es$share$the$CPU$?me$based$on$the$budget$ra?o$•  not$split$the$system$hardware,$but$split$the$“CPU$?me”$for$flexible$opera?on$(except$several$specially$dedicated$par??ons)$

•  single$system$entry$for$HPCI$program,$and$other$own$program$by$each$university$is$performed$under$“CPU$?me”$sharing$

•  Special$opera?on$•  massively$large$scale$opera?on$(limited$period)$$]>$effec?vely$using$the$largest$class$resource$in$Japan$for$special$occasion$(ex.$Gordon$Bell$Challenge)$

•  Power$saving$opera?on$•  power$capping$feature$for$energy$saving$scheduling$feature$reacts$to$power$saving$requirement$(ex.$summer$?me)�


Summary�

•  Oakforest]PACS,$new$Japan’s$fastest$supercomputer$is$launched$under$JCAHPC$opera?on$by$U.$Tsukuba$and$U.$Tokyo$

•  25$PFLOPS$peak,$13.55$PFLOPS$Linpack$•  8208$Intel$KNL$nodes$with$Intel$OPA$•  Fujitsu$manifuctured$with$high$density$water$cool$packaging$

•  FIle$I/O$•  equipped$with$DDN$Lustre$as$well$as$world$largest$class$of$IME$•  500GB/s$for$Lustre,$1.5TB/s$for$IME,$theore?cal$peak$

•  Open$for$na?onwide$service$(Apr.$2017)$•  major$resource$in$HPCI$program$•  both$universi?es$operate$their$own$program$for$wide$are$of$computa?onal$science$&$engineering�


Introduction to the Oakforest-PACS Supercomputer in Japan

Technology

Transcript of Introduction to the Oakforest-PACS Supercomputer in Japan