Oakforest-PACSにおける · Oakforest-PACSにおける LAMMPS ReaxFFの高度化支援 高度情報科学技術研究機構(RIST) 井上 義昭 青柳 哲雄 小林 寛(発表)
Introduction to the Oakforest-PACS Supercomputer in Japan
-
Upload
insidehpc -
Category
Technology
-
view
261 -
download
1
Transcript of Introduction to the Oakforest-PACS Supercomputer in Japan
Taisuke$Boku$Center$for$Computa?onal$Sciences,$University$of$Tsukuba$/$
Joint$Center$for$Advanced$HPC$(JCAHPC)$
Introduc)on*to**Oakforest1PACS*(OFP)�
2016/11/14� ����DDN$UG@SC16,$Salt$Lake$City�
new$Japan’s$Fastest$Supercomputer�
Fiscal Year� 2014� 2015� 2016� 2017� 2018� 2019� 2020� 2021�2022�2023�2024�2025�
Hokkaido�
Tohoku�
Tsukuba�
Tokyo�
Tokyo Tech.�
Nagoya�
Kyoto�
Osaka�
Kyushu�
Deployment*plan*of*9*supercompu)ng*center**(Oct.*2015*+*latest,�
HITACHI SR16000/M1+172TF, 22TB,�Cloud System BS2000 +44TF, 14TB,
Data Science Cloud / Storage HA8000 / WOS7000 +10TF, 1.96PB,
50+ PF (FAC) 3.5MW�200+ PF
(FAC) 6.5MW�
Fujitsu FX10 (1PFlops, 150TiB, 408 TB/s), Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s)�
NEC SX-9� (60TF)
~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW�
SX-ACE(707TF,160TB, 655TB/s) LX406e(31TF), Storage(4PB), 3D Vis, 2MW�
100+ PF (FAC/UCC+CFL-M)�
up to 4MW�
50+ PF (FAC/UCC + CFL-M)�FX10(90TF)� Fujitsu FX100 (2.9PF, 81 TiB)�CX400(470TF)� Fujitsu CX400 (774TF, 71TiB)�
SGI UV2000 (24TF, 20TiB)� 2MW in total� up to 4MW�
Power consumption indicates maximum of power supply (includes cooling facility)
Cray: XE6 + GB8K + XC30 (983TF) 7-8 PF(FAC/TPF + UCC)
1.5 MW�Cray XC30 (584TF)
50-100+ PF (FAC/TPF + UCC) 1.8-2.4
MW�
3.2 PF (UCC + CFL/M) 0.96MW �
30 PF +UCC + CFL-M, 2MW�
0.3 PF (Cloud) 0.36MW
15-20 PF (UCC/TPF)�HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB)� 100-150 PF
(FAC/TPF + UCC/TPF)�FX10
(90.8TFLOPS)�
3MW FX10 (272.4TF, 36 TB) CX400 (966.2 TF, 183TB)�
2.0MW 2.6MW
HA-PACS (1166 TF)
100+ PF 4.5MW (UCC + TPF)�
TSUBAME 3.0 (20 PF, 4~6PB/s) 2.0MW (3.5, 40PF at 2018 if upgradable)�
TSUBAME 4.0 (100+ PF, >10PB/s, ~2.0MW)�
TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW�
TSUBAME 2.5 (3.4 PF, extended)�
25.6 PB/s, 50-100Pflop/s,1.5-2.0MW�
3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M) �NEC SX-ACE NEC Express5800 (423TF) (22.4TF)
0.7-1PF (UCC) �
COMA (PACS-IX) (1001 TF)
Reedbush 1.8�1.9 PF 0.7 MW�
Post Open Supercomputer 25 PF (UCC + TPF) 4.2 MW�
PACS-X 10PF (TPF) 2MW�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City�����
JCAHPC�
• Joint$Center$for$Advanced$High$Performance$Compu?ng$$$$(hyp://jcahpc.jp)$$
• Very$?ght$collabora?on$for$“post]T2K”$with$two$universi?es,$Univ.$of$Tsukuba$and$the$Univ.$of$Tokyo$
• For$main$supercomputer$resources,$uniform(specifica:on$to$single&shared&system(
• Each$university$is$financially$responsible$to$introduce$the$machine$and$its$opera?on$]>$unified$procurement$toward$single$system$with$largest&scale&in&Japan&
• To$manage$everything$smoothly,$a$joint$organiza?on$was$established$]>(JCAHPC(by(Center(for(Computa?onal(Sciences,(U.(Tsukuba(and(Informa?on(Technology(Center,(U.(Tokyo�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
History*of*JCAHPC�
• March$2013:$U.$Tsukuba$and$U.$Tokyo$exchanged$agreement$for$”JCAHPC$establishment$and$opera?on”$
• Center$for$Computa?onal$Sciences,$University$of$Tsukuba$and$Informa?on$Technology$Center,$University$of$Tokyo$
• April$2013:$JCAHPC$started$• 1st$period$director:$Mitsuhisa$Sato$(Tsukuba),$vice$director:$Yutaka$Ishikawa$(Tokyo)$
• 2nd$period$(2016~)$director:$Hiroshi$Nakamura$(Tokyo),$vice$director:$Masayuki$Umemura$(Tsukuba)$
• July$2013:$RFI$for$procurement$• at$this$?me,$the$joint$procurement$style$was$not$fixed$]>$then$a$single$system$procurement$was$decided$
• to$give$enough$?me$for$very$advanced$technology$for$processor,$network,$memory,$etc.,$more$than$1$year$of$period$was$taken$to$fix$the$specifica?on$
• It(is(the(first(challenge(to(introduce(a(shared(single(supercomputer(system(by(mul?ple(na?onal(universi?es(in(Japan(!(
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Procurement*Policies*of*JCAHPC*System�
• based$on$the$spirit$of$T2K,$introducing$open$advanced$technology$• massively$parallel$PC$cluster$• advanced$processor$for$HPC$• easy$to$use$and$efficient$interconnec?on$• large$scale$shared$file$system$flatly$shared$by$all$nodes$
• joint$procurement$by$two$universi?es$• the$largest$class$of$budget$as$na?onal$universi?es’$supercomputer$in$Japan$• the$largest$system$scale$as$PC$cluster$in$Japan$• no$accelerator$to$support$wide$variety$of$users$and$applica?on$fields$
• goodness$of$single$system$• scale]merit$by$merging$budget$]>$largest$in$Japan$• ultra$large$scale$single$job$execu?on$at$special$occasion$such$as$“Gordon$Bell$Prize$Challenge”$
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
'(Oakforest1PACS((OFP)�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Oakforest1PACS:*Japan’s$Fastest$Supercomputer$today�
Photo*of*computa)on*node*&*chassis�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Computa?on$node$(Fujitsu$next$genera?on$PRIMERGY)$with$single$chip$Intel$Xeon$Phi$(Knights$Landing,$3+TFLOPS)$and$Intel$Omni]Path$Architecture$card$(100Gbps)�
Chassis$with$8$nodes,$2U$size�
Specifica)on*of*Oakforest1PACS*system�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Total$peak$performance$Linpack$performance�
25$PFLOPS$13.55$PFLOPS$(with$8,178$nodes,$556,104$cores)�
Total$number$of$compute$nodes� 8,208�
Compute$node�
Product� Fujitsu$PRIMERGY$CX1640$M1�
Processor� Intel®$Xeon$Phi&$7250$+Code$name:$Knights$Landing,,$68$cores,$3$TFLOPS�
Memory� High$BW� 16$GB,$$>$400$GB/sec$(MCDRAM,$effec?ve$rate)$�
Low$BW� 96$GB,$115.2$GB/sec$(DDR4]2400$x$6ch,$peak$rate)�
Inter]connect�
Product� Intel®$Omni]Path$Architecture�
Link$speed� 100$Gbps�
Topology� Fat]tree$with$(completely)$full]bisec?on$bandwidth$(102.6TB/s)$�
Login$node�Product� Fujitsu$PRIMERGY$RX2530$M2$server$�
#$of$servers� 20�
Processor� Intel$Xeon$E5]2690v4$(2.6$GHz$14$core$x$2$socket)�
Memory� 256$GB,$153$GB/sec$(DDR4]2400$x$4ch$x$2$socket)�
Specifica)on*of*Oakforest1PACS*system*(I/O)�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Parallel$File$System�
Type� Lustre$File$System�
Total$Capacity� 26.2$PB�
Meta$data�
Product� DataDirect$Networks$MDS$server$+$SFA7700X�
#$of$MDS� 4$servers$x$3$set�
MDT� 7.7$TB$(SAS$SSD)$x$3$set�
Object$storage$�
Product� DataDirect$Networks$SFA14KE�
#$of$OSS$(Nodes)� 10$(20)�
Aggregate$BW� 500$GB/sec�
Fast$File$Cache$System�
Type� Burst$Buffer,$Infinite$Memory$Engine$(by$DDN)�
Total$capacity� 940$TB$(NVMe$SSD,$including$parity$data$by$erasure$coding)�
Product� DataDirect$Networks$IME14K�
#$of$servers$(Nodes)� 25$(50)�
Aggregate$BW� 1,560$GB/sec�
Full*bisec)on*bandwidth*Fat1tree*by**Intel®*Omni1Path*Architecture*�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
12$of$768$port$Director$Switch$(Source$by$Intel)�
362$of$48$port$Edge$Switch�
2�2�
24�1� 48�25� 72�49�
Uplink:$24�
Downlink:$24�
.$.$.� .$.$.� .$.$.� Compute$Nodes� 8208�
Login$Nodes� 20�
Parallel$FS� 64�
IME� 200�
Mgmt,$etc.� 8�
Total� 8500�
Firstly,$to$reduce$switches&cables,$we$considered$:$• All$the$nodes$into$subgroups$are$connected$with$FBB$Fat]tree$• Subgroups$are$connected$with$each$other$with$>20%$of$FBB$But,$HW$quan?ty$is$not$so$different$from$globally$FBB,$and$globally$FBB$is$preferred$for$flexible$job$management.�
Schema)cs*of*User’s*View�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
$Parallel$File$System$
26.2(PB($
Omni]Path$Architecture$(100$Gbps),$$Full]bisec?on$BW$Fat]tree$�
Lustre$Filesystem$DDN$SFA14KE$x10�
File$Cache$System$940TB(
DDN$IME14K$x25�
1560(GB/s�
500(GB/s�
Compute$Nodes:$25(PFlops(�
CPU:$Intel$Xeon$Phi$7250$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(KNL$68$core,$1.4$GHz)$Mem:$16$GB$(MCDRAM,$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$490$GB/sec,$effec?ve)$$$$$$$$$$+$96$GB$(DDR4]2400,$115.2$GB/sec)�
×8,208�
Fujitsu$PRIMERGY$CX1640$M1$$x$8$node$inside$CX600$M1$(2U)$
$$$$
�
Login$node$$
Login$Node$x20�
Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$
Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$
U.$Tsukuba$users�
U.$Tokyo$users�
$Parallel$File$System$
26.2(PB($
Omni]Path$Architecture$(100$Gbps),$$Full]bisec?on$BW$Fat]tree$�
Lustre$Filesystem$DDN$SFA14KE$x10�
File$Cache$System$940TB(
DDN$IME14K$x25�
1560(GB/s�
500(GB/s�
Compute$Nodes:$25(PFlops(�
CPU:$Intel$Xeon$Phi$7250$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(KNL$68$core,$1.4$GHz)$Mem:$16$GB$(MCDRAM,$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$490$GB/sec,$effec?ve)$$$$$$$$$$+$96$GB$(DDR4]2400,$115.2$GB/sec)�
×8,208�
Fujitsu$PRIMERGY$CX1640$M1$$x$8$node$inside$CX600$M1$(2U)$
$$$$
�
Login$node$$
Login$Node$x20�
Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$
Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$Login$node$$
U.$Tsukuba$users�
U.$Tokyo$users�
Facility*of*Oakforest1PACS*system�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Power$consump?on� 4.2$MW$(including$cooling)�
#$of$racks� 102�
Cooling$system�
Compute$Node�
Type� Warm]water$cooling$$$$$Direct$cooling$(CPU)$$$$$Rear$door$cooling$$(except$CPU)$$
Facility�
Cooling$tower$&$Chiller$
Others� Type� Air$cooling$
Facility�
PAC$
SoTware*of*Oakforest1PACS*system�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Compute(node� Login(node�
OS� CentOS$7,$McKernel� Red$Hat$Enterprise$Linux$7�
Compiler� gcc,$$Intel$compiler$(C,$C++,$Fortran)�MPI� Intel$MPI,$MVAPICH2�
Library� Intel$MKL$LAPACK,$FFTW,$SuperLU,$PETSc,$METIS,$Scotch,$ScaLAPACK,$GNU$Scien?fic$Library,$NetCDF,$Parallel$netCDF,$Xabclib,$ppOpen]HPC,$ppOpen]AT,$MassiveThreads�
Applica?on� mpijava,$XcalableMP,$OpenFOAM,$ABINIT]MP,$PHASE$system,$FrontFlow/blue,$FrontISTR,$REVOCAP,$OpenMX,$xTAPP,$AkaiKKR,$MODYLAS,$ALPS,$feram,$GROMACS,$BLAST,$R$packages,$Bioconductor,$BioPerl,$BioRuby�
Distributed$FS� $� Globus$Toolkit,$Gfarm�
Job$Scheduler� Fujitsu$Technical$Compu?ng$Suite�
Debugger� Allinea$DDT�
Profiler� Intel$VTune$Amplifier,$Trace$Analyzer$&$Collector�
File*I/O*on*Oakforest1PACS�
• Basic$shared$file$system$by$Lustre$• 500GB/s$of$high$bandwidth$shared$file$system$supported$by$fat]tree$network$
• Freedom$of$data$alloca?on$and$flexibility$on$node$scheduling$
• File$staging$on$File$Cache$(IME)$• 1.5TB/s$of$high$bandwidth$burst$buffer$with$NVMe$to$support$highly$I/O$intensive$jobs$
• Suppor?ng$astrophysics,$climate,$etc.$
• Job$scheduler$rela?on$• Pre]staging$is$under$control$of$job$scheduler$for$smooth$job$launch$with$background$IME]Lustre$data$transfer�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Ultra*Compact*File*System�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
System( IO(Node� Rack(unit� 45U(Rack� Max(Write((GB/s)�
Max(Read((GB/s)�
IO(Test�
K$Computer� 2592� 10368� ~231� 965� 1486� FPP�
ORNL$Spider$2� 2288� 2088� ~47� 1100� 1100� FPP�
Blue$Waters� 432� 1080� ~25� 1296� 1296� FPP�
Oakforest]PACS�
50� 100� ~2.5� (peak$1560)� (peak$1560)� SSF$&$FPP�
On$Oakforest]PACS,$Single$Shared$File$access$speed$is$quite$fast$as$well$as$File$per$Process�
Machine*loca)on:*Kashiwa*Campus*of*U.*Tokyo�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Hongo$Campus$of$U.$Tokyo�
U.$Tsukuba$
Kashiwa$Campus$of$U.$Tokyo�
System*opera)on*outline�
• Regular$opera?on$• both$universi?es$share$the$CPU$?me$based$on$the$budget$ra?o$• not$split$the$system$hardware,$but$split$the$“CPU$?me”$for$flexible$opera?on$(except$several$specially$dedicated$par??ons)$
• single$system$entry$for$HPCI$program,$and$other$own$program$by$each$university$is$performed$under$“CPU$?me”$sharing$
• Special$opera?on$• massively$large$scale$opera?on$(limited$period)$$]>$effec?vely$using$the$largest$class$resource$in$Japan$for$special$occasion$(ex.$Gordon$Bell$Challenge)$
• Power$saving$opera?on$• power$capping$feature$for$energy$saving$scheduling$feature$reacts$to$power$saving$requirement$(ex.$summer$?me)�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����
Summary�
• Oakforest]PACS,$new$Japan’s$fastest$supercomputer$is$launched$under$JCAHPC$opera?on$by$U.$Tsukuba$and$U.$Tokyo$
• 25$PFLOPS$peak,$13.55$PFLOPS$Linpack$• 8208$Intel$KNL$nodes$with$Intel$OPA$• Fujitsu$manifuctured$with$high$density$water$cool$packaging$
• FIle$I/O$• equipped$with$DDN$Lustre$as$well$as$world$largest$class$of$IME$• 500GB/s$for$Lustre,$1.5TB/s$for$IME,$theore?cal$peak$
• Open$for$na?onwide$service$(Apr.$2017)$• major$resource$in$HPCI$program$• both$universi?es$operate$their$own$program$for$wide$are$of$computa?onal$science$&$engineering�
2016/11/14� DDN$UG@SC16,$Salt$Lake$City� ����