Post-K Computer Development, Updates for SC 18 · Title: Post-K Computer Development, Updates for...
Transcript of Post-K Computer Development, Updates for SC 18 · Title: Post-K Computer Development, Updates for...
Post-K Computer Development, Updates for SC’18
Copyright 2018 FUJITSU LIMITED0
Fujitsu’s High-end HPC Development
Fujitsu has been leading HPC technologies for over 40 years
Copyright 2018 FUJITSU LIMITED
© RIKEN
Ranked Top500 No.1 in 2011Competitive in various fields
K computer
HPCGNo.3(2018)
Graph500No.1(2018)
Gordon Bell Prize Finalist
(2016)
PRIMEHPC FX10
PRIMEHPCFX100
Next Japanese flagship supercomputerunder development
Post-K computer
1
RIKEN and Fujitsu are currently developing Post-K computer, the most advanced general-purpose supercomputer, in the world
Copyright 2018 FUJITSU LIMITED
Japanese National Project
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
K ComputerDevelopment
OperationInstall /Tuning
MassProduct
Prototype・Detailed DesignBasic
DesignPost-K
K ComputerOperation
Calendar Year
Post-K computer is optimized to achieve superior performance in real applications as next Japanese flagship system
2
Copyright 2018 FUJITSU LIMITED
Post-K Computer Goals and Approaches
Low power consumption
User convenience
Ability to produce ground-breaking
results
Application performanceApplication
performanceApplication
performance
Approach
1. A64FX CPU
2. Compiler and Runtime
3. LLIO (Lightweight Layered IO-Accelerator)
3
A64FX CPU
Achieves high performance in HPC and AI applications
Arm Scalable vector extension (SVE), high-bandwidth caches and memory
(D|S|H)GEMM and INT (16b/8b) GEMM > 90%, STREAM Triad > 80%
Copyright 2018 FUJITSU LIMITED
A64FX(Post-K computer)
SPARC64 VIIIfx(K computer)
ISA (Base + Extension) Armv8.2-A + SVE SPARC-V9 + HPC-ACE
Process Technology 7 nm 45 nm
Peak Performance > 2.7 TFLOPS 128 GFLOPS
SIMD 512-bit 128-bit
# of Cores 48+4 8
Memory Peak B/W 1024 GB/s 64 GB/s
CMG: Core Memory GroupHBM2 8GiBHBM2 8GiBHBM2 8GiBHBM2 8GiB
Performance> 2.7 TFLOPS
CMG
L1 Cache > 11.0 TB/s (BF ratio = 4)
L2 Cache > 3.6 TB/s (BF ratio = 1.3)
512-bit wide SIMD 2x FMAs
CoreCore
12x Computing Cores + 1x Assistant Core
Memory 1024 GB/s (BF ratio =~0.37)
L2 Cache 8MiB, 16way
256GB/s
>115GB/s
>57GB/s
Core Core
L1D 64KiB, 4way
>230GB/s
>115GB/s
4
Compiler and Runtime
Fujitsu’s compiler and runtime libraries exploit the hardware capabilities along three dimensions
Support Fortran, C/C++, Python software development environment
Copyright 2018 FUJITSU LIMITED
Memory accessperformance
Computational performance
• Software prefetch• Loop-blocking
• Out-of-order• 512-bit SVE
• Software pipelining with Loop fission
• Auto-vectorization with SVE
Thread-parallel performance
• CMG & SVE optimized math library
• OpenMP 5.0 API & fast barrierCompiler & Runtime
Hardware capabilities• 48 cores in 4 CMG• Inter-core barrier
• Hardware prefetch• Stacked memory; HBM2
5
Preliminary Performance Evaluation Results
A64FX’s instruction set and compiler achieve high performance on loop of math function
Step 1. Armv8 coding
Step 2. + SVE + accel.Instruction coding
Step 3. + Inlined by compiler
Step 4. + Applied software pipelining by compiler
Copyright 2018 FUJITSU LIMITED
7.6 5.4 7.6 3.624.0
14.6
34.1
14.7
38.3
22.8
66.8
28.9
FLOAT EXP DOUBLE EXP FLOAT SIN DOUBLE SIN
Rel
ativ
e to
Ste
p.1
Step.2 Step.3 Step.4
6
Copyright 2018 FUJITSU LIMITED
LLIO (Lightweight Layered IO-Accelerator)
Boosts I/O performancew/o modifying Apps
Exploits SSD as a shared cache of persistent filesystem (PFS)
Provides two kinds of temporary filesystems for I/O optimization
Shared/Local temporary filesystem
Persistent filesystem (FEFS)
IO Network
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
SSD
CN
CN
CN
CN
SSD
CN
CN
CN
CN
SSD
CN
CN
CN
CN
SSD
CN
CN
CN
CN
Compute node (CN) cluster
LLIO
data
POSI
X A
PI /
MPI
-IO
Application
Au
tom
ated
cac
hin
g
data
acce
ss t
o P
FS
Tran
spar
ent
7
Copyright 2018 FUJITSU LIMITED
Post-K Computer Current Status
CPU powered-on, OS running
System design verification and testing are underway
Preliminary performance evaluation started
Development Proceeding on Schedule8
Post-K computer’s high-density mounting achieves over 1 PFlops per rack
Post-K Computer Hardware Features
Copyright 2018 FUJITSU LIMITED
Shelf24 CMUs/Shelf
(48 Nodes)
Rack8 Shelves/Rack
(384 Nodes)
1CPU/Node CMU (CPU Memory Unit)2 Nodes/CMU
9
(Cont.) Post-K Computer Hardware Features
Copyright 2018 FUJITSU LIMITED
• High-efficiency water cooling uniton the CPU memory unit (CMU)provides 100% water-cooling
Conventional connection
• Single action connection of electric connectors and water couplers achieve compact CMU
CMU
High-density mounting, shortened transmission distance between CPUs
High-efficiency water cooling unit
Electric connectors and water couplers
• The back-to-back layout and cable box shorten the cables length
Post-K computerconnection
Cable box
10
11