Post on 03-Feb-2022
All Rights Reserved,Copyright© FUJITSU LIMITED 2009
SPARC64™ VIIIfx: Fujitsu’s New Generation Octo Core Processor
for PETA Scale computing
August 25, 2009
Takumi Maruyama
LSI Development DivisionNext Generation Technical Computing UnitFujitsu Limited
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20092
2004~2007
2000~2003
1998~1999
1996~1997
~1995
SPARC64II
SPARC64™Processor
SPARC64SPARC64™™ProcessorProcessor
Tr=190MCMOS Cu
130nm
Tr=30MCMOS Cu
180nm / 150nm
Tr=190MCMOS Cu
130nmSPARC64
V
SPARC64GP
Tr=30MCMOS Al
250nm / 220nm
Tr=46MCMOS Cu180nm
GS8900
MainframeMainframeMainframe
GS21600
GS8800B
Tr=400MCMOS Cu
90nm
Tr=540MCMOS Cu
90nm
SPARC64VI
GS8800
Tr=500MCMOS Cu
90nm
Tr=10MCMOS Al
350nm
Store AheadBranch HistoryPrefetch
Single-chip CPU
Non-Blocking $O-O-O ExecutionSuper-Scalar
L2$ on DieMulti-core / Multi-thread
Fujitsu Processor DevelopmentFujitsu Processor Development
High Perform
anceT
echnologyH
igh Reliability
Technology GS8600
GS21900
Tr=600MCMOS Cu
65nm
SPARC64V +
GS21
SPARC64
UNIXUNIXUNIX
HPC-ACE System On Chip
Hardware Barrier
SPARC64VIIIfx
SPARC64VII
Tr=760MCMOS Cu / 45nm
2008~
MainframeMainframeMainframe
HPCHPCHPCFeatures
$ ECCRegister/ALU ParityInstruction Retry$ Dynamic DegradationRC/RT/History
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20093
SPARC64™ VIIIfx Design TargetProcessor for Fujitsu’s supercomputer for the peta-scale
computing age, which realizes both high performance and low powerUnachievable goal with conventional processor design
• More GF (Giga Flops)• More Efficiency• No more high frequency
Large ISA (Instruction Set Architecture)extension is required
High Integration: SoC (System On Chip)High Reliability
Reuse SPARC64TM VII design when applicable
Focus of this presentation
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20094
HPC-ACE (High Performance Computing - Arithmetic Computational Extensions)
ISA (Instruction Set Architecture) of SPARC64VIIIfx is: Complies with
• The SPARC-V9 standard• JPS (Joint Programmer’s Specification): Extension to SPARC-V9
HPC-ACE: Fujitsu’s unique ISA extension for HPC• Large register sets• SIMD (single instruction multiple data) instructions• etc
You can download SPARC64TM VIIIfx ISA documents from http://jp.fujitsu.com/solutions/hpc/brochures/ The SPARC® Architecture Manual Version 9 SPARC® Joint Programming Specification (JPS1): Commonality SPARC64 VIIIfx Extensions
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20095
Large register sets 1/2Register enhancements from V9
160192 INT registers 32256 FP registers (DP)
• Lower 32 registers are the same with SPARC-V9’s.
• FP registers are “flat”. Extended FP registers can be accessed with non-SIMD instruction as well.
Why needed To extract more parallelism
currently limited by the number of Arch registers.
Reduce spill/fill overhead
Ext.
V9
Ext.
V9V9
Ext.
INT Reg FP RegRegisterWindow
32
224
32
32160
SIMD(basic)
SIMD(extended)
Loop0
Loop1
Loop0Loop1
LoopN
Limited parallelismw/ O-O-O
Large parallelismw/ loop unrolling
…
32
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20096
Large register sets 2/2 Instruction format for 256 FP registers
8 bit x 4 (3 read +1 write) register number fields are necessary for FMA (Floating-point Multiply and Add) instruction.
But SPARC-V9 instruction length is limited (32bits – fixed) Defined a new prefix instruction (SXAR) to specify upper-3bit of
register numbers of the following two instructions.
SXAR inst1 inst2
Lower-5bit x4Upper-3bit x4
SXAR (Set XAR) instruction XAR: Extended Arithmetic Register
• Set by the SXAR instruction• Valid bit is cleared once the corresponding subsequent instruction
gets executed.
fv furd furs1 furs2 furs3 sv surd surs1 surs2 surs331 0
fsimd ssimd1615
First Upper Register Source-1 bits
SXAR1: set XAR for subsequent one instruction. SXAR2: set XAR for subsequent two instructions.
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20097
SIMD SIMD FP operation
1 SIMD FP instruction executes two SP or DP floating-point operations
SIMD Load/Store 1 SIMD load/store instruction accesses two
contiguous SP or DP data in memory Data alignment requirement for SIMD-load
(DP) is 8 byte rather than usual 16byte. Combine two independent FP operation into
one with the Flat FP registers
SXAR instruction Specifies SIMD operation of the
subsequent two instructions. No new opcode is required for SIMD
operation
More SW optimization possibility with SIMD
16byte Load x2
basic extended
SIMD
SIMD
Conventional SIMD usageSIMD-Load (Src1)SIMD-Load (Src2)SIMD-FPSIMD-Store (Dst)
Combine two FP ops with SIMDLoad (Src1-A)Load (Src1-B)Load (Src2-A)Load (Src2-B)SIMD-FPStore (Dst-A)Store (Dst-B)
B-pipe D-pipe
A-pipe C-pipe
#128-255#32-127
#0-31
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20098
FP Trigonometric Functions
Taylor series approximation of sin (x)~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15
= x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1)
New instruction - ftrimaddd Function: rs1 × abs(rs2) + T[index] → rd Usage:
Instructions for fast Trigonometric Function calculation
# S = 0ftrimaddd S, X2, 7, S # S * X2 – 1/15! Sftrimaddd S, X2, 6, S # S * X2 + 1/13! Sftrimaddd S, X2, 5, S # S * X2 – 1/11! Sftrimaddd S, X2 , 4, S # S * X2 + 1/9! Sftrimaddd S, X2 , 3, S # S * X2 – 1/7! Sftrimaddd S, X2 , 2, S # S * X2 + 1/5! Sftrimaddd S, X2 , 1, S # S * X2 – 1/3! Sftrimaddd S, X2 , 0, S # S * X2 + 1/1! SFmuld S, X, S # S * X S
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20099
1
Software controlled Cache
HW Implementation Keep sector info of each cache line. Select the way to be replaced to meet the
sector ratio on cache misses
SW specifies a sector depending on temporal and spatial locality of data
L2 Cache structure
111111000
Way0 Way9Ex) Sector0:Sector1=30%:70%
1011110110index
Give SW ability to control cache to optimize performance while keeping cache coherency
Divide Cache into 2 groups (sectors) SXAR instruction specifies the sector
• sector 0: Instruction fetch / normal operand access (default)• sector 1: operand access explicitly specified by SXAR
Sector cache configuration registerSpecifies the ratio of sector 0 and 1 at the same index
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200910
Other HPC-ACE featuresConditional operation
For efficient execution of Loop with ‘if’ Conditional branch can be removed with the
following conditional instructions. FP Conditional Compare to Register Move Selected FP-register on FP-register’s condition Store FP-register on FP-register’s condition
Compiler can optimize the loop with SW pipeline
FP Reciprocal Approximation of Divide/Square-root Calculate reciprocal approximation with
rounding error < 1/256 To achieve higher divide/square-root
performance with pipelined operation
FP Minimum and Maximum
SIMD
i=0 i=1i=2 i=3i=4 i=5
time
for (i=0; i<00; i++) {if (A(i)>0)
FPop X(i)else
FPop Y(i)}
basic extended
Usage:# %f2 / %f10 %f0frcpad %f10, %f6fmuld %f2, %f6, %f2fnmsubd %f6, %f10, 1.0, %f6fmuld %f6, %f6, %f0fmaddd %f6, %f6, %f6, %f4fmaddd %f0, %f0, %f6, %f0fmaddd %f4, %f2, %f2, %f4fmaddd %f0, %f4, %f2, %f0
i=6 i=7
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200911
DO J=1,NDO I=1,MA(J)=A(J)+A(I,J+1)*B(I,J)
ENDEND
DO J=1,NDO I=1,MA(J)=A(J)+A(I,J+1)*B(I,J)
ENDEND
DO J=1,NDO I=1,M
A(J)=A(J)+A(I,J+1)*B(I,J)END
END
Integrated Multicore Parallel Architecture
Parallel
Time
PPP
Vector
Time
VVV
P
P
Vector Conventional Scalar
SPARC64TM VIIIfx
CORE0 CORE2CORE1 CORE3
I= CORE0 CORE1 CORE2
SIMD basic extendedbasic extended
SPARC64TM VIIIfx HW HPC-ACE Shared L2 cache to avoid false sharing Hardware barrier for fast inter-core
synchronization Fujitsu’s compiler technology
Automatic parallelization
More possibility of optimization:Parallelize the innermost loop
J 1234・・・
J 1234・・・
I 1234・・・
J = 1 2 3 4 5I= 1 2 3 4 5
Time
Parallel
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200912
SPARC64™ VIIIfx Chip Overview• Architecture Features
• 8 cores• Shared 5 MB L2$• Embedded Memory Controller• 2 GHz
• Fujitsu 45nm CMOS• 22.7mm x 22.6mm• 760M transistors• 1271 signal pins
• Performance (peak)• 128GFlops• 64GB/s memory throughput
• Power• 58W (TYP, 30℃)• Water Cooling – Low leakage
power and High reliability
Core5
Core4
Core1
Core0
Core7
Core6
Core3
Core2
DD
R3
inte
rfac
e
DD
R3
inte
rfac
e
L2$ Data
L2$ Data
HSIO
L2$ ControlMACMAC
MACMAC
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200913
<floating-point><integer>Execution units
256 (FPR)48 x2 (FUB)
188 (GPR)32 (GUB)
#Registers
FMA x4 (2SIMD)COMPARE x2
DIVIDE x2VIS x1
ALU x2SHIFT x2MULT x1DIVIDE x1AGEN x2
8 (= 4 Floating Multiply and Add)#FP-operations per clock
L1I$ 32KB/2wayL1D$ 32KB/2way
L1$
SPARC-V9/JPS + HPC-ACE
Instruction Set Architecture
Execution UnitL1D$
L1I$
Instruction controlL1$
control
SPARC64™ VIIIfx Core spec
FPR + FUB
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200914
L1 I$64KB2Way
BranchTarget
Address8Kentry4Way
Decode& Issue
RSE8x2Entry
RSA10Entry
RSF8x2Entry
RSBR10Entry
GUB48Registers
GPR156Registers
x2
EXA
EXB
EAGA
EAGB
FPR64Registers
x2
FUB48Registers
FLA
FLB
FetchPort
16Entry
Store Port
8x2Entry
Write Buffer
5x2Entry
L1 D$64KB2Way
System BusInterface
Fetch(4stages)
Issue(2stages)
Dispatch Reg.-Read(4stages)
Execute Memory(L1$: 3(int)/4(fp)stages)
CSE64Entry
Commit(2stages)
PCx2
ControlRegisters
x2
SPARC64TM VII Pipeline
L2$6MB
12Way
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200915
L1 I$32KB2Way
BranchTarget
Address1Kentry2Way
Decode& Issue
RSE10Entry
RSA10Entry
RSF8x2Entry
RSBR6Entry
GUB32Registers
GPR188Registers
EXA
EXB
EAGA
EAGB
FPR256Registers
FUB48x2Registers
FLA
FLB
FetchPort
20Entry
Store Port
8Entry
L1 D$32KB2Way
MemoryController
Fetch(4stages)
Issue(3stages)
Dispatch Reg.-Read(4(int)/5(fp) stages)
Execute Memory(L1$: 3(int)/4(fp)stages)
CSE48Entry
Commit(2stages)
PC
ControlRegisters
SPARC64TM VIIIfx Pipeline
L2$5MB
10WayFLC
FLD
DIMM
Write Buffer5Entry
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200916
Changes in the pipeline
Issue stage Added “PD” stage to handle SXAR 6-instructions packed into 4 at “PD” An CSE entry is assigned at “D” The packed instructions have
Extended register fields SIMD operation attributes
4 packed (=6 unpacked) instructions can be committed at the same time.
Execution Stage “B” stage has been split into two for flat
Floating-point Register access
SXAR
E
PD
D
A B SXAR C D
A’ B’ C’ D’
Decode
Read IB
P
X
B FPR FUBFPR FUB
B2 Bypass
basic
FLA/B FLC/D
extended
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200917
0
0.2
0.4
0.6
0.8
1
1.2
1.4
VII VIIIfx(noSIMD)
VIIIfx VII VIIIfx(noSIMD)
VIIIfx VII VIIIfx(noSIMD)
VIIIfx
Loop6: y(i) = x1(i) / x2(i) y(i) = sin(x1(i)) Loop14: 9th Degr Polyn
Performance= x 1.4
Performance Results – Core
Benchmark:Euroben + in-house
SPARC64TM VIIIfx@2.0GHzEx
ecut
ion
Tim
eR
elat
ive
to S
PAR
C64
TMVI
I@2.
5GH
z
Performance= x 4.3
Performance= x 6.8
SPARC64TM VIIIfx realizes much higher core performance than SPARC64TM VII thanks to HPC-ACE despite its lower frequency
Hardware measured results
0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit
0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200918
Performance Results – ChipSimulated results
SPARC64TM VIIIfx@2.0GHz
Performance= x 2.4
Performance= x 2.6
SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII Stall time due to the execution unit busy has been reduced dramatically.
Expect x 3 performance with further compiler optimization
Exec
utio
n Ti
me
Rel
ativ
e to
SPA
RC
64TM
VII@
2.5G
Hz Benchmark:
Scientific Application- Molecular Dynamics- Fluid Dynamics
0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit
0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit
0
0.2
0.4
0.6
0.8
1
1.2
1.4
SPARC64VII SPARC64VIIIfx SPARC64VII SPARC64VIIIfx
Molecular Dynamics Fluid Dynamics
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200919
SPARC64™ History and FutureEvolution rather than revolution
SPARC64TM V (1 core)• RAS• Single thread performance
SPARC64TM VI (2 core x 2 VMT)• Throughput
SPARC64TM VII (4 core x 2 SMT)• More Throughput• High Performance Computing
SPARC64TM VIIIfx (8 core)• High Performance Computing• Low Power• SoC
Peak Performance & Power
0
5
10
15
20
25
V V + VI VII VIIIfx
SPARC64
SPA
RC
64 V
= 1
.0
Performance (GF) = x 3
Power (W) = x 0.5
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200920
SPARC64TM VIIIfx SummarySPARC64TM VIIIfx has been designed to be used for
Fujitsu’s supercomputer for the PETA-scale computing age.
HPC-ACE instruction sets overcomes the limitation of conventional SPARC-V9 architecture.
SPARC64TM VIIIfx has combined high performance and low power.
SPARC64TM VIIIfx chip is up and running in the lab.
Fujitsu will continue to develop SPARC64TM series to meet the needs of a new era.
SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200921
Abbreviations• SPARC64TM VIIIfx
– IB: Instruction Buffer – RSA: Reservation Station for Address generation– RSE: Reservation Station for Execution– RSF: Reservation Station for Floating-point– RSBR: Reservation Station for Branch– GUB: General Update Buffer– FUB: Floating point Update Buffer– GPR: General Purpose Register– FPR: Floating Point Register– CSE: Commit Stack Entry