SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ...

SPARC64™ VIIIfx: Fujitsu’s New Generation Octo Core Processor

for PETA Scale computing

August 25, 2009

Takumi Maruyama

LSI Development DivisionNext Generation Technical Computing UnitFujitsu Limited

2004～2007

2000～2003

1998～1999

1996～1997

～1995

SPARC64II

SPARC64™Processor

SPARC64SPARC64™™ProcessorProcessor

Tr=190MCMOS Cu

Tr=30MCMOS Cu

180nm / 150nm

Tr=190MCMOS Cu

130nmSPARC64

SPARC64GP

Tr=30MCMOS Al

250nm / 220nm

Tr=46MCMOS Cu180nm

GS8900

MainframeMainframeMainframe

GS21600

GS8800B

Tr=400MCMOS Cu

Tr=540MCMOS Cu

SPARC64VI

GS8800

Tr=500MCMOS Cu

Tr=10MCMOS Al

Store AheadBranch HistoryPrefetch

Single-chip CPU

Non-Blocking $O-O-O ExecutionSuper-Scalar

L2$ on DieMulti-core / Multi-thread

Fujitsu Processor DevelopmentFujitsu Processor Development

High Perform

echnologyH

igh Reliability

Technology GS8600

GS21900

Tr=600MCMOS Cu

SPARC64V +

SPARC64

UNIXUNIXUNIX

HPC-ACE System On Chip

Hardware Barrier

SPARC64VIIIfx

SPARC64VII

Tr=760MCMOS Cu / 45nm

2008～

MainframeMainframeMainframe

HPCHPCHPCFeatures

$ ECCRegister/ALU ParityInstruction Retry$ Dynamic DegradationRC/RT/History

SPARC64™ VIIIfx Design TargetProcessor for Fujitsu’s supercomputer for the peta-scale

computing age, which realizes both high performance and low powerUnachievable goal with conventional processor design

• More GF (Giga Flops)• More Efficiency• No more high frequency

Large ISA (Instruction Set Architecture)extension is required

High Integration: SoC (System On Chip)High Reliability

Reuse SPARC64TM VII design when applicable

Focus of this presentation

HPC-ACE (High Performance Computing - Arithmetic Computational Extensions)

ISA (Instruction Set Architecture) of SPARC64VIIIfx is: Complies with

• The SPARC-V9 standard• JPS (Joint Programmer’s Specification): Extension to SPARC-V9

HPC-ACE: Fujitsu’s unique ISA extension for HPC• Large register sets• SIMD (single instruction multiple data) instructions• etc

You can download SPARC64TM VIIIfx ISA documents from http://jp.fujitsu.com/solutions/hpc/brochures/ The SPARC® Architecture Manual Version 9 SPARC® Joint Programming Specification (JPS1): Commonality SPARC64 VIIIfx Extensions

Large register sets 1/2Register enhancements from V9

160192 INT registers 32256 FP registers (DP)

• Lower 32 registers are the same with SPARC-V9’s.

• FP registers are “flat”. Extended FP registers can be accessed with non-SIMD instruction as well.

Why needed To extract more parallelism

currently limited by the number of Arch registers.

Reduce spill/fill overhead

INT Reg FP RegRegisterWindow

SIMD(basic)

SIMD(extended)

Loop0Loop1

Limited parallelismw/ O-O-O

Large parallelismw/ loop unrolling

Large register sets 2/2 Instruction format for 256 FP registers

8 bit x 4 (3 read +1 write) register number fields are necessary for FMA (Floating-point Multiply and Add) instruction.

But SPARC-V9 instruction length is limited (32bits – fixed) Defined a new prefix instruction (SXAR) to specify upper-3bit of

register numbers of the following two instructions.

SXAR inst1 inst2

Lower-5bit x4Upper-3bit x4

SXAR (Set XAR) instruction XAR: Extended Arithmetic Register

• Set by the SXAR instruction• Valid bit is cleared once the corresponding subsequent instruction

gets executed.

fv furd furs1 furs2 furs3 sv surd surs1 surs2 surs331 0

fsimd ssimd1615

First Upper Register Source-1 bits

SXAR1: set XAR for subsequent one instruction. SXAR2: set XAR for subsequent two instructions.

SIMD SIMD FP operation

1 SIMD FP instruction executes two SP or DP floating-point operations

SIMD Load/Store 1 SIMD load/store instruction accesses two

contiguous SP or DP data in memory Data alignment requirement for SIMD-load

(DP) is 8 byte rather than usual 16byte. Combine two independent FP operation into

one with the Flat FP registers

SXAR instruction Specifies SIMD operation of the

subsequent two instructions. No new opcode is required for SIMD

operation

More SW optimization possibility with SIMD

16byte Load x2

basic extended

Conventional SIMD usageSIMD-Load (Src1)SIMD-Load (Src2)SIMD-FPSIMD-Store (Dst)

Combine two FP ops with SIMDLoad (Src1-A)Load (Src1-B)Load (Src2-A)Load (Src2-B)SIMD-FPStore (Dst-A)Store (Dst-B)

B-pipe D-pipe

A-pipe C-pipe

#128-255#32-127

FP Trigonometric Functions

Taylor series approximation of sin (x)~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15

= x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1)

New instruction - ftrimaddd Function: rs1 × abs(rs2) + T[index] → rd Usage:

Instructions for fast Trigonometric Function calculation

# S = 0ftrimaddd S, X2, 7, S # S * X2 – 1/15! Sftrimaddd S, X2, 6, S # S * X2 + 1/13! Sftrimaddd S, X2, 5, S # S * X2 – 1/11! Sftrimaddd S, X2 , 4, S # S * X2 + 1/9! Sftrimaddd S, X2 , 3, S # S * X2 – 1/7! Sftrimaddd S, X2 , 2, S # S * X2 + 1/5! Sftrimaddd S, X2 , 1, S # S * X2 – 1/3! Sftrimaddd S, X2 , 0, S # S * X2 + 1/1! SFmuld S, X, S # S * X S

Software controlled Cache

HW Implementation Keep sector info of each cache line. Select the way to be replaced to meet the

sector ratio on cache misses

SW specifies a sector depending on temporal and spatial locality of data

L2 Cache structure

111111000

Way0 Way9Ex) Sector0:Sector1=30%:70%

1011110110index

Give SW ability to control cache to optimize performance while keeping cache coherency

Divide Cache into 2 groups (sectors) SXAR instruction specifies the sector

• sector 0: Instruction fetch / normal operand access (default)• sector 1: operand access explicitly specified by SXAR

Sector cache configuration registerSpecifies the ratio of sector 0 and 1 at the same index

Other HPC-ACE featuresConditional operation

For efficient execution of Loop with ‘if’ Conditional branch can be removed with the

following conditional instructions. FP Conditional Compare to Register Move Selected FP-register on FP-register’s condition Store FP-register on FP-register’s condition

Compiler can optimize the loop with SW pipeline

FP Reciprocal Approximation of Divide/Square-root Calculate reciprocal approximation with

rounding error < 1/256 To achieve higher divide/square-root

performance with pipelined operation

FP Minimum and Maximum

i=0 i=1i=2 i=3i=4 i=5

for (i=0; i<00; i++) {if (A(i)>0)

FPop X(i)else

FPop Y(i)}

basic extended

Usage:# %f2 / %f10 %f0frcpad %f10, %f6fmuld %f2, %f6, %f2fnmsubd %f6, %f10, 1.0, %f6fmuld %f6, %f6, %f0fmaddd %f6, %f6, %f6, %f4fmaddd %f0, %f0, %f6, %f0fmaddd %f4, %f2, %f2, %f4fmaddd %f0, %f4, %f2, %f0

i=6 i=7

DO J=1,NDO I=1,MA(J)=A(J)+A(I,J+1)*B(I,J)

ENDEND

DO J=1,NDO I=1,MA(J)=A(J)+A(I,J+1)*B(I,J)

ENDEND

DO J=1,NDO I=1,M

A(J)=A(J)+A(I,J+1)*B(I,J)END

Integrated Multicore Parallel Architecture

Parallel

Vector

Vector Conventional Scalar

SPARC64TM VIIIfx

CORE0 CORE2CORE1 CORE3

I= CORE0 CORE1 CORE2

SIMD basic extendedbasic extended

SPARC64TM VIIIfx HW HPC-ACE Shared L2 cache to avoid false sharing Hardware barrier for fast inter-core

synchronization Fujitsu’s compiler technology

Automatic parallelization

More possibility of optimization:Parallelize the innermost loop

J 1234・・・

I 1234・・・

J = 1 2 3 4 5I= 1 2 3 4 5

Parallel

SPARC64™ VIIIfx Chip Overview• Architecture Features

• 8 cores• Shared 5 MB L2$• Embedded Memory Controller• 2 GHz

• Fujitsu 45nm CMOS• 22.7mm x 22.6mm• 760M transistors• 1271 signal pins

• Performance (peak)• 128GFlops• 64GB/s memory throughput

• Power• 58W (TYP, 30℃)• Water Cooling – Low leakage

power and High reliability

L2$ Data

L2$ ControlMACMAC

MACMAC

<floating-point><integer>Execution units

256 (FPR)48 x2 (FUB)

188 (GPR)32 (GUB)

#Registers

FMA x4 (2SIMD)COMPARE x2

DIVIDE x2VIS x1

ALU x2SHIFT x2MULT x1DIVIDE x1AGEN x2

8 (= 4 Floating Multiply and Add)#FP-operations per clock

L1I$ 32KB/2wayL1D$ 32KB/2way

SPARC-V9/JPS + HPC-ACE

Instruction Set Architecture

Execution UnitL1D$

Instruction controlL1$

control

SPARC64™ VIIIfx Core spec

FPR + FUB

L1 I$64KB2Way

BranchTarget

Address8Kentry4Way

Decode& Issue

RSE8x2Entry

RSA10Entry

RSF8x2Entry

RSBR10Entry

GUB48Registers

GPR156Registers

FPR64Registers

FUB48Registers

FetchPort

16Entry

Store Port

8x2Entry

Write Buffer

5x2Entry

L1 D$64KB2Way

System BusInterface

Fetch(4stages)

Issue(2stages)

Dispatch Reg.-Read(4stages)

Execute Memory(L1$: 3(int)/4(fp)stages)

CSE64Entry

Commit(2stages)

ControlRegisters

SPARC64TM VII Pipeline

L2$6MB

L1 I$32KB2Way

BranchTarget

Address1Kentry2Way

Decode& Issue

RSE10Entry

RSA10Entry

RSF8x2Entry

RSBR6Entry

GUB32Registers

GPR188Registers

FPR256Registers

FUB48x2Registers

FetchPort

20Entry

Store Port

8Entry

L1 D$32KB2Way

MemoryController

Fetch(4stages)

Issue(3stages)

Dispatch Reg.-Read(4(int)/5(fp) stages)

Execute Memory(L1$: 3(int)/4(fp)stages)

CSE48Entry

Commit(2stages)

ControlRegisters

SPARC64TM VIIIfx Pipeline

L2$5MB

10WayFLC

Write Buffer5Entry

Changes in the pipeline

Issue stage Added “PD” stage to handle SXAR 6-instructions packed into 4 at “PD” An CSE entry is assigned at “D” The packed instructions have

Extended register fields SIMD operation attributes

4 packed (=6 unpacked) instructions can be committed at the same time.

Execution Stage “B” stage has been split into two for flat

Floating-point Register access

A B SXAR C D

A’ B’ C’ D’

Decode

Read IB

B FPR FUBFPR FUB

B2 Bypass

FLA/B FLC/D

extended

VII VIIIfx(noSIMD)

VIIIfx VII VIIIfx(noSIMD)

VIIIfx

Loop6: y(i) = x1(i) / x2(i) y(i) = sin(x1(i)) Loop14: 9th Degr Polyn

Performance= x 1.4

Performance Results – Core

Benchmark:Euroben + in-house

SPARC64TM VIIIfx@2.0GHzEx

Performance= x 4.3

Performance= x 6.8

SPARC64TM VIIIfx realizes much higher core performance than SPARC64TM VII thanks to HPC-ACE despite its lower frequency

Hardware measured results

0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit

Performance Results – ChipSimulated results

SPARC64TM VIIIfx@2.0GHz

Performance= x 2.4

Performance= x 2.6

SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII Stall time due to the execution unit busy has been reduced dramatically.

Expect x 3 performance with further compiler optimization

Hz Benchmark:

Scientific Application- Molecular Dynamics- Fluid Dynamics

0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit

SPARC64VII SPARC64VIIIfx SPARC64VII SPARC64VIIIfx

Molecular Dynamics Fluid Dynamics

SPARC64™ History and FutureEvolution rather than revolution

SPARC64TM V (1 core)• RAS• Single thread performance

SPARC64TM VI (2 core x 2 VMT)• Throughput

SPARC64TM VII (4 core x 2 SMT)• More Throughput• High Performance Computing

SPARC64TM VIIIfx (8 core)• High Performance Computing• Low Power• SoC

Peak Performance & Power

V V + VI VII VIIIfx

SPARC64

Performance (GF) = x 3

Power (W) = x 0.5

SPARC64TM VIIIfx SummarySPARC64TM VIIIfx has been designed to be used for

Fujitsu’s supercomputer for the PETA-scale computing age.

HPC-ACE instruction sets overcomes the limitation of conventional SPARC-V9 architecture.

SPARC64TM VIIIfx has combined high performance and low power.

SPARC64TM VIIIfx chip is up and running in the lab.

Fujitsu will continue to develop SPARC64TM series to meet the needs of a new era.

Abbreviations• SPARC64TM VIIIfx

– IB: Instruction Buffer – RSA: Reservation Station for Address generation– RSE: Reservation Station for Execution– RSF: Reservation Station for Floating-point– RSBR: Reservation Station for Branch– GUB: General Update Buffer– FUB: Floating point Update Buffer– GPR: General Purpose Register– FPR: Floating Point Register– CSE: Commit Stack Entry

SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ...

Documents

Transcript of SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ...

Fujitsu Limited Ver 15, 26 Apr. 2010...SPARC64 VIIIfx Extensions Ver 15, 26 Apr. 2010 7.6.1 Traps Defined by SPARC V9 As Mandatory 50 7.6.2 SPARC V9 Optional Traps That Are Mandatory

SPARC64™ X+: Fujitsu’s Next Generation … SPARC64 X+: Fujitsu’s Next Generation Processor for UNIX servers Author 富士通株式会社 Created Date 8/15/2013 3:38:33 PM

スーパーコンピュータ「京」のCPU SPARC64 VIIIfx...SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd.ʼs 45-nm CMOS

Post-K Computer with Fujitsu's Original CPU, A64FX Powered ... · Title: Post-K Computer with Fujitsu's Original CPU, A64FX Powered by Arm ISA Author: Fujitsu Limited Created Date:

Insomni'hack2015 CTF writeups...Exploit|JurassicSparc 0solve • Sparc64 service, custom protocol • Provided python client • Credentials + magic word= flag • 8 byte stack buffer

SPARC64 VIIIfx: CPU for the K computer - FujitsuCore 5 Core 4 Core 0 MAC Core 7 Core 6 Core 3 Core 2 MAC MAC MAC FUITSU Sci. Tech. ., Vol. 48, No. 3 (uly 2012) 275 T. Yoshida et al.:

Fujitsu's Managed Mobile for Android Smartphones and Tablets

Preface - Fujitsu you for your continuing loyalty to Fujitsu's semiconductor products. Electronic equipment is continually becoming smaller, lighter, and less

PRIMEQUEST Virtual Machine Function · function with Fujitsu's Systemwalker integrated operation management software product set. Automatic operation • By combining the virtual

SPARC64(TM) VII Fujitsu's Next Generation Quad-Core Processor · 2014. 5. 2. · Tr=190M CMOS Cu 130nm Tr=30M CMOS Cu 180nm / 150nm Tr=190M CMOS Cu 130nm SPARC64 V SPARC64 GP Tr=30M

SPARC64™ XIfx Extensions - Fujitsu...7.128. Partitioned Unsigned Compare ..... 301 7.129. Floating-Point Lexicographic

The Future of SPARC64 - Fujitsu Japanjp.fujitsu.com/.../c-session/pdf/TheFutureofSPARC64TM_V14_fix.pdf · The Future of SPARC64TM Kevin Oltendorf ... mandatory for current & future

リアルタイム処理を実現するSPARC64 X SPARC64 Xが実現する … · スポーツスケジューリング問題をベースとしました。スポーツスケジューリング問題とは、リーグ戦の対戦スケジュールを作成する問題です。

SPARC64 XIfx: Fujitsu's Next Generation Processor for HPC · Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 8/13/2014 12:12:40 AM

Advanced module6 [Read-Only] · Computational Science K computer, SPARC64 VIIIfx 2.0GHz,/ 2011 Fujitsu Japan 548,352 8.162 8.774 9.899 2 National Supercomputing Center in Tianjin

Fujitsu Limited Ver 15, 26 Apr. 2010 · SPARC64™ VIIIfx Extensions. Fujitsu Limited Ver 15, 26 Apr. 2010. Fujitsu Limited 4-1-1 Kamikodanaka Nakahara-ku, Kawasaki, 211-8588 Japan

How XBRL Transformed Fujitsu's IT Platformarchive.xbrl.org/18th/sites/18thconference.xbrl.org/files/hanaoka.pdf · How XBRL Transformed Fujitsu's IT Platform Oct. 15th 2008 HANAOKA

Interconnect link bandwidth Supercomputer · Supercomputer PRIMEHPC FX10 Specifications Theoretical peak performance 236.5 Gigaflops Processor SPARC64 ⅠXfx (1.848 GHz/16-core）

Shaping the Future together - Fujitsu's Channel Organization and German SMEs

SPARC64 V Processor Whitepaper For UNIX Server