Micro-architecture of Godson-3 Multi-Core Processor · 2013. 7. 28. · References The architecture...
Transcript of Micro-architecture of Godson-3 Multi-Core Processor · 2013. 7. 28. · References The architecture...
1
Micro-architecture of Godson-3 Multi-Core Processor
Weiwu Hu, Jian Wang, Xiang Gao, and Yunji Chen
Institute of Computing Technology (ICT)
Chinese Academy of Sciences
Hotchips 2008, San Francisco, August 26
2
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS
Godson is the academic name of LoongsonTM
3
National Project High performance CPUs are of national strategic importance
Chinese ICT industry is growing to a significant scale2007 Demand: ICT market = 5.6 trillion RMB 2007 Supply: only 22% by domestic companies, 3.75% profits
Godson CPU is supported by National 863 projectNational 973 projectNational Science Foundation of ChinaNational key projectKey project of Chinese Academy of Sciences
4
Godson CPU Briefs
ICT started Godson CPU design in 2001 The 32-bit Godson-1 CPU in 2002 is the
first general purpose CPU in China The 64-bit Godson-2B in 2003.10 The 64-bit Godson-2C in 2004.12 The 64-bit Godson-2E in 2006.03 Each tripled the performance of its
previous one
5
Godson Development
10
100
1000
10000
1999 2000 2001 2002 2003 2004 2005 2006
Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate
Godson rate
6
Programs Reftime Run time Ratio
164.gzip 1400 403 347
175.vpr 1400 273 512
176.gcc 1100 221 497
181.mcf 1800 307 586
186.crafty 1000 167 598
197.parser 1800 472 382
252.eon 1300 188 690
253.perlbmk 1800 354 508
254.gap 1100 240 458
255.vortex 1900 263 722
256.bzip2 1500 365 411
300.twolf 3000 645 465
SPEC int2000 <503>
Godson-2E SPEC CPU2000 RatePrograms Ref time Run time Ratio
168.wupwise 1600 238 672
171.swim 3100 660 469
172.mgrid 1800 579 311
173.applu 2100 549 382
177.mesa 1400 221 634
178.galgel 2900 412 704
179.art 2600 416 624
183.equake 1300 208 624
187.facerec 1900 300 632
188.ammp 2200 432 509
189.lucas 2000 396 506
191.fma3d 2100 531 395
200.sixtrack 1100 345 319
301.apsi 2600 528 493
SPEC fp2000 <503>
7
Godson-2E and Godson-2F 1.0GHz@90nm CMOS, 5-7W 47M xtors, area 36mm^2 Godson-2 CPU core
64-bit MIPS III compatible Four-issue, OOO 64KB+64KB L1 (four-way) 512KB L2 (four-way)
On-chip DDR controller SysAD Front-end bus
1.0GHz@90nm CMOS, 3-5W 51M xtors, area 43mm^2 Godson-2 CPU core
64-bit MIPS III compatible Four-issue, OOO 64KB+64KB L1 (four-way) 512KB L2 (four-way)
On-Chip DDR2 controller PCI/PCIX, local IO, GPIO, etc. Volume production
8
Low end roadmap: From CPU to SOC
CPU
NorthBridge
SouthBridge
Graphic
CPU+NB
SouthBridge
Graphic
CPU+NB
GPU+SouthBridge
CPU+GPU+NB+SB
2E 2006 2H 20092F 2007 2G 2008
PCI
PCI PCI/HT
9
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS
10
Scalable architecture Reconfigurable CPU core and L2 X86 binary translation optimization Low power consumption >1.0GHz@65nm
Godson-3 Briefs
11
Scalable Architecture Design Scalable interconnection network
Crossbar + MeshSingle crossbar connects cores, L2s, and four directions
Directory-based cache coherence protocolDistributed L2 caches are globally addressedEach cache block has a directory entry Both data cache and instruction cache are recorded in directory
8x8 Xbar
P0 P1 P2 P3
L2 L2 L2 L2
E
S
W
N
E
S
W
N
12
Reconfigurable
Architecture
6*6 AXI Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4
m5
s5s1 s2 s3 s4
5*4 AXI Switch
MC1MC0
m0
s0
DM
A C
on
trolle
r
HT
HT
DM
A C
on
trolle
r
PC
I...Shared L2 can be configured as internal RAM; DMA to internal RAM directly
General Purpose Core:64-bit, 4-issue, OOO,AXI interface
Multiple Purpose Core: LINPACK, biology, signal processing, AXI interface
DMA engine supports pre-fetch and matrix transpose
8 configurable address windows of each master port allow pages migration across L2 and memory
13
GS464 general purpose core MIPS64, 200+ more instructions for X86
binary translation and media acceleration Four-issue superscalar OOO pipeline Two fix, two FP, one memory units Two FP units each supports full pipelined
double/paired-single MAC operation 48-bit VA and PA, 128-bit memory access 64KB icache and 64KB dcache, 4-way 64-entry fully associated TLB, 16-entry
ITLB, variable page size Non-blocking accesses, load speculation Directory-based cache-coherence for CMP Parity check for icache, ECC for dcache EJTAG for debugging Standard 128-bit AXI interface
DMA+AXI Controller
XBAR
1024*64
4W4R
AXI interface
Micro-Controller
Micro-Code Store
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
1024*64
4W4R
Processor Interface
BTB
BHT
ITLB
ICache
FixQueue
Reorder Queue
FloatingPoint
RegisterFile
ALU1
ALU2
FPU1
FPU2
FloatQueue
CP0Queue
DCACHE
DTLB
ROQBRQ
IntegerRegister
File
AGU
Write back Bus
Commit Bus
Map Bus
missqueue
Refill Bus
imemread dmemwrite
PC
PC
+16
dmemread, duncacheucqueue wtbkqueue
AXI Interface
EJTAG TAP ControllerTest Controller
JTAG InterfaceTest Interface
GS464 Architecture
Decode
Bus
Branch Bus
clock, reset, int, …
Pre
de
co
de
r
Deco
de
r
Reg
iste
r Ma
pp
er
Tag
Com
pare
GStera multiple purpose core Target for LINPACK, biology
computation, signal processing 8-16 MACs per node Big multi-port register file Reconfigurable based on applications. Standard 128-bit AXI interface
14
Matrix Transposing Performance
15+times faster
0
50000000
100000000
150000000
200000000
250000000
256x256 512x512 1024x1024
dsp
cpu
15
Hardware Support for X86 Binary Translation
Define new instructionsX86 ISA function and MIPS ISA formatBinary translation mechanism supporting>200 instructions are defined with 5% additional silicon cost
Speed up X86-to-MIPS binary translation 10 times compared to software only QEMU
649 6121010
421
1643
4178
2312
5430
4851
6743
9032
6086
9980
8616
10212
7954
14350
11611
9964
0
2000
4000
6000
8000
10000
12000
14000
16000
IDCT FFT(FX) FFT(FP) GP EFLAG
No HO, No SO
HO Only
Both
Ideal
MS windows Linux apps. on X86
Linux apps. on MIPS
System level X86 VM Process level X86 VM
Linux on MIPS
Enhanced MIPS decode
Enhanced Godson internal operations
Figure 2. The architecture of Godson-2 virtual machine
16
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS
17
Physical implementation
65nm CMOS LP/GP technology Cell-based design methodology
DC + ICC Manual P&R for critical cells
2008: 4-core (4GP + 0MP) + 4MB L2GP: General purpose coreMP: Multiple purpose core10w@1GHz
2009: 8-core (4GP + 4MP) + 4MB L2 20w@1GHz
18
4-core
20084-core (general purpose core)65nm, 1GHz, 10w
6*6 X1 Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4
m5
s5s1 s2 s3 s4
5*4 X2 Switch
MC1MC0
m0
s0
DM
A C
on
trolle
r
HT
, PC
IE
HT
, PC
IE
DM
A C
on
trolle
r
PC
I...
19
8-core
X1 Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4
m5
s5s1 s2 s3 s4
X2 Switch
MC0
PCILPC
m0
s0
DM
A C
on
trolle
r
HT
, PC
IE
X1 Switch
S0
P0 P1 P2 P3
S1 S2 S3
m1 m2 m3 m4
m5
s5s1 s2 s3 s4
X2 Switch
MC1
m0
s0
HT
, PC
IE
DM
A C
on
trolle
r
20098-core (4GP + 4MP)65nm, 1GHz, 20w
20
Physical register file
Size: 321um x 262um
Power: 50mW@1GHz
Delay: 470ps
TLB CAM
Size: 224 um x 235 um
Power: 55mw @ 1GHz
Delay: 550ps
Full Customer Register file and CAM
21
HT1.0 Driver & Receiver
FlipChip Compatible 2Row design
800mw @ 1.6Gbps
Size: 250um x 300um
Power: < 10mW
Freq: 1.6GHz
Jitter: 20 ps
HyperTransport PHY
22
TEST CHIP
ST 65nm
1206um x 1206um
Function:
CAM1W1R - BIST
CAM1W1R - Scan
RAM4W4R - BIST
RAM4W4R - Scan
ICT_PLL - Freq. test
HT1.0 - BIST
HT1.0 - Error rate test
Test Chip for Customer Blocks
23
Cell-based High Performance Physical Design
The Full Hierarchical Design Methodology Manual placement & route for critical paths Manual placement of all FFs and clock buffers,
manual clock gating Architecture optimization with physical feedback
ictreg1
ictreg0
Mul
Div
regfile0
rissuebus2
resbus2
fwdbus2
mapbus0/1/2/3
resbus0
resbus1
resb
us0
resb
us2
fwdb
us2
br_p
c_va
lue
resb
us2/
3
fwdb
us2/
3
jr_ta
rget
alu1
qissbus
rissbus
alu_buf_vsrc
alu2
cam0fxqueue
24
Clock Tree
H-Tree + Mesh Manual placement
of FFs Manual clock gate
25
Layout of 4-core Godson-3
Xbar
Xbar
L2L2L2 L2
GS464GS464
GS464 GS464
HT HT
DMA DMA
DDR2/3 DDR2/3
26
Contents
A brief introduction to Godson processors The architecture of Godson-3 multi-core processor Physical implementation PetaFLOPS and TeraFLOPS
27
PetaFLOPS and TeraFLOPS
PetaFLOPS for Large Scale ApplicationsTo build PetaFLOPS HPC with Godson-3 in 2010
TeraFLOPS for Personal HPCPutting desktop to pocketsPutting TeraFLOPS to desktopHigh-performance computing for the masses
28
Scaling down TeraFLOPS
$100K/2007“refrigerator”
$50K/2008“washing machine”
$10K/2009“microwave oven”
TeraFLOPS in 1997
29
References
The architecture of Godson-2 superscalar architecture is available at: Weiwu Hu, Fuxin Zhang, Zusong Li, “Microarchitecture of the Godson-2
Processor”, Journal of Computer Science and Technology, 20(2):243-249, Mar. 2005
Weiwu Hu, Jiye Zhao, Shiqiang Zhong, Xu Yang, Elio Guidetti, Chris Wu, “Implementing a 1GHz Four-issue Out-of-Order Execution Microprocessor in a Standard Cell ASIC Methodology”, Journal of Computer Science and Technology, 22(1):1-14, Jan. 2007
The experiences learning from Godson processor design is available at: Weiwu Hu, Jian Wang, “Making Effective Decisions in Computer
Architects’ Real-World: Lessons and Experiences with Godson-2 Processor Designs”, Journal of Computer Science and Technology, 23(4), July 2008
The architecture of Godson-3 multi-core is available at: HotChip’08
30
Concluding Remarks
CPU R&D are of national strategic importance Godson-3 has a low-power, scalable architecture Godson-3 will be used to build
client side systemsPetaflops machinesTeraflops systems for the masses
The Godson team at ICT: open cooperation
31
Thanks