Xianfeng Li-The SuperK Project at MPRCcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN...The SuperK...
Transcript of Xianfeng Li-The SuperK Project at MPRCcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN...The SuperK...
-
The SuperK Project at MPRC
Xianfeng LI (李险峰)
Microprocessor Research and Development Center (MPRC) Peking University
-
2
What Is SuperK
A project on low-cost, portable computersUniCore architectureHardware multimedia enhancementFlash diskSingle-chip solution
Two product lines
OLPC-like: ~1000RMB UMPC-like: 3C computer
-
3
SuperK SoC Architecture
Dual-port BIU
Globally-Asynchronous-Locally-Synchronous (GALS)
-
4
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
5
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
6
UniCoreII Overview
TSMC 0.13umClock rate: 600 MHz
TSMC 90nmClock rate: 800 MHz
-
7
Int/FP Units
UniCore328-stage pipeline7 operation modesDynamic branch predictionPrecise interrupt
UniCore-F64IEEE-754 standard2D/3D ISA extensionsFloating-point Register file(32 x 32-bit or 16 x 64-bit)Precise interrupt
IF1
IF2
DEC
ISS
MUL1 EXE1 MAG
MUL2 EXE2 MEM1
MADD MSW MEM2
WB
-
8
Other Core Components
CoprocessorsCP0, OCD
CachesDecoupled I/D caches16KB, 4-way
MMUsHierarchical TLBsSupport multiple page sizes
Dual-port BIU64-bit memory access bus32-bit high-speed system bus
-
9
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
10
UniGFX Graphics Engine
UniCore32
IMMU
16KBICache
CP0
OCD
DMMU
16KBDCache
Two-Port BIU
Data BusInstruction Bus
UniCore-F64
10M/100M/1GEthernet MAC
Controller
Host/PCI 2.2Bridge
Static MemoryController
(SRAM/Flash)
6 ChannelDMA
AHB/APBBridge
GPIO
RTC
INTC
Power Manager
OS Timer
Reset Controller
System Modules I2C
SPI
UART0/IrDA
UART1
32-bit Peripheral Bus
IDE SATAController
AC’97
USB OTGMulti-portControllerDDR-II SDRAMController
32-bit System Bus
DataPort 1
DataPort 2
64-bit Memory Bus
NAND FlashController
RegPort
MMC/SD
PS/2
Efficient HW implementation of 2D graphics operations (Line Draw/ROP/BLT/Alpha Blending/...)
GUI acceleration
-
11
UniGFX Video Engine
UniCore32
IMMU
16KBICache
CP0
OCD
DMMU
16KBDCache
Two-Port BIU
Data BusInstruction Bus
UniCore-F64
10M/100M/1GEthernet MAC
Controller
Host/PCI 2.2Bridge
Static MemoryController
(SRAM/Flash)
6 ChannelDMA
AHB/APBBridge
GPIO
RTC
INTC
Power Manager
OS Timer
Reset Controller
System Modules I2C
SPI
UART0/IrDA
UART1
32-bit Peripheral Bus
IDE SATAController
AC’97
USB OTGMulti-portControllerDDR-II SDRAMController
32-bit System Bus
DataPort 1
DataPort 2
64-bit Memory Bus
NAND FlashController
RegPort
MMC/SD
PS/2
HW/SW co-operated video decodingUniCore: PredecodingMME: IDCT, MC
Dedicated HW for H.264 codec (collaboration with Prof. Y-L Lin’s group, TW NTHU)
-
12
UniGFX Display Engine
UniCore32
IMMU
16KBICache
CP0
OCD
DMMU
16KBDCache
Two-Port BIU
Data BusInstruction Bus
UniCore-F64
10M/100M/1GEthernet MAC
Controller
Host/PCI 2.2Bridge
Static MemoryController
(SRAM/Flash)
6 ChannelDMA
AHB/APBBridge
GPIO
RTC
INTC
Power Manager
OS Timer
Reset Controller
System Modules I2C
SPI
UART0/IrDA
UART1
32-bit Peripheral Bus
IDE SATAController
AC’97
USB OTGMulti-portControllerDDR-II SDRAMController
32-bit System Bus
DataPort 1
DataPort 2
64-bit Memory Bus
NAND FlashController
RegPort
MMC/SD
PS/2
Supports GFX and Video display channel
Supports HW Cursor
Supports VGA interface
Inner DMA channels for display data fetching
-
13
Workflow
SPEC
C Model
System-C Model RTL Model
Performance Evaluation
SoC TLM Environment
RTL Simulation
FPGA Validation
Test CasesGolden References
SoC RTL Environment
SoC FPGA Platform
Reference
-
14
Pressures on Bus and Memory
1024X768@32bpp
HighRandom & Bursty~20GE
720X576@25fps, YUV420
HighVaries with stream≤60VE
1024X768@60fps,32bpp
HardConstant180DE
Real Time Req.Behavior ConditionCharacters
Bandwidth (MB/s)
Module
-
15
Demo – H.264
-
16
Demo – DE
DE SiS 6326
-
17
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
18
Bus Structure
-
19
Bus Monitor
-
20
Verification
Simulation(test vectors, small programs)
FPGA(real-life programs, OS)
-
21
SURP: A Verification Platform
SURP = Scalable, Unified, Reusable PlatformImplementation: OpenVera + Synopsys RVM libraryCan generate different kinds of stimuli at any of the five layersFunctional coverage monitoring
-
22
Benefits
Multiple IPs were verified with SURP3x verification efficiency
Amount of testing code: 70% reduction
Found a bug in bus arbiter
Clock-Domain-Crossing (CDC) faults [DATE’07]
-
23
FPGA Verification
Dual-chip solutionTwo Xilinx XC4VLX200 chips, connected via LVDSDDR/DDR2 SDRAM slotsTwo NAND flash chipsOn-board VGA/USB/MAC PHYsPCI, IDE slotsJTAG
MME
H264 E
H264 D
DMA
MAC
USB
SATA
PCI
FLASH
SRAM
RTC
GPIO
UART
SD Card
AC97
PS2
I2C
SPI
PM
OST
AHB/APB Bridge FPGA2FPGA1
AH
B S
ysB
us32
_Alt
APB
Sys
Bus
32_A
lt
AH
B S
ysB
us32
_Lite
AH
B S
ysB
us64
_Lite
DisplayEngine
Graph Engine
DDR-II
UniCore-2Fpga1_to_fpga2
Fpga2_to_fpga1
Dual directions
-
24
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
25
Transaction Level Model (TLM) - The First Cut
-
26
Performance Validation
-0.73%2.00 1.98 12000K
-0.56%2.15 2.14 10000K
2.56%2.30 2.36 8000K
-2.77%2.71 2.63 6000K
-3.88%3.09 2.97 4000K
-4.91%3.56 3.38 2600Kmpeg4-4.35%3.04 2.91 12000K
-3.41%3.33 3.21 10000K
-1.60%3.66 3.61 8000K
2.15%4.02 4.11 6000K
9.47%4.49 4.92 4300Kmpeg2FPGATLM difference
decode speed (fps)
bit rate
-
27
Evaluation Results
MPEG2-Decode Rate
0
10
20
30
40
50
60
4300K 6000K 8000K 10000K 12000K
bps
fps
MPEG2-UniCore Memory Accesses
0
5
10
15
20
25
30
4300K 6000K 8000K 10000K 12000K
bps
MBytes
Read
Write
MPEG2-MME Memory Access
0
20
40
60
80
100
4300K 6000K 8000K 10000K 12000K
bps
MBytes
Read
Write
MPEG2-High-speed Bus Usage
3.55%
3.60%
3.65%
3.70%
3.75%
3.80%
4300K 6000K 8000K 10000K 12000K
bps
-
28
Future work
Model construction (UniCore + MME)
Model validation (UniCore + MME)
Model construction (full-fledged)
Model optimization(performance-accuracy tradeoff)
Performance evaluation/Design space exploration
Performance evaluation (UniCore + MME)
-
29
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
30
Issues
Timing
Power
DFT
DFM
Packaging
-
31
Timing Improving Techniques
ChallengesCore: TSMC 0.13um, 600 MHz – tough job for standard cell-based designDie size: 7x7Pin number: 7508 clock domains
Useful skewTraditional clock skew scheduling (CSS): 6.4%Our algorithm: 26% [DAC’06]
Other tricks~10% additional improvement...
-
32
Power Considerations
Dynamic PowerClock gating: ~30% reduction
Static PowerMulti-VT technique: high-Vt cells for non-critical pathsEDA tools : ~20% reductionOur replacement algorithm: ~20% more reduction
-
33
DFT
Test data compressionTest data: 431M (after dynamic compaction)Mentor: 10x compressionOur algorithm: 43x compression[ICCAD’07]
Scan slice
FSM
Single Coding
Write Coding
Read Coding
dictionary
Scan Slice Selector
shift
Default value
index
index
Segment content
Default value
codeword
clk
Select
we
control
control
control
address
W
G
G
D
S*G
S*G
S*G
Scan slice
FSM
Single Coding
Write Coding
Read Coding
dictionary
Scan Slice Selector
shift
Default value
index
index
Segment content
Default value
codeword
clk
Select
we
control
control
control
address
W
G
G
D
S*G
S*G
S*G
-
34
DFM
Dummy metal insertion and double via for higher product yield
Fix SI violations caused by dummy metal
Several iterations between timing optimization and DFM
Dummy metal Insertion andDouble Via
Layout ParasiticExtraction
Timing Met?
TimingAnalysis
Timingoptimization
OK
-
35
Packaging – Flipchip
MotivationIR drop problemThermal considerationElectrical consideration
ApproachPeripheral IO designUsing Cu for RDLBump template for self-adjustment
-
36
Outline
UniGFX
SW IC
Simulator
UniCore
SoC
-
37
Software System
SuperK SoC
Linux Kernel
Bootloader
Application System
as
gcc
ld
glibc
Toolchain
Device Drivers
-
38
Software Development Platforms
System-level simulatorLinux kernel porting
FPGA BoardVerification of SoC modulesDriver development
ASIC boardSoftware integrationLinux kernel porting
Simulator FPGA board ASIC board
-
39
Toolchain
Porting to UniCore architecturegcc, ld, as, glibc
Integrated Development Environment (IDE)
Based on EclipseNative execution or simulation
DebuggingRemote GDB
OptimizationsLink-time optimizationCode compactionmemory access reductions
-
40
Applications
Graphics supportX-windowGnome
Key applicationsOpenofficeMozilla Firefox, email client, multimedia playerJVM
Software package management tool
Chinese language support
-
41
Application Demo
-
42
Summary
A Single-chip solutionUniCore basedSoC-centricHardware multimedia enhancement
Project progressMulti-disciplinary workClose collaboration of ~8 teamsFirst chip: late this year
10M/100M/1GEthernet MAC
Controller
UniCore32-II
IMMU
16KBICache
CP0
OCD
DMMU
16KBDCache
Two-Port BIU
Host/PCI 2.2Bridge
Static MemoryController
(SRAM/Flash)
6-ChannelDMABridge
Data BusInstruction Bus
GPIO
RTC
INTC
Power Manager
OS Timer
Reset Controller
System Modules I2C
SPI
UART0/IrDA
UART1
IDE SATAController
AC’97
UniCoreF64-II
USB OTGMulti-portControllerDDR-II SDRAMController
32-bit System Bus
DataPort 1
DataPort 2
64-bit Memory Bus
2D Graphics Accelerator
MPEG 1/2/4 Decoder
H.264 Codec
VGA Controller
Graphics Engine
NAND FlashController
RegPort
MMC/SD
PS/2
UniGFX
SW IC
Simulator
UniCore
SoC