Xianfeng Li-The SuperK Project at MPRCcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN...The SuperK...

42
The SuperK Project at MPRC Xianfeng LI (李险峰) Microprocessor Research and Development Center (MPRC) Peking University

Transcript of Xianfeng Li-The SuperK Project at MPRCcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN...The SuperK...

  • The SuperK Project at MPRC

    Xianfeng LI (李险峰)

    Microprocessor Research and Development Center (MPRC) Peking University

  • 2

    What Is SuperK

    A project on low-cost, portable computersUniCore architectureHardware multimedia enhancementFlash diskSingle-chip solution

    Two product lines

    OLPC-like: ~1000RMB UMPC-like: 3C computer

  • 3

    SuperK SoC Architecture

    Dual-port BIU

    Globally-Asynchronous-Locally-Synchronous (GALS)

  • 4

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 5

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 6

    UniCoreII Overview

    TSMC 0.13umClock rate: 600 MHz

    TSMC 90nmClock rate: 800 MHz

  • 7

    Int/FP Units

    UniCore328-stage pipeline7 operation modesDynamic branch predictionPrecise interrupt

    UniCore-F64IEEE-754 standard2D/3D ISA extensionsFloating-point Register file(32 x 32-bit or 16 x 64-bit)Precise interrupt

    IF1

    IF2

    DEC

    ISS

    MUL1 EXE1 MAG

    MUL2 EXE2 MEM1

    MADD MSW MEM2

    WB

  • 8

    Other Core Components

    CoprocessorsCP0, OCD

    CachesDecoupled I/D caches16KB, 4-way

    MMUsHierarchical TLBsSupport multiple page sizes

    Dual-port BIU64-bit memory access bus32-bit high-speed system bus

  • 9

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 10

    UniGFX Graphics Engine

    UniCore32

    IMMU

    16KBICache

    CP0

    OCD

    DMMU

    16KBDCache

    Two-Port BIU

    Data BusInstruction Bus

    UniCore-F64

    10M/100M/1GEthernet MAC

    Controller

    Host/PCI 2.2Bridge

    Static MemoryController

    (SRAM/Flash)

    6 ChannelDMA

    AHB/APBBridge

    GPIO

    RTC

    INTC

    Power Manager

    OS Timer

    Reset Controller

    System Modules I2C

    SPI

    UART0/IrDA

    UART1

    32-bit Peripheral Bus

    IDE SATAController

    AC’97

    USB OTGMulti-portControllerDDR-II SDRAMController

    32-bit System Bus

    DataPort 1

    DataPort 2

    64-bit Memory Bus

    NAND FlashController

    RegPort

    MMC/SD

    PS/2

    Efficient HW implementation of 2D graphics operations (Line Draw/ROP/BLT/Alpha Blending/...)

    GUI acceleration

  • 11

    UniGFX Video Engine

    UniCore32

    IMMU

    16KBICache

    CP0

    OCD

    DMMU

    16KBDCache

    Two-Port BIU

    Data BusInstruction Bus

    UniCore-F64

    10M/100M/1GEthernet MAC

    Controller

    Host/PCI 2.2Bridge

    Static MemoryController

    (SRAM/Flash)

    6 ChannelDMA

    AHB/APBBridge

    GPIO

    RTC

    INTC

    Power Manager

    OS Timer

    Reset Controller

    System Modules I2C

    SPI

    UART0/IrDA

    UART1

    32-bit Peripheral Bus

    IDE SATAController

    AC’97

    USB OTGMulti-portControllerDDR-II SDRAMController

    32-bit System Bus

    DataPort 1

    DataPort 2

    64-bit Memory Bus

    NAND FlashController

    RegPort

    MMC/SD

    PS/2

    HW/SW co-operated video decodingUniCore: PredecodingMME: IDCT, MC

    Dedicated HW for H.264 codec (collaboration with Prof. Y-L Lin’s group, TW NTHU)

  • 12

    UniGFX Display Engine

    UniCore32

    IMMU

    16KBICache

    CP0

    OCD

    DMMU

    16KBDCache

    Two-Port BIU

    Data BusInstruction Bus

    UniCore-F64

    10M/100M/1GEthernet MAC

    Controller

    Host/PCI 2.2Bridge

    Static MemoryController

    (SRAM/Flash)

    6 ChannelDMA

    AHB/APBBridge

    GPIO

    RTC

    INTC

    Power Manager

    OS Timer

    Reset Controller

    System Modules I2C

    SPI

    UART0/IrDA

    UART1

    32-bit Peripheral Bus

    IDE SATAController

    AC’97

    USB OTGMulti-portControllerDDR-II SDRAMController

    32-bit System Bus

    DataPort 1

    DataPort 2

    64-bit Memory Bus

    NAND FlashController

    RegPort

    MMC/SD

    PS/2

    Supports GFX and Video display channel

    Supports HW Cursor

    Supports VGA interface

    Inner DMA channels for display data fetching

  • 13

    Workflow

    SPEC

    C Model

    System-C Model RTL Model

    Performance Evaluation

    SoC TLM Environment

    RTL Simulation

    FPGA Validation

    Test CasesGolden References

    SoC RTL Environment

    SoC FPGA Platform

    Reference

  • 14

    Pressures on Bus and Memory

    1024X768@32bpp

    HighRandom & Bursty~20GE

    720X576@25fps, YUV420

    HighVaries with stream≤60VE

    1024X768@60fps,32bpp

    HardConstant180DE

    Real Time Req.Behavior ConditionCharacters

    Bandwidth (MB/s)

    Module

  • 15

    Demo – H.264

  • 16

    Demo – DE

    DE SiS 6326

  • 17

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 18

    Bus Structure

  • 19

    Bus Monitor

  • 20

    Verification

    Simulation(test vectors, small programs)

    FPGA(real-life programs, OS)

  • 21

    SURP: A Verification Platform

    SURP = Scalable, Unified, Reusable PlatformImplementation: OpenVera + Synopsys RVM libraryCan generate different kinds of stimuli at any of the five layersFunctional coverage monitoring

  • 22

    Benefits

    Multiple IPs were verified with SURP3x verification efficiency

    Amount of testing code: 70% reduction

    Found a bug in bus arbiter

    Clock-Domain-Crossing (CDC) faults [DATE’07]

  • 23

    FPGA Verification

    Dual-chip solutionTwo Xilinx XC4VLX200 chips, connected via LVDSDDR/DDR2 SDRAM slotsTwo NAND flash chipsOn-board VGA/USB/MAC PHYsPCI, IDE slotsJTAG

    MME

    H264 E

    H264 D

    DMA

    MAC

    USB

    SATA

    PCI

    FLASH

    SRAM

    RTC

    GPIO

    UART

    SD Card

    AC97

    PS2

    I2C

    SPI

    PM

    OST

    AHB/APB Bridge FPGA2FPGA1

    AH

    B S

    ysB

    us32

    _Alt

    APB

    Sys

    Bus

    32_A

    lt

    AH

    B S

    ysB

    us32

    _Lite

    AH

    B S

    ysB

    us64

    _Lite

    DisplayEngine

    Graph Engine

    DDR-II

    UniCore-2Fpga1_to_fpga2

    Fpga2_to_fpga1

    Dual directions

  • 24

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 25

    Transaction Level Model (TLM) - The First Cut

  • 26

    Performance Validation

    -0.73%2.00 1.98 12000K

    -0.56%2.15 2.14 10000K

    2.56%2.30 2.36 8000K

    -2.77%2.71 2.63 6000K

    -3.88%3.09 2.97 4000K

    -4.91%3.56 3.38 2600Kmpeg4-4.35%3.04 2.91 12000K

    -3.41%3.33 3.21 10000K

    -1.60%3.66 3.61 8000K

    2.15%4.02 4.11 6000K

    9.47%4.49 4.92 4300Kmpeg2FPGATLM difference

    decode speed (fps)

    bit rate

  • 27

    Evaluation Results

    MPEG2-Decode Rate

    0

    10

    20

    30

    40

    50

    60

    4300K 6000K 8000K 10000K 12000K

    bps

    fps

    MPEG2-UniCore Memory Accesses

    0

    5

    10

    15

    20

    25

    30

    4300K 6000K 8000K 10000K 12000K

    bps

    MBytes

    Read

    Write

    MPEG2-MME Memory Access

    0

    20

    40

    60

    80

    100

    4300K 6000K 8000K 10000K 12000K

    bps

    MBytes

    Read

    Write

    MPEG2-High-speed Bus Usage

    3.55%

    3.60%

    3.65%

    3.70%

    3.75%

    3.80%

    4300K 6000K 8000K 10000K 12000K

    bps

  • 28

    Future work

    Model construction (UniCore + MME)

    Model validation (UniCore + MME)

    Model construction (full-fledged)

    Model optimization(performance-accuracy tradeoff)

    Performance evaluation/Design space exploration

    Performance evaluation (UniCore + MME)

  • 29

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 30

    Issues

    Timing

    Power

    DFT

    DFM

    Packaging

  • 31

    Timing Improving Techniques

    ChallengesCore: TSMC 0.13um, 600 MHz – tough job for standard cell-based designDie size: 7x7Pin number: 7508 clock domains

    Useful skewTraditional clock skew scheduling (CSS): 6.4%Our algorithm: 26% [DAC’06]

    Other tricks~10% additional improvement...

  • 32

    Power Considerations

    Dynamic PowerClock gating: ~30% reduction

    Static PowerMulti-VT technique: high-Vt cells for non-critical pathsEDA tools : ~20% reductionOur replacement algorithm: ~20% more reduction

  • 33

    DFT

    Test data compressionTest data: 431M (after dynamic compaction)Mentor: 10x compressionOur algorithm: 43x compression[ICCAD’07]

    Scan slice

    FSM

    Single Coding

    Write Coding

    Read Coding

    dictionary

    Scan Slice Selector

    shift

    Default value

    index

    index

    Segment content

    Default value

    codeword

    clk

    Select

    we

    control

    control

    control

    address

    W

    G

    G

    D

    S*G

    S*G

    S*G

    Scan slice

    FSM

    Single Coding

    Write Coding

    Read Coding

    dictionary

    Scan Slice Selector

    shift

    Default value

    index

    index

    Segment content

    Default value

    codeword

    clk

    Select

    we

    control

    control

    control

    address

    W

    G

    G

    D

    S*G

    S*G

    S*G

  • 34

    DFM

    Dummy metal insertion and double via for higher product yield

    Fix SI violations caused by dummy metal

    Several iterations between timing optimization and DFM

    Dummy metal Insertion andDouble Via

    Layout ParasiticExtraction

    Timing Met?

    TimingAnalysis

    Timingoptimization

    OK

  • 35

    Packaging – Flipchip

    MotivationIR drop problemThermal considerationElectrical consideration

    ApproachPeripheral IO designUsing Cu for RDLBump template for self-adjustment

  • 36

    Outline

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC

  • 37

    Software System

    SuperK SoC

    Linux Kernel

    Bootloader

    Application System

    as

    gcc

    ld

    glibc

    Toolchain

    Device Drivers

  • 38

    Software Development Platforms

    System-level simulatorLinux kernel porting

    FPGA BoardVerification of SoC modulesDriver development

    ASIC boardSoftware integrationLinux kernel porting

    Simulator FPGA board ASIC board

  • 39

    Toolchain

    Porting to UniCore architecturegcc, ld, as, glibc

    Integrated Development Environment (IDE)

    Based on EclipseNative execution or simulation

    DebuggingRemote GDB

    OptimizationsLink-time optimizationCode compactionmemory access reductions

  • 40

    Applications

    Graphics supportX-windowGnome

    Key applicationsOpenofficeMozilla Firefox, email client, multimedia playerJVM

    Software package management tool

    Chinese language support

  • 41

    Application Demo

  • 42

    Summary

    A Single-chip solutionUniCore basedSoC-centricHardware multimedia enhancement

    Project progressMulti-disciplinary workClose collaboration of ~8 teamsFirst chip: late this year

    10M/100M/1GEthernet MAC

    Controller

    UniCore32-II

    IMMU

    16KBICache

    CP0

    OCD

    DMMU

    16KBDCache

    Two-Port BIU

    Host/PCI 2.2Bridge

    Static MemoryController

    (SRAM/Flash)

    6-ChannelDMABridge

    Data BusInstruction Bus

    GPIO

    RTC

    INTC

    Power Manager

    OS Timer

    Reset Controller

    System Modules I2C

    SPI

    UART0/IrDA

    UART1

    IDE SATAController

    AC’97

    UniCoreF64-II

    USB OTGMulti-portControllerDDR-II SDRAMController

    32-bit System Bus

    DataPort 1

    DataPort 2

    64-bit Memory Bus

    2D Graphics Accelerator

    MPEG 1/2/4 Decoder

    H.264 Codec

    VGA Controller

    Graphics Engine

    NAND FlashController

    RegPort

    MMC/SD

    PS/2

    UniGFX

    SW IC

    Simulator

    UniCore

    SoC