Synchronous Latency Insensitive Design · • Synchronous Latency Insensitive Design • Multiple...

35
Christer Svensson, ASYNC 2004 1 Synchronous Latency Insensitive Design Christer Svensson and Anders Edman Linköping University

Transcript of Synchronous Latency Insensitive Design · • Synchronous Latency Insensitive Design • Multiple...

  • Christer Svensson, ASYNC 2004 1

    Synchronous Latency Insensitive Design

    Christer Svensson and Anders EdmanLinköping University

  • Christer Svensson, ASYNC 2004 2

    Outline

    • Introduction• Overview of wire properties• Architectural view of future systems• Synchronous Latency Insensitive Design• Multiple clocks• Conclusion

  • Christer Svensson, ASYNC 2004 3

    Introduction

    The wire delay problem was recognized very early (Anceau 1982)

    In spite of the “alarm” 1982, we still manage multigigahertz synchronous designs, BUT today with considerable problems.

    ASIC style designs normally limited to 300-500MHz clock, with severe “timing closure” problems.

    Multigigahertz designs very demanding full custom design style.

    Wire delay ~ L2/s2, Gate delay ~sα, s=feature size, α=1..2

  • Christer Svensson, ASYNC 2004 4

    Introduction

    Synchronous design paradigm VERY established – we need to keep.(Easy to keep track on exact timing of all events; predictable performance)Vast experience used to manage ever increasing complexity.

    Critical: Timing relations between clock and data

    Present solution: “Flat” clock distribution (skew-free clock)Does not solve problem with data delays

    clk

    Balanced clk net - no skewWire delay still affects data

  • Christer Svensson, ASYNC 2004 5

    Overview of wire properties

    Twisted pair

    Coaxialcable Microstrip

    Circuit boards and chips

    Coplanar waveguide

    Cables

    Ground planes

    We will concentrate on microstrip in the following

  • Christer Svensson, ASYNC 2004 6

    Overview of wire propertiesSkin effect loss

    Higher frequencies - skineffektFields penetrate metal to skin-depth δResistance per unit length, r:

    ωsrr =Current flow, depth δ, (skin depth)

    Frequency dependence (dispersion) gives rise to signal distortion

    ( ) ωjrrr sDC ++= 1Including current phase and low frequency resistance:

  • Christer Svensson, ASYNC 2004 7

    Overview of wire properties

    We discuss 2 wire properties in the following

    Delay (Latency)Capacity (Maximum data rate)

  • Christer Svensson, ASYNC 2004 8

    Overview of wire properties

    High loss case (RC-case), rDCL/Z0>2ln2. Elmore delay good approximation:

    ( ) 2ln2

    ++++= L

    wwLwSSd C

    CRCCCRt

    Low loss case (LC-case), rDCL/Z0

  • Christer Svensson, ASYNC 2004 9

    Overview of wire propertiesCapacity or maximum data rate

    Single pulse Eye diagram

    Eye opening

    Eye opening = 2S(T)-1, S(t) step response, T symbol time

    We need a minimum opening for safe data detection, say 64%

    For long wires we may afford a simple equalizer, allowing 0%

    S(T)

    T

  • Christer Svensson, ASYNC 2004 10

    Overview of wire propertiesCapacity or maximum data rate

    RC-wire: Step response:

    Eye opening of 64% yields S(T)=0.82 or T=0.85RwCw

    Max data rate

    LC-wire: Step response (skin effect):

    Max data rate,

    ( ) wwCRT

    eTS2

    1−

    −=

    2

    1LAb

    TB RC==

    ( )

    −=

    TwZL

    erfTS0

    0

    21

    ρµ

    2LAbB LC=

  • Christer Svensson, ASYNC 2004 11

    Overview of wire propertiesNote the difference between latency and data rate

    td

    Ts>td

    Ts

  • Christer Svensson, ASYNC 2004 12

    Overview of wire propertiesEstimated data-rates Typical

    Boardwire10Gb/s@ 0.5m

    Low delayregion

    Top metalchip wire10Gb/s@ 15mm

    Low levelmetal wire10Gb/s@ 1mm

  • Christer Svensson, ASYNC 2004 13

    Overview of wire properties

    Low level on-chip wiresWire delay limits diameter of synchronous blockSystem partition – “Global Asynchronous Local Synchronous”

    Upper on-chip wiresLow delay, high data-rate global communicationInter-block communication

    Circuit board wiresCan be used at least to 10Gb/s per wireFacilitates very high on-board bandwidths

  • Christer Svensson, ASYNC 2004 14

    Overview of wire propertiesOn-chip local

    Future processes, feature size f=0.1 - 0.035 µmwire cross section ~3f2, for 0.1µm: 3·10-14m210Gb/s up to 1.25mm length1mm wire will have a delay of 26ps (26% of 10GHz clock cycle)

    We may use 10GHz clock frequency in fully synchronous blockof diameter 1mm. Such a block can contain 250,000 gates.(Compare to Sylvester and Keutzler 50-100 kgates)

    Note that diameter scales as f2; number of gates as f-2so 250 kgates is kept until 0.035µm (or further) at 10GHz.

  • Christer Svensson, ASYNC 2004 15

    Overview of wire properties

    On-chip global

    Traditional alternativeAutomatic insertion of repeaters along long wiresWith wave pipelining allows >10Gb/s per wireDelays may exceed one clock cycle

    Utilizing upper thick metal layerData rate >10Gb/sDelays close to velocity-of-light, still order of one clock cycle

  • Christer Svensson, ASYNC 2004 16

    Overview of wire propertiesUpper wire/driver example

    2µm3.5µm

    12µm4µm

    Inverter in 0.18µm CMOSWn=88µm, wp=194µm, RS=20Ω

    Actual step response

    Step response without overdrive

    Step response, terminated

    Wire length 2cm

    2µm x 4µm copper wire, low loss12µm spacing, X-talk

  • Christer Svensson, ASYNC 2004 17

    Overview of wire properties

    Estimated performance (length 2cm)

    • Simulated velocity: 108m/s (c0/3)• Simulated maximum data-rate 10Gb/s• Each link is 16 bit wide, 2 links carry 320Gb/s (bidirectionally)• Each 2 links need 544µm width

    Upper wire/driver example

  • Christer Svensson, ASYNC 2004 18

    Architectural view of future systems

    Clock

    Chip Chip

    High speed board links

    Synchronousblocks

    On-chip global links

  • Christer Svensson, ASYNC 2004 19

    Architectural view of future systems

    Clock

    Chip Chip

    High speed board links

    Synchronousblocks

    On-chip global links

    Challenges

    Allow scaling of clock rates and bandwidths

    Mitigate synchronization and clock skew problems

    Keep an unchanged synchronous design paradigm

  • Christer Svensson, ASYNC 2004 20

    Architectural view of future systems

    Wire delays are inevitable: we must accept latency.

    The latency/delay problem should be managed at two levels

    • System level (predictability)

    • Implementation level (error-free)

  • Christer Svensson, ASYNC 2004 21

    Architectural view of future systems

    System level.Partition the system into blocks of limited size.(Preferably natural partition, processors, memories, IP-blocks etc.)

    We may define a system where only order of events is important.(“Classical” asynchronous, Patient systems (Carloni et al 1999))We may then accept any latency between blocks.

    We may define a system with fixed latency between blocks.(If fixed latency is n clock cycles, the system is synchronous)We may then accept any latency < nTc between blocks.

  • Christer Svensson, ASYNC 2004 22

    Architectural view of future systemsImplementation level (We must avoid synchronization errors)

    Use synchronizers with long decision time (extra latency, nonzero error probability)

    Use stoppable clocks to synchronize communication(Classical GALS, Chapiro 1984)

    Adapt clock phase to data (mesochronous clocks) (Mu 2001)

    Use FIFO’s to isolate clock regions(FIFO’s initialized with synchronizers, Chakraborty 2001)(FIFO’s initialized via system reset, Edman 2004)

  • Christer Svensson, ASYNC 2004 23

    Architectural view of future systemsImplementation level, Examples

    Data in Data out

    Metastab.detector Rx clk

    Choise of clock phase (Mu 2001)

    “Circular” FIFO

    Writepointer

    Data in Data out

    Readpointer

    Rx clkTx clk

    FIFO solution (Chakraborty 2001,Edman 2004)

  • Christer Svensson, ASYNC 2004 24

    Synchronous Latency Insensitive Design

    Problem formulation

    Find a method to mitigate wire-induced latencies within a synchronous paradigm

  • Christer Svensson, ASYNC 2004 25

    Synchronous Latency Insensitive Design

    clk

    Communication links

    Fixed delays (n clk cycles)

    Synchronousblocks

    Clock true model SynthesisDuring synthesis we replace Fixed delays withsynchronizing ports(elastic FIFOs) that absorball link latencies and clock skews.

    Final design agree exactlywith Clock true modelindependently oflink delays and clock skews.

    Concept

  • Christer Svensson, ASYNC 2004 26

    Synchronous Latency Insensitive Design

    System partition

    Clock-true model &

    verification

    Synthesis &Back-end

    Timing verification

    “Natural” partition (processors, memories,IP-blocks…) into isochronous regions

    NEW: Insertion of dummy delays between isochronic regions. Clock-true verification.

    Replace dummy delays with elastic FIFO’s

    Considerably easier, feedback can be avoided

    Design flow

  • Christer Svensson, ASYNC 2004 27

    Synchronous Latency Insensitive Design

    clkExample with three blocksand two links

    data

    strobe

    Synchronizing portFixed nominal delay preset in counters

    Outputcounter

    regdata

    strobe

    datareg

    Localclock

    select

    Implementation

    Inputcounter

  • Christer Svensson, ASYNC 2004 28

    Synchronous Latency Insensitive Design

    System reset used as initialization mechanism (example n=2)

    Tx1

    Tx2

    Rx

    clk rst resetclk at root

    data at Tx1data at Rx

    written into FIFO(2) by strobe

    clk at RxFIFO(2)read from FIFO(2) by

    Rx clk after 2 counts

    data in Rx

    Note that data relation to clk period number predictable

    Implementation

  • Christer Svensson, ASYNC 2004 29

    Synchronous Latency Insensitive Design

    00 01 10 11 00 01 10 11 00 01

    10 11 00 01 10 11 00 01 10 11 00 01

    00 01 10 11 00 01 10 11 00 01 10

    10 11 00 01 10 11 00 01 10 11 00 01

    0 20 ns 40 ns 60 ns

    00 01 10 11 00 01 10 11 00 01

    10 11 00 01 10 11 00 01 10 11 00 01

    00 01 10 11 00 01 10 11 00 01 10

    10 11 00 01 10 11 00 01 10 11 00 01

    Clk

    Tx1 out

    Rx1 in

    Tx2 out

    Rx2 in

    Rx1 out

    Rx2 out

    Rx1 in count

    Rx1 out count

    Rx2 in count

    Rx2 out count

    00 01 10 11 00 01 10 11 00 01

    10 11 00 01 10 11 00 01 10 11

    00 01 10 11 00 01 10 11 00 01 10

    10 11 00 01 10 11 00 01 10 11

    0 20 ns 40 ns 60 ns

    Clk

    Tx1 out

    Rx1 in

    Tx2 out

    Rx2 in

    Rx1 out

    Rx2 out

    Rx1 in count 00 01 10 11 00 01 10 11 00 01

    Rx1 out count 10 11 00 01 10 11 00 01 10 11

    Rx2 in count 00 01 10 11 00 01 10 11 00 01 10

    Rx2 out count 10 11 00 01 10 11 00 01 10 11

    Tx1

    Tx2

    Rx

    clk

    Simulation

  • Christer Svensson, ASYNC 2004 30

    Synchronous Latency Insensitive DesignImplementation example, receiver in 0.18µm CMOS

    fc=2.75GHzArea ≈ 3500 µm2Data sent over 2mm wireLatency 2 cyclesRx clk delay 1 cycle

    (SPICE circuit level @110oC)

    Rx input

    Tx clk

    Rx clk

    Read data

    Reference data

  • Christer Svensson, ASYNC 2004 31

    Synchronous Latency Insensitive Design

    New method to ease timing closure in large DSM chips• Correct clock-true verification before synthesis

    • Synchronous design paradigm and design tools kept

    • Implementation induced data delays and clock skews mitigated

    • Implementation in standard libraries

    • Full clock alignment between blocks

    • No synchronizers, no risk for metastability

  • Christer Svensson, ASYNC 2004 32

    Multiple clocks

    Can a multiple clock system be synchronous?

    Example – rationally related clocks

    fc1

    fc2=(2/3)fc1

    f=Synchronous to fc1

  • Christer Svensson, ASYNC 2004 33

    Multiple clocks

    FIFO synchronization can be extended to rationally related clocks(FIFO used for mitigation of delays and introduced clock jitter)

    Chakraborty 2003, (Our proposal 2004)

    Chakraborty extended his scheme to any clock frequency relation

    Writepointer

    Readpointer

    Jitteraccepted

  • Christer Svensson, ASYNC 2004 34

    ConclusionsWire delays are inevitableWire delays may be limited to velocity-of-light delaysSynchronous blocks may include 250kgates @10GHz clockDelays must be managed at system level and implementation levelOur proposed scheme facilitates:

    synchronous flow from system to implementationclock-true verification before synthesismitigation of clock skews and data latencies

    “Synchronous” schemes can be extended to multiple clocks

  • Christer Svensson, ASYNC 2004 35

    References

    F. Anceau, "A Synchronous Approach for Clocking VLSI Systems", IEEE J. Solid-State Circuits, Vol. 17, pp. 51-56, 1982.D. M. Chapiro, “Globally-Asynchronous Locally-Synchronous Systems”, PhD Thesis, Stanford University, Oct. 1984.M. Afghahi and C. Svensson, “Performance of Synchronous and Asynchronous Schemes for VLSI Systems”, IEEE Trans. on Computers, Vol. 41, pp. 858-872, 1992.D. Sylvester and K. Keutzer, "Getting to the bottom of deep submicron", IEEE/ACM Int. Conference on Computer Aided Design 1998, Digest of Technical Papers, pp. 203-211, 1998.L. P. Carloni, K. L. McMillan, A. Saldanha and A. L. Sangiovanni-Vincentelli, "A Methodology for Correct-by-Construction Latency Insensitive Design", 1999 IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, pp. 309-315, Nov. 1999.F. Mu and C. Svensson, ”Self-tested self-synchronization circuit for mesochronous clocking”, IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, vol 48, pp. 129 – 140, Feb. 2001A. Chakraborty and M. R. Greenstreet, "A Minimal Source-Synchronous Interface", 15th Annual IEEE International ASIC/SOC Conference, pp. 443-447, Sept. 2002.C. Svensson, “Electrical Interconnects Revitalized”, IEEE Trans. on Very Large Scale Integration, vol. 10, pp. 777-788, Dec. 2002.J. Xu and W. Wolf, “A Wave-Pipelined On-chip Interconnect Structure for Network-on-Chips”, Proc. of the 11th Symp. On High Performance Interconnect, pp. 10-14, 2003A. Chakraborty and M. R. Greenstreet, “Efficient Self-Timed Interfaces for Crossing Clock Domains”, Proceedings of Ninth International Symposium on Asynchronous Circuits and Systems, pp. 78-88, May 2003.A. Edman and C. Svensson, "Timing Closure through a Globally Synchronous, Timing Partitioned Design Methodology", accepted for presentation at the 41st Design Automation Conference, 2004.