Power of Priority Nocs07

download Power of Priority Nocs07

of 22

Transcript of Power of Priority Nocs07

  • 8/9/2019 Power of Priority Nocs07

    1/22

    1 E. Bolotin The Power of Priority, NoCs 2007

    The Power of Priority:

    NoC based Distributed CacheCoherency

    Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny

    QNoC Research Group

    Technion

    EE Department

    Technion, Haifa, Israel

  • 8/9/2019 Power of Priority Nocs07

    2/22

    2 E. Bolotin The Power of Priority, NoCs 2007

    Chip Multi-Processor (CMP)

    Dual-Core

    Monolithic shared cache

    0 7

    56 63

    P0 P1

    P5 P4

    P6

    P7

    P3

    P2

    Distributed L2

    Multi-Core

    Large cache

    Shared cache

    Distributed cache

    NoC-based: How?

  • 8/9/2019 Power of Priority Nocs07

    3/22

    3 E. Bolotin The Power of Priority, NoCs 2007

    Global wires delayGlobal wire delay

    100

    1

    10

    0.1250

    130 90 65 45 32180

    250

    250

    Gate delay

    Source: ITRS 2003

    Global Wires Delay

    Future Cache - Physics Perspective

    Large cache Large access time

    Fraction of chip

    reachable in 1 clock cycle

    Source: Keckler et al. ISSCC 2003

    Distance reached in single cycle

    Today: ~25% of chip

    In 10 years: ~1% of chip

    Large monolithic cache is not scalable

  • 8/9/2019 Power of Priority Nocs07

    4/22

    4 E. Bolotin The Power of Priority, NoCs 2007

    NUCA - Non Uniform Cache Architecture

    NUCA= Non uniform access times

    Banked cache over NoC Smaller bank Smaller Access Time

    Multiple banks Multiple Ports

    Closer bank Smaller Access Time

    Cache-line placement policy

    Static NUCA (SNUCA)

    Dynamic NUCA (DNUCA)

    Sources:

    Kim et al. ASPLOS 2002Beckmann et al. MICRO 2004

  • 8/9/2019 Power of Priority Nocs07

    5/22

    5 E. Bolotin The Power of Priority, NoCs 2007

    Issues in NUCA-based CMP

    0 7

    56 63

    P0 P1

    P5 P4

    P6

    P7

    P3

    P2

    Distributed L2

    NoC performance CMP performance

    Cache coherency and transaction order (correctness)

    Search (in DNUCA)

    Different traffic types (e.g. fetch vs. prefetch)

    Synchronization (locks)

    NoC Services for CMP?

  • 8/9/2019 Power of Priority Nocs07

    6/22

    6 E. Bolotin The Power of Priority, NoCs 2007

    Cache Coherency over NoC

    0 7

    56 63

    P0 P1

    P5 P4

    P6

    P7

    P

    3

    P2

    Distributed L2

    How do we maintain coherency over NoC?

    Snooping

    Central directory

    cache line status vec. D

    cache line status vec. D

    cache line status vec. D

    cache line status vec. D

    cache line status vec. D

    cache line status vec. D

    Cache lines Dist. Directory

    Cache bank with distributed directory

    Distributed directory

  • 8/9/2019 Power of Priority Nocs07

    7/227 E. Bolotin The Power of Priority, NoCs 2007

    Distributed Cache Coherency

    Example: Simple read transaction

    L2Directory

    P0L1

    1. READ REQ

    2. READ RESP

    (data transfer )

    NoC

    P0-Shared

    Cache access Multiple NoC transactions

    Ctrl. packet

    Data packet

  • 8/9/2019 Power of Priority Nocs07

    8/228 E. Bolotin The Power of Priority, NoCs 2007

    Read Transaction of Modified Block

    L2Directory

    P2L1

    P0L1

    2. READ RESP

    (data transfer)

    NoC

    NoC

    P2-MOD.

    L2Directory

    P2L1

    P0L1

    4. WR BACK REQ3. READ REQ

    6. READ RESP

    (data transfer)5. WR BACK RESP

    (data t ransfer)

    NoC

    NoC

    P0-SHARED

    1. READ EXCL. REQ

    Ctrl. packet

    Data packet

  • 8/9/2019 Power of Priority Nocs07

    9/229 E. Bolotin The Power of Priority, NoCs 2007

    Read Exclusive of Shared Block

    L2 Directory

    NoC

    N

    oC

    N

    oC

    P1L1

    P2L1P0L1

    2.REA

    DRE

    SP.

    (data

    tran

    sfer)

    1. READ. REQ

    1.

    READR

    EQ

    P1-Shared

    P2-Shared

    L2Directory

    NoC

    NoC

    NoC

    P1L1

    P2L1

    P0L1

    4.INVA

    LID.

    REQ

    3. READ EXCL. REQ

    6. Read EXCL. RESP

    (data transfer )

    5. INVALID. ACK

    5.

    INVALID.

    ACK

    P0-MOD.

    Ctrl. packet

    Data packet

  • 8/9/2019 Power of Priority Nocs07

    10/2210 E. Bolotin The Power of Priority, NoCs 2007

    Smart interfaces

    Basic NoC to Support CMP

    Can We Do Better?

    Off-the-shelf (Vanilla) NoC:

    Grid of wormhole routers

    L2Directory

    NoC

    NoC

    NoC

    P1L1

    P2L1

    P0L1

    4.INVA

    LID.

    REQ

    3. READ EXCL. REQ

    6. Read EXCL. RESP

    (data transfer )

    5. INVALID. ACK

    5.

    INVALID.

    ACK

    P0-MOD.

    Unicast only

    Ordering in network Static routing

    No virtual channels

    Vanilla NoC

  • 8/9/2019 Power of Priority Nocs07

    11/2211 E. Bolotin The Power of Priority, NoCs 2007

    Observations: L2 Access

    A) Delay = Queueing + NoC transactionsB)AllNoC transactions are equally important

    C) NoC transactions consist of:

    Shortctrl. packetsLongdata packets

    Idea: Differentiate between Ctrl. and Data

    Solution: Preemptive Priority NoC Give priority to short ctrl. packets

    L2Directory

    NoC

    NoC

    NoC

    P1L1

    P2L1

    P0L1

    4.INVA

    LID.

    REQ

    3. READ EXCL. REQ

    6. Read EXCL. RESP

    (data transfer )

    5. INVALID. ACK

    5.

    INVALID.

    ACK

    P0-MOD.

  • 8/9/2019 Power of Priority Nocs07

    12/2212 E. Bolotin The Power of Priority, NoCs 2007

    Preemptive Priority NoC: QNoC

    Multiple SL link

    QNoC

    Input ports Output ports

    BufSize

    SL 0

    SL 1

    CROSS-BAR

    Scheduler CREDITControlCREDIT

    SL 2

    SL 3

    SL 0

    SL 1

    SL 2

    SL 3

    Physical Link

    Output Input

    SL 0

    SL 1

    SL 2

    SL 3

    SL 0

    SL 1

    SL 2

    SL 3

    Service Levels:

    Dedicated wormhole buffer

    Preemptive priority scheduling

    Multiple SL Router

  • 8/9/2019 Power of Priority Nocs07

    13/2213 E. Bolotin The Power of Priority, NoCs 2007

    Example: Vanilla NoC

    Blue delay ~XRed delay ~ 2X+

    Average delay ~ 1.5X

    Vanilla NoC example

    A B

    Without contention:X:Delay of long packet

    :Delay of short packetLong Data

    Transaction 1

    Short Req.

    Long Resp.

    Transaction 2

  • 8/9/2019 Power of Priority Nocs07

    14/2214 E. Bolotin The Power of Priority, NoCs 2007

    Example: Priority NoC

    Blue delay=XRed delay = 2X+

    Average delay ~ 1.5X

    Without contention:X:Delay of long packet

    :Delay of short packet

    Vanilla NoC example

    A B

    Blue delay= X+

    Red delay = X+

    Average delay ~ X

    Potential delay reduction ~ 0.5X

    Priority NoC example

    Long Data

    Transaction 1

    Short Req.

    Long Resp.

    Transaction 2

  • 8/9/2019 Power of Priority Nocs07

    15/2215 E. Bolotin The Power of Priority, NoCs 2007

    Priority NoC: Different Destinations

    Very important in wormhole When ctrl. packet is blocked by other worms

    Short Req.

    Long Data

  • 8/9/2019 Power of Priority Nocs07

    16/2216 E. Bolotin The Power of Priority, NoCs 2007

    Protocol Correctness

    L2Directory

    1. Read Req.

    2. Read Resp.

    4. Invalidation Req.

    P0L1

    P1L1

    3. Read Excl. Req.Legend:

    High Priority (ctrl.)

    Low Priority (data)

    Need state-preserving serialization of transactions in

    the processor interface

  • 8/9/2019 Power of Priority Nocs07

    17/2217 E. Bolotin The Power of Priority, NoCs 2007

    Numerical Evaluation

    CMP simulator (SIMICS)

    Simulate parallel benchmarks

    Obtain L2-cache access traces

    QNoC simulator (OPNET)

    Simulate distributed coherence protocol over NoC

    Measure total RD/RX L2-access delay

    Measure total program throughput

    0 7

    56 63

    P0 P1

    P5 P4

    P

    6

    P7

    P3

    P2

    Distributed L2

  • 8/9/2019 Power of Priority Nocs07

    18/2218 E. Bolotin The Power of Priority, NoCs 2007

    Priority NoC: Results

    Av. Delay Reduction of L2-Transaction in Apache

    0.00

    5.00

    10.00

    15.00

    20.00

    25.00

    30.00

    1 4 16

    Link Capacity [gbps]

    DelayReduction

    [%]

    Read

    Read Exclusive

    Av. Delay of L2-Read in Apache

    234

    5762

    286

    1301

    994

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1 4 16Link Capacity[gbps]

    Delay[cycle

    s]

    Vanilla NoC

    Priority-based NoC

    Short ctrl. packet gets high priority Long data packet gets low priority

    Delay Reduction vs. Network Load

    RD Delay - Apache RD/RX Delay Reduction - Apache

  • 8/9/2019 Power of Priority Nocs07

    19/2219 E. Bolotin The Power of Priority, NoCs 2007

    Priority NoC: Several Benchmarks

    L2 Access Delay Reduction by Priority-based NoC

    22.6

    31.8

    19.6

    28.4

    13.5

    25.3

    18.3

    32.9

    22.3

    28.0

    0.0

    5.0

    10.0

    15.0

    20.0

    25.0

    30.0

    35.0

    apache zeus fft ocean radix

    DelayReduction[%]

    Read Read Exclusive

    Delay Reduction Program Speedup

    Total Program Speedup by Priority-based NoC

    9.4

    8.79.0

    8.6

    5.0

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    7.0

    8.0

    9.0

    10.0

    apache zeus fft ocean radix

    Speedup[%]

  • 8/9/2019 Power of Priority Nocs07

    20/2220 E. Bolotin The Power of Priority, NoCs 2007

    So Far: The Power of Priority

    Simplicity - Almost for Free

    Significant CMP Speed-up

    Good For:

    Coherency

    Traffic differentiation (e.g. Fetch vs. Pre-Fetch)

    Search in DNUCA

    Synchronization (Locks)

    0 7

    56 63

    P0 P1

    P5 P4

    P6

    P7

    P3

    P2

    Distributed L2

  • 8/9/2019 Power of Priority Nocs07

    21/2221 E. Bolotin The Power of Priority, NoCs 2007

    Special Broadcast for Short Messages Broadcast service (e.g. search in DNUCA)

    Wormhole broadcast slow and expensive

    S&F broadcast embedded in wormhole

    Virtual Ring

    No Additional Cost

    For Invalidation Multicast

    Snooping or synchronization

    Advanced Support Functions

    S

    Source

    Replicating

    Forwarding

    0 7

    56 63

    P0 P1

    P5 P4

    P6

    P7

    P3

    P2

  • 8/9/2019 Power of Priority Nocs07

    22/22

    Summary

    NoC at CMP Service!

    Shared cache over NoC

    Priority is powerful

    Built-in support functions