5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011...

26
06/27/22 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues semantic model: coherence and memory consistency dead-lock, live-lock, and starvation Design issues simplistic-to-realistic one-by-one: Single-level cache and an atomic bus Multi-level cache design issues Split-transaction bus design issues Scalable snoop-based design techniques More Architectural Support for MIMD

Transcript of 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011...

Page 1: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 1PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Snoop-based multiprocessor design

Correctness issues semantic model: coherence and memory consistency dead-lock, live-lock, and starvation

Design issues simplistic-to-realistic one-by-one: Single-level cache and an atomic bus Multi-level cache design issues Split-transaction bus design issues

Scalable snoop-based design techniques

More Architectural Support for MIMD

Page 2: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 2PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Key goals Key goals

Correctness Design simplicity (verification is costly) High performance

Design simplicity and performance are often at odds

Get picture of bus-based coherence organization, dual tags, proc-side and bus-side controllers

Page 3: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 3PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Correctness RequirementsCorrectness Requirements

Semantic model: contract between HW/SW cache coherence -> write serialization sequential consistency -> prog. order, write atomicity

Deadlock: no forward progress and no system activity resources being held in a cyclic relationship

Livelock: no forward progress but system activity allocation/de-allocation of resources with no progress

Starvation: some processes are denied service often temporary

Page 4: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 4PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Single-Level Single-Level CCache and ache and AAtomic tomic BBusus

Single-level caches and an atomic bus Tag and cache controller design issues Snoop protocol design Race conditions: non-atomic state transitions

Correctness issues serialization deadlock, livelock, and starvation

Atomic (synchronization) operations

Page 5: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 5PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Cache Cache CController ontroller DDesignesignExtension for snoop support: bus requests also access cache processor-side controller bus-side controller

Recall actions on a cache access:

1. Indexing cache with tag check

2. Get/request data

3. Update state bits

Cached data

Tags

Processor requests

bus requests

Performance issue:

Simultaneous tag accesses from processor and bus

Solution:

Duplicate tags but keep them consistent

Tags

Page 6: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 6PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Reporting Reporting SSnoop noop RResultsesultsWhere to read (memory or cache) and what state transition to make? support wired-and/or bus lines

When is the snoop result available? (main alternatives) synchronous: requires dual tags and must adapt to

worst-case because of updates of state bits caused by processor

asynchronous (variable delay snoop): assume minimum delay but add enough cycles if necessary

memory state bit to distinguish between valid/invalid memory block

Page 7: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 7PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Dealing with Write-backsDealing with Write-backs

One would like to service miss before writing back the replaced block

Two implications: Add a write-back buffer Bus snoops must also look into write-back buffer

Page 8: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 8PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Baseline ArchitectureBaseline Architecture

Write-back buffer

Page 9: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 9PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

State Transitions Must State Transitions Must Appear AtomicAppear Atomic

Upgr

Cache 1 Cache 2

Upgr 1. Await use of bus

2. Cache 2 gets access to bus

3. Upgrade fromCache 2 updatesstate of Cache 1to invalid

4. Upgrade from cache 1is performed. However,Upgrade is not appropriate

Assume a block isin shared state inboth caches

Page 10: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 10PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Non-Atomic Non-Atomic SState tate TTransitionsransitionsTime window between issuing and performing of a bus operation Problem: another transaction may change action Solution: extend with non-atomic state

Page 11: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 11PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Correctness Correctness IIssuesssues Write serialization: ownership acquisition and cache block modification should appear atomic processor may not write data into cache until read-

exclusive request is on bus; it is committed Deadlock: Two cache controllers may be in a circular dependence relation if one is locking the cache while waiting for the bus (fetch deadlock) Livelock: If several controllers issue read-exclusive requests for same block at the same time Let each one complete before taking care of next

Starvation: Bus arbitration is unfair to some nodes

Page 12: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 12PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

A Fetch-Deadlock SituationA Fetch-Deadlock Situation

ReadX B

Cache 1 Cache 2

BusRd A1. Await use of bus, but Cache 1 is locked

2. Cache 2 gets access to bus

3. Cache 2 waits for Cache 1to respond and Cache 1 waitsfor Cache 2 to release the busDeadlock!

AB

Page 13: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 13PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

A Livelock SituationA Livelock Situation

ReadX A

Cache 1 Cache 2

ReadX A1. Try to get bus

3. Make Cache 2’scopy invalid

Etc……Livelock!

A read exclusive operation involves:1. Acquisition of an exclusive block2. Reattempting the write in the local cache 2. Make cache 1’s

copy invalid

Page 14: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 14PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Remedies to Correctness Remedies to Correctness IssuesIssues

Do not update cache until Upgrade is on busService incoming snoops while waiting for busComplete the transaction with no interruption

Upgr

Cache 1 Cache 2

Upgr

Page 15: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 15PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Implementation of Implementation of AAtomic tomic MMemory emory OOperationsperations

Test&set should result in atomic read-modify-write Cacheable t&s vs memory-based implementation lower latency & bw for spinning and self-acquisition longer time to transfer lock to other node memory-based requires bus to be locked down

Load-linked (LL) and store-conditional (SC) implementation Lock flag and lock address register at each processor LL reads block, sets lock flag, puts block address in reg Incoming invalidates checked against address: if match,

reset flag SC checks lock flag as indicator of intervening conflicting

write: if reset, fail; if not, succeed

Page 16: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 16PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Multi-Level Multi-Level CCache ache DDesignsesigns Coherence needs to be extended across L1 and L2 L1 on-chip. Snoop support in L1 expensive

Is snoop support needed in L1?

P L1 L2

M

Definition: L1 included in L2 iff all blocks in L1 also in L2

If inclusion maintained then snoop support only needed at L2 (must be able to invalidate blocks in L1)

Consequence: a block in owned state in L1 (M in MSI) must be marked modified in L2

Page 17: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 17PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Maintaining Maintaining IInclusionnclusion

Violations to the inclusion property: Set-associative L1 with history-based replacement

algorithm Split I- and D-caches at L1 and unified at L2 Different cache block sizes in L1 and L2

Techniques to maintain inclusion:

Direct-mapped L1 and L2 with any associativity given some additional constraints for block size, fetch policy, …

Note: One can always displace a block in L1 on replacement in L2 to maintain inclusion

Page 18: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 18PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Split Split TTransaction ransaction BBusesuses

Challenging issues: Avoid conflicting requests in progress simultaneously Buffers needed => flow control Correctness issues (coherence, SC, deadlock, livelock,...)

Separate request-response phases improve bus utilization

Mem Access Delay

Address/CMD

Mem Access Delay

Data

Address/CMD

Data

Address/CMD

Busarbitration

Page 19: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 19PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Example of Conflict SituationExample of Conflict Situation

With atomic bus, Upgrade is committed when bus is grantedHere, two Upgrades can be on bus and may invalidate both copies

Upgr

Cache 1 Cache 2

Upgr

Page 20: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

Some real examples

Details can be interesting Supports historical emphasis of the course SGI Power Challenge

04/18/23 slide 20PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Page 21: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 21PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

SGI Challenge 1(SGI Challenge 1(44))

High-level design decisions Avoid conflicts: Allow a fixed number of requests to different blocks in progress at a time Flow-control: Limited buffers, so NACK when full and retry Ordering: Allow out-of-order responses (to cope with non-uniform delays)

Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack

Addrreq

Addr Addr

Datareq

Tag

D0 D1 D2 D3

Addrreq

Addr Addr

Datareq

Tag

Grant

D0

check check

ackack

Time

Addressbus

Dataarbitration

Databus

Read operation 1

Read operation 2

Page 22: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 22PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

SGI Challenge SGI Challenge 22((44))

Separate request-response buses Request phase: (use address request bus) present the address and initiate snooping report snoop result (prolong or nack if necessary)

Response phase: (use data request bus) send data back

Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack Arb Rslv Addr Dcd Ack

Addrreq

Addr Addr

Datareq

Tag

D0 D1 D2 D3

Addrreq

Addr Addr

Datareq

Tag

Grant

D0

check check

ackack

Time

Addressbus

Dataarbitration

Databus

Read operation 1

Read operation 2

Page 23: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 23PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Design of Design of SGI Challenge SGI Challenge 33((44)) Max 8 outstand. requests 3-bit tag to separate req. Request table in each node to keep track of outstanding requests Writes are committed when request is granted Flow control: NACK and retry when buffers are full

Addr + cmdSnoop Data buffer

Write-back buffer

Comparator

Tag

Addr + cmd

Tocontrol

TagTag

Data to/from $

Requestbuffer

Request table

Tag

7

Add

ress

Request +

Mis

cella

neous

responsequeue

Addr + cmd bus

Data + tag bus

Snoop statefrom $

state

Issue +merge

Writ

e back

s

Resp

onse

s

check

0

Origi

nato

r

My

resp

ons

e

info

rmatio

n

Res

pons

equ

eue

Conflict resolution Before address request is done, request table is checked Memory and caches check request independently

Page 24: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 24PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Serialization and SCSerialization and SC 4(4) 4(4)

Serialization to a single location guaranteed 1. Only a single request to each block allowed 2. Request committed when request on bus

Problems to guarantee SC: requires serialization across writes to different

locations requests can be reordered in buffers so being

committed is not same as performed A solution: Servicing incoming requests before processor’s

own requests guarantees write atomicity

Page 25: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 25PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Multiple Multiple OOutstanding utstanding PProcessor rocessor RRequestsequests

Modern processors allow multiple outstanding memory operations Problem: may violate sequential consistency Solution: Buffer all outstanding requests Don’t make writes visible to any until committed Don’t perform reads before previously issued

requests are committed Lockup-free caches implement the buffering capability to enforce ordering of uncommitted memory operations

Page 26: 5/8/2015 slide 1 PCOD: MIMD II Lecture (Coherence) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Snoop-based multiprocessor design Correctness issues.

04/18/23 slide 26PCOD: MIMD II Lecture (Coherence)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Commercial Commercial MMachinesachines

VM

E-6

4

SCSI-2

Gra

phics

HPPI

I/O subsystem

Interleavedmemory:

16 GB maximum

Powerpath-2 bus (256 data, 40 address, 47.6 MHz)

R4400 CPUsand caches

(b) Machine organization

SGI Challenge: 36 MIPS R8000 processors with a 1.2 GB/s bus

Peak: 5.4 GFLOPS

GigaplaneTM bus (256 data, 41 address, 83 MHz)

I/O Cards

P

$2

$P

$2

$

mem ctrl

Bus Interface / SwitchBus Interface

CPU/MemCards

Sun Enterprise 6000: 30 UltraSparc processors with 2.67 GB/s bus

Peak: 9 GFLOPS

Look these up on the net