Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and...
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and...
Architecture and Design of AlphaServer GS320
Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren
ASPLOS’2000
Presented By: Alok Garg
Motivations
• Coherence Protocol– Bandwidth limitations of snoopy-based
protocol– Inefficiencies in directory protocol– Correctness issues related to rare protocol
races
• Implementation of Consistency models– Burdens the common transaction flow
Paper Contributions
• Exploiting network ordering to simplify cache coherence protocol
• Solutions to decrease network occupancy
• Elegant solution for deadlock, livelock, starvation, and fairness problems
• Techniques for efficiently supporting memory ordering
Overview
• Architecture Overview
• MOESI Cache Coherence Protocol
• GS320 optimized Cache Coherence Protocol
• Alpha Consistency Model
• Consistency Model Implementation
• Performance
Block Diagram
8x8Global
CrossbarSwitch
QBB
QBB
QBB
QBB
QBB
QBB
QBB
QBB
QBB – Quad-Processor Building Block
1.6 GB/s
Quad-Processor Building Block (QBB)
10-Port Local
CrossbarSwitch
1.6 GB/s3.2 GB/s
P L2
P L2
P L2
P L2
SDRAM Memory8GB, 64-bit200 MHz
64-entryCache
I/OPCI:4 PCI Bus64-bit, 33 MHz Global Port
DTAG
DIR
TTT
Arbitration Point
32 Alpha 21264 Duplicate Tag Store
Transactions In Transit Buffer
The Directory
Owner = 0 S0 S1 S2 S3 S4 S5 S6 S7
14-bit per 64 Byte Memory Line
6-bit
Forward
QBB0DTAG
QBB3
P0 P1 P2 P3
Invalidate
Invalidate
Crossbar SwitchNetwork bi-section Bandwidth:
Global Switch (8x8): 12.8 GB/sLocal Switch (10-port): 6.4 GB/s
MOESI - Directory• States
– Invalid (I) : – Shared (S) : Valid, (potentially) shared, clean– Exclusive (E) : Valid, exclusive, clean– Modified (M) : Valid, exclusive, (potentially) dirty– Owner (O) : Valid, (potentially) shared, clean
• Responsible for supplying Data instead of memory (potentially)
• Request Messages– Read (Rd) : Data needed in shard state (S/E)– Read Exclusive (RE) : Data needed in Modified State (M) – Exclusive (Ex) : Data needed in Modified State (M)
• Home node – Original owner of data (directory)
MOESI Read-Exclusive
H/D N3/I
N4/IN5/S
N2/ON1/I
RE
Forward
Marker
Invalidate
N5/I
N2/I
Ack
Reply
N3/E
GS320 Optimized Cache Coherence Protocol
• Dirty Sharing
• No negative acknowledgment– 3 Deadlock Conditions due to races
Late Request Race Condition
H/D N3/I
N4/IN5/I
N2/MN1/I
Rd
Forward
Marker
N2/X
Write Back
Ack
Reply
Write Buffer
N5/S
DEADLOCK?
Early Request Race Condition
H/D N3/M
N4/IN5/I
N2/IN1/I
RE
H/D Forward
Rd
Marker
Forward
H/D N3/I
Reply
N2/MN2/O
Reply
N5/S
Marker
DEADLOCK?
Crossbar NetworkQ0 Queue: Request to Home Node – (point to point order)
Q1 Queue: Forward, Replies and Invalidations from Home Node – (global order)Q2 Queue: Data Replies from Owner to Requester Node
Total Ordering on Q1!!
P1
A (O)Cache
Q1InboundQueue
P2
B (O)
Cache
Q1InboundQueue
HA
Q1OutboundQueue
HB
Q1OutboundQueue
Crossbar Switch
A (P1) B (P2)
RE1(B) RE2(A)
A (P2) B (P1)
P1 – RE2(A) P2 – RE1(B)
RE4(A) RE3(B)
A (X) B (Y)
P1 – RE3(B)P2 – RE4(A)
P1 – RE3(B) P2 – RE4(A)P1 – RE2(A) P2 – RE1(B)
DEADLOCK?
Desirable Characteristics
• Dirty sharing - efficient for migratory accesses
• All directory changes are instant. Needs just single access to home node and directory
• Eliminate livelock, starvation, and fairness problems
• Writes can start as soon as Exclusive request is issued
Alpha Consistency Model
MB: Memory Barrier
LOAD
STORE
LOAD
STORE
STORE
LOAD
Oldest Memory Operation
Program Order
LOAD
STORE
STORE
LOADLOAD
STORE
STORE
LOAD
LOAD
Atomicity is not violated: Read others write early
Consistency Model Implementation
• Barrier Performance (Commit Event)– Early acknowledge of Invalidates – Early acknowledge of Forwards of (Exclusive, Read
Exclusive and Read Requests)
• Overall Performance– Relax total order condition on Q1 at commit points.
Let replies (Q1->Q2) bypass forwards (Q1), and invalidations (Q1)
Early Acknowledgement of Invalidation Request
P1
A = 0B = 0
Cache
Q1InboundQueue
P2
A = 0
Cache
Q1InboundQueue
Crossbar Switch
SCA = 1;B = 1;
SCu = B;v = A;
u? v?u = 1v = 0
SCA = 1;B = 1;
EX
INVAL A
A = 1
SCA = 1;B = 1;
B = 1
SCu = B;v = A;
u? v?u = 1v = 0
Rd Marker P1
B = 1
Not a RaceCondition
B = 1
Commit
Races
MB
1. Optimize memory barrier at P1 for write to write/read ordering2. Commit events in Q1 queue for ordering purposes in case of replies3. Sufficient condition: Commit events not to bypass invalidates4. Memory Barrier at P2 wait for all the commits before going ahead
INVCommit
Commit pt Commit ptINV Ack
Early Acknowledge of Forwards
P1
A = 0B = 0
Cache
Q1InboundQueue
P2
A = 0
Cache
Q1InboundQueue
Crossbar Switch
A = 1;MBB = 1;
u = B;MBv = A;
u? v?u = 1v = 0
Commit pt Commit pt
u = B;MBv = A;
u? v?u = 1v = 0
Read B Commit/RdB
u = B;MBv = A; = 0u? v?u = ?v = 0Fwd Ack
A = 1;MBB = 1;
INVAL ACommit/INV A
bypass
Sufficient condition: Commit events not to bypass invalidates, reads and read-exclusive forwards
A = 1
A = 1;MBB = 1;
Optimization Summary• Dirty Sharing – Reduces home node traffic• No negative acknowledgements
– Reduces network traffic (Home Node)– Simple implementation of directory– Removes livelock, starvation, and fairness problems– Network total ordering avoid deadlocks – Write optimization
• Bypass of replies in Q1 queue– Improve overall performance
• Improves barrier performance– Early invalidation acknowledgements– Early Forward responses (Rd, RE, EX)– Memory ordering based on commit events