Power of Priority Nocs07
-
Upload
rajni-kant -
Category
Documents
-
view
214 -
download
0
Transcript of Power of Priority Nocs07
-
8/9/2019 Power of Priority Nocs07
1/22
1 E. Bolotin The Power of Priority, NoCs 2007
The Power of Priority:
NoC based Distributed CacheCoherency
Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny
QNoC Research Group
Technion
EE Department
Technion, Haifa, Israel
-
8/9/2019 Power of Priority Nocs07
2/22
2 E. Bolotin The Power of Priority, NoCs 2007
Chip Multi-Processor (CMP)
Dual-Core
Monolithic shared cache
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
Multi-Core
Large cache
Shared cache
Distributed cache
NoC-based: How?
-
8/9/2019 Power of Priority Nocs07
3/22
3 E. Bolotin The Power of Priority, NoCs 2007
Global wires delayGlobal wire delay
100
1
10
0.1250
130 90 65 45 32180
250
250
Gate delay
Source: ITRS 2003
Global Wires Delay
Future Cache - Physics Perspective
Large cache Large access time
Fraction of chip
reachable in 1 clock cycle
Source: Keckler et al. ISSCC 2003
Distance reached in single cycle
Today: ~25% of chip
In 10 years: ~1% of chip
Large monolithic cache is not scalable
-
8/9/2019 Power of Priority Nocs07
4/22
4 E. Bolotin The Power of Priority, NoCs 2007
NUCA - Non Uniform Cache Architecture
NUCA= Non uniform access times
Banked cache over NoC Smaller bank Smaller Access Time
Multiple banks Multiple Ports
Closer bank Smaller Access Time
Cache-line placement policy
Static NUCA (SNUCA)
Dynamic NUCA (DNUCA)
Sources:
Kim et al. ASPLOS 2002Beckmann et al. MICRO 2004
-
8/9/2019 Power of Priority Nocs07
5/22
5 E. Bolotin The Power of Priority, NoCs 2007
Issues in NUCA-based CMP
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
NoC performance CMP performance
Cache coherency and transaction order (correctness)
Search (in DNUCA)
Different traffic types (e.g. fetch vs. prefetch)
Synchronization (locks)
NoC Services for CMP?
-
8/9/2019 Power of Priority Nocs07
6/22
6 E. Bolotin The Power of Priority, NoCs 2007
Cache Coherency over NoC
0 7
56 63
P0 P1
P5 P4
P6
P7
P
3
P2
Distributed L2
How do we maintain coherency over NoC?
Snooping
Central directory
cache line status vec. D
cache line status vec. D
cache line status vec. D
cache line status vec. D
cache line status vec. D
cache line status vec. D
Cache lines Dist. Directory
Cache bank with distributed directory
Distributed directory
-
8/9/2019 Power of Priority Nocs07
7/227 E. Bolotin The Power of Priority, NoCs 2007
Distributed Cache Coherency
Example: Simple read transaction
L2Directory
P0L1
1. READ REQ
2. READ RESP
(data transfer )
NoC
P0-Shared
Cache access Multiple NoC transactions
Ctrl. packet
Data packet
-
8/9/2019 Power of Priority Nocs07
8/228 E. Bolotin The Power of Priority, NoCs 2007
Read Transaction of Modified Block
L2Directory
P2L1
P0L1
2. READ RESP
(data transfer)
NoC
NoC
P2-MOD.
L2Directory
P2L1
P0L1
4. WR BACK REQ3. READ REQ
6. READ RESP
(data transfer)5. WR BACK RESP
(data t ransfer)
NoC
NoC
P0-SHARED
1. READ EXCL. REQ
Ctrl. packet
Data packet
-
8/9/2019 Power of Priority Nocs07
9/229 E. Bolotin The Power of Priority, NoCs 2007
Read Exclusive of Shared Block
L2 Directory
NoC
N
oC
N
oC
P1L1
P2L1P0L1
2.REA
DRE
SP.
(data
tran
sfer)
1. READ. REQ
1.
READR
EQ
P1-Shared
P2-Shared
L2Directory
NoC
NoC
NoC
P1L1
P2L1
P0L1
4.INVA
LID.
REQ
3. READ EXCL. REQ
6. Read EXCL. RESP
(data transfer )
5. INVALID. ACK
5.
INVALID.
ACK
P0-MOD.
Ctrl. packet
Data packet
-
8/9/2019 Power of Priority Nocs07
10/2210 E. Bolotin The Power of Priority, NoCs 2007
Smart interfaces
Basic NoC to Support CMP
Can We Do Better?
Off-the-shelf (Vanilla) NoC:
Grid of wormhole routers
L2Directory
NoC
NoC
NoC
P1L1
P2L1
P0L1
4.INVA
LID.
REQ
3. READ EXCL. REQ
6. Read EXCL. RESP
(data transfer )
5. INVALID. ACK
5.
INVALID.
ACK
P0-MOD.
Unicast only
Ordering in network Static routing
No virtual channels
Vanilla NoC
-
8/9/2019 Power of Priority Nocs07
11/2211 E. Bolotin The Power of Priority, NoCs 2007
Observations: L2 Access
A) Delay = Queueing + NoC transactionsB)AllNoC transactions are equally important
C) NoC transactions consist of:
Shortctrl. packetsLongdata packets
Idea: Differentiate between Ctrl. and Data
Solution: Preemptive Priority NoC Give priority to short ctrl. packets
L2Directory
NoC
NoC
NoC
P1L1
P2L1
P0L1
4.INVA
LID.
REQ
3. READ EXCL. REQ
6. Read EXCL. RESP
(data transfer )
5. INVALID. ACK
5.
INVALID.
ACK
P0-MOD.
-
8/9/2019 Power of Priority Nocs07
12/2212 E. Bolotin The Power of Priority, NoCs 2007
Preemptive Priority NoC: QNoC
Multiple SL link
QNoC
Input ports Output ports
BufSize
SL 0
SL 1
CROSS-BAR
Scheduler CREDITControlCREDIT
SL 2
SL 3
SL 0
SL 1
SL 2
SL 3
Physical Link
Output Input
SL 0
SL 1
SL 2
SL 3
SL 0
SL 1
SL 2
SL 3
Service Levels:
Dedicated wormhole buffer
Preemptive priority scheduling
Multiple SL Router
-
8/9/2019 Power of Priority Nocs07
13/2213 E. Bolotin The Power of Priority, NoCs 2007
Example: Vanilla NoC
Blue delay ~XRed delay ~ 2X+
Average delay ~ 1.5X
Vanilla NoC example
A B
Without contention:X:Delay of long packet
:Delay of short packetLong Data
Transaction 1
Short Req.
Long Resp.
Transaction 2
-
8/9/2019 Power of Priority Nocs07
14/2214 E. Bolotin The Power of Priority, NoCs 2007
Example: Priority NoC
Blue delay=XRed delay = 2X+
Average delay ~ 1.5X
Without contention:X:Delay of long packet
:Delay of short packet
Vanilla NoC example
A B
Blue delay= X+
Red delay = X+
Average delay ~ X
Potential delay reduction ~ 0.5X
Priority NoC example
Long Data
Transaction 1
Short Req.
Long Resp.
Transaction 2
-
8/9/2019 Power of Priority Nocs07
15/2215 E. Bolotin The Power of Priority, NoCs 2007
Priority NoC: Different Destinations
Very important in wormhole When ctrl. packet is blocked by other worms
Short Req.
Long Data
-
8/9/2019 Power of Priority Nocs07
16/2216 E. Bolotin The Power of Priority, NoCs 2007
Protocol Correctness
L2Directory
1. Read Req.
2. Read Resp.
4. Invalidation Req.
P0L1
P1L1
3. Read Excl. Req.Legend:
High Priority (ctrl.)
Low Priority (data)
Need state-preserving serialization of transactions in
the processor interface
-
8/9/2019 Power of Priority Nocs07
17/2217 E. Bolotin The Power of Priority, NoCs 2007
Numerical Evaluation
CMP simulator (SIMICS)
Simulate parallel benchmarks
Obtain L2-cache access traces
QNoC simulator (OPNET)
Simulate distributed coherence protocol over NoC
Measure total RD/RX L2-access delay
Measure total program throughput
0 7
56 63
P0 P1
P5 P4
P
6
P7
P3
P2
Distributed L2
-
8/9/2019 Power of Priority Nocs07
18/2218 E. Bolotin The Power of Priority, NoCs 2007
Priority NoC: Results
Av. Delay Reduction of L2-Transaction in Apache
0.00
5.00
10.00
15.00
20.00
25.00
30.00
1 4 16
Link Capacity [gbps]
DelayReduction
[%]
Read
Read Exclusive
Av. Delay of L2-Read in Apache
234
5762
286
1301
994
0
200
400
600
800
1000
1200
1400
1 4 16Link Capacity[gbps]
Delay[cycle
s]
Vanilla NoC
Priority-based NoC
Short ctrl. packet gets high priority Long data packet gets low priority
Delay Reduction vs. Network Load
RD Delay - Apache RD/RX Delay Reduction - Apache
-
8/9/2019 Power of Priority Nocs07
19/2219 E. Bolotin The Power of Priority, NoCs 2007
Priority NoC: Several Benchmarks
L2 Access Delay Reduction by Priority-based NoC
22.6
31.8
19.6
28.4
13.5
25.3
18.3
32.9
22.3
28.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
apache zeus fft ocean radix
DelayReduction[%]
Read Read Exclusive
Delay Reduction Program Speedup
Total Program Speedup by Priority-based NoC
9.4
8.79.0
8.6
5.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
apache zeus fft ocean radix
Speedup[%]
-
8/9/2019 Power of Priority Nocs07
20/2220 E. Bolotin The Power of Priority, NoCs 2007
So Far: The Power of Priority
Simplicity - Almost for Free
Significant CMP Speed-up
Good For:
Coherency
Traffic differentiation (e.g. Fetch vs. Pre-Fetch)
Search in DNUCA
Synchronization (Locks)
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
Distributed L2
-
8/9/2019 Power of Priority Nocs07
21/2221 E. Bolotin The Power of Priority, NoCs 2007
Special Broadcast for Short Messages Broadcast service (e.g. search in DNUCA)
Wormhole broadcast slow and expensive
S&F broadcast embedded in wormhole
Virtual Ring
No Additional Cost
For Invalidation Multicast
Snooping or synchronization
Advanced Support Functions
S
Source
Replicating
Forwarding
0 7
56 63
P0 P1
P5 P4
P6
P7
P3
P2
-
8/9/2019 Power of Priority Nocs07
22/22
Summary
NoC at CMP Service!
Shared cache over NoC
Priority is powerful
Built-in support functions