Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi...
-
Upload
lucas-salazar -
Category
Documents
-
view
218 -
download
2
Transcript of Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi...
Using Partial Tag Comparison in Using Partial Tag Comparison in Low-Power Snoop-based Chip Low-Power Snoop-based Chip
MultiprocessorsMultiprocessors
Ali Shafiee Narges Shahidi Amirali Baniasadi
Sharif University of TechnologyUniversity of Victoria
1
Goal: Improving energy efficiency in snoop-based CMPs.
Motivation: Broadcasting/processing entire tag is inefficient.
Our Solution: Using Partial Tag Comparison (PTC) prior to snoop.
Key Results Performance (2.9%)
Tag array power (52%) Bandwidth utilization (78.5%)
2
This Work: Improving Snoop Coherency This Work: Improving Snoop Coherency
Our Solution (PTC) vs. Conventional Our Solution (PTC) vs. Conventional
3
D$D$
Interconnect Interconnect
Upper Level CacheUpper Level Cache
….D$D$ D$D$ D$D$
Upper Level Cache
….D$D$ D$D$
InterconnectInterconnect
Conventional Our solution
Fast +Power & Bandwidth −
Fast ++ (early miss detection)
Power & Bandwidth Efficient +
Conventional Snooping
4
Address BusAddress Bus Snoop Bus Snoop Bus
Command BusCommand Bus
D$CPUCPU
D$
D$D$
CPU CPU
21
3
33
controller54 4
4
Redundant (miss): ~
70%
Snoop Filters
5
Goal: Eliminate redundant snoop requests.Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP
(ASPLOS’08)
PTC:(1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided.
How often is that possible?
6
How often using n bits is enough to detect a miss?
95+% of misses can be detected using 8 bits.
7
D$
Address BusAddress Bus
LSB
LSB
LSB
misshit
Avoid Snoop Access Upper Level
Snoop Potential Targets
PTC-Filter
PTC-Filter
PTC-Filter
8
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
4-way D$
PTC-FilterPTC-Filter FilterFilter FilterFilter FilterFilter
0 1 2 3
…
Core1’s LSB Core2’s LSB Core3’s LSB
VDLSB
8 bits
PTC: Filter Miss
9
Address BusAddress Bus Snoop Bus Snoop Bus
Command BusCommand Bus
D$CPUCPU
D$
D$D$
CPU CPU
32
controller
1
PTC: Filter Hit
10
Address BusAddress Bus Snoop Bus Snoop Bus
Command BusCommand Bus
D$CPUCPU
D$
D$D$
CPU CPU
2
4
controller6
5
✗ ✗
✓1 ✗✗
3
✓
Filter Maintenance
11
PTC- FilterPTC- Filter
CPUCPU
1
B F D E
Request =A
33
Address Bus
Core 0
….. …..
Core i
Addr.
C W D
Snoop Controller
4
Command Bus5
6
6
miss A. place it in position of tag F
22
Pending Request Table
{Address=A, C=0,W=1, D=1}
A 0 1 1
Place A, insert in Way 1 of core 0
12
Methodology
• SESC simulator 4-way CMP• SPLASH-2 benchmarks• CACTI 6.0
4 MB 4-banked 16-way 10 cycle latency L2
6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol
DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency
64 B cache line+ 500 cycle Memory access
13
Performance
Average: 2.9%
14
Bandwidth
Average: 78.5%
15
Tag Power
Average: 52%
Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path
Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access
16
Discussion
PTC: Using subset of tag bits to improve
bandwidth/power efficiency.
Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5%
17
Summary
18
19
Global vs. Local Miss
D$D$
Interconnect Interconnect
Upper Level CacheUpper Level Cache
….D$D$ D$D$
Have B? NO NO
D$D$
interconnect interconnect
Upper Level CacheUpper Level Cache
….D$D$ D$D$
Have B? NO YES
D$D$
NO
Global Miss Local Miss
local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs.
(destination-based filter)
20
Partial tag lookup: global miss
21
Partial tag lookup: local miss