Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi...

21
Using Partial Tag Using Partial Tag Comparison in Low-Power Comparison in Low-Power Snoop-based Chip Snoop-based Chip Multiprocessors Multiprocessors Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria 1

Transcript of Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi...

Page 1: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Using Partial Tag Comparison in Using Partial Tag Comparison in Low-Power Snoop-based Chip Low-Power Snoop-based Chip

MultiprocessorsMultiprocessors

Ali Shafiee Narges Shahidi Amirali Baniasadi

Sharif University of TechnologyUniversity of Victoria

1

Page 2: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Goal: Improving energy efficiency in snoop-based CMPs.

Motivation: Broadcasting/processing entire tag is inefficient.

Our Solution: Using Partial Tag Comparison (PTC) prior to snoop.

Key Results Performance (2.9%)

Tag array power (52%) Bandwidth utilization (78.5%)

2

This Work: Improving Snoop Coherency This Work: Improving Snoop Coherency

Page 3: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Our Solution (PTC) vs. Conventional Our Solution (PTC) vs. Conventional

3

D$D$

Interconnect Interconnect

Upper Level CacheUpper Level Cache

….D$D$ D$D$ D$D$

Upper Level Cache

….D$D$ D$D$

InterconnectInterconnect

Conventional Our solution

Fast +Power & Bandwidth −

Fast ++ (early miss detection)

Power & Bandwidth Efficient +

Page 4: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Conventional Snooping

4

Address BusAddress Bus Snoop Bus Snoop Bus

Command BusCommand Bus

D$CPUCPU

D$

D$D$

CPU CPU

21

3

33

controller54 4

4

Redundant (miss): ~

70%

Page 5: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Snoop Filters

5

Goal: Eliminate redundant snoop requests.Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP

(ASPLOS’08)

PTC:(1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided.

How often is that possible?

Page 6: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

6

How often using n bits is enough to detect a miss?

95+% of misses can be detected using 8 bits.

Page 7: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

7

D$

Address BusAddress Bus

LSB

LSB

LSB

misshit

Avoid Snoop Access Upper Level

Snoop Potential Targets

PTC-Filter

PTC-Filter

Page 8: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

PTC-Filter

8

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

4-way D$

PTC-FilterPTC-Filter FilterFilter FilterFilter FilterFilter

0 1 2 3

Core1’s LSB Core2’s LSB Core3’s LSB

VDLSB

8 bits

Page 9: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

PTC: Filter Miss

9

Address BusAddress Bus Snoop Bus Snoop Bus

Command BusCommand Bus

D$CPUCPU

D$

D$D$

CPU CPU

32

controller

1

Page 10: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

PTC: Filter Hit

10

Address BusAddress Bus Snoop Bus Snoop Bus

Command BusCommand Bus

D$CPUCPU

D$

D$D$

CPU CPU

2

4

controller6

5

✗ ✗

✓1 ✗✗

3

Page 11: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Filter Maintenance

11

PTC- FilterPTC- Filter

CPUCPU

1

B F D E

Request =A

33

Address Bus

Core 0

….. …..

Core i

Addr.

C W D

Snoop Controller

4

Command Bus5

6

6

miss A. place it in position of tag F

22

Pending Request Table

{Address=A, C=0,W=1, D=1}

A 0 1 1

Place A, insert in Way 1 of core 0

Page 12: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

12

Methodology

• SESC simulator 4-way CMP• SPLASH-2 benchmarks• CACTI 6.0

4 MB 4-banked 16-way 10 cycle latency L2

6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol

DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency

64 B cache line+ 500 cycle Memory access

Page 13: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

13

Performance

Average: 2.9%

Page 14: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

14

Bandwidth

Average: 78.5%

Page 15: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

15

Tag Power

Average: 52%

Page 16: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path

Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access

16

Discussion

Page 17: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

PTC: Using subset of tag bits to improve

bandwidth/power efficiency.

Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5%

17

Summary

Page 18: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

18

Page 19: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

19

Global vs. Local Miss

D$D$

Interconnect Interconnect

Upper Level CacheUpper Level Cache

….D$D$ D$D$

Have B? NO NO

D$D$

interconnect interconnect

Upper Level CacheUpper Level Cache

….D$D$ D$D$

Have B? NO YES

D$D$

NO

Global Miss Local Miss

local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs.

(destination-based filter)

Page 20: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

20

Partial tag lookup: global miss

Page 21: Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

21

Partial tag lookup: local miss