Adaptive Insertion Policies for Managing Shared Caches
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi,
Julien Sebot, Simon Steely Jr., Joel Emer
Intel Corporation, VSSAD
[email protected] Conference on Parallel Architectures and Compilation Techniques (PACT)
22
Paper Motivation
• Shared caches common and more so with increasing # of cores
• # concurrent applications contention for shared cache
• High Performance Manage shared cache efficiently
Core 0
FLC
Core 1
FLC
LLC
Core 0
FLC
MLC
Core 1
FLC
MLC
Core 2
FLC
MLC
Core 3
FLC
MLC
LLC
Core 0
FLC
LLC
Single Core( SMT )
Dual Core( ST/SMT )
Quad-Core( ST/SMT )
33
Problems with LRU-Managed Shared Caches
• Conventional LRU policy allocates resources based on rate of demand– Applications that do not benefit from
cache cause destructive cache interference
Mis
ses
Per
10
00
In
str
(un
der
LRU
)
soplex
h264ref
soplex
0 25 50 75 100Cache Occupancy Under LRU Replacement
(2MB Shared Cache)
h264ref
44
Addressing Shared Cache Performance
• Conventional LRU policy allocates resources based on rate of demand– Applications that do not benefit from
cache cause destructive cache interference
• Cache Partitioning: Reserves cache resources based on application benefit rather than rate of demand HW to detect cache benefit Changes to existing cache structure Not scalable to large # of applications
Mis
ses
Per
10
00
In
str
(un
der
LRU
)
soplex
h264ref
soplex
0 25 50 75 100Cache Occupancy Under LRU Replacement
(2MB Shared Cache)
h264ref
Eliminate Drawbacks of Cache Partitioning
55
Paper Contributions
• Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit
• Goals: Design a dynamic hardware mechanism that:1. Provides High Performance by Allocating Cache on a Benefit-basis2. Is Robust Across Different Concurrently Executing Applications3. Scales to Large Number of Competing Applications4. Requires Low Design Overhead
• Solution: Thread-Aware Dynamic Insertion Policy (TADIP) that improves average throughput by 12-18% for 2, 4, 8, and 16-core systems with two bytes of storage per HW-thread
TADIP, Unlike Cache Partitioning, DOES NOT Attempt to Reserve Cache Space
66
“Adaptive Insertion Policies for High-Performance Caching”
Moinuddin Qureshi, Aamer Jaleel, Yale Patt, Simon Steely Jr., Joel Emer
Appeared in ISCA’07
Review Insertion Policies
77
Cache Replacement 101 – ISCA’07
Two components of cache replacement:
• Victim Selection:– Which line to replace for incoming line? (E.g. LRU, Random
etc)
• Insertion Policy: – With what priority is the new line placed in the replacement
list? (E.g. insert new line into MRU position)Simple changes to insertion policy can minimize cache thrashing
and improves cache performance for memory-intensive workloads
88
Static Insertion Policies – ISCA’07
• Conventional (MRU Insertion) Policy: – Choose victim, promote to MRU
• LRU Insertion Policy (LIP):– Choose victim, DO NOT promote to MRU– Unless reused, lines stay at LRU position
• Bimodal Insertion Policy (BIP) – LIP does not age older lines– Infrequently insert some misses at MRU– Bimodal Throttle: • We used ~= 3%
a b c d e f g hMRU LRU
i a b c d e f g
Reference to ‘i’ with conventional LRU policy:
a b c d e f g i
Reference to ‘i’ with LIP:
if( rand() < ) Insert at MRU postion
elseInsert at LRU position
Reference to ‘i’ with BIP:
Applications Prefer Either Conventional LRU or BIP…
99
SDM-LRU
Follower Sets
SDM-BIP
Dynamic Insertion Policy (DIP) via “Set-Dueling” – ISCA’07
• Set Dueling Monitors (SDMs): Dedicated sets to estimate the performance of a pre-defined policy
• Divide the cache in three:– SDM-LRU: Dedicated LRU-sets– SDM-BIP: Dedicated BIP-sets– Follower sets
• PSEL: n-bit saturating counter– misses to SDM-LRU: PSEL++– misses to SDM-BIP: PSEL--
• Follower sets insertion policy:– Use LRU: If PSEL MSB = 0– Use BIP: If PSEL MSB = 1
PSEL+
miss
–miss
MSB = 1?NO YES
USE LRU DO BIP
- Based on Analytical and Empirical Studies:• 32 Sets per SDM• 10 bit PSEL counter
HW Required: 10 bits + Combinational Logic
1010
Extending DIP to Shared Caches
• DIP uses a single policy (LRU or BIP) for all applications competing for the cache
• DIP can not distinguish between apps that benefit from cache and those that do not
• Example: soplex + h264ref w/2MB cache– DIP learns LRU for both apps– soplex causes destructive
interference– Desirable that only h264ref
follow LRU and soplex follow BIP
soplex
h264ref
Need a Thread-Aware Dynamic Insertion Policy (TADIP)
Mis
ses
Per
10
00
In
str
(un
der
LRU
)
1111
Thread Aware Dynamic Insertion Policy (TADIP)• Assume N-core CMP running N apps, what is best insertion policy for each
app? (LRU=0, BIP=1)
• Insertion policy decision can be thought of as an N-bit binary string:
< P0, P1, P2 … PN-1 >
– If Px = 1, then for application use BIP, else use LRU– e.g. 0000 always use conventional LRU, 1111 always use BIP
• With N-bit string, 2N possible string combinations. How to find best one???– Offline Profiling: Input set/system dependent & impractical with
large N– Brute Force Search using SDMs: Infeasible with large NNeed a PRACTICAL and SCALABLE Implementation of TADIP
1212
Using Set-Dueling As a Practical Approach to TADIP
• Unnecessary to exhaustively search all 2N combinations
• Some bits of the best binary insertion string can be learned independently– Example: Always use BIP for applications that create interference
• Exponential Search Space Linear Search Space– Learn best policy (BIP or LRU) for each app in presence of all other
appsUse Per-Application SDMs To Decide:
In the presence of other apps, does an app cause destructive interference…
If so, use BIP for this app, else use LRU policy
1313
< P0, P1, P2, P3 >
TADIP Using Set-Dueling Monitors (SDMs)
• Assume a cache shared by 4 applications: APP0 APP1 APP2 APP3
< 0, P1, P2, P3 >
P = MSB( PSEL )
High-Level View of CacheSet-Level View of Cache
PSEL0+
miss
–In the presence of other apps, does
APP0 doing LRU or BIP improve cache
performance?
Follower Sets
PSEL1+
–
PSEL2+
–
PSEL3+
–
< 1, P1, P2, P3 >< P0, 0, P2, P3 >< P0, 1, P2, P3 >< P0, P1, 0, P3 >< P0, P1, 1, P3 >< P0, P1, P2, 0 >< P0, P1, P2, 1 >
PSEL0
PSEL1
PSEL2
PSEL3
1414
TADIP Using Set-Dueling Monitors (SDMs)
– LRU SDMs for each APP– BIP SDMs for each APP– Follower sets
• Per-APP PSEL saturating counters– misses to LRU: PSEL++– misses to BIP: PSEL--
• Follower sets insertion policy:– SDMs of one thread are
follower sets of another thread
– Let Px = MSB[ PSELx ]– Fill Decision: <P0, P1, P2, P3 >
HW Required: (10*T) bits + Combinational Logic
• Assume a cache shared by 4 applications: APP0 APP1 APP2 APP3
< P0, P1, P2, P3 >
< 0, P1, P2, P3 >
• 32 sets per SDM• 10-bit PSEL
P = MSB( PSEL )
PSEL0+
miss
–
Follower Sets
PSEL1+
–
PSEL2+
–
PSEL3+
–
< 1, P1, P2, P3 >< P0, 0, P2, P3 >< P0, 1, P2, P3 >< P0, P1, 0, P3 >< P0, P1, 1, P3 >< P0, P1, P2, 0 >< P0, P1, P2, 1 >
1515
Summarizing Insertion Policies
Policy Insertion Policy Search Space# of
SDMs#
Counters
LRU Replacement
< 0, 0, 0, … 0 > 0 0
DIP< 0, 0, 0, … 0 > and < 1, 1, 1, … 1
>2 1
Brute Force< 0, 0, 0, … 0 > … < 1, 1, 1, … 1
>2N 2N
TADIP< P0, P1, P2, … PN-1 > and Hamming Distance of 1
2N N
TADIP is SCALABLE with Large N
1616
Experimental Setup
• Simulator and Benchmarks: – CMP$im – A Pin-based Multi-Core Performance Simulator– 17 representative SPEC CPU2006 benchmarks
• Baseline Study:– 4-core CMP with in-order cores (assuming L1-hit IPC of 1)– Three-level Cache Hierarchy: 32KB L1, 256KB L2, 4MB L3– 15 workload mixes of four different SPEC CPU2006 benchmarks
• Scalability Study:– 2-core, 4-core, 8-core, 16-core systems– 50 workload mixes of 2, 4, 8, & 16 different SPEC CPU2006
benchmarks
1717
MPK
I%
MR
Uin
sert
ions
Cach
eU
sag
eA
PK
I
LRU BIP
MPK
I%
MR
Uin
sert
ions
Cach
eU
sag
eA
PK
I
Baseline LRU Policy / DIP TADIP
soplex + h264ref Sharing 2MB Cache
TADIP Improves Throughput by 27% over LRU and DIP
SOPLEX H264REFAPKI: accesses per 1000 instMPKI: misses per 1000 inst
MPK
I
1818
TADIP Results – Throughput
1.00
1.10
1.20
1.30
1.40
1.50
1.60
MIX
_0
MIX
_1
MIX
_2
MIX
_3
MIX
_4
MIX
_5
MIX
_6
MIX
_7
MIX
_8
MIX
_9
MIX
_10
MIX
_11
MIX
_12
MIX
_13
MIX
_14
GE
OM
EA
N
Th
rou
gh
pu
t N
orm
aliz
ed t
o L
RU
DIPTADIPNo Gains from DIP
DIP and TADIP are ROBUST and Do Not Degrade Performance over LRUMaking Thread-Aware Decisions is 2x Better than DIP
1919
TADIP Compared to Offline Best Static Policy
1.00
1.10
1.20
1.30
1.40
1.50
1.60
MIX
_0
MIX
_1
MIX
_2
MIX
_3
MIX
_4
MIX
_5
MIX
_6
MIX
_7
MIX
_8
MIX
_9
MIX
_10
MIX
_11
MIX
_12
MIX
_13
MIX
_14
GE
OM
EA
N
Th
rou
gh
pu
t N
orm
aliz
ed t
o L
RU DIP
TADIP
BEST STATIC
TADIP is within 85% of Best Offline Determined Insertion Policy Decision
Static Best almost always better because insertion string withbest IPC chosen as “Best Static”. TADIP optimizes for fewer misses. Can use TADIP to optimize other metrics (e.g. IPC)
TADIP Better Due to Phase Adaptation
2020
TADIP Vs. UCP ( MICRO’06 )
1.00
1.10
1.20
1.30
1.40
1.50
1.60
MIX
_0
MIX
_1
MIX
_2
MIX
_3
MIX
_4
MIX
_5
MIX
_6
MIX
_7
MIX
_8
MIX
_9
MIX
_10
MIX
_11
MIX
_12
MIX
_13
MIX
_14
GE
OM
EA
N
Th
rou
gh
pu
t N
orm
aliz
ed t
o L
RU
TADIP
UCP
DIP Out-Performs UCP Without Requiring Any Cache Partitioning Hardware
UCP
1920
TADIP
2Cost Per Thread (bytes)
Unlike Cache Partitioning Schemes, TADIP Does NOT Reserve Cache Space TADIP Does Efficient CACHE MANAGEMENT by Changing Insertion
Policy
Utility Based Cache Partitioning (UCP)
2121
TADIP Results – Sensitivity to Cache Size
1.00
1.25
1.50
1.75
2.00
MIX
_0
MIX
_1
MIX
_2
MIX
_3
MIX
_4
MIX
_5
MIX
_6
MIX
_7
MIX
_8
MIX
_9
MIX
_10
MIX
_11
MIX
_12
MIX
_13
MIX
_14
GE
OM
EA
N
Th
rou
gh
pu
t N
orm
aliz
ed t
o 4
MB
LR
U TADIP - 4MB
LRU - 8MB
TADIP - 8MB
LRU - 16MB
TADIP Provides Performance Equivalent to Doubling Cache Size
2222
TADIP Results – Scalability
1.00
1.25
1.50
1.75
2.00
0 10 20 30 40 50
Workloads
Th
rou
gh
pu
t N
orm
aliz
ed t
o L
RU
of
Res
pec
tive
Bas
elin
e S
yste
m
2-Thread
4-thread
8-Thread
16-Thread
TADIP Scales to Large Number of Concurrently Executing Applications
Thro
ughput
Norm
aliz
ed t
o B
ase
line S
yst
em
2323
Summary
• The Problem: For shared caches, conventional LRU policy allocates cache resources based on rate-of demand rather than benefit
• Solution: Thread-Aware Dynamic Insertion Policy (TADIP)
1. Provides High Performance by Allocating Cache on a Benefit-Basis
- Up to 94%, 64%, 26% and 16% performance on 2, 4, 8, and 16 core CMPs
2. Is Robust Across Different Workload Mixes
- Does not significantly hurt performance when LRU works well
3. Scales to Large Number of Competing Applications
- Evaluated up to 16-cores in our study
4. Requires Low Design Overhead
- < 2 bytes per HW-thread and NO CHANGES to existing cache structure
2424
Q&A
Journal of Instruction-Level Parallelism
1st Data Prefetching Championship (DPC-1) Sponsored by: Intel, JILP, IEEE TC-uARCH
Conjunction with: HPCA-15
Paper & Abstract Due: December 12th, 2008
Notification: January 16th, 2008
Final Version: January 30th, 2008
More Information and Prefetch Download Kit At:http://www.jilp.org/dpc/
2626
TADIP Results – Weighted Speedup
1.00
1.05
1.10
1.15
1.20
MIX
_0
MIX
_1
MIX
_2
MIX
_3
MIX
_4
MIX
_5
MIX
_6
MIX
_7
MIX
_8
MIX
_9
MIX
_10
MIX
_11
MIX
_12
MIX
_13
MIX
_14
GE
OM
EA
N
Wei
gh
ted
Sp
eed
up
No
rmal
ized
to
LR
U
DIP
TADIP
TADIP Provides More Than Two Times Performance of DIPTADIP Improves Performance over LRU by 18%
2727
TADIP Results – Fairness Metric
0.00
0.25
0.50
0.75
1.00
MIX
_0
MIX
_1
MIX
_2
MIX
_3
MIX
_4
MIX
_5
MIX
_6
MIX
_7
MIX
_8
MIX
_9
MIX
_10
MIX
_11
MIX
_12
MIX
_13
MIX
_14
GE
OM
EA
NHar
mo
nic
Mea
n o
f N
orm
aliz
ed I
PC
s
LRU
DIP
TADIP
TADIP Improves the Fairness
2828
TADIP In Presence of Prefetching on 4-core CMP TADIP In The Presence of Prefetching
0.90
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
0 10 20 30 40 50
Workloads
Th
rou
gh
pu
t N
orm
aliz
ed t
o L
RU
+ P
refe
tch
ing
TADIP Improves Performance Even In Presence of HW Prefetching
2929
Insertion Policy to Control Cache Occupancy (16-Cores)
• Changing insertion policy directly controls the amount of cache resources provided to an application
• In figure, only showing only the TADIP selection insertion policy for xalancbmk & sphinx3
• TADIP improves performance by 28%
Sixteen Core Mix with 16MB LLC
Insertion Policy Directly Controls Cache Occupancy
MPK
I%
MR
Uin
sert
ions
Cach
eU
sag
eA
PK
I
3030
< P0 , P1 >
TADIP Using Set-Dueling Monitors (SDMs)
• Assume a cache shared by 2 applications: APP0 and APP1
< 0 , P1 >
< 1 , P1 >
< P0 , 0 >
< P0 , 1 >
• 32 sets per SDM• 9-bit PSEL
P = MSB( PSEL )
PSEL0
PSEL1
High-Level View of CacheSet-Level View of Cache
+miss
–miss
+miss
–miss
In the presence of other apps, should
APP0 do LRU or BIP?
In the presence of other apps, should
APP1 do LRU or BIP?
Follower Sets
3131
< P0 , P1 >
TADIP Using Set-Dueling Monitors (SDMs)
• Assume a cache shared by 2 applications: APP0 and APP1
< 0 , P1 >
< 1 , P1 >
< P0 , 0 >
< P0 , 1 >
• 32 sets per SDM• 9-bit PSEL cntr
P = MSB( PSEL )
PSEL0
PSEL1
+miss
–miss
+miss
–miss
Follower Sets
– LRU SDMs for each APP– BIP SDMs for each APP– Follower sets
• PSEL0, PSEL1: per-APP PSEL– misses to LRU: PSEL++– misses to BIP: PSEL--
• Follower sets insertion policy:– SDMs of one thread are
follower sets of another thread
– Let Px = MSB[ PSELx ]– Fill Decision: <P0, P1>
HW Required: (9*T) bits + Combinational Logic
Top Related