RATHIJIT SEN DAVID A. WOOD Reuse-based Online Models for Caches 6/20/2013 ACM SIGMETRICS 2013 @ CMU,...
-
Upload
ashlyn-paul -
Category
Documents
-
view
226 -
download
0
Transcript of RATHIJIT SEN DAVID A. WOOD Reuse-based Online Models for Caches 6/20/2013 ACM SIGMETRICS 2013 @ CMU,...
RATHIJIT SENDAVID A. WOOD
Reuse-based Online Models for Caches
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
1
2
The Problem
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Caches: power vs performance
Reconfigurable caches e.g., IvyBridge
The Problem: Which configuration to select?
e.g., to get the best energy-efficiency?
Core
Core
Core
Core
Core
Core
Core
Core
LLC LLC
LLC LLC
LLC LLC
LLC LLC
DRAM
Miss Fetch
3
Cache Performance Prediction
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
We propose a frameworkh = (r · B) · φ
h: hit ratio r: reuse-distance distribution (novel hardware support) B: stochastic Binomial matrix φ: hit function (LRU, PLRU, RANDOM, NMRU)
Case study: Energy-Delay Product (EDP) within 7% of minimum
4
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
5
Cache Overview
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Limited storage Sets of (usually 64-byte) blocks #blocks/set = associativity (#ways) Set Index + Address tags identify data
b b b b b b b b
b b b b b b b b
b b b b b b b b
b b b b b b b b
Associativity (A)
Sets (S)
AddressTag
Match?
Y HitMiss
N
6
Last-Level Cache (LLC)
Workload Variation
2MB 4MB 8MB 16MB 32MB0
5
10
15
20
25
30
Mis
s / 1
000
Inst
ructi
on swim
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
ammp, blackscholes, bodytrack, fluidanimate, freqmine, swaptions
equake, gafort, wupwise
apache
mgrid
zeus
oltpjbb
fma3d
7
Bad configurations hurt!
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
EDP (energy-delay product)
blac body flui freq swapammp equa fma3 gafo mgri swim wupw apac jbb oltp zeus1
1.5
2
2.5
3
3.5Max. EDP
Rela
tive
to m
in. E
DP
27% worse
218% worse
MinimumMaximum
8
Problem Summary
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Reconfigurable caches
Multiple replacement policies
Goal: Online miss-ratio prediction
b b b b b b b b
b b b b b b b b
b b b b b b b b
b b b b b b b b
Associativity (A)
Sets (S)
9
Indexing Assumption
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Mapping of unique addresses to cache setsAssumption: independent, uniform [Smith, 1978]
Unique accesses as Bernoulli trials
(Partial) Hashing POWER4, POWER5, POWER6, Xeon Simple XOR-based function [similar to Cypher, 2008]
10
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
11
Temporal Locality Metrics
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Unique Reuse Distance (URD) #unique intervening addresses x y z z y x : URD(x)=2 Stack Distance [Mattson, 1970] – 1 Large cache large distances to track
Absolute Reuse Distance (ARD) #intervening addresses x y z z y x : ARD(x)=4
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r
Size?
12
Per-set Locality, r(S)
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
r(S) is “compressed” as S (#sets) increases Less of the tail is important
0 4 8 12 16 20 24 28 320
0.1
0.2
0.3
0.4
0.5
0.6S=2^14S=2^13S=2^12S=2^11S=2^10
Per-set URD (unique reuse distance)
Prob
abili
ty
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r
x x
x x
#sets: S #sets: S > S
0 4 8 12 16 20 24 28 320
0.2
0.4
0.6
0.8
1
S=2^14S=2^13S=2^12S=2^11S=2^10
Per-set URD (unique reuse distance)
Cum
ulati
ve P
rob.
13
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
14
Generalized stochastic Binomial matrices [Strum, 1977]r(S) = r(1) · B(1 – 1/S, 1/S)
Composition:r(S) = r(S) · B(1 – S/S, S/S)
0
0
0
0
0 0
0
0
0
0
0 0 0
0
0
0
0
Estimating per-set locality
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
■ ■ ■ ■ ■ ■ ■ ■ i
P(URD=i)
k
ir
B
P(k successes in i trials) i.e.,P(k of i to the same set)
0
0
0 0
0
0
0
0
0
0
0
1
15
Computation reuse & speedup
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
“Shorter” tail smaller matrices
r(1)
r(214)
r(213)
r(212)
r(211)
r(210)
r(210)
r(214)
r(213)
r(212)
r(211)
r(1)
Now: computeLater: hardware support
Size?
Poisson Approximation
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r
16
Size of r(210)?
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Prediction with r(210) limited to URD < n
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2MB 4MB 8MB 16MB 32MB
0
0.05
0.1
0.15
0.2
0.25
0.3n=32 n=64 n=128n=256 n=512 Actual
Mis
s Ra
tio
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r
17
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
18
Hit Function, φ
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
φk: P(x will hit|URD(x)=k)
Monotonically decreasing model Intuition: larger URD same or larger eviction probability
φ0 = 1φk ≤ φk-1
φ = 0
x
Not x
x
∞
19
Hit Function, φ
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Example: A=8
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320
0.10.20.30.40.50.60.70.80.9
1LRUPLRUNMRURANDOM
Unique Reuse Distance
Hit
Prob
abili
ty
20
Formulating φ
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
φ(LRU): step-function (r · B) · φ(LRU) [Smith, 1978], [Hill & Smith, 1989]
φ(PLRU): Assumes on average, traffic evenly divided between subtrees
φ(RANDOM): Estimates #intervening misses using ARD
φ(NMRU): similar to φ(RANDOM) except φ1=1
21
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
22
Prediction Accuracy
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
LRU, PLRU(A=2), NMRU(A=2): exact per-set modelOthers: approximate per-set model
-1% 0% 1% 2% 3% 4% 5% 6%0
0.10.20.30.40.50.60.70.80.9
1
LRU PLRU RANDOM NMRU
abs((predicted-actual)/actual) miss ratio
Cum
ulati
ve P
roba
bilit
y
23
Overheads
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
r = r · B : 6 80 μsec Binomial Poisson approximation for each row of B
h = (r · B) · φ : 20 30 μsec Average over 24 configurations B applied 8 times
24
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
25
Computation reuse & speedup
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
“Shorter” tail smaller matrices
r(1)
r(214)
r(213)
r(212)
r(211)
r(210)
r(210)
r(214)
r(213)
r(212)
r(211)
r(1)
Now: computeLater: hardware support
Size=512
Poisson Approximation
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r
Now
26
Insights
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
x y z z y x : URD(x)=2
Unique “remember” addresses Only cardinality, not full addresses
Bloom filter for compact (approximate) representation
r(210) is seen by any set of a cache with S=210 Filter address stream
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r
27
Reference address register
access
insert
Set Filter Control Logic
filtered access
load hitinc
reset
read
read
1024-bit Bloom Filter2 hash fns
9-bit Counter
inc
512-entry Histogram
array
Hardware Support for estimating r(210)
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Start Sample
Addr match?
Unique?
Remember
End Sample
N
Y (not hit)
Y
28
Agenda
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem
Framework Locality (r) Matrix transformations (B) Hit functions (φ) h = (r · B) · φ
Hardware support
Case Study
+ way counters
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
29
LRU Way Counters [Suh, et al. 2002]
6/20/2013
One counter per logical way (stack position)Determining logical position is hard
not totally (re-)ordered with every access heuristics, e.g., for PLRU [Kedzierski, et al. 2010]
Other Limitations Inclusion property Fixed #sets
S = S : special case of reuse frameworkS S ? Use B
provided, enough tail of r(S) is available
30
Min. EDP configuration
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
EDP within 7% of minimumReuse models outperform PLRU way counters in most cases
blac body flui freq swapammpequa fma3 gafo mgri swim wupw apac jbb oltp zeus1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08Reuse ModelPLRU Way Counters
Rela
tive
to m
in. E
DP
31
Summary
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem: Online miss-rate estimation for reconfigurable caches
We propose a frameworkh = (r · B) · φ
h: hit-ratio r: reuse-distance distribution (novel hardware support) B: stochastic Binomial matrix φ: hit function (LRU, PLRU, RANDOM, NMRU)
Case study: EDP within 7% of minimum
Future work: More policies, applications/case studies
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
32
Also in the paper
6/20/2013
r: lossy summarization of the address trace
Estimation for ARD
Optimizations for LRU
Conditions for PLRU eviction
More details on models & evaluation
Reuse-based Online Models for Caches
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
33
Questions?
34
Example LLC performance
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
OLTP (TPC-C + IBM DB2)
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2MB 4MB 8MB 16MB 32MB
0
0.1
0.2
0.3
0.4RANDOMNMRUPLRULRU
Mis
s Ra
tio
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
35
Estimating cache performance
6/20/2013
Hit ratio = hits/access
∑ P(URD=i) · P(hit|URD=i)
= ·
Miss ratio = misses/access = 1 – hit ratio
Miss rate = misses/instruction = miss ratio x access/instruction
■ ■ ■ ■ … ■ ■ i
P(URD=i)
r … i
P(hit|URD=i)
φ
i
36
URD vs ARD
6/20/2013ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
x x
z0z1 z2 z3 zk-1
{z0}* {z0,z1}* {z0,z1,z2}* {z0,z1,z2,...,zk-1}*
dk = dk-1 +1/rikApproximation:
∞dk