Computer Architecture 2012 – probabilistic L1 filtering 1 Computer Architecture Probabilistic L1...

Computer Architecture 2012 – probabilistic L1 filtering1

Computer Architecture

Probabilistic L1 Cache Filtering

By Dan Tsafrir 7/5/2012Presentation based on slides by Yoav Etsion


Lecture is based on… Paper titled

“L1 cache filtering through random selection of memory references”

Authors Yoav Etsion and Dror G. Feitelson (from the Hebrew U.)

Published in PACT 2007: the international conference on parallel

architectures and complication techniques Can be downloaded from

http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf

http://www.cs.technion.ac.il/~yetsion/papers/CorePact.pdf


MOTIVATION


A Case for more efficient caches CPUs get more and more cache dependent

Growing gap between CPU and memory Growing popularity of multi-core chips

Common solution: larger caches, but… Larger caches consume more power Longer wire delays yield longer latencies


Efficiency through insertion policies

Need smarter caches

We focus on insertion policies Currently everything goes into the cache Need to predict which blocks will be evicted quickly… … and prevent them from polluting the caches

Reducing pollution may enable use of low-latency, low-power, direct-mapped caches Less pollution yields fewer conflicts

The Problem: “It is difficult to make predictions, especially about the

future”(many affiliations)


CDF,RESIDENCY LENGTHS,MASS-COUNT DISPARITY

background


PDF(probability distribution function) In statistics, a PDF is, roughly, a function f describing

The likelihood to get some value in some domain For example, f can specify how many students have a

first name comprised of exactly k Hebrew letters f(1) = 0%

f(2) = 22% ( ,דן, רם, שי, שי, שי, שי, גל, חן, חן, בן(... ,גד, טל, ליf(3) = 24% ( ,גיל, גיל, רון, שיר, שיר, שיר, שיר, נגה(... ,משה, חיה, רחל, חנהf(4) = 25% ( ,יואב, אחמד, אביב, מיכל, מיכל, נועה, נועה( ...,נועה, נועה, נועהf(5) = 13% ( ,אביטל, ירוחם, עירית, יהודה, חנניה, אביבה( ... ,אביתר, אביעדf(6) = 9% ( ,יחזקאל, אביבית, אבינעם, אביגיל, שלומית( ... ,אבשלום, אדמונדf(7) = 6% ( ,אביגדור, אבינועם, מתיתיהו,, עמנואלה( ...אנסטסיהf(8) = 0.6% (אלכסנדרה, ...)f(9) = 0.4% (קונסטנטין, ...)f(10) = 0

Note that sigma[ f(k) ] = 100%


CDF(cumulative distribution function) In statistics, a CDF is, roughly, a function F describing

The likelihood to get some value in some domain, or less For example, f can specify how many students have a

first name comprised of exactly k Hebrew letters, or less F(1) = 0% = f(0) = 0%

F(2) = 22% = f(0)+f(1)+f(2) = 0%+22%F(3) = 46% = f(0)+f(1)+f(2) = 0%+22%+24%F(4) = 71% = f(0)+f(1)+f(2)+f(3)= 0%+22%+24%+25%F(5) = 84% = F(4)+f(5) = 71% + 13%F(6) = 93% = F(5)+f(6) = 84% + 9%F(7) = 99% = F(6)+f(7) = 93% + 6%F(8) = 99.6% = F(7)+f(8) = 99% + 0.6%F(9) = 100% = F(8)+f(9) = 99.6% + 0.4%F(10) = 100%

Generally, F(x) = , monotonically non-decreasing


Cache residency A “residency”

Is what we call a block of memory• From the time it was inserted to the cache• Until the time it was evicted

Each memory block can be associated with many residencies during a single execution

“Residency length” Number of memory references (= load/store operation)

served by the residency

“The mass of residency length=k” Percent of memory references (throughout the entire program

execution) that were served by residencies of length k


Computing residency length on-the-fly

At runtime, residency length is generated like so (assume C++):

class Cache_Line_Residency { private:

int counter; // the residency length public:

Cache_Line_Residency() { // constructor: a new object allocated

// when a cache-line is allocated for a

counter = 0; // newly inserted memory block}~Cache_Line() { // destructor: called when

the block is// evicted from the cache (or when

the// program ends)

cout << counter << endl;}void do_reference() { // invoked whenever the cache line

is// referenced (read from or written

to)counter++;

}};

ctor

dtor

accessmemory


013304744600…

0

Example Assume:

Size of cache: 4 bytes Size of cache line: 2 bytes (namely, there are two lines) Cache is directly mapped => address x maps into x % 4 A program references memory (order: top to bottom):

ad

dre

sses

(of

mem

ory

refe

rence

s)

(1)x(2)x(2) x (1)(2) x (2)(3)x (2)(1)x (2)(1) x (1)(2)x (1)

(3) x (2)(1)x (2)(2)x (2)

(90)x (2)

resid

ency

#1

#3

#4

#2

#5

print 3

print 3

print 2

print 90print 2program end

(3)x (1)


013304744600…

0

Example – CDF of residency lengths

ad

dre

sses

(of

mem

ory

refe

rence

s)

(1)x(2)x(2) x (1)(2) x (2)(3)x (2)(1)x (2)(1) x (1)(2)x (1)

(3) x (2)(1)x (2)(2)x (2)

(90)x (2)

resid

ency

#1

#3

#4

#2

#5

print 3

print 3

print 2


(3)x (1)

So printed residency lengths are: 3, 2, 3, 90, 2Thus, CDF of residency length is:• 40% of residencies have length <= 2

= |[2,2]| / |[3,2,3,90,2]|• 80% of residencies have length <= 3

= |[2,2,3,3]| / |[3,2,3,90,2]|• 100% of residencies have length <= 90

= |[2,2,3,3,90]| / |[3,2,3,90,2]|

2 3 90lengths

CD

F 60%

80%

100%

40%

20%


013304744600…

0

Example – CDF of mass of references

ad

dre

sses

(of

mem

ory

refe

rence

s)

(1)x(2)x(2) x (1)(2) x (2)(3)x (2)(1)x (2)(1) x (1)(2)x (1)

(3) x (2)(1)x (2)(2)x (2)

(90)x (2)

resid

ency

#1

#3

#4

#2

#5

print 3

print 3

print 2


(3)x (1)

So printed residency lengths are: 3, 2, 3, 90, 2Thus, CDF of mass of references (“refs”) is:• 4% of refs are to residencies with length

<= 2= (2+2) / (3+2+3+90+2)

• 10% of refs are to residencies with len <= 3= (2+2+3+3) / (3+2+3+90+2)

• 100% of refs are to residencies w len <= 90= (2+2+3+3+90) / (3+2+3+90+2)

2 3 90lengths

CD

F 60%

80%

100%

40%

20%


Superimposing graphs

2 3 90lengths

CD

F 60%

80%

100%

40%

20%

residency

lengths

memory

references the

“mass”

the“counters”

“mass-count disparity” (disparity = לּות �נ�י, נ�ב�ָּד� is the (ֹׁשֹונּות, ֹׁשterm describing the phenomenon shown in the graph, whereby:• most of the mass resides in very few counters, and • most of the counters count very little mass


RESULTS FROMREAL BENCHMARKS


Methodology Using all benchmarks from the SPEC-CPU 2000

benchmarks suite In this presentation we show only four But we include all the rest in the averages

The benchmarks were compiled for The Alpha AXP architecture

All benchmarks were fast-forwarded 15 billion instructions (to skip any initialization code) and were then executed for another 2 billion instructions

Unless otherwise stated, all simulated runs utilize a 16KB direct-mapped cache


C

DF

CDF of residency length

(of 4 SPEC benchmark apps)

Vast majority of residencies are relatively short Which likely means they are transient

Small fraction of residencies are extremely long

Crafty Vortex Facerec Spsl

data

inst

ruct

ions

length of residency


CDF of mass of residencies(of 4 SPEC benchmark apps)

Fraction of memory references serviced by each length

Most references target residencies longer than, say, 10

C

DF


data

inst

ruct

ions

length of residency


Superimposing graphs reveals mass-count disparity

Mass

C

DF


data

inst

ruct

ions

length of residency

Count

Every x value along the curves reveals how many of the residencies account for how much of the mass

For example, in Crafty, 55% if the (shortest) residencies account for only 5% of the mass

Which means the other 45% (longer) residencies account for 95% of the mass


The joint-ratiomass-disparity metric

Mass

Crafty Vortex

Joint-Ratio Facerec Spsl

C

DF


data

inst

ruct

ions

length of residency

The divergence between the distributions (= the mass-count disparity) can be quantified by the “joint ratio”

It’s a generalization of the proverbial 20/80 principle Definition: the joint ratio is the unique point in the graphs

where the sum of the two CDFs is 1 Example: in the case of Vortex, the joint ratio is 13/87 (blue

arrow in middle of plot), meaning 13% of the (longest) residencies hold 87% of the mass of the memory references, while the remaining 87% of the residencies hold only 13% of the mass

Count


The W1/2 mass-disparity metric

Mass

W1/2

C

DF


data

inst

ruct

ions

length of residency

Count

Definition: overall mass (in %) of the shorter half of the residencies

Example: in Vortex and Facerec W1/2 is less than 5% of the references

Average W1/2 across all benchmarks is < 10% (median of W1/2 is < 5%)


The N1/2 mass-disparity metric

Mass

C

DF


data

inst

ruct

ions

length of residency

Count

Definition: % of longer residencies accounting for half of the mass

Example: in Vortex and Facerec N1/2 is less than 1% of the references

Average N1/2 across all benchmarks is < 5% (median of N1/2 is < 1%)

N1/2


DESIGNING A NEW INSERTION POLICY

Let us utilize our understandings for…


Probabilistic insertion? The mass-disparity we’ve identified means

A small number of long residencies account for most memory references; but still most residencies are short

So when randomly selecting a residency It would likely be a short residency

Which means we have a way to approximate the future: Given a block about to be inserted into cache,

probabilistically speaking, we know with high degree of certainty that it’d be disadvantageous to actually insert it…

So we won’t! Instead, we’ll flip a coin…• Heads = insert block to cache (small probability)• Tails = insert block to a small filter (high probability)

Rationale Long residencies will enjoy many coin-flips, so chances are

they’ll eventually get into the cache Conversely, short residencies have little chance to get in


Design Direct-mapped L1 + small

fully-associative filter w/ CAM

Insertion policy for lines not in L1: for each mem ref, flip biased coin to decide if line goes into filter or into L1

SRAM is cache memory Not to be confused with

DRAM Holds blocks that, by the

coin flip, shouldn’t be inserted to L1

Usage First, search data in L1 If not found, search in filter If not found, go to L2, and

then use above insertion policy

L1 with random filter


Result Long residencies end up in

L1 Short residencies tend to

end up in filter

Benefit of randomness Filtering is purely

statistical, eliminating the need to save any state or reuse information!

Explored filter sizes 1KB, 2KB, and 4KB Consisting of 16, 32, and

64 lines, respectively Results presented in slides:

were achieved using a 2K filter

L1 with random filter


Exploring coin bias

Find the probability minimizing the miss-rate High probability swamps cache Low probability swamps filter

Constant selection probabilities seem sufficient Data miss-rate reduced by ~25% for P = 5/100 Inst. miss-rate reduced by >60% for P = 1/1000

Data Instruction

Reductionin miss rate


Exploring coin bias

Random sampling with probability P turned out equivalent toperiodic sampling at a rate of ~1/P Do not need real randomness

Majority of memory refs serviced by L1 cache, whereas majority of blocks remain in the filter; specifically: L1 services 80% - 90% of refs With only ~35% of the blocksData Instruction

Reductionin miss rate


Problem – CAM is wasteful & slow Fully-associative filter uses CAM (content addressable

memory) Input = address; output (on a hit) = “pointer” into SRAM

saying where’s the associated data CAM lookup done in parallel

Parallel lookup drawbacks Wastes energy Is slower (relative to direct-

mapped)

Possible solution Introducing the “WLB”…


WLB is a small direct-mapped lookup table caching the most recent CAM lookups (Recall: given an address, CAM returns a pointer into SRAM;

it’s a search like any search and therefore can be cached) Fast, low-power lookups

Filter usage when addingto it the WLB First, search data in L1 In parallel search its address in WLB If data not in L1 but WLB hits

• Access the SRAM without CAM If data not in L1 and WLB misses

• Only then use the slower / wasteful CAM

If not found, go to L2 as usual

Wordline look-aside Buffer (WLB)


Effectiveness of WLB?Data Instruction

WLB is quite effective with only 8 entries (for both I and D) Eliminates 77% of CAM data lookups Eliminates 98% of CAM instructions lookups

Since WLB is so small and simple (direct map) It’s fast and consumes extremely low power Therefore, it can be looked up in parallel with main L1 cache

size of WLB [number of entries]


PERFORMANCEEVALUATION


Methodology 4 wide, out-of-order micro-architecture (SimpleScalar)

(You’ll understand this when we learn out-of-order execution) Simulated L1

16K, 32K, 64K, with several set-associative configuration; latency:• Direct-mapped: 1 cycle• Set-associative: 2 cycles

Simulated filter 2K, fully-associative, with 8-entry WLB; latency: 5 cycles

• 1 cycle = for WLB (in parallel to accessing the cache)• 3 cycles = for CAM lookup• 1 cycle = for SRAM access

Simulated L2 512K; latency: 16 cycles

Simulated main-memory Latency: 350 cycles


Results – runtime

Comparing random sampling filter cache to other common cache designs

Outperforms a 4-way cache double its size! Interesting: DM’s low-latency compensates for conflict misses

16K DM + filter 32K DM + filtera

vera

ge

rela

tiv

e

imp

rov

eme

nt

[%]


Results – power consumption

Expectedly, DM-filtered loses to DM, because it’s more complex

Direct mapped cache reduces dynamic power, but filter adds ~15% more leakage over 4-way

Same size: 60%-80% reduction in dynamic power Double size: ~40% reduction in leakage


Conclusions

The Mass-Count disparity phenomenon can be leveraged for caching policies

Random Sampling effectively identifies frequently used blocks Adding just 2K filter is better than doubling the cache size,

both in terms of IPC and power

The WLB is effective at eliminating costly CAM lookups Offering fast, low-power access while maintaining fully-

associativity benefits

Computer Architecture 2012 – probabilistic L1 filtering 1 Computer Architecture Probabilistic L1...

Documents

Transcript of Computer Architecture 2012 – probabilistic L1 filtering 1 Computer Architecture Probabilistic L1...