LECTURE 9 ADVANCED MULTICORE...
Transcript of LECTURE 9 ADVANCED MULTICORE...
![Page 1: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/1.jpg)
6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE
SPRING 2013
LECTURE 9
ADVANCED MULTICORE CACHING
DANIEL SANCHEZ AND JOEL EMER
[BASED ON EE382A MATERIAL FROM KOZYRAKIS & SANCHEZ]
![Page 2: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/2.jpg)
Administrivia 2
Project proposal due next week
2-3 pages
Idea, motivation, expected results
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 3: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/3.jpg)
Caches? Again?
Caches set performance and power of multi-core chips
Why?
Caches take ~50% of multi-core chips
Our focus today: last-level caches (LLC)
45nm 11nm
16b integer multiply 2 pJ 0.4 pJ
64b FP multiply-add 50 pJ 8 pJ
64b read, 8KB SRAM 14 pJ 2 pJ
256b read, 1MB SRAM 566 pJ 94 pJ
256b 10nm wire 310 pJ 174 pJ
256b DRAM interface 5,120 pJ 512 pJ
256b read DRAM 2,048 pJ 640 pJ
Compute
Memory
3
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 4: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/4.jpg)
Motivations for Caching
Main benefit in uniprocessors
Reduce average memory access time (latency)
Additional crucial benefits in CMPs
Memory bandwidth amplification
Energy efficiency
Faster inter-thread communication
CPU
Cache
(small, fast
memory)
Main mem
(large, slow
memory)
CPU Large, fast
memory ≈
4
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 5: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/5.jpg)
Outline
Shared vs private CMP caches
Addressing CMP caching issues
High access latency [shared]: placement, migration,
replication
Lost capacity [private]: controlled replication
Interference [shared]: cache partitioning, replacement
policies for shared caches
Underutilization [private]: capacity sharing
5
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 6: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/6.jpg)
Private Caches
Low access latency
Isolation (capacity, bandwidth)
Lower bandwidth interconnect
Underutilization of resources (capacity, replicated data)
Expensive coherence, slow inter-core communication
core
L1$
core
L1$
core
L1$
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
L2 $ L2 $ L2 $
core
L1$
L2 $
Main memory
…
Directory
Interconnect
Note, private caches are still
coherent!
6
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 7: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/7.jpg)
Shared Caches
Resource sharing (capacity, bandwidth)
Cheaper coherence, fast inter-core communication
High L2 avg. access latency
Requires high-bandwidth interconnect
Destructive interference (capacity)
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
core
L1$
core
L1$
core
L1$
core
L1$
Shared L2
…
Directory
Main memory
c L1
c L1
c L1
c L1
c L1
c L1
c L1
c L1
Dir Shared
L2 bank
Dir Shared
L2 bank
Dir Shared
L2 bank
Dir Shared
L2 bank
Dir Shared
L2 bank
Dir Shared
L2 bank
Dir Shared
L2 bank
Dir Shared
L2 bank
c L1
c L1
c L1
c L1
c L1
c L1
c L1
c L1
Interconnect
7
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 8: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/8.jpg)
Notes
Can also have hybrid models (hierarchical cache)
E.g., parts of the LLC shared between a group of cores
Note difference between logical and physical origination
E.g., shared cache with private-like chip layout
Notice anything interesting with this distributed way of implementing shared
caches?
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Shared
L2 bank
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
SW c L1
Dir Private
L2
8
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 9: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/9.jpg)
Shared/Private Pros & Cons
Private Shared
Access latency Low High
Duplication of read-shared data Yes No
Destructive interference No Yes
Resource underutilization Yes No
Interconnect bandwidth Low High
Coherence & communication cost High Low
9
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 10: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/10.jpg)
Addressing Limitations
Shared cache limitations
High latency: line placement, migration, and replication
Interference: controlled sharing
Private cache limitations
Duplication of shared data: controlled replication
Underutilization: capacity stealing
10
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 11: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/11.jpg)
Shared Caches:
Latency Reduction Techniques
Placement: make linebank mapping flexible
Normally, line address determines bank
Instead, cache line in bank close to cores that use it
Migration: move cache lines to close banks
Adapts to changing access patterns
Power-hungry, has pathological behavior
Replication: enable multiple copies (replicas) of frequently-
accessed read-shared lines
Lower access latency
Reduces total capacity
11
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 12: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/12.jpg)
NUCA: Non-Uniform Cache Access
Idea: accept & manage differences in
access latencies
Some banks are closer than other
From static to dynamic placement
Static: address bits determine bank
Dynamic: allow lines to migrate
Hopefully, important data are mostly in
the nearby banks
12
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 13: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/13.jpg)
NUCA Management
Approach: organize cache banks into bank sets
Bank group determined by address bits
Banks within the group provide cache associativity
Need to look in all the banks in bank group
Cache lines can move within a group to get closer to requesting CPU
Works because of LRU, most hits normally happen to first cache ways
Mechanisms: mapping, searching, migration
Mapping: simple, fair, shared
Searching: incremental, multicast, smart
Migration: data moves closer as it is accessed, evicted data moved further
13
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 14: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/14.jpg)
NUCA & Multi-core
Dark more
accesses
OLTP (on-line
transaction
processing)
Ocean
(scientific code)
14
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 15: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/15.jpg)
NUCA Discussion & Ideas
What are the complication of dynamic NUCA?
Ideas for improvements
Centralized tags but distributed data
Prediction of bank search
See syllabus for additional refs
15
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 16: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/16.jpg)
Victim Replication
Idea: use local L2 bank as victim cache
Each line has a single home L2 bank
When evicting from L1, write data in local L2 bank
Victim can evict invalid lines, replicas and unshared lines
Can’t evict actively shared blocks that have local L2 as home
Implementation: simple modifications to shared L2
On a miss, search local L2 slice before remote L2 slices
Directory or banking structure does not change
Victim does not change sharer’s info (still as if in local L1)
Invalidations need to check both L1 and local L2 bank
Pros/cons over shared and private?
16
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 17: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/17.jpg)
Adaptive Selective Replication
Very useful profiling approach
Private caches always replicate, lose capacity
Idea: cost/benefit analysis to decide how much to replicate
Benefit: faster hits on replicas
Cost: more misses due to lost capacity
Implementation:
Choose to keep block or not in L1 eviction probabilistically
Adapt replication probability
Small victim tag buffer to profile extra misses
Count hits on replicas to estimate gains on hit latency
17
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 18: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/18.jpg)
Capacity sharing:
Dynamic Spill-Receive
Capacity sharing by spilling evicted lines to nearby L2s
Caches can be spillers or receivers
Spilled lines served using cache
coherence
Implementation:
Dedicate a few sets in each
cache to always-spill or
always-receive, measure
which one works best
18
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 19: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/19.jpg)
Example of Cache Interference
Slowdown for SPECCPU2000 apps when running in parallel with swim, sharing the L2 cache
Run-Time Slowdown(Swim in 2nd core)
mcf ar
t
equa
ke vpr
amm
p
face
rec
bzip2
luca
s
vorte
x
galgel
wup
wise
applu
swim
pars
er
mgr
idtw
olf
gcc
fma3
dga
p
perlb
mk
craf
tyap
si
mes
a
sixtra
ckgz
ipeo
n
Run-time
slowdown
Baseline
2x
3x
4x
5x
19
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 20: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/20.jpg)
Can OS Priorities Solve the Problem?
What is the problem with OS priority mechanisms?
Run-Time Slowdown with thread prioritization(Swim in 2nd core)
mcf ar
t
equa
ke vpr
amm
p
face
rec
bzip2
luca
s
vorte
x
galgel
wup
wise
applu
swim
pars
er
mgr
idtw
olfgc
c
fma3
dga
p
perlb
mk
craf
tyap
si
mes
a
sixtra
ckgz
ipeo
n
Run-time
slowdown
Normal priority SpecSuite-High, Swim-Low priority
Baseline
2x
3x
4x
5x
20
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 21: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/21.jpg)
Is Interference a Common Problem?
Need mechanisms for isolation & QoS
Performance Impact
0x
1x
2x
3x
4x
5x
6x
1 85 169 253 337 421 505 589 673
SPEC App Pairs
Slo
wd
ow
n F
ac
tor
Spec apps 20% line
Rest exhibit <10%
slowdown
20% pairs
exhibit 10-
20%
slowdown
30% pairs exhibit
20%-500%
slowdown
21
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 22: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/22.jpg)
Isolation via Cache Partitioning
Idea: eliminate interference by partitioning the
capacity of the cache
Different apps and different uses get their own partition
We need two techniques
A policy to assign the capacities to cores
A mechanism to enforce capacity assignments
22
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 23: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/23.jpg)
Enforcing Allocations
Way partitioning: Restrict evictions/fills to specific ways
How many partitions can we have?
What happens with associativity?
Can we partition the cache by sets?
Issues and challenges?
Any other schemes?
23
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 24: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/24.jpg)
Capacity Management Policies
Capitalist (most systems today)
No management
If you can generate the requests, you take over resources
Communist
Equal distribution of resources across all apps
Guarantees fairness but not best utilization
Elitist
Highest prio for one app through biased resource allocation
Best effort for the rest of the apps
Utilitarian
Focus on overall efficiency (e.g., throughput)
Provide resources to whoever needs it the most
24
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 25: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/25.jpg)
Utility-based Cache Partitioning
Idea: assign capacity to apps based on how well they use it
Maximize reduction in number of misses)
Implementation: find utility of using each way
Naïve: one auxiliary set of L2 tags per core, count hits/way
Dynamic set sampling: simulate a small number of sets
25
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 26: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/26.jpg)
Replacement policies for CMPs
Replacement policy keeps a rank of blocks
Select least desirable candidate on an eviction
Control how to change the block’s rank on an insertion or hit
(promotion)
LRU
Select last line in LRU chain for eviction
Put block in head of chain (MRU) on ins/promotion
Does not work well with streaming/scanning applications
(many lines w/o reuse) or under thrashing (working set > size
of cache)
27
6.888 Spring 2013 - Sanchez and Emer - L09
![Page 27: LECTURE 9 ADVANCED MULTICORE CACHINGcourses.csail.mit.edu/6.888/spring13/lectures/L9-caching.pdf6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 9 ADVANCED](https://reader036.fdocuments.net/reader036/viewer/2022071408/61001dcda7a58b6ff71b0e84/html5/thumbnails/27.jpg)
Replacement Policies: DIP
LRU insertion policy (LIP)
Insert in LRU position, promote to MRU scan-resistance
Bimodal insertion policy (BIP)
Randomly insert few lines at MRU, others LRU thrash-resistance
Dynamic insertion policy (DIP)
Profile and choose between LRU and DIP
Achieves good performance on LRU-friendly workloads
Thread-aware DIP
Select between DIP and LRU per thread
Scanning/thrashing/low-utility applications use BIP, get less effective capacity similar effects as UCP
S/D/TAD-RRIP, SHiP, …
28
6.888 Spring 2013 - Sanchez and Emer - L09