EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism •...
Transcript of EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism •...
Lecture 17 Slide 1 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS470Lecture18NUCA&Prefetching
Fall2020
Prof.RonaldDreslinski
h5p://www.eecs.umich.edu/courses/eecs470
Prefetch A3
11
Correlating Prediction Table
A3A0,A1 A0
History Table
Latest
A1
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.
Lecture 17 Slide 2 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Non-Uniform Cache Architecture ASPLOS2002proposedbyUT-AusDn(Kim,Burger,Keckler)
Facts❒ Largesharedon-dieL2❒ Wire-delaydominaDngon-diecache
3 cycles 1MB 180nm, 1999
11 cycles 4MB 90nm, 2004
24 cycles 16MB 50nm, 2010
Lecture 17 Slide 3 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Multi-banked L2 cache
Bank=128KB 11 cycles
2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles
Lecture 17 Slide 4 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Multi-banked L2 cache
Bank=64KB 47 cycles
16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles
Lecture 17 Slide 5 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Static NUCA-1
Useprivateper-bankchannel
EachbankhasitsdisDnctaccesslatency
StaDcallydecidedatalocaDonforitsgivenaddress
Averageaccesslatency=34.2cycles
Wireoverhead=20.9%àanissue
Tag Array
Data Bus
Address Bus
Bank
Sub-bank
Predecoder
Sense amplifier
Wordline driver and decoder
Lecture 17 Slide 6 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Static NUCA-2
Usea2Dswitchednetworktoalleviatewireareaoverhead
Averageaccesslatency=24.2cycles
Wireoverhead=5.9%
Bank
Data bus
Switch Tag Array
Wordline driver and decoder
Predecoder
Lecture 17 Slide 7 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
Datacandynamicallymigrate
MovefrequentlyusedcachelinesclosertoCPU
Lecture 17 Slide 8 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
SimpleMapping
All4waysofeachbanksetneedstobesearched
Fartherbanksetsàlongeraccess
8 bank sets way 0
way 1
way 2
way 3
one set bank
Lecture 17 Slide 9 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
FairMapping
AverageaccessDmeacrossallbanksetsareequal
8 bank sets way 0
way 1
way 2
way 3
one set bank
Lecture 17 Slide 10 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
SharedMapping
Sharingtheclosestbanksforfartherbanks
8 bank sets way 0
way 1
way 2
way 3
bank
Lecture 17 Slide 11 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
11
The memory wall
Today:1memaccess≈500arithmeDcops
Howtoreducememorystallsforexis3ngSW?
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Perf
orm
ance
Source:Hennessy&Pa]erson,ComputerArchitecture:AQuan2ta2veApproach,4thed.
Processor
Memory
Lecture 17 Slide 12 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
data
Conventional approach #1: Avoid main memory accesses
Cachehierarchies:
Tradeoffcapacityforspeed
Addmorecachelevels?
Diminishinglocalityreturns
NohelpforshareddatainMPs
CPU
64K
4M
Mainmemory
2clk
20clk
200clk
Writedata
CPU
Lecture 17 Slide 13 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Conventional approach #2: Hide memory latency
Out-of-orderexecuDon:
Overlapcompute&memstalls
ExpandOoOinstrucDonwindow?
Issue&load-storelogichardtoscale
NohelpfordependentinstrucDons
execuD
on
computememstall
OoOInorder
Lecture 17 Slide 14 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Challenges of server apps
Frequentsharing&synchronizaDon
Manylinked-datastructures❒ E.g.,linkedlist,B+tree,…❒ Dependentmisschains[Ranganathan98]
Lowmemorylevelparallelism[Chou04]
50-66%Dmestalledonmemory[Trancoso97][Barroso98][Ailamaki99]……
Ourgoals:Readmisses: Fetchearlier&inparallelWritemisses: Neverstall
exec
ution
Goal Today
compute mem stall
Lecture 17 Slide 15 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
What is Prefetching? • FetchmemoryaheadofDme
• Targetscompulsory,capacity,&coherencemisses
Bigchallenges:
1. knowing“what”tofetch• Fetchinguselessinfowastesvaluableresources
2. “when”tofetchit• Fetchingtooearlyclu]ersstorage• Fetchingtoolatedefeatsthepurposeof“pre”-fetching
Lecture 17 Slide 16 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Software Prefetching
Compiler/programmerplacesprefetchinstrucDons❒ requiresISAsupport❒ whynotuseregularloads?❒ foundinrecentISA’ssuchasSPARCV-9
Prefetchinto❒ register(binding)❒ caches(non-binding)
Lecture 17 Slide 17 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Software Prefetching (Cont.)
e.g.,
for(I=1;I<rows;I++)
for(J=1;J<columns;J++)
{
prefetch(&x[I+1,J]);
sum=sum+x[I,J];
}
Lecture 17 Slide 18 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Hardware Prefetching
Whattoprefetch?❒ oneblockspaDallyahead?❒ useaddresspredictorsàworksforregularpa]erns(x,x+8,x+16,.)
Whentoprefetch?❒ oneveryreference❒ oneverymiss❒ whenpriorprefetcheddataisreferenced❒ uponlastprocessorreference❒ usemorecomplicatedrate-matchingmechanisms
Wheretoputprefetcheddata?❒ auxiliarybuffers❒ caches
Lecture 17 Slide 19 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Generalized Access Pattern Prefetchers
Howdoyouprefetch
1. Heapdatastructures?
2. Indirectarrayaccesses?
3. Generalizedmemoryaccesspa]erns?
Taxonomyofapproaches:
• SpaDalprefetchers
• Address-correlaDngprefetchers
• PrecomputaDonprefetchers
Lecture 17 Slide 20 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Spatial Locality and Sequential Prefetching
WorkswellforI-cache❒ InstrucDonfetchingtendtoaccessmemorysequenDally
Doesn’tworkverywellforD-cache❒ Moreirregularaccesspa]ern❒ regularpa]ernsmayhavenon-unitstride(e.g.matrixcode)
RelaDvelyeasytoimplement❒ Largecacheblocksizealreadyhavetheeffectofprefetching❒ Auerloadingone-cacheline,startloadingthenextlineautomaDcallyifthelineisnotincacheandthebusisnotbusy
Lecture 17 Slide 21 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Array/stridecorrelatedtostaDcloadinstrucDon[Baer’91]
ReferencePredicDonTable
RecordloadPC,lastaddr.&stridebetweenlasttwoaddrs.
Onloadàcomputedistancebetweencurrent&lastaddr
• ifnewdistancematchesoldstrideàfoundapa]ern,goaheadandprefetch“currentaddr+stride”
• update“lastaddr”and“laststride”fornextlookup
LoadInst. LastAddress Last Flags
PC(tag) Referenced Stride
……. ……. ……
PC-based Stride Prefetchers
Load Inst PC
Lecture 17 Slide 22 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Stream Buffers [Jouppi] EachstreambufferholdsonestreamofsequenDally
prefetchedcachelines
Onaloadmisschecktheheadofallstreambuffersforanaddressmatch
❒ ifhit,poptheentryfromFIFO,updatethecachewithdata
❒ ifnot,allocateanewstreambuffertothenewmissaddress(mayhavetorecycleastreambufferfollowingLRUpolicy)
StreambufferFIFOsareconDnuouslytopped-offwithsubsequentcachelineswheneverthereisroomandthebusisnotbusy
StreambufferscanincorporatestridepredicDonmechanismstosupportnon-unit-stridestreams
FIFO
FIFO
FIFO
FIFO
DCache
Mem
ory
inte
rface
Indirectarrayaccesses(e.g.,A[B[i]])?
Nocachepollu2on
Lecture 17 Slide 23 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Global History Buffer (GHB) [Nesbit’04]
HoldsmissaddresshistoryinFIFOorder
LinkedlistswithinGHBconnectrelatedaddresses
❒ SamestaDcload(PC/DC)❒ Sameglobalmissaddress(G/AC)
LinkedlistwalkisshortcomparedwithL2misslatency
Global History Buffer
miss addresses
Index Table
FI
Load PC
FO
Lecture 17 Slide 24 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
GHB – Deltas
1 4 8
1 8 8 1 4 4 1 8 8 Delta Stream
Miss Address Stream 27 28 36 44 45 49 53 54 62 70 71
1
1
8
=> Current => Prefetches
Key
8
4
4
Width Depth Hybrid Markov Graph
.3 .3
.3 .7 .7 .7
71 + 8 => 79
79 + 8 => 87
Prefetches 71 + 4 => 75
79 + 4 => 79
Prefetches 71 + 8 => 79
71 + 4 => 75
Prefetches
Lecture 17 Slide 25 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
GHB – Stride Prefetching GHB-StrideusesthePCtoaccesstheindextable
Thelinkedlistscontainthelocalhistoryofeachload
Comparethelasttwolocalstrides.Ifthesamethenprefetchn+s,n+2s,…,n+ds
Global History Buffer miss address pointer pointer
Index Table
head pointer
A B C A B C B
1
C
1
PC
=?
Lecture 17 Slide 26 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
GHB –Delta Correlation (PC/DC) FormdeltacorrelaDonswithineachload’slocalhistory
Forexample,considerthelocalmissaddressstream:
BestresultsamongdataprefetchersforSPEC2K
[GraciaPérez’04]
Addresses 0 1 2 64 65 66 128 129 Deltas 1 1 62 1 1 62 1
Correlation Prefetch Predictions (1,1) 62 1 1
(1,62) 1 1 62 (62, 1) 1 62 1
Lecture 17 Slide 27 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Spatial Correlation
RepeDDvespaDalrelaDonshipsbetweenaccesses• Irregularlayoutànon-strided• Sparseàcan’tcapturewithcacheblocks• But,repeDDveàpredicttoimprovememory-levelpar.NottobeconfusedwithspaDallocality:• pa]ernsmayrepeatoverlarge(e.g.,fewkB)regions
Lecture 17 Slide 28 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Large-scalespaDalaccesspa]ernsPa]ernisfuncDonofprogram
Database Page (8kB)
page header
tuple data
tuple slot index
Mem
ory
Example Spatial correlation [Somogyi’06]
Lecture 17 Slide 29 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
SMS Operation Summary
spatial patterns
1100000001101…
1100001010001… Spatial patterns stored in a
pattern history table
3
2
1
cache hits time
observe
store
predict
PC1 ld A+4 PC2 ld A
PC3 ld A+3
PC1 ld B+4 PC2 ld B
PC3 ld B+3
evict A+3
29
Lecture 17 Slide 30 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Correlation-Based Prefetching [Charney 96]
ConsiderthefollowinghistoryofLoadaddressesemi]edbyaprocessorA,B,C,D,C,E,A,C,F,F,E,A,A,B,C,D,E,A,BC,D,C
AuerreferencingaparDcularaddress(sayAorE),aresomeaddressesmorelikelytobereferencednext
A B C
D E F 1.0
.33 .5
.2
1.0 .6 .2
.67 .6
.5
.2
.2
Markov Model
Lecture 17 Slide 31 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
TrackthelikelynextaddressesauerseeingaparDcularaddr.
PrefetchaccuracyisgenerallylowsoprefetchuptoNnextaddressestoincreasecoverage (butthiswastesbandwidth)
Prefetchaccuracycanbeimprovedbyusinglongerhistory❒ DecidewhichaddresstoprefetchnextbylookingatthelastKloadaddressesinsteadofjustthecurrentone
❒ e.g.indexwiththeXORofthedataaddressesfromthelastKloads❒ UsinghistoryofacoupleloadscanincreaseaccuracydramaDcally
Thistechniquecanalsobeappliedtojusttheloadmissstream
LoadDataAddr Prefetch Confidence …. Prefetch Confidence
(tag) Candidate1 …. CandidateN
……. ……. …… .… ……. ……
….
Correlation-Based Prefetching
Load Data Addr
Lecture 17 Slide 32 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
MarkovPrefetchers
• Correlatesubsequentcachemisses
• Triggerprefetchonmiss
• Width-prefetching:predict&prefetchfourcandidates❒ predicDngonlyoneresultsinlowcoverage!
• Prefetchintoabuffer
Example: Markov Prefetchers [Joseph’07]
Lecture 17 Slide 33 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Tag-Correlating Prefetchers [Kaxiras’04]
CorrelaDon-basedprefetching:
• tablesaretoobig
• theygrowwithdataworkingsetsize
Muchsimilarityinblockaddressesmappingtosets
• whenmarchingthrougharrays,tagsacrosssetsidenDcal!
• savespaceincorrelaDontablesbycorrelaDngtagsonly(notfulladdresses)
Lecture 17 Slide 34 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Revisit: Global History Buffer (GHB) [Nesbit’04]
HoldsmissaddresshistoryinFIFOorder
LinkedlistswithinGHBconnectrelatedaddresses
❒ PC/DC❒ Sameglobalmissaddress(G/AC)
LinkedlistwalkisshortcomparedwithL2misslatency
Global History Buffer
miss addresses
Index Table
FI
Miss Address
FO
Lecture 17 Slide 35 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Miss Address Stream 27 28 29 27 28 29 28
GHB (G/AC) - Example
29 Global History Buffer
miss address pointer pointer Index Table
28 29 29
29
head pointer
28
27
27 27
=> Current => Prefetches
Key
28 29 28
Miss Address
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Linked-DataPrefetchers• Whentraversinglinked-structure:• Learn/recordload-to-loaddependence• GetaheadofprocessorbytraversingstructureinFSM• FSMgetsaheadofprocessorbyskippingcomputaDon
q SimilarproposalswithSWhelp(e.g.,helper/scoutthreads)• Example:while(*p!=NULL){if(p->key==MATCH)p->val++;p=p->next;}
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
LinkedDataStructureAccess
next 0 4 8 12 14
next 0 4 8 12 14
next 0 4 8 12 14
next 0 4 8 12 14 Offset
Offset
Offset
Offset
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
DetecNngRecursiveAccesses
next
a+0 a+4 a+8 a+12 a+14
next
Offset Offset
LOAD rdest, rsrc(14)
a rsrc:
LOAD rdest, rsrc(14)
b rsrc:
Producerofb Consumerofb/Producerofc
hold same value
next
c+0 c+4 c+8 c+12 c+14
Offset
b+0 b+4 b+8 b+12 b+14
p = p->next;
Example
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Roth,Moshovos,Sohi(HW)[Roth’98]
PC-A: LOAD rdest, rsrc(14)
a rsrc:
PC-B: LOAD rdest, rsrc(14)
b rsrc:
Producer of b Consumer of b/Producer of c
hold same value
Potential Producer Window
Memory Value
Loaded
Producer Instruction Address
PC-A b
Correlation Table
Producer Instruction Address
Consumer Instruction Address
PC-B PC-A
Consumer Instruction Template
LOAD r,r(14)
Lecture 17 Slide 40 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar Runahead Execution [Mutlu’03]
Memory-levelparallelismoflargewindowwithoutbuildingit!
WhenoldestinstrucDonisL2miss:❒ Checkpointstateandenterrunaheadmode
Inrunaheadmode:❒ InstrucDonsspeculaDvelypre-executed❒ TodiscoverotherL2misses❒ ProcessorconDnuestorun
RunaheadmodeendswhentheoriginalL2missreturns❒ CheckpointisrestoredandnormalexecuDonresumes
Lecture 17 Slide 41 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Compute
Compute
Compute
Load 1 Miss
Stall Compute
Load 2 Miss
Stall
Load 1 Hit Load 2 Hit
Compute
Load 1 Miss
Runahead
Load 2 Miss Load 2 Hit
Compute
Load 1 Hit
Saved Cycles
Perfect Caches:
Small Window:
Runahead:
Runahead Example
Lecture 17 Slide 42 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Benefits of Runahead Execution
AvoidstallingduringanL2cachemiss!Pre-executedloads/storesgenerateaccurateprefetches❒ bothregularandirregularaccesspa]erns
InstrucDonsonpredictedpath❒ prefetchedintoi-cacheandL2
Hardwareprefetcherandbranchpredictor❒ aretrainedusingfutureaccessinformaDon
Lecture 17 Slide 43 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Improving Cache Performance: Summary Missrate
❒ largeblocksize❒ higherassociaDvity❒ vicDmcaches❒ skewed-/pseudo-associaDvity❒ hardware/souwareprefetching❒ compileropDmizaDons
Misspenalty❒ giveprioritytoreadmissesoverwrites/writebacks
❒ subblockplacement❒ earlyrestartandcriDcalwordfirst❒ non-blockingcaches❒ mulD-levelcaches
HitDme(difficult?)❒ smallandsimplecaches❒ avoidingtranslaDonduringL1indexing(later)
❒ pipeliningwritesforfastwritehits
❒ subblockplacementforfastwritehitsinwritethroughcaches