Lec5a Supp
Transcript of Lec5a Supp
-
7/29/2019 Lec5a Supp
1/77
-
7/29/2019 Lec5a Supp
2/77
Review: Major Components of a Computer
rocessor
Control
Devices
Input
Datapath Output
Cache
Main
Memory
Second
Memo
(Disk
CSE431 Chapter 5A.2 Irwin, PSU, 2008
ry
y
-
7/29/2019 Lec5a Supp
3/77
Processor-Memory Performance Gap
1000055%/year(2X/1.5yr)
1000nce Moores Law
100
rfo
rma Processor-Memory
Performance Gaprows 50%/ ear
10P DRAM7%/year
1
1980 1984 1988 1992 1996 2000 2004
(2X/10yrs)
CSE431 Chapter 5A.3 Irwin, PSU, 2008
Year
-
7/29/2019 Lec5a Supp
4/77
The Memory Wall
Processor vs DRAM speed disparity continues to grow
1000ss
100
tructio
n
Ma
cce
1
Core
Memory
sperin
perDR
0.1Cloc
Clock
.VAX/1980 PPro/1996 2010+
CSE431 Chapter 5A.4 Irwin, PSU, 2008
important to overall performance
-
7/29/2019 Lec5a Supp
5/77
The Memory Hierarchy Goal
Fact: Large memories are slow and fast memories aresmall
How do we create a memory that gives the illusion of
being large, cheap and fast (most of the time)?z erarc y
z With parallelism
CSE431 Chapter 5A.5 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
6/77
-
7/29/2019 Lec5a Supp
7/77
Memory Hierarchy Technologies
compatibilityz Fast (typical access times of 0.5 to 2.5 nsec)
z Low density (6 transistor cells), higher power, expensive ($2000to $5000 per GB in 2008)
z Static: content will last forever (as long as power is left on)
Main memory uses DRAM for size (density)
z Slower (typical access times of 50 to 70 nsec)z High density (1 transistor cells), lower power, cheaper ($20 to $75
per GB in 2008)
z Dynamic: needs to be refreshed regularly (~ every 8 ms)
- consumes1% to 2% of the active cycles of the DRAM
z Addresses divided into 2 halves (row and column)
-
CSE431 Chapter 5A.7 Irwin, PSU, 2008
- CAS orColumn Access Strobe triggering the column selector
-
7/29/2019 Lec5a Supp
8/77
The Memory Hierarchy: Why Does it Work?
Temporal Locality (locality in time)
z If a memory location is referenced then it will tend to bereferenced again soon
Keep most recently accessed data items closer to the processor
Spatial Locality (locality in space)
z If a memory location is referenced, the locations with nearbya resses w en o e re erence soon
Move blocks consisting ofcontiguous words closer to theprocessor
CSE431 Chapter 5A.8 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
9/77
The Memory Hierarchy: Terminology
present (or not) in a cache
Hit Rate: the fraction of memory accesses found in a level
of the memory hierarchyz Hit Time: Time to access that level which consists of
Miss Rate: the fraction of memory accesses not found in a
level of the memory hierarchy 1 - (Hit Rate)z Miss Penalty: Time to replace a block in that level with the
corresponding block from a lower level which consists of
Time to access the block in the lower level + Time to transmit that block
to the level that experienced the miss + Time to insert the block in thatlevel + Time to pass the block to the requestor
CSE431 Chapter 5A.9 Irwin, PSU, 2008
Hit Time
-
7/29/2019 Lec5a Supp
10/77
Characteristics of the Memory Hierarchy
ProcessorInclusive
4-8 b tes word
Increasingdistance
L1$
is a subset ofwhat is in L2$
8-32 b tes block
from theprocessor
in access
L2$
what is in MMthat is asubset of is in1 to 4 blocks
time SM1,024+ bytes (disk sector = page)
(Relative) size of the memory at each level
CSE431 Chapter 5A.10 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
11/77
How is the Hierarchy Managed?
registers memory
z by compiler (programmer?)
cache main memoryz by the cache controller hardware
main memory disks
z by the operating system (virtual memory)z virtual to physical address mapping assisted by the hardware
(TLB)
z by the programmer (files)
CSE431 Chapter 5A.11 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
12/77
Cache Basics
Two questions to answer (in hardware):
z Q1: How do we know if a data item is in the cache?
,
Direct mapped
z Each memory block is mapped to exactly one block in thecache
- lots of lower level blocks must share blocks in the cache
z Address mapping (to answer Q2):
z Have a tag associated with each cache block that contains
CSE431 Chapter 5A.12 Irwin, PSU, 2008
e a ress n orma on e upper por on o e a ressrequired to identify the block (to answer Q1)
-
7/29/2019 Lec5a Supp
13/77
Caching: A Simple First ExampleMain Memory
Cache
xx
0001xx
0010xxTag DataValid
One word blocksTwo low order bitsdefine the byte in theIndex
0001
10
0100xx0101xx
0110xx
word (32b words)
11 0111xx
1000xx
1001xx
Q2: How do we find it?
1010xx
1011xx
1100xx
Q1: Is it there?
memory address bits the index to
xx1110xx
1111xx
tag to the high order 2memory address bits totell if the memor block
cache block (i.e.,modulo the number ofblocks in the cache
CSE431 Chapter 5A.13 Irwin, PSU, 2008
is in the cache
(block address) modulo (# of blocks in the cache)
-
7/29/2019 Lec5a Supp
14/77
Caching: A Simple First ExampleMain Memory
Cache
Tag DataValid
xx0001xx
0010xx
One word blocksTwo low order bitsdefine the byte in theIndex
0001
10
0100xx0101xx
0110xx
word (32b words)
11Q2: How do we find it?
0111xx
1000xx
1001xx
memory address bits the index to
Q1: Is it there?1010xx
1011xx
1100xx
cache block (i.e.,modulo the number ofblocks in the cache
tag to the high order 2memory address bits totell if the memor block
xx1110xx
1111xx
CSE431 Chapter 5A.14 Irwin, PSU, 2008
is in the cache
(block address) modulo (# of blocks in the cache)
-
7/29/2019 Lec5a Supp
15/77
Direct Mapped Cache
0 1 2 3 4 3 4 15Start with an empty cache - all
blocks initially marked as not valid
3 4 15
CSE431 Chapter 5A.15 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
16/77
Direct Mapped Cache
0 1 2 3 4 3 4 15Start with an empty cache - all
blocks initially marked as not valid
00 Mem(0) 00 Mem(0)00 Mem(1)
00 Mem(0) 00 Mem(0)
00 Mem(1)
m ss m ss m ss m ss
00 Mem(1)
3 4 15
00 Mem(2)
miss misshit hit
00 Mem(2)
00 Mem(3)
00 Mem(0)
00 Mem(1)
01 Mem(4)
00 Mem(1)
01 Mem(4)
00 Mem(1)
01 Mem(4)
00 Mem(1)
01 4
00 Mem(3)
00 Mem(3)
00 Mem(3)
00 Mem(3)11 15
z 8 re uests, 6 misses
CSE431 Chapter 5A.16 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
17/77
-
7/29/2019 Lec5a Supp
18/77
Multiword Block Direct Mapped Cache
=31 30 . . . 13 12 11 . . . 4 3 2 1 0
ByteoffsetHit Data
,
8
IndexDataIndex TagValid
Tag oc o se
1
2
.
.
.
253
254
255
20
CSE431 Chapter 5A.18 Irwin, PSU, 2008
32
What kind of locality are we taking advantage of?
-
7/29/2019 Lec5a Supp
19/77
Taking Advantage of Spatial Locality
0 1 2 3 4 3 4 15Start with an empty cache - allblocks initially marked as not valid
2
3 4 3
4 15
CSE431 Chapter 5A.19 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
20/77
Taking Advantage of Spatial Locality
0 1 2 3 4 3 4 15Start with an empty cache - allblocks initially marked as not valid
2
00 Mem(1) Mem(0)
m ss
00 Mem(1) Mem(0)00 Mem(3) Mem(2)00 Mem(1) Mem(0)
m ss
3 4 3hit
00 Mem(1) Mem(0)
miss
00 Mem(1) Mem(0)01
5 4
hit
01 Mem(5) Mem(4)
4 15
00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2)
hit miss
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
00 Mem(3) Mem(2)01 Mem(5) Mem(4)11 15 14
CSE431 Chapter 5A.20 Irwin, PSU, 2008
z 8 requests, 4 misses
-
7/29/2019 Lec5a Supp
21/77
Miss Rate vs Block Size vs Cache Size
10
) 8 KB
5
ssrate( 16 KB
64 KB
256 KB
Mi
16 32 64 128 256
Block size (bytes)
Miss rate goes up if the block size becomes a significantfraction of the cache size because the number of blocks
CSE431 Chapter 5A.21 Irwin, PSU, 2008
(increasing capacity misses)
-
7/29/2019 Lec5a Supp
22/77
Cache Field Sizes
The number of bits in a cache includes both the storagefor data and for the tags
-
z For a direct mapped cache with 2n
blocks, n bits are used for theindex
z or a oc s ze o wor s y es , m s are use oaddress the word within the block and 2 bits are used to address
the byte within the word a s e s ze o e ag e
The total number of bits in a direct-mapped cache is thenn + +
How many total bits are required for a direct mappedcache with 16KB of data and 4-word blocks assuming a
CSE431 Chapter 5A.22 Irwin, PSU, 2008
32-bit address?
-
7/29/2019 Lec5a Supp
23/77
Handling Cache Hits
z this is what we want!
Write hits (D$ only)
z require the cache and memory to be consistent
- the memory hierarchy (write-through)
- writes run at the speed of the next level in the memory hierarchy soslow! or can use a write bufferand stall onl if the write buffer is full
z allow cache and memory to be inconsistent
- write the data only into the cache block (write-back the cache block to
evicted)
- need a dirty bit for each data cache block to tell if it needs to bewritten back to memory when it is evicted can use a write bufferto
CSE431 Chapter 5A.23 Irwin, PSU, 2008
help buffer write-backs of dirty blocks
-
7/29/2019 Lec5a Supp
24/77
Sources of Cache Misses
ompu sory co s ar or process m gra on, rsreference):
z First access to a block, cold fact of life, not a whole lot youcan o a ou . you are go ng o run m ons o ns ruc on,
compulsory misses are insignificantz Solution: increase block size (increases miss penalty; very
Capacity:
z Solution: increase cache size (may increase access time)
Conflict collision :
z Multiple memory locations mapped to the same cache location
z Solution 1: increase cache size
CSE431 Chapter 5A.24 Irwin, PSU, 2008
z Solution 2: increase associativity (stay tuned) (may increaseaccess time)
-
7/29/2019 Lec5a Supp
25/77
Handling Cache Misses (Single Word Blocks)
z stall the pipeline, fetch the block from the next level in the memory
hierarchy, install it in the cache and send the requested word toe processor, en e e p pe ne resume
Write misses (D$ only)1. stall the i eline, fetch the block from next level in the memor
hierarchy, install it in the cache (which may involve having to evicta dirty block if using a write-back cache), write the word from the
processor to the cache, then let the pipeline resumeor
2. Write allocate just write the word into the cache updating boththe ta and data no need to check for cache hit no need to stall
or
3. No-write allocate skip the cache write (but must invalidate that
CSE431 Chapter 5A.25 Irwin, PSU, 2008
word to the write buffer (and eventually to the next memory level),
no need to stall if the write buffer isnt full
-
7/29/2019 Lec5a Supp
26/77
Multiword Block Considerations
Read misses (I$ and D$)
z Processed the same as for single word blocks a miss returnsthe entire block from memor
z
Miss penalty grows as block size grows- Early restart processor resumes execution as soon as the
re uested word of the block is returned
- Requested word first requested word is transferred from thememory to the cache (and processor) first
z the cache while the cache is handling an earlier miss
Write misses (D$)
z
us ng wr te a ocate must rst etc t e oc rom memory anthen write the word to the block (or could end up with a garbledblock in the cache (e.g., for 4 word blocks, a new tag, one word
CSE431 Chapter 5A.26 Irwin, PSU, 2008
,block)
-
7/29/2019 Lec5a Supp
27/77
Memory Systems that Support Caches
affect overall system performance in dramatic ways
One word wide organization (one word wide buson-chip
CPUan one wor w e memory
Assume
Cache
bus
.
2. 15 memory bus clock cycles to get the 1st
word in the block from DRAM (row cycle
DRAM
, ,3rd, 4th words (column access time)
3. 1 memory bus clock cycle to return a word
-&
32-bit addrper cycle
emory
Memory-Bus to Cache bandwidth
z number of bytes accessed from memory
CSE431 Chapter 5A.27 Irwin, PSU, 2008
and transferred to cache/CPU per memory
bus clock cycle
-
7/29/2019 Lec5a Supp
28/77
Review: (DDR) SDRAM OperationColumn +1
N colsAddress er a row sread into the SRAM register
z In ut CAS as the startin burst
N
row
s Row
Address
address along with a burst length
z Transfers a burst of data (ideallya cache block from a series of
N x M SRAM
sequential addrs within that row
- The memory bus clock controls
transfer of successive words in
M-bit Output
the burst
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
Cycle Time
CAS
RAS
CSE431 Chapter 5A.28 Irwin, PSU, 2008
Row Address Col Address Row Add
-
7/29/2019 Lec5a Supp
29/77
DRAM Size Increase
Add a table like figure 5.12 to show DRAM growth since1980
CSE431 Chapter 5A.29 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
30/77
One Word Wide Bus, One Word Blocks
CPU
on-chip
If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for the
Cache
number of cycles required to return one
data word from memoryc cle to send address
buscycles to read DRAM
cycle to return data
DRAMMemory
o a c oc cyc es m ss pena y
cycle (bandwidth) for a single miss is
bytes per memory bus clock
CSE431 Chapter 5A.30 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
31/77
One Word Wide Bus, One Word Blocks
CPU
on-chip
If the block size is one word, then for amemory access due to a cache miss,the pipeline will have to stall for the
Cache
number of cycles required to return one
data word from memorymemor bus clock c cle to send address1
busmemory bus clock cycles to read DRAM
memory bus clock cycle to return data
15
1
DRAMMemory
o a c oc cyc es m ss pena y
cycle (bandwidth) for a single miss is
bytes per memory bus clock4/17 = 0.235
CSE431 Chapter 5A.31 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
32/77
One Word Wide Bus, Four Word Blocks
CPU
on-chip
each word is in a different DRAM row?
cycle to send 1st address
Cache
cycles to read DRAM
cycles to return last data word
bus
DRAMMemory
Number of bytes transferred per clockcycle (bandwidth) for a single miss is
CSE431 Chapter 5A.32 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
33/77
One Word Wide Bus, Four Word Blocks
CPU
on-chip
each word is in a different DRAM row?
cycle to send 1st address1
Cache
cycles to read DRAM
cycles to return last data word
4 x 15 = 60
1
bus
15 cycles
15 cycles
DRAMMemory
15 cycles
15 cycles
Number of bytes transferred per clockcycle (bandwidth) for a single miss is
=
CSE431 Chapter 5A.33 Irwin, PSU, 2008
.
-
7/29/2019 Lec5a Supp
34/77
One Word Wide Bus, Four Word Blocks
CPU
on-chip
words are in the same DRAM row?
cycle to send 1st address
Cache
cycles to read DRAM
cycles to return last data word
bus
DRAMMemory
Number of bytes transferred per clock
CSE431 Chapter 5A.34 Irwin, PSU, 2008
bytes per clock
-
7/29/2019 Lec5a Supp
35/77
One Word Wide Bus, Four Word Blocks
CPU
on-chip
words are in the same DRAM row?
cycle to send 1st address1
Cache
cycles to read DRAM
cycles to return last data word15 + 3*5 = 30
1
bus
15 cycles
DRAMMemory
cyc es
5 cycles
5 cycles
Number of bytes transferred per clock
CSE431 Chapter 5A.35 Irwin, PSU, 2008
bytes per clock(4 x 4)/32 = 0.5
-
7/29/2019 Lec5a Supp
36/77
Interleaved Memory, One Word Wide Bus
For a block size of four wordscycle to send 1st address
CPU
on-chip
cycles to return last data wordtotal clock cycles miss penaltyCache
bus
Memorybank 1
Memorybank 0
Memorybank 2
Memorybank 3
Number of bytes transferredper clock cycle (bandwidth) for asin le miss is
CSE431 Chapter 5A.36 Irwin, PSU, 2008
bytes per clock
-
7/29/2019 Lec5a Supp
37/77
Interleaved Memory, One Word Wide Bus
For a block size of four wordscycle to send 1st address
CPU
on-chip 1
cycles to return last data wordtotal clock cycles miss penaltyCache
4*1 = 420
bus 15 cycles
15 cycles15 cycles
15 cyclesMemorybank 1
Memorybank 0
Memorybank 2
Memorybank 3
Number of bytes transferredper clock cycle (bandwidth) for asin le miss is
CSE431 Chapter 5A.37 Irwin, PSU, 2008
bytes per clock(4 x 4)/20 = 0.8
-
7/29/2019 Lec5a Supp
38/77
DRAM Memory System Summary
Its important to match the cache characteristicsz caches access one block at a time (usually more than one
word)
with the DRAM characteristics
z use DRAMs that support fast multiple word accesses,preferably ones that match the block size of the cache
with the memory-bus characteristics
z make sure the memory-bus can support the DRAM accessra es an pa erns
z with the goal of increasing the Memory-Bus to Cachebandwidth
CSE431 Chapter 5A.38 Irwin, PSU, 2008
-
7/29/2019 Lec5a Supp
39/77
Measuring Cache Performance
normal CPU execution cycle, then
CPU time = IC CPI CC
= IC (CPIideal + Memory-stall cycles) CC
CPIstall
emory-s a cyc es come rom cac e m sses a sum oread-stalls and write-stalls)
Read-stall c cles = reads/ ro ram read miss rate read miss penalty
Write-stall cycles = (writes/program write miss rate write miss enalt
+ write buffer stalls
For write-through caches, we can simplify this to
CSE431 Chapter 5A.39 Irwin, PSU, 2008
Memory-stall cycles = accesses/program miss rate miss penalty
-
7/29/2019 Lec5a Supp
40/77
Impacts of Cache Performance
performance improves (faster clock rate and/or lower CPI)
z The memory speed is unlikely to improve as fast as processorcycle time. When calculating CPIstall, the cache miss penalty is
measured in processorclock cycles needed to handle a missz The lower the CPIideal, the more pronounced the impact of stalls
A processor with a CPIideal of 2, a 100 cycle miss penalty,36% load/store instrs, and 2% I$ and 4% D$ miss rates
- .
So CPIstalls = 2 + 3.44 = 5.44
more than twice the CPIideal !
What if the CPIideal is reduced to 1? 0.5? 0.25? What if the D$ miss rate went up 1%? 2%?
CSE431 Chapter 5A.40 Irwin, PSU, 2008
What if the processor clock rate is doubled (doubling the
miss penalty)?
-
7/29/2019 Lec5a Supp
41/77
Average Memory Access Time (AMAT)
A larger cache will have a longer access time. Anincrease in hit time will likely add another stage to thei eline. At some oint the increase in hit time for a
larger cache will overcome the improvement in hit rateleading to a decrease in performance.
verage emory ccess me s t e average toaccess memory considering both hits and misses
AMAT = Time for a hit + Miss rate x Miss penalty
What is the AMAT for a processor with a 20 psec clock, amiss penalty of 50 clock cycles, a miss rate of 0.02misses er instruction and a cache access time of 1
CSE431 Chapter 5A.41 Irwin, PSU, 2008
clock cycle?
-
7/29/2019 Lec5a Supp
42/77
Reducing Cache Miss Rates #1
1. Allow more flexible block placement
In a direct mapped cache a memory block maps toexactly one cache block
At the other extreme, could allow a memory block to bemapped to any cache block fully associative cache
A compromise is to divide the cache into sets each of - .
memory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (so
CSE431 Chapter 5A.42 Irwin, PSU, 2008
(block address) modulo (# sets in the cache)
-
7/29/2019 Lec5a Supp
43/77
Another Reference String Mapping
0 4 0 4 0 4 0 4Start with an empty cache - all
blocks initially marked as not valid
0 4 0 4
CSE431 Chapter 5A.43 Irwin, PSU, 2008
A th R f St i M i
-
7/29/2019 Lec5a Supp
44/77
Another Reference String Mapping
0 4 0 4 0 4 0 4Start with an empty cache - all
blocks initially marked as not valid
00 Mem(0) 00 Mem(0)
01 401 Mem(4)
000
00 Mem(0)
01
4
0 4 0 4miss miss miss miss
00 Mem(0) 00 Mem(0)4
01 Mem(4)0
01 Mem(4)
z 8 requests, 8 misses
CSE431 Chapter 5A.44 Irwin, PSU, 2008
Ping pong effect due to conflict misses - two memory
locations that map into the same cache block
S t A i ti C h E l
-
7/29/2019 Lec5a Supp
45/77
Set Associative Cache ExampleMain Memory
Cache
Tag DataV
xx0001xx
0010xxSetWay
One word blocksTwo low order bitsdefine the byte in the
0 0100xx
0101xx
0110xx
1
0
0
wor wor s
Q2: How do we find it?0111xx
1000xx
1001xx
1
memory address bit todetermine which
Q1: Is it there?
Compare all the cache
1010xx
1011xx
1100xx . .,
the number of sets inthe cache)
tags in the set to thehigh order 3 memoryaddress bits to tell if
xx
1110xx
1111xx
CSE431 Chapter 5A.45 Irwin, PSU, 2008
e memory oc s nthe cache
A th R f St i M i
-
7/29/2019 Lec5a Supp
46/77
Another Reference String Mapping
0 4 0 4 0 4 0 4Start with an empty cache - all
blocks initially marked as not valid
0 4 0 4
CSE431 Chapter 5A.46 Irwin, PSU, 2008
A th R f St i M i
-
7/29/2019 Lec5a Supp
47/77
Another Reference String Mapping
0 4 0 4 0 4 0 4Start with an empty cache - all
blocks initially marked as not valid
0 4 0 4m ss m ss
000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)
010 Mem(4) 010 Mem(4) 010 Mem(4)
z 8 requests, 2 misses
Solves the ping pong effect in a direct mapped cachedue to conflict misses since now two memory locationsthat ma into the same cache set can co-exist!
CSE431 Chapter 5A.47 Irwin, PSU, 2008
Four Way Set Associative Cache
-
7/29/2019 Lec5a Supp
48/77
Four-Way Set Associative Cache 28 = 256 sets each with four wa s each with one block
31 30 . . . 13 12 11 . . . 2 1 0 Byte offset
822Tag
DataTagV01
DataTagV01
DataTagV01
Index DataTagV01
.
.
.
253
.
.
.
253
.
.
.
253
.
.
.
253
254
255
254
255
254
255
254
255
32
CSE431 Chapter 5A.48 Irwin, PSU, 2008Hit Data
4x1 select
Range of Set Associative Caches
-
7/29/2019 Lec5a Supp
49/77
Range of Set Associative Caches
,in associativity doubles the number of blocks per set (i.e.,the number or ways) and halves the number of sets
the size of the tag by 1 bit
Block offset B te offsetIndexTa
CSE431 Chapter 5A.49 Irwin, PSU, 2008
Range of Set Associative Caches
-
7/29/2019 Lec5a Supp
50/77
Range of Set Associative Caches
,in associativity doubles the number of blocks per set (i.e.,the number or ways) and halves the number of sets
the size of the tag by 1 bit
Block offset B te offsetIndexTa
Selects the setUsed for tag compare Selects the word in the block
Decreasing associativityIncreasing associativity
(only one set)Tag is all the bits exceptblock and byte offset
Direct mapped(only one way)Smaller tags, only a
CSE431 Chapter 5A.50 Irwin, PSU, 2008
single comparator
Costs of Set Associative Caches
-
7/29/2019 Lec5a Supp
51/77
Costs of Set Associative Caches
,replacement?
z Least Recently Used (LRU): the block replaced is the one thathas been unused for the longest time
- Must have hardware to keep track of when each ways block wasused relative to the other blocks in the set
- For 2-way set associative, takes one bit per set set the bit when ablock is referenced (and reset the other ways bit)
N-way set associative cache costsz N comparators (delay and area)
z MUX delay (set selection) before data is available
.
direct mapped cache, the cache block is available before theHit/Miss decision
CSE431 Chapter 5A.51 Irwin, PSU, 2008
- if it was a miss
Benefits of Set Associative Caches
-
7/29/2019 Lec5a Supp
52/77
Benefits of Set Associative Caches
on the cost of a miss versus the cost of implementation
12
8
10
e
4KB
8KB16KB
4
6
M
issRat
64KB
128KB
256KB
0
2512KB
Data from Hennessy &
1-w ay 2-w ay 4-w ay 8-w ayAssoci ati vity
,
Architecture, 2003
CSE431 Chapter 5A.52 Irwin, PSU, 2008
arges ga ns are n go ng rom rec mappe o -way(20%+ reduction in miss rate)
Reducing Cache Miss Rates #2
-
7/29/2019 Lec5a Supp
53/77
Reducing Cache Miss Rates #2
2. se mu p e eve s o cac es
room on the die for bigger L1 caches orfor a secondlevel of caches normally a unified L2 cache (i.e., it
even a unified L3 cache
For our exam le CPI of 2 100 c cle miss enalt(to main memory) and a 25 cycle miss penalty (toUL2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss
, .
CPIstalls = 2 + .0225 + .36.0425 + .005100 +
CSE431 Chapter 5A.53 Irwin, PSU, 2008
. . .(as compared to 5.44 with no L2$)
Multilevel Cache Design Considerations
-
7/29/2019 Lec5a Supp
54/77
Multilevel Cache Design Considerations
different
z Primary cache should focus on minimizing hit time in support of
- Smaller with smaller block sizes
z Secondary cache(s) should focus on reducing miss rate to
- Larger with larger block sizes
- Higher levels of associativity
The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cache so it can be smalleri.e. faster but have a hi her miss rate
For the L2 cache, hit time is less important than miss rate
z The L2 hit time determines L1 s miss enalt
CSE431 Chapter 5A.54 Irwin, PSU, 2008
z L2$ local miss rate >> than the global miss rate
Using the Memory Hierarchy Well
-
7/29/2019 Lec5a Supp
55/77
Using the Memory Hierarchy Well
Include plots from Figure 5.18
CSE431 Chapter 5A.55 Irwin, PSU, 2008
Two Machines Cache Parameters
-
7/29/2019 Lec5a Supp
56/77
Two Machines Cache Parameters
Intel Nehalem AMD Barcelona
L1 cacheorganization & size
Split I$ and D$; 32KB foreach per core; 64B blocks
Split I$ and D$; 64KB for eachper core; 64B blocks
L1 associativity 4-way (I), 8-way (D) set
assoc.; ~LRU replacement
2-way set assoc.; LRU
replacementL1 write olic write-back, write-allocate write-back, write-allocate
L2 cacheorganization & size
Unified; 256MB (0.25MB)per core; 64B blocks
Unified; 512KB (0.5MB) percore; 64B blocks
L2 associativit 8-wa set assoc. ~LRU 16-wa set assoc. ~LRUL2 write policy write-back write-back
L2 write policy write-back, write-allocate write-back, write-allocate
organization & size
shared by cores; 64B blocks
shared by cores; 64B blocksL3 associativity 16-way set assoc. 32-way set assoc.; evict block
CSE431 Chapter 5A.56 Irwin, PSU, 2008
L3 write policy write-back, write-allocate write-back; write-allocate
Two Machines Cache Parameters
-
7/29/2019 Lec5a Supp
57/77
Intel P4 AMD Opteron
L1 organization Split I$ and D$ Split I$ and D$
L1 cache size 8KB for D$, 96KB for
trace cache (~I$)
64KB for each of I$ and D$
L1 block size 64 bytes 64 bytes
L1 associativity 4-way set assoc. 2-way set assoc.
L1 replacement ~ LRU LRU
L1 write policy write-through write-backL2 organization Unified Unified
L2 cache size 512KB 1024KB (1MB)
L2 associativity 8-way set assoc. 16-way set assoc.
L2 replacement ~LRU ~LRU
CSE431 Chapter 5A.57 Irwin, PSU, 2008
L2 write policy write-back write-back
FSM Cache Controller
-
7/29/2019 Lec5a Supp
58/77
Key characteristics for a simple L1 cachez Direct mapped
z - -
z Block size of 4 32-bit words (so 16B); Cache size of 16KB (so1024 blocks)
- - - -, , , ,bit, valid bit, LRU bits (if set associative)
Cache&
Cache
- t ea r te
essor
DRAM
1-bit Valid
32-bit address
- t ea r te
1-bit Valid
32-bit address
ControllerPro
DDR32-bit data
32-bit data128-bit data128-bit data
CSE431 Chapter 5A.58 Irwin, PSU, 2008
- -
Four State Cache Controller
-
7/29/2019 Lec5a Supp
59/77
Compare TagIf Valid && Hit
Cache HitMark Cache Ready
e Set Valid, Set Tag,
If Write set DirtyValid CPU request
Old block isDirty
Cache MissOld block is
AllocateRead new block
Write BackWrite old block
Memory Ready
MemoryMemory
Not Read
CSE431 Chapter 5A.59 Irwin, PSU, 2008
o ea y
Cache Coherence in Multicores
-
7/29/2019 Lec5a Supp
60/77
share a common physical address space, causing acache coherence problem
Core 1 Core 2
Unified (shared) L2
CSE431 Chapter 5A.60 Irwin, PSU, 2008
Cache Coherence in Multicores
-
7/29/2019 Lec5a Supp
61/77
share a common physical address space, causing acache coherence problem
Core 1 Core 2
Write 1 to X
X = 0 X = 0X = 1
Unified (shared) L2
X = 0X = 1
CSE431 Chapter 5A.61 Irwin, PSU, 2008
A Coherent Memory System
-
7/29/2019 Lec5a Supp
62/77
Any read of a data item should return the most recentlywritten value of the data item
- Writes to the same location are serialized (two writes to the same
location must be seen in the same order by all cores)
by a read
,z Replication of shared data items in multiple cores caches
z Replication reduces both latency and contention for a read shareddata item
z Migration of shared data items to a cores local cache
z Migration reduced the latency of the access the data and the
CSE431 Chapter 5A.62 Irwin, PSU, 2008
bandwidth demand on the shared memory (L2 in our example)
Cache Coherence Protocols
-
7/29/2019 Lec5a Supp
63/77
the most popular of which is snooping
z The cache controllers monitor (snoop) on the broadcast medium(e.g., bus) with duplicate address tag hardware (so they dont
interfere with cores access to the cache) to determine if theircache has a copy of a block that is requested
Write invalidate protocol writes require exclusiveaccess and invalidate all other copies
z copies of an item exists
If two processors attempt to write the same data at the,
cores copy to be invalidated. For the other core tocomplete, it must obtain a new copy of the data which
CSE431 Chapter 5A.63 Irwin, PSU, 2008
must now contain the updated value thus enforcingwrite serialization
Handling Writes
-
7/29/2019 Lec5a Supp
64/77
informed of writes can be handled two ways:
- - . broadcasts new data over the bus, all copies areupdatedz wr es go o e us g er us ra c
z Since new values appear in caches sooner, can reduce latency
. - signal on bus, cache snoops check to see if they have acopy of the data, if so they invalidate their cache block
only one writer)z Uses the bus only on the first write lower bus traffic, so better
CSE431 Chapter 5A.64 Irwin, PSU, 2008
Example of Snooping Invalidation
-
7/29/2019 Lec5a Supp
65/77
Core 1 Core 2
L1 I$ L1 D$L1 I$ L1 D$
n e s are
CSE431 Chapter 5A.65 Irwin, PSU, 2008
Example of Snooping Invalidation
-
7/29/2019 Lec5a Supp
66/77
Core 1 Core 2
Read X Read X
L1 I$ L1 D$L1 I$ L1 D$
= =
Write 1 to X Read X
I
X = 0
I
X = IX = 1
n e s are
When the second miss by Core 2 occurs, Core 1responds with the value canceling the response fromthe L2 cache (and also updating the L2 copy)
CSE431 Chapter 5A.66 Irwin, PSU, 2008
A Write-Invalidate CC Protocol
-
7/29/2019 Lec5a Supp
67/77
SharedInvalid
read (miss)
miss)
s)
w
rite(mis
write-back caching
(dirty)
CSE431 Chapter 5A.67 Irwin, PSU, 2008
rea or write (hit or miss)
A Write-Invalidate CC Protocol
-
7/29/2019 Lec5a Supp
68/77
SharedInvalid
read (miss)read (hit ormiss)
receives invalidate
s)
(write by another coreto this block)
eto
by this
ate
w
rite(mis
-backdu
d(miss)
ercoret
block
n
dinvalid
write-back caching
writ
re
anot
s
(dirty)
signals from the core inredsignals from the bus in
CSE431 Chapter 5A.68 Irwin, PSU, 2008
rea or wr e ue
Write-Invalidate CC ExamplesI = invalid man S = shared man M = modified onl one
-
7/29/2019 Lec5a Supp
69/77
z I = invalid man S = shared man M = modified onl one
Core 1 Core 2 Core 1 Core 2
A S A I A S A I
Main MemA
Main MemA
Core 1 Core 2 Core 1 Core 2
CSE431 Chapter 5A.69 Irwin, PSU, 2008
a n emA
a n emA
Write-Invalidate CC Examplesz I = invalid man S = shared man M = modified onl one
-
7/29/2019 Lec5a Supp
70/77
z I = invalid man S = shared man M = modified onl one
1. read miss for A3. snoop seesread request for
A & lets MM 4. gets A from MM
1. write miss for A
2. writes A &Core 1 Core 2 Core 1 Core 2
2. read re uest for A
supply A & changes its stateto S
changes its stateto M
4. change Astate to IA S A I A S A I
.
Main MemA
Main MemA
. rea m ss or. snoop sees rearequest for A, writes-
back A to MM 4. gets A from MM& changes its state5. chan e A
.
2. writes A &changes its state4. change A
Core 1 Core 2 Core 1 Core 2
2. read request for A
to M
5. C2 sends invalidate for A
state to I to M
3. C2 sends invalidate for A
state to I
CSE431 Chapter 5A.70 Irwin, PSU, 2008
a n emA
a n emA
Data Miss Rates
-
7/29/2019 Lec5a Supp
71/77
z Share data misses often dominate cache behavior even thoughthey may only be 10% to 40% of the data accesses
18
Capacity miss rate
Coherence miss rate64KB 2-way set associativedata cache with 32B blocks
12
14
16
Hennessy & Patterson, ComputerArchitecture: A Quantitative Approach
8
Capacity miss rate
Coherence miss rate
8
10
2
4
2
4
CSE431 Chapter 5A.71 Irwin, PSU, 2008
FFT
0
1 2 4 8 16
Ocean
1 2 4 8 16
Block Size Effects
-
7/29/2019 Lec5a Supp
72/77
Writes to one word in a multi-word block mean that thefull block is invalidated
Multi-word blocks can also result in false sharing: whentwo cores are writing to two different variables thatappen o a n e same cac e oc
z With write-invalidate false sharing increases cache miss rates
A B
Core1 Core2
4 word cache block
Compilers can help reduce false sharing by allocating
CSE431 Chapter 5A.72 Irwin, PSU, 2008
Other Coherence Protocols
-
7/29/2019 Lec5a Supp
73/77
There are many variations on cache coherence protocols
Another write-invalidate protocol used in the Pentium 4
(and many other processors) is MESI with four states:
z Exclusive only one copy of the shared data is allowed to becached; memory has an up-to-date copy
- nce ere s on y one copy o e oc , wr e s on nee osend invalidate signal
z Shared multiple copies of the shared data may be cached (i.e.,data ermitted to be cached with more than one rocessormemory has an up-to-date copy
z Invalid same
CSE431 Chapter 5A.73 Irwin, PSU, 2008
MESI Cache Coherency ProtocolProcessor
-
7/29/2019 Lec5a Supp
74/77
Invalid
shared read
Invalidate for this block
no vablock) (clean)
Processorshared readmiss
no er processor
has read/writemiss for this
Processorwrite
oc
Processorexclusive
rocessorsharedread
Modified
Processorwrite Exclusive
readexclusiveread miss
Processor write
(dirty)
Processor exclusiveread miss
Processor
[Write back block]
CSE431 Chapter 5A.74 Irwin, PSU, 2008
or read hi tread
-
7/29/2019 Lec5a Supp
75/77
Summary: Improving Cache Performance
-
7/29/2019 Lec5a Supp
76/77
. e uce e m ss pena yz smaller blocks
z use a write buffer to hold dirt blocks bein re laced so donthave to wait for the write to complete before reading
z check write buffer (and/or victim cache) on read miss may getluck
z for large blocks fetch critical word first
z use multiple cache levels L2 cache not tied to CPU clock rate
z as er ac ng s ore mprove memory an w
- wider buses
- memory interleaving, DDR SDRAMs
CSE431 Chapter 5A.76 Irwin, PSU, 2008
Summary: The Cache Design Space
-
7/29/2019 Lec5a Supp
77/77
z cache size
z block size
Cache Size
z associativity
z replacement policy
ssoc a v y
- -
z write allocation Block Size
z depends on access characteristics
- workload Bad
- use I-cache, D-cache, TLB
z depends on technology / cost Good Factor A Factor B
CSE431 Chapter 5A.77 Irwin, PSU, 2008