CS152 – Computer Architecture and Engineering Lecture 13 – Cache Part I
Cse410 13 Cache
-
Upload
rajesh-tiwary -
Category
Documents
-
view
214 -
download
0
Transcript of Cse410 13 Cache
-
8/12/2019 Cse410 13 Cache
1/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 1
Cache Memory
CSE 410, Spring 2008
Computer Systems
http://www.cs.washington.edu/410
-
8/12/2019 Cse410 13 Cache
2/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 2
Reading and References
Reading
Computer Organization and Design, Patterson and Hennessy
Section 7.1 Introduction
Section 7.2 The Basics of Caches
Section 7.3 Measuring and Improving Cache Performance
Reference
OSC (The dino book), Chapter 8: focus on paging CDA&AQA, Chapter 5
Chapter 4, See MIPS Run, D. Sweetman
IBM and Cell chips: http://www.blachford.info/computer/Cell/Cell0_v2.html
-
8/12/2019 Cse410 13 Cache
3/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 3
Arthur W. Burks, Herman H. Goldstine,
John von Neumann
We are therefore forced to recognize the possibility
of constructing a hierarchy of memories, each of
which has greater capacity than the preceding but
which is less quickly accessible
This team was forced to realize this in the 1960s
Cachea safe place to store something (webster)
Cache (cse & ee ) : a small and fast local memory
Cachessome consider this to be
-
8/12/2019 Cse410 13 Cache
4/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 4
The Quest for Speed - Memory
If all memory accesses (IF/lw/sw) accessed main
memory, programs would run exponentially slower
Compare memory access times: 2^4- 2^6 time slower?
Could be even slower depending on pipeline length!
And its getting worse
processors speed up by ~50% annually
memory accesses speed up by ~9% annually
its becoming harder and harder to keep these processors
fed
-
8/12/2019 Cse410 13 Cache
5/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 5
A Solution: Memory Hierarchy*
Keep copies of the active
data in the small, fast,
expensive storage
Keep all data in the big,
slow, cheap storage
fast, small,
expensive
storage
slow, large,
cheap storage
-
8/12/2019 Cse410 13 Cache
6/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 6
Memory Hierarchy**
10M
100
10
.5-5
-
8/12/2019 Cse410 13 Cache
7/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 7
What is a Cache?
A subset* of a larger memory
A small and fast place to store frequently
accessed items
Can be an instruction cache, a video cache, a
streaming buffer/cache
Like a fisheye view of memory, where
magnification == speed
-
8/12/2019 Cse410 13 Cache
8/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 8
What is a Cache?
A cache allows for fastaccesses to a subset of a largerdata store
Your web browsers cache gives you fast access to pages
you visited recently
faster because its stored locally
subset because the web wont fit on your disk
The memory cache gives the processor fast access to
memory that it used recently faster because its fancy and usually located on the CPU chip
subset because the cache is smaller than main memory
-
8/12/2019 Cse410 13 Cache
9/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 9
Memory Hierarchy
CPU
Registers
L1 cache
L2 Cache
Main Memory
-
8/12/2019 Cse410 13 Cache
10/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 10
-
8/12/2019 Cse410 13 Cache
11/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 11
IBMs Cell Chips
-
8/12/2019 Cse410 13 Cache
12/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 12
Locality of reference
Temporal locality - nearness in time Data being accessed now will probably be
accessed again soon
Useful data tends to continue to be useful Spatial locality - nearness in address
Data near the data being accessed now will
probably be needed soon Useful data is often accessed sequentially
-
8/12/2019 Cse410 13 Cache
13/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 13
Memory Access Patterns
Memory accesses
dontusually look
like this
random accesses
Memory accesses do
usuallylook like this
hot variables
step through arrays
-
8/12/2019 Cse410 13 Cache
14/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 14
Cache Terminology
Hit andMiss the data item is in the cache or the data item is not in the
cache
Hit rate andMiss rate the percentage of references that the data item is in the
cache or not in the cache
Hit time andMiss time the time required to access data in the cache (cache access
time) and the time required to access data not in the cache(memory access time)
-
8/12/2019 Cse410 13 Cache
15/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 15
Effective Access Time
teffective= (h)tcache+ (1-h)tmemory
effective
access time
memory access
time
cache
miss rate
cache
access time
cache
hit rate
aka, Average Memory Access Time (AMAT)
-
8/12/2019 Cse410 13 Cache
16/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 16
# of bits in a Cache
Caches are expensive memories that carryadditional metadata.
Note the bits needed for a cache is a function
of the # of cache blocks as well as the addresssize (embedded in the tag bit calculation)
2n * ( block size + tag size + valid bit )
Where n = number of blocks
How many words per block? = 2m
tag size = (32nm2)
-
8/12/2019 Cse410 13 Cache
17/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 17
Cache Contents
When do we put something in the cache? when it is used for the first time
When do we remove something in the cache?
when we need the space in the cache for some otherentry
all of memory wont fit on the CPU chip so not
every location in memory can be cached
-
8/12/2019 Cse410 13 Cache
18/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 18
A small two-level hierarchy
8-word cache
32-word memory (128 bytes)
0=
000
0000
4=
000
0100
8=
000
1000
116=
11
10100
120=
11
11000
124=
11
11100
12=
000
1100
16=
001
0000
20=
001
0100
-
8/12/2019 Cse410 13 Cache
19/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 19
Fully Associative Cache
0x00012D10Y0101100
0x00000005N0001100
0x0349A291Y1101100
0x000123A8Y0100000
0x00000200N1111100
0x00000410Y0100100
0x09D91D11N0000100
0x00000001Y0010100
ValueValidAddress In a fully associativecache, any memory word can
be placed in any cacheline
each cache line stores anaddress and a data value
accesses are slow (butnot as slow as you mightthink)
Why? Check alladdresses at once.
-
8/12/2019 Cse410 13 Cache
20/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 20
Direct Mapped Caches
Fully associative caches are often too slow With direct mapped caches the address of the
item determines where in the cache to store it
In our example, the lowest order two bits are thebyte offset within the word stored in the cache
The next three bits of the address dictate the
location of the entry within the cache
The remaining higher order bits record the rest of
the original address as a tag for this entry
-
8/12/2019 Cse410 13 Cache
21/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 21
Address Tags
A tagis a label for a cache entry indicatingwhere it came from
The upper bits of the data items address
1011101
7 bit Address
0111110Byte Offset (2)Index (3)Tag (2)
-
8/12/2019 Cse410 13 Cache
22/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 22
Direct Mapped Cache
0x00012D10Y00
0x00000005N10
0x0349A291Y11
0x000123A8Y000x00000200N10
0x00000410Y01
0x09D91D11N10
0x00000001Y11
ValueValidTag
0112= 3
1002= 4
1012= 5
1102= 6
1112= 7
0102= 2
0012= 1
0002= 0
Cache
Index
0001100
1010000
1110100
0011000
1011100
0101000
1000100
1100000
Memory
Address
Cache Contents
-
8/12/2019 Cse410 13 Cache
23/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 23
N-way Set Associative Caches
Direct mapped caches cannot store more than one addresswith the same index Simple case of 1-way Set associative
Compare tofully associative, where N = number of blocks
If two addresses collide, then you overwrite the older entry
2-way associative caches can store two different addresseswith the same index 3-way, 4-way and 8-way set associative designs too
Reduces misses due to conflicts/competition/thrashing for
same cache block, as each cache block can hold more Larger sets imply slower accesses
Need to compare N-Items, where N in D.M. is 1.
-
8/12/2019 Cse410 13 Cache
24/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 24
0x00012D10Y000x00000005N10
0x0349A291Y11
0x000123A8Y00
0x00000200N10
0x00000410Y01
0x09D91D11N10
0x00000001Y11
ValueValidTag
0x000000A2N100x00000333N11
0x00003333Y10
0x0000C002Y01
0x00000005N10
0x000000CFY11
0x0000003BN10
0x00000002Y00
ValueValidTag
2-way Set Associative Cache
011100
101
110
111
010
001
000
Index
The highlighted cache entry contains values for
addresses 10101xx2and 11101xx2.
-
8/12/2019 Cse410 13 Cache
25/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 25
Associativity Spectrum
Direct Mapped
Fast to access
Conflict Misses
Fully Associative
Slow to access
No Conflict Misses
N-way Associative
Slower to access
Fewer Conflict Misses
-
8/12/2019 Cse410 13 Cache
26/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 26
Spatial Locality
Using the cache improves performance bytaking advantage of temporal locality
When a word in memory is accessed it is loaded
into cache memory It is then available quickly if it is needed again
soon
This does nothing for spatial locality
Whats the solution here?
-
8/12/2019 Cse410 13 Cache
27/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 27
Memory Blocks
Divide memory into blocks If any word in a block is accessed, then load
an entire block into the cache
Block 00x000000000x0000003F
Block 10x000000400x0000007F
Block 20x00000080
0x000000BF
w13 w14tag valid w15w12w11w10w9w8w7w6w5w4w3w2w1w0
Cache line for 16 word block size
-
8/12/2019 Cse410 13 Cache
28/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 28
Address Tags Revisited
A cache block size > 1 word requires the address tobe divided differently
Instead of a byte offset into a word, we need a byteoffset into the block
Assume we have 10-bit addresses, 8 cache lines, and4 words (16 bytes) per cache line block
0101100111
10 bit Address
0111110010
Block Offset (4)Index (3)Tag (3)
-
8/12/2019 Cse410 13 Cache
29/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 29
The Effects of Block Size
Big blocks are good Fewer first time misses
Exploits spatial locality
Counter: Law of diminishing returns applies hereif block size becomes a significant fraction ofoverall cache size
Small blocks are good
Dont evict as much data when bringing in a newentry
More likely that all items in the block will turn outto be useful
-
8/12/2019 Cse410 13 Cache
30/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 30
Reads vs. Writes
Caching is essentially making a copy of thedata
When you read, the copies still match when
youre done (read op is immutable) When you write, the results must eventually
propagate to both copies (a hierarchy)
Especially at the lowest level of the hierarchy,which is in some sense the permanent copy
-
8/12/2019 Cse410 13 Cache
31/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 31
Cache Misses
Reads are easier, due to constraints A dedicated memory unit (or, memory controller) has the
job of fetching and filling cache lines
Writes are trickier as we need to maintain consistentmemory! What if memory changed at a top level but not at either
main memory or the disk?
Write-through: write to your cache, and percolate thatwrite down (slow and straightforward, factor of 10)
Use a cache or buffer to percolate the write down Optimization: Write-back (or, write-on-replace) Faster, more complex to implement
-
8/12/2019 Cse410 13 Cache
32/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 32
Write-Through Caches
Write all updates to both cache and memory Advantages
The cache and the memory are always consistent
Evicting a cache line is cheap because no dataneeds to be written out to memory at eviction
Easy to implement
Disadvantages
Runs at memory speeds when writing One solution: use a cache. We can use writebuffer\victim buffer to mask this delay
-
8/12/2019 Cse410 13 Cache
33/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 33
Write-Back Caches
Write the update to the cache only. Write tomemory only when cache block is evicted
Advantage
Runs at cache speed rather than memory speed Some writes never go all the way to memory
When a whole block is written back, can usehigh bandwidth transfer
Disadvantage complexity required to maintain consistency
-
8/12/2019 Cse410 13 Cache
34/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 34
Dirty bit
When evicting a block from a write-backcache, we could
always write the block back to memory
write it back only if we changed it
Caches use a dirty bit to mark if a linewas changed
the dirty bit is 0 when the block is loaded
it is set to 1 if the block is modified when the line is evicted, it is written back only
if the dirty bit is 1
-
8/12/2019 Cse410 13 Cache
35/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 35
i-Cache and d-Cache
There usually are two separate caches forinstructions and data.
Avoids structural hazards in pipelining
The combined cache is twice as big but still has anaccess time of a small cache
Allows both caches to operate in parallel, for twice
the bandwidth
But the miss rate will slightly increase, due to the
extra partition that a shared cache wouldnt have
-
8/12/2019 Cse410 13 Cache
36/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 36
Latency V.S. Throughput
Latency: The time to get the first requestedword of the cache (if present)
The whole point to caches: this should berelatively small (compared to other memories)
Throughput: The time it takes to fetch the restof the block (could be many words)
Defined by your hardware
wide bandwidths preferred: at what cost? DDRregister dual read/write on clock edge
-
8/12/2019 Cse410 13 Cache
37/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 37
Cache Line Replacement
How do you decide which cache block toreplace?
If the cache is direct-mapped, its easy
only one slot per indexonly one choice! Otherwise, common strategies:
Random Why might this be worthwhile?
Least Recently Used (LRU) FIFO approximation works reasonably well here
-
8/12/2019 Cse410 13 Cache
38/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 38
LRU Implementations
LRU is very difficult to implement for highdegrees of associativity
4-way approximation:
1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item inthis pair
Another approximation: FIFO queuing
We will see this again at the operating systemlevel
-
8/12/2019 Cse410 13 Cache
39/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 39
Multi-Level Caches
Use each level of the memory hierarchy as acache over the next lowest level
(or, simply recurse again on our currentabstraction)
Inserting level 2 between levels 1 and 3allows:
level 1 to have a higher miss rate (so can be
smaller and cheaper) level 3 to have a larger access time (so can be
slower and cheaper)
-
8/12/2019 Cse410 13 Cache
40/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 40
Summary: Classifying Caches
Where can a block be placed? Direct mapped, N-way Set or Fully associative
How is a block found? Direct mapped: one choice -> by index
Set associative: by index and search Fully associative: by search (all at once)
What happens on a write access? Write-back or Write-through
Which block should be replaced? Random
LRU (Least Recently Used)
-
8/12/2019 Cse410 13 Cache
41/42
-
8/12/2019 Cse410 13 Cache
42/42
2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 42
Being Explored
Cache Coherency in multiprocessor systems Consider keeping multiple l1s in sync, when separated by a slowbus
Want each processor to have its own cache Fast local access
No interference with/from other processors
No problem if processors operate in isolation (a thread tree perproc, for example)
But: now what happens if more than one processoraccesses a cache line at the same time? How do we keep multiple copies consistent?
Consider multiple writes to the same data Overhead in all this synchronization/message passing on a possibly
slow bus
What about synchronization with main storage?