Post on 21-Jan-2016
Chapter 5Memory III
CSE 820
Michigan State UniversityComputer Science and Engineering
Miss Rate Reduction (cont’d)
Michigan State UniversityComputer Science and Engineering
Larger Block Size
• Reduces compulsory missesthrough spatial locality
• But, – miss penalty increases:
higher bandwidth helps– miss rate can increase:
fixed cache size + larger blocksmeans fewer blocks in the cache
Michigan State UniversityComputer Science and Engineering
Notice the “U” shape: some is good, too much is bad.
Michigan State UniversityComputer Science and Engineering
Larger Caches
• Reduces capacity misses
• But– Increased hit time– Increased cost ($)
• Over time, L2 and higher cache size increases
Michigan State UniversityComputer Science and Engineering
Higher Associativity
• Reduces miss rates with fewer conflicts
• But– Increased hit time (tag check)
• Note– An 8-way associative cache has close to
the same miss rate as fully associative
Michigan State UniversityComputer Science and Engineering
Way Prediction
Predict which way of a L1 cache will be accessed next– Alpha 21264
correct prediction is 1 cycleincorrect prediction is 3 cycles
– SPEC95 prediction is 85% correct
Michigan State UniversityComputer Science and Engineering
Compiler Techniques
• Reduce conflicts in I-cache: 1989 study showed reduced misses by 50% for a 2KB cache and by 75% for an 8KB cache
• D-cache performs differently
Michigan State UniversityComputer Science and Engineering
Compiler data optimizationsLoop Interchange• Before
for (j = …
for (i = …
x[i][j] = 2 * x[i][j]
• Afterfor (i = …
for (j = …
x[i][j] = 2 * x[i][j]
• Improved Spatial Locality
Michigan State UniversityComputer Science and Engineering
Blocking: Improve Spatial Locality
Before
After
Michigan State UniversityComputer Science and Engineering
Miss Rate and Miss Penalty Reduction via Parallelism
Michigan State UniversityComputer Science and Engineering
Nonblocking Caches
• Reduces stalls on cache miss
• A blocking cache refuses all requests while waiting for data
• A nonblocking cache continues to handle other requests while waiting for data on another request
• Increases cache controller complexity
Michigan State UniversityComputer Science and Engineering
NonBlocking Cache (8K direct L1; 32 byte blocks)
Michigan State UniversityComputer Science and Engineering
Hardware Prefetch
• Fetch two blocks: desired + next• “Next” goes into “stream buffer”
on fetch check stream buffer first• Performance
– Single-instruction stream buffercaught 15% to 25% of L1 misses
– 4-instruction stream buffer caught 50%– 16-instruction stream buffer caught 72%
Michigan State UniversityComputer Science and Engineering
Hardware Prefetch
• Data prefetch– Single-data stream buffer
caught 25% of L1 misses– 4-data stream buffer caught 43%– 8-data stream buffers caught 50% to 70%
• Prefetch from multiple addresses• UltraSPARCIII handles 8 prefetches
calculates “stride” for next prediction
Michigan State UniversityComputer Science and Engineering
Software Prefetch
• Many processors such as Itanium have prefetch instructions
• Remember they are nonfaulting
Michigan State UniversityComputer Science and Engineering
Hit Time Reduction
Michigan State UniversityComputer Science and Engineering
Small, Simple Caches
• Time– Indexing– Comparing tag
• Small indexing is fast
• Simple direct allows tag comparison in parallel with data load
L2 with tag on chip with data off chip
Michigan State UniversityComputer Science and Engineering
Time vs cache size & organization
Michigan State UniversityComputer Science and Engineering
Perspective on previous graph
Same:– 1ns clock is 10-9 sec/clockCycle– 1 GHz is 109 clockCycles/sec
Therefore,– 2ns clock is 500 MHz– 4ns clock is 250 MHz
Conclude that small differences in nsrepresents a large difference in MHz
Michigan State UniversityComputer Science and Engineering
Virtual vs Physical Address in L1
• Translating from virtual address to physical address as part of cache access takes time on critical path
• Translation is needed for both index and tag
• Making the common case fast suggests avoiding translation for hits (misses must be translated)
Michigan State UniversityComputer Science and Engineering
Why are L1 caches physical?(almost all)
• Security (Protection): page-level protection must be checked on access(protection data can be copied into cache)
• Process switch can change virtual mapping requiring cache flush(or Process ID) [see next slide]
• Synonyms: two virtual addresses for same (shared) physical address
Michigan State UniversityComputer Science and Engineering
Virtually-addressed cache context-switch cost
Michigan State UniversityComputer Science and Engineering
Hybrid: virtually indexed, physically taggedIndex with the part of the page offset that is identical
in virtual and physical addresses i.e. the index bits are a subset of the page-offset bits
In parallel with indexing, translate the virtual address to check the physical tag
Limitation: direct-mapped cache ≤ page size (determined by address bits)set-associative caches can be
biggersince fewer bits are needed for
index
Michigan State UniversityComputer Science and Engineering
Example
• Pentium III– 8 KB pages
with 16KB 2-way set-associative cache
• IBM 3033– 4KB pages
with 64KB 16-way set-associative cache(note that 8-way is sufficient, but 16-way is needed to keep index bits sufficiently small)
Michigan State UniversityComputer Science and Engineering
Trace Cache
• Pentium 4 NetBurst architecture• I-cache blocks are organized to
contain instruction traces including predicted taken branchesinstead of organized around memory addresses
• Advantage over regular large cache blocks which contain branches and, hence, many unused instructionse.g. AMD Athlon 64-byte blocks contain 16-24 x86 instructions with 1-in-5 being branches
• Disadvantage: complex addressing
Michigan State UniversityComputer Science and Engineering
Trace Cache
• P4 trace cache (I-cache) is placed after decode and branch predictso it contains– μops– only desired instructions
• Trace cache contains 12K μops• Branch predict BTB is 4K
(33% improvement over PIII)
Michigan State UniversityComputer Science and Engineering
Michigan State UniversityComputer Science and Engineering
Summary (so far)
• Figure 5.26 summarizes all
Michigan State UniversityComputer Science and Engineering
Main-memory
Main-memory modifications can help cache miss penalty by bringing words faster from memory– Wider path to memory brings in more
words at a time, e.g. one address request brings in 4 words (reduces overhead)
– Interleaved memory can allow memory to respond faster