Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate...

Chapter 5Memory III

CSE 820

Michigan State UniversityComputer Science and Engineering

Miss Rate Reduction (cont’d)

Larger Block Size

• Reduces compulsory missesthrough spatial locality

• But, – miss penalty increases:

higher bandwidth helps– miss rate can increase:

fixed cache size + larger blocksmeans fewer blocks in the cache

Notice the “U” shape: some is good, too much is bad.

Larger Caches

• Reduces capacity misses

• But– Increased hit time– Increased cost ($)

• Over time, L2 and higher cache size increases

Higher Associativity

• Reduces miss rates with fewer conflicts

• But– Increased hit time (tag check)

• Note– An 8-way associative cache has close to

the same miss rate as fully associative

Way Prediction

Predict which way of a L1 cache will be accessed next– Alpha 21264

correct prediction is 1 cycleincorrect prediction is 3 cycles

– SPEC95 prediction is 85% correct

Compiler Techniques

• Reduce conflicts in I-cache: 1989 study showed reduced misses by 50% for a 2KB cache and by 75% for an 8KB cache

• D-cache performs differently

Compiler data optimizationsLoop Interchange• Before

for (j = …

for (i = …

x[i][j] = 2 * x[i][j]

• Afterfor (i = …

for (j = …

x[i][j] = 2 * x[i][j]

• Improved Spatial Locality

Blocking: Improve Spatial Locality

Before

Miss Rate and Miss Penalty Reduction via Parallelism

Nonblocking Caches

• Reduces stalls on cache miss

• A blocking cache refuses all requests while waiting for data

• A nonblocking cache continues to handle other requests while waiting for data on another request

• Increases cache controller complexity

NonBlocking Cache (8K direct L1; 32 byte blocks)

Hardware Prefetch

• Fetch two blocks: desired + next• “Next” goes into “stream buffer”

on fetch check stream buffer first• Performance

– Single-instruction stream buffercaught 15% to 25% of L1 misses

– 4-instruction stream buffer caught 50%– 16-instruction stream buffer caught 72%

Hardware Prefetch

• Data prefetch– Single-data stream buffer

caught 25% of L1 misses– 4-data stream buffer caught 43%– 8-data stream buffers caught 50% to 70%

• Prefetch from multiple addresses• UltraSPARCIII handles 8 prefetches

calculates “stride” for next prediction

Software Prefetch

• Many processors such as Itanium have prefetch instructions

• Remember they are nonfaulting

Hit Time Reduction

Small, Simple Caches

• Time– Indexing– Comparing tag

• Small indexing is fast

• Simple direct allows tag comparison in parallel with data load

L2 with tag on chip with data off chip

Time vs cache size & organization

Perspective on previous graph

Same:– 1ns clock is 10-9 sec/clockCycle– 1 GHz is 109 clockCycles/sec

Therefore,– 2ns clock is 500 MHz– 4ns clock is 250 MHz

Conclude that small differences in nsrepresents a large difference in MHz

Virtual vs Physical Address in L1

• Translating from virtual address to physical address as part of cache access takes time on critical path

• Translation is needed for both index and tag

• Making the common case fast suggests avoiding translation for hits (misses must be translated)

Why are L1 caches physical?(almost all)

• Security (Protection): page-level protection must be checked on access(protection data can be copied into cache)

• Process switch can change virtual mapping requiring cache flush(or Process ID) [see next slide]

• Synonyms: two virtual addresses for same (shared) physical address

Virtually-addressed cache context-switch cost

Hybrid: virtually indexed, physically taggedIndex with the part of the page offset that is identical

in virtual and physical addresses i.e. the index bits are a subset of the page-offset bits

In parallel with indexing, translate the virtual address to check the physical tag

Limitation: direct-mapped cache ≤ page size (determined by address bits)set-associative caches can be

biggersince fewer bits are needed for

Example

• Pentium III– 8 KB pages

with 16KB 2-way set-associative cache

• IBM 3033– 4KB pages

with 64KB 16-way set-associative cache(note that 8-way is sufficient, but 16-way is needed to keep index bits sufficiently small)

Trace Cache

• Pentium 4 NetBurst architecture• I-cache blocks are organized to

contain instruction traces including predicted taken branchesinstead of organized around memory addresses

• Advantage over regular large cache blocks which contain branches and, hence, many unused instructionse.g. AMD Athlon 64-byte blocks contain 16-24 x86 instructions with 1-in-5 being branches

• Disadvantage: complex addressing

Trace Cache

• P4 trace cache (I-cache) is placed after decode and branch predictso it contains– μops– only desired instructions

• Trace cache contains 12K μops• Branch predict BTB is 4K

(33% improvement over PIII)

Summary (so far)

• Figure 5.26 summarizes all

Main-memory

Main-memory modifications can help cache miss penalty by bringing words faster from memory– Wider path to memory brings in more

words at a time, e.g. one address request brings in 4 words (reduces overhead)

– Interleaved memory can allow memory to respond faster

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate...

Documents

Transcript of Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate...

CSE 820 Graduate Computer Architecture Week 5 ...

CSE 820 Graduate Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading

Abstraction (Cont’d)

CSE 820 Advanced Computer Architecture Week 4 …cse.msu.edu/~cse820/lectures/lecturesS11/week04.pdfCS252 S05 1 CSE 820 Advanced Computer Architecture Week 4 – Memory Hierarchy Review

Senyawa Organotimah(Cont’d)

Physical Layer (cont’d)

CSE 820 Graduate Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading Base on slides by David Patterson.

ヘリカル・ベベルギヤモータ Kシリーズ - SEW‑EURODRIVE...820 - 471 820 - 420 820 - 361 820 - 323 820 - 279 820 - 246 820 - 217 820 - 191 820 - 166 820 - 144 820 -

Resolution (cont’d)

Topic 7 Cont’d

Diário Oficial nº 820 820 - veracruz.rs.gov.br

The Belvoir - Evoke Living Homes · 2018. 7. 6. · evoelivihome.om. leevoelivihome.om. 300 OKE The Belvoir 4 2 203.05m2 820 820 820 820 CAVITY SD 720 720 720 820 820 820 CAVITY SD

CSE 820 Graduate Computer Architecture Lec 7 – Instruction Level Parallelism

Students have to fill webkiosk... · 101704 nancy kondal ... cse cse cse cse cse cse cse cse cse cse cse cse cse ... radha agnihotri 131336 sagar bansal 131342 132210 132219 141215

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

The Constitution…cont’d

CSE 820 Advanced Computer Architecture Lec 4 – Memory ...

Fantastic Fractions Cont’d

CSE 820 Advanced Computer Architecture Week 4 – …cse820/lectures/lecturesS10/week04.pdfCS252 S05 1 CSE 820 Advanced Computer Architecture Week 4 – Memory Hierarchy Review Based

Case Representation Cont’d