Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss...

46

Transcript of Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss...

Page 1: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Page 2: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Improving Cache Performance

• Four categories of optimisation:– Reduce miss rate– Reduce miss penalty– Reduce miss rate or miss penalty using

parallelism– Reduce hit time

AMAT = Hit time + Miss rate × Miss penalty

Page 3: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.5. Reducing Miss Rate

• Three sources of misses:– Compulsory

• “cold start misses”

– Capacity• Cache is full

– Conflict• Set is full/block is occupied

Increase block size

Increase size of cache

Increase degree of associativity

Page 4: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Larger Block Size

• Bigger blocks reduce compulsory misses– Spatial locality

• BUT:– Increased miss penalty

• More data to transfer

– Possibly increased overall miss rate• More conflict and capacity misses as there are fewer

blocks

Page 5: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Effect of Block Size

AMAT

Block size

Missrate

Block size

Transfer

Access

Misspenalty

Block size

Page 6: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Larger Caches

• Reduces capacity misses

• Increases hit time and cost

Page 7: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Higher Associativity

• Miss rates improve with higher associativity

• Two rules of thumb:– 8-way set associative caches are almost as

effective as fully associative• But much simpler!

– 2:1 cache rule• A direct mapped cache of size N has about the same

miss rate as a 2-way set associative cache of size N/2

Page 8: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Way Prediction

• Set-associative cache predicts which block will be needed on next access to the set

• Only one tag check is done– If mispredicted the whole set must be checked

• E.g. Alpha 21264 instruction cache– Prediction rate > 85%– Correct prediction: 1 cycle hit– Misprediction: 3 cycles

Page 9: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Pseudo-Associative Caches

• Check a direct mapped cache for a hit as usual

• If it misses, check a second block– Invert MSB of index

• One fast and one slow hit time

Page 10: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Compiler Optimisations

• Compilers can optimise code to minimise miss rates:– Reordering procedures– Aligning basic blocks with cache blocks– Reorganising array element accesses

Page 11: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.6. Reduce Miss Rate or Miss Penalty via Parallelism

• Three techniques that overlap instruction execution with memory access

Page 12: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Nonblocking caches

• Dynamic scheduling allows CPU to continue with other instructions while waiting for data

• Nonblocking cache allows other cache accesses to continue while waiting for data

Page 13: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Hardware Prefetching• Fetch data/instructions before they are

requested by the processor– Either into cache or another buffer

• Particularly useful for instructions– High degree of spatial locality

• UltraSPARC III– Special prefetch cache for data– Increases effectiveness by about four times

Page 14: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Compiler Prefetching

• Compiler inserts “prefetch” instructions

• Two types:– Prefetch register value– Prefetch data cache block

• Can be faulting or non-faulting

• Cache continues as normal while data is prefetched

Page 15: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

SPARC V9• Prefetch:

prefetch [%rs1 + %rs2], fcnprefetch [%rs1 + imm13], fcn

fcn = Prefetch function 0 = Prefetch for several reads 1 = Prefetch for one read 2 = Prefetch for several writes 3 = Prefetch for one write 4 = Prefetch page

Page 16: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.7. Reducing Hit Time

• Critical– Often affects CPU clock cycle time

Page 17: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Small, simple caches

• Small usually equals fast in hardware

• A small cache may reside on the processor chip– Decreases communication– Compromise: tags on chip, data separate

• Direct mapped– Data can be read in parallel with tag checking

Page 18: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Avoiding address translation

• Physical caches– Use physical addresses

• Address translation must happen before cache lookup

• Virtual caches– Use virtual addresses– Protection issues– High context switching overhead

Page 19: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Virtual caches

• Minimising context switch overhead:– Add process-identifier tag to cache

• Multiple virtual addresses may refer to a single physical address– Hardware enforces anti-aliasing– Software requires less significant bits to be the

same

Page 20: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Avoiding address translation (cont.)

• Choice of page size:– Bigger than cache index + offset– Address translation and tag lookup can happen

in parallel

OffsetTag IndexAddress

CPUPage offsetPage no.

Cache

VM Translation

Page 21: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Pipelining cache access

• Split cache access into several stages

• Impacts on branch and load delays

Page 22: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Trace caches

• Blocks follow program flow rather than spatial locality!

• Branch prediction is taken into account by cache

• Intel NetBurst microarchitecture

• Complicates address mapping

• Minimises wasted space within blocks

Page 23: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Cache OptimisationSummary

• Cache optimisation is very complex– Improving one factor may have a negative

impact on another

Page 24: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.6. Main Memory

• Latency and bandwidth are both important

• Latency is composed of two factors:– Access time– Cycle time

• Two main technologies:– DRAM– SRAM

Page 25: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

5.7. Virtual Memory

• Physical memory is divided into blocks– Allocated to processes– Provides protection– Allows swapping to disk– Simplifies loading

• Historically:– Overlays

• Programmer controlled swapping

Page 26: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Terminology

• Block:– Page– Segment

• Miss:– Page fault– Address fault

• Memory mapping (address translation)– Virtual address physical address

Page 27: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Characteristics

• Block size– 4kB – 64kB

• Hit time– 50 – 150 cycles

• Miss penalty– 1 000 000 – 10 000 000 cycles

• Miss Rate– 0.000 01 – 0.001%

Page 28: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Categorising VM Systems

• Fixed block size– Pages

• Variable block size– Segments– Difficult replacement

• Hybrid approaches– Paged segments– Multiple page sizes (2n × smallest)

Page 29: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q1: Block placement?

• Anywhere in memory– “Fully associative”– Minimises miss rate

Page 30: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q2: Block identification?

• Page/segment number gives physical page address– Paging: offset concatenated– Segments: offset added

• Uses a page table– Number of pages in virtual address space– Save space: inverted page table

• Number of pages in physical memory

Page 31: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q3: Block replacement?

• Least-recently used (LRU)– Minimises miss rate– Hardware provides a use bit or reference bit

Page 32: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Q4: Write strategy?

• Write back– With a dirty bit

You won’t become famous by being the first to try write through!

Page 33: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Fast Address Translation

• Page tables are big– Stored in memory themselves– Two memory accesses for every datum!

• Principle of locality– Cache recent translations– Translation look-aside buffer (TLB), or

translation buffer (TB)

Page 34: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Alpha 21264 TLB

Page 35: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Selecting a Page Size

• Big– Smaller page table– Allows parallel cache access– Efficient disk transfers– Reduces TLB misses

• Small– Less memory wastage (internal fragmentation)– Quicker process startup

Page 36: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Putting it ALL Together!

SPARC Revisited

Page 37: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Two SPARCs

• SuperSPARC– 1992– 32-bit superscalar design

• UltraSPARC– Late 1990’s– 64-bit design– Graphics support (VIS)

Page 38: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

UltraSPARC

• Four-way superscalar execution

• Two integer ALU’s

• FP unit– Five functional units

• Graphics unit

Page 39: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Pipeline

• 9 stages:– Fetch

– Decode

– Grouping

– Execution

– Cache access

– Load miss

– Integer pipe wait (for FP/graphics pipelines)

– Trap resolution

– Writeback

Page 40: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Branch Handling

• Dynamic branch prediction– Two bit scheme– Every second instruction in cache has

prediction bits (predicts up to 2048 branches)– 88% success rate (integer)

• Target prediction– Fetches from predicted path

Page 41: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

FPU

• Five functional units:– Add– Multiply– Divide/square root– Two graphics units (add and multiply)

• Mostly fully pipelined (latency 3 cycles)– Except divide and square root (not pipelined,

latency is 22 cycles for 64-bit)

Page 42: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Memory Hierarchy

• On-chip instruction and data caches– Data:

• 16kB direct-mapped, write-through

– Instructions:• 16kB 2-way set associative

– Both virtually addressed

• External cache– Up to 4MB

Page 43: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Virtual Memory

• 64-bit virtual addresses 44-bit physical addresses

• TLB– 64 entry, fully-associative cache

Page 44: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Multimedia Support (VIS)

• Integrated with FPU

• Partitioned operations– Multiple smaller values in 64-bits

• Video compression instructions– E.g. motion estimation instruction replaces 48

simple instructions for MPEG compression

Page 45: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

The End!

Page 46: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.