A Scalable Front-End Architecture for Fast Instruction Delivery

51
A Scalable Front- End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong

description

A Scalable Front-End Architecture for Fast Instruction Delivery. Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong. Conventional Pipeline Architecture. High-performance processors can be broken down into two parts Front-end : fetches and decodes instructions - PowerPoint PPT Presentation

Transcript of A Scalable Front-End Architecture for Fast Instruction Delivery

Page 1: A Scalable Front-End Architecture for Fast Instruction Delivery

A Scalable Front-End Architecture for Fast Instruction Delivery

Paper by: Glenn Reinman, Todd Austin and Brad Calder

Presenter: Alexander Choong

Page 2: A Scalable Front-End Architecture for Fast Instruction Delivery

Conventional Pipeline Architecture

High-performance processors can be broken down into two partsFront-end: fetches and decodes instructionsExecution core: executes instructions

Page 3: A Scalable Front-End Architecture for Fast Instruction Delivery

Front-End and Pipeline

Simple Front-End

Decode …

Decode

Fetch

Fetch

Fetch

Page 4: A Scalable Front-End Architecture for Fast Instruction Delivery

Front-End with Prediction

Fetch Predict

Fetch Predict

Fetch Predict

Simple Front-End

…Decode

Decode

Page 5: A Scalable Front-End Architecture for Fast Instruction Delivery

Front-End Issues I

Flynn’s bottleneck: IPC is bounded by the number of Instructions

fetched per cycle Implies: As execution performance

increases, the front-end must keep up to ensure overall performance

Page 6: A Scalable Front-End Architecture for Fast Instruction Delivery

Front-End Issues II

Two opposing forcesDesigning a faster front-end

Increase I-cache size

Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size

Page 7: A Scalable Front-End Architecture for Fast Instruction Delivery

Key Contributions I

Page 8: A Scalable Front-End Architecture for Fast Instruction Delivery

Key Contributions:Fetch Target Queue Objective:

Avoid using large cache with branch prediction

Purpose Decouple I-cache

from branch prediction

Results Improves

throughput

Page 9: A Scalable Front-End Architecture for Fast Instruction Delivery

Key Contributions:Fetch Target Buffer Objective

Avoid large caches with branch prediction

Implementation A multi-level buffer

Results Deliver performance

is 25% better than single level

Scales better with “future” feature size

Page 10: A Scalable Front-End Architecture for Fast Instruction Delivery

Outline

Scalable Front-End and ComponentsFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Page 11: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Target Queue Decouples I-cache from branch prediction

Branch predictor can generate predictions independent of when the I-cache uses them

Fetch Predict

Fetch Predict

Fetch Predict

Simple Front-End

Page 12: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Target Queue Decouples I-cache from branch prediction

Branch predictor can generate predictions independent of when the I-cache uses them

Fetch

Fetch

Predict

Fetch

Fetch

Predict

Predict

Front-End with FTQ

Predict

Page 13: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Target Queue

Fetch and predict can have different latenciesAllows for I-cache to be pipelined

As long as they have the same throughput

Page 14: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Blocks

FTQ stores fetch block Sequence of instructions

Starting at branch target Ending at a strongly biased branch

Instructions are directly fed into pipeline

Page 15: A Scalable Front-End Architecture for Fast Instruction Delivery

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Page 16: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Target Buffer:Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality

Page 17: A Scalable Front-End Architecture for Fast Instruction Delivery

Review: Branch Target Buffer I

Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict

Fetch Predict

Fetch Predict

Fetch Predict

Simple Front-End

Fetch Predict

Fetch Predict

Fetch Predict

With Branch Target Buffer

Page 18: A Scalable Front-End Architecture for Fast Instruction Delivery

Review: Branch Target Buffer II

CharacteristicsHash tableMakes predictionsCaches prediction information

Page 19: A Scalable Front-End Architecture for Fast Instruction Delivery

Review: Branch Target Buffer III Index/

Tag

Branch Prediction

Predicted branch target

Fall-through address

Instructions at Branch

0x1718 Taken 0x1834 0x1788 add

sub

0x1734 Taken 0x2088 0x1764 neq

br

0x1154 Not taken 0x1364 0x1200 ld

store

… … … …

PC

Page 20: A Scalable Front-End Architecture for Fast Instruction Delivery

FTP Optimizations over BTB

Multi-levelSolves

conundrum Need a small

cache Need enough

space to successfully predict branches

Page 21: A Scalable Front-End Architecture for Fast Instruction Delivery

FTP Optimizationsover BTB Oversize bit

Indicates if a block is larger than cache lineWith multi-port cache

Allows several smaller blocks to be loaded at the same time

Page 22: A Scalable Front-End Architecture for Fast Instruction Delivery

FTP Optimizationsover BTB Only stores partial fall-through address

Fall-through address is close to the current PC

Only need to store an offset

Page 23: A Scalable Front-End Architecture for Fast Instruction Delivery

FTP Optimizations over BTB Doesn’t store every blocks:

Fall-through blocksBlocks that are seldom taken

Page 24: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Target Buffer

Next PC

Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line

Page 25: A Scalable Front-End Architecture for Fast Instruction Delivery

Fetch Target Buffer

Page 26: A Scalable Front-End Architecture for Fast Instruction Delivery

PC used as index into FTB

Page 27: A Scalable Front-End Architecture for Fast Instruction Delivery

HIT!

L1 Hit

Page 28: A Scalable Front-End Architecture for Fast Instruction Delivery

HIT!

HIT! NOT TAKEN

Branch NOT Taken

Page 29: A Scalable Front-End Architecture for Fast Instruction Delivery

HIT!

HIT! NOT TAKEN

Branch NOT Taken

Page 30: A Scalable Front-End Architecture for Fast Instruction Delivery

HIT!

Branch Taken

TAKEN

Page 31: A Scalable Front-End Architecture for Fast Instruction Delivery

L1: MISS

FALL THROUGH

L1 Miss

Page 32: A Scalable Front-End Architecture for Fast Instruction Delivery

L1: MISS

FALL THROUGH

After N cycle Delay

L2: HIT

L1 Miss

Page 33: A Scalable Front-End Architecture for Fast Instruction Delivery

L1: MISS

FALL THROUGH: eventually mispredicts

L2: MISS

L1 and L2 Miss

Page 34: A Scalable Front-End Architecture for Fast Instruction Delivery

Hybrid branch prediction

Meta-predictor selects betweenLocal history predictorGlobal history predictorBimodal predictor

Page 35: A Scalable Front-End Architecture for Fast Instruction Delivery

Branch Prediction

Meta Bimod Local

Pred

Local

History

GlobalPredictor

Page 36: A Scalable Front-End Architecture for Fast Instruction Delivery

Branch Prediction

Page 37: A Scalable Front-End Architecture for Fast Instruction Delivery

Committing Results

When full, SHQ commits oldest value to local history or global history

Page 38: A Scalable Front-End Architecture for Fast Instruction Delivery

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Methodology Results Analysis and Conclusion

Page 39: A Scalable Front-End Architecture for Fast Instruction Delivery

Experimental Methodology I

Baseline Architecture Processor

8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache

64k 2-way instruction cache 64k 4 way data cache (pipelined)

Page 40: A Scalable Front-End Architecture for Fast Instruction Delivery

Experimental Methodology II

Timing ModelCacti cache compiler

Models on-chip memory Modified for 0.35 um, 0.188 um and 0.10 um

processes

Test set6 SPEC95 benchmarks2 C++ Programs

Page 41: A Scalable Front-End Architecture for Fast Instruction Delivery

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Page 42: A Scalable Front-End Architecture for Fast Instruction Delivery

Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k

and 8K entries

Better

Page 43: A Scalable Front-End Architecture for Fast Instruction Delivery

Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance

Smaller fetch size 2 Level Average Size: 6.6 1 Level Average Size: 7.5

Higher accuracy on average Two-Level: 83.3% Single: 73.1 %

Higher performance 25% average speedup over single

Page 44: A Scalable Front-End Architecture for Fast Instruction Delivery

Fall-through Bits Used

Number of fall-through bits: 4-5Because fetch

distances 16 instructions do not improve performance

Better

Page 45: A Scalable Front-End Architecture for Fast Instruction Delivery

FTQ Occupancy

Roughly indicates throughput

On average, FTQ isEmpty: 21.1% Full: 10.7%

of the time

Better

Page 46: A Scalable Front-End Architecture for Fast Instruction Delivery

Scalability

Two level FTB scale well with features sizeHigher slope is

better

Better

Page 47: A Scalable Front-End Architecture for Fast Instruction Delivery

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Page 48: A Scalable Front-End Architecture for Fast Instruction Delivery

Analysis

25% improvement in IPC over best performing single-level designs

System scales well with feature size On average, FTQ is non-empty 21.1% of

the time FTB Design requires at most 5 bits for fall-

through address

Page 49: A Scalable Front-End Architecture for Fast Instruction Delivery

Conclusion

FTQ and FTB designDecouples the I-cache from branch prediction

Produces higher throughput

Uses multi-level buffer Produces better scalability

Page 50: A Scalable Front-End Architecture for Fast Instruction Delivery

References

[1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999

[2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

Page 51: A Scalable Front-End Architecture for Fast Instruction Delivery

Thank you

Questions?