A Scalable Front-End Architecture for Fast Instruction Delivery
description
Transcript of A Scalable Front-End Architecture for Fast Instruction Delivery
A Scalable Front-End Architecture for Fast Instruction Delivery
Paper by: Glenn Reinman, Todd Austin and Brad Calder
Presenter: Alexander Choong
Conventional Pipeline Architecture
High-performance processors can be broken down into two partsFront-end: fetches and decodes instructionsExecution core: executes instructions
Front-End and Pipeline
Simple Front-End
Decode …
Decode
Fetch
Fetch
Fetch
Front-End with Prediction
Fetch Predict
Fetch Predict
Fetch Predict
Simple Front-End
…Decode
Decode
Front-End Issues I
Flynn’s bottleneck: IPC is bounded by the number of Instructions
fetched per cycle Implies: As execution performance
increases, the front-end must keep up to ensure overall performance
Front-End Issues II
Two opposing forcesDesigning a faster front-end
Increase I-cache size
Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size
Key Contributions I
Key Contributions:Fetch Target Queue Objective:
Avoid using large cache with branch prediction
Purpose Decouple I-cache
from branch prediction
Results Improves
throughput
Key Contributions:Fetch Target Buffer Objective
Avoid large caches with branch prediction
Implementation A multi-level buffer
Results Deliver performance
is 25% better than single level
Scales better with “future” feature size
Outline
Scalable Front-End and ComponentsFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
Fetch Target Queue Decouples I-cache from branch prediction
Branch predictor can generate predictions independent of when the I-cache uses them
Fetch Predict
Fetch Predict
Fetch Predict
Simple Front-End
Fetch Target Queue Decouples I-cache from branch prediction
Branch predictor can generate predictions independent of when the I-cache uses them
Fetch
Fetch
Predict
Fetch
Fetch
Predict
Predict
Front-End with FTQ
Predict
Fetch Target Queue
Fetch and predict can have different latenciesAllows for I-cache to be pipelined
As long as they have the same throughput
Fetch Blocks
FTQ stores fetch block Sequence of instructions
Starting at branch target Ending at a strongly biased branch
Instructions are directly fed into pipeline
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
Fetch Target Buffer:Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality
Review: Branch Target Buffer I
Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict
Fetch Predict
Fetch Predict
Fetch Predict
Simple Front-End
Fetch Predict
Fetch Predict
Fetch Predict
With Branch Target Buffer
Review: Branch Target Buffer II
CharacteristicsHash tableMakes predictionsCaches prediction information
Review: Branch Target Buffer III Index/
Tag
Branch Prediction
Predicted branch target
Fall-through address
Instructions at Branch
0x1718 Taken 0x1834 0x1788 add
sub
0x1734 Taken 0x2088 0x1764 neq
br
0x1154 Not taken 0x1364 0x1200 ld
store
… … … …
PC
FTP Optimizations over BTB
Multi-levelSolves
conundrum Need a small
cache Need enough
space to successfully predict branches
FTP Optimizationsover BTB Oversize bit
Indicates if a block is larger than cache lineWith multi-port cache
Allows several smaller blocks to be loaded at the same time
FTP Optimizationsover BTB Only stores partial fall-through address
Fall-through address is close to the current PC
Only need to store an offset
FTP Optimizations over BTB Doesn’t store every blocks:
Fall-through blocksBlocks that are seldom taken
Fetch Target Buffer
Next PC
Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line
Fetch Target Buffer
PC used as index into FTB
HIT!
L1 Hit
HIT!
HIT! NOT TAKEN
Branch NOT Taken
HIT!
HIT! NOT TAKEN
Branch NOT Taken
HIT!
Branch Taken
TAKEN
L1: MISS
FALL THROUGH
L1 Miss
L1: MISS
FALL THROUGH
After N cycle Delay
L2: HIT
L1 Miss
L1: MISS
FALL THROUGH: eventually mispredicts
L2: MISS
L1 and L2 Miss
Hybrid branch prediction
Meta-predictor selects betweenLocal history predictorGlobal history predictorBimodal predictor
Branch Prediction
Meta Bimod Local
Pred
Local
History
GlobalPredictor
Branch Prediction
Committing Results
When full, SHQ commits oldest value to local history or global history
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Methodology Results Analysis and Conclusion
Experimental Methodology I
Baseline Architecture Processor
8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache
64k 2-way instruction cache 64k 4 way data cache (pipelined)
Experimental Methodology II
Timing ModelCacti cache compiler
Models on-chip memory Modified for 0.35 um, 0.188 um and 0.10 um
processes
Test set6 SPEC95 benchmarks2 C++ Programs
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k
and 8K entries
Better
Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance
Smaller fetch size 2 Level Average Size: 6.6 1 Level Average Size: 7.5
Higher accuracy on average Two-Level: 83.3% Single: 73.1 %
Higher performance 25% average speedup over single
Fall-through Bits Used
Number of fall-through bits: 4-5Because fetch
distances 16 instructions do not improve performance
Better
FTQ Occupancy
Roughly indicates throughput
On average, FTQ isEmpty: 21.1% Full: 10.7%
of the time
Better
Scalability
Two level FTB scale well with features sizeHigher slope is
better
Better
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
Analysis
25% improvement in IPC over best performing single-level designs
System scales well with feature size On average, FTQ is non-empty 21.1% of
the time FTB Design requires at most 5 bits for fall-
through address
Conclusion
FTQ and FTB designDecouples the I-cache from branch prediction
Produces higher throughput
Uses multi-level buffer Produces better scalability
References
[1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999
[2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.
Thank you
Questions?