A Scalable Front-End Architecture for Fast Instruction Delivery

Post on 31-Jan-2016

49 views 0 download

Tags:

description

A Scalable Front-End Architecture for Fast Instruction Delivery. Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong. Conventional Pipeline Architecture. High-performance processors can be broken down into two parts Front-end : fetches and decodes instructions - PowerPoint PPT Presentation

Transcript of A Scalable Front-End Architecture for Fast Instruction Delivery

A Scalable Front-End Architecture for Fast Instruction Delivery

Paper by: Glenn Reinman, Todd Austin and Brad Calder

Presenter: Alexander Choong

Conventional Pipeline Architecture

High-performance processors can be broken down into two partsFront-end: fetches and decodes instructionsExecution core: executes instructions

Front-End and Pipeline

Simple Front-End

Decode …

Decode

Fetch

Fetch

Fetch

Front-End with Prediction

Fetch Predict

Fetch Predict

Fetch Predict

Simple Front-End

…Decode

Decode

Front-End Issues I

Flynn’s bottleneck: IPC is bounded by the number of Instructions

fetched per cycle Implies: As execution performance

increases, the front-end must keep up to ensure overall performance

Front-End Issues II

Two opposing forcesDesigning a faster front-end

Increase I-cache size

Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size

Key Contributions I

Key Contributions:Fetch Target Queue Objective:

Avoid using large cache with branch prediction

Purpose Decouple I-cache

from branch prediction

Results Improves

throughput

Key Contributions:Fetch Target Buffer Objective

Avoid large caches with branch prediction

Implementation A multi-level buffer

Results Deliver performance

is 25% better than single level

Scales better with “future” feature size

Outline

Scalable Front-End and ComponentsFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Fetch Target Queue Decouples I-cache from branch prediction

Branch predictor can generate predictions independent of when the I-cache uses them

Fetch Predict

Fetch Predict

Fetch Predict

Simple Front-End

Fetch Target Queue Decouples I-cache from branch prediction

Branch predictor can generate predictions independent of when the I-cache uses them

Fetch

Fetch

Predict

Fetch

Fetch

Predict

Predict

Front-End with FTQ

Predict

Fetch Target Queue

Fetch and predict can have different latenciesAllows for I-cache to be pipelined

As long as they have the same throughput

Fetch Blocks

FTQ stores fetch block Sequence of instructions

Starting at branch target Ending at a strongly biased branch

Instructions are directly fed into pipeline

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Fetch Target Buffer:Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality

Review: Branch Target Buffer I

Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict

Fetch Predict

Fetch Predict

Fetch Predict

Simple Front-End

Fetch Predict

Fetch Predict

Fetch Predict

With Branch Target Buffer

Review: Branch Target Buffer II

CharacteristicsHash tableMakes predictionsCaches prediction information

Review: Branch Target Buffer III Index/

Tag

Branch Prediction

Predicted branch target

Fall-through address

Instructions at Branch

0x1718 Taken 0x1834 0x1788 add

sub

0x1734 Taken 0x2088 0x1764 neq

br

0x1154 Not taken 0x1364 0x1200 ld

store

… … … …

PC

FTP Optimizations over BTB

Multi-levelSolves

conundrum Need a small

cache Need enough

space to successfully predict branches

FTP Optimizationsover BTB Oversize bit

Indicates if a block is larger than cache lineWith multi-port cache

Allows several smaller blocks to be loaded at the same time

FTP Optimizationsover BTB Only stores partial fall-through address

Fall-through address is close to the current PC

Only need to store an offset

FTP Optimizations over BTB Doesn’t store every blocks:

Fall-through blocksBlocks that are seldom taken

Fetch Target Buffer

Next PC

Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line

Fetch Target Buffer

PC used as index into FTB

HIT!

L1 Hit

HIT!

HIT! NOT TAKEN

Branch NOT Taken

HIT!

HIT! NOT TAKEN

Branch NOT Taken

HIT!

Branch Taken

TAKEN

L1: MISS

FALL THROUGH

L1 Miss

L1: MISS

FALL THROUGH

After N cycle Delay

L2: HIT

L1 Miss

L1: MISS

FALL THROUGH: eventually mispredicts

L2: MISS

L1 and L2 Miss

Hybrid branch prediction

Meta-predictor selects betweenLocal history predictorGlobal history predictorBimodal predictor

Branch Prediction

Meta Bimod Local

Pred

Local

History

GlobalPredictor

Branch Prediction

Committing Results

When full, SHQ commits oldest value to local history or global history

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Methodology Results Analysis and Conclusion

Experimental Methodology I

Baseline Architecture Processor

8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache

64k 2-way instruction cache 64k 4 way data cache (pipelined)

Experimental Methodology II

Timing ModelCacti cache compiler

Models on-chip memory Modified for 0.35 um, 0.188 um and 0.10 um

processes

Test set6 SPEC95 benchmarks2 C++ Programs

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k

and 8K entries

Better

Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance

Smaller fetch size 2 Level Average Size: 6.6 1 Level Average Size: 7.5

Higher accuracy on average Two-Level: 83.3% Single: 73.1 %

Higher performance 25% average speedup over single

Fall-through Bits Used

Number of fall-through bits: 4-5Because fetch

distances 16 instructions do not improve performance

Better

FTQ Occupancy

Roughly indicates throughput

On average, FTQ isEmpty: 21.1% Full: 10.7%

of the time

Better

Scalability

Two level FTB scale well with features sizeHigher slope is

better

Better

Outline

Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer

Experimental Methodology Results Analysis and Conclusion

Analysis

25% improvement in IPC over best performing single-level designs

System scales well with feature size On average, FTQ is non-empty 21.1% of

the time FTB Design requires at most 5 bits for fall-

through address

Conclusion

FTQ and FTB designDecouples the I-cache from branch prediction

Produces higher throughput

Uses multi-level buffer Produces better scalability

References

[1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999

[2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

Thank you

Questions?