Alpha 21264 Microarchitecture · 2020. 12. 9. · Key Features of 21264 • Introduced in Feb 98 at...

25
Alpha 21264 Microarchitecture Onur/Aditya 11/6/2001

Transcript of Alpha 21264 Microarchitecture · 2020. 12. 9. · Key Features of 21264 • Introduced in Feb 98 at...

  • Alpha 21264 Microarchitecture

    Onur/Aditya11/6/2001

  • Key Features of 21264• Introduced in Feb 98 at 500 MHz• 15M transistors, 2.2V 0.35-micron 6 metal layer CMOS

    process• Implements 64-bit Alpha ISA• Out-of-order execution (unlike 21164)• 4-wide fetch (like 21164)• Max 6 inst/cycle execution bandwidth• 7-stage pipeline• Hybrid two-level branch prediction (tournament predictor)• Clustered integer pipeline• 80 in-flight instructions

  • Overview of the Presentation

    • Overview of 21264 pipeline• Fetch and Branch Prediction mechanism• Register Renaming in 21264• Clustering• Memory System

  • 21264 Pipelines

    Source: Microprocessor Report, 10/28/96

  • Pipeline Structure

    Source: IEEE Micro, March-April 1999

  • Instruction Fetch Mechanism• Two features:

    – Line and way prediction– Branch prediction

    • Line-way predictor predicts the line-way of the I-cache that will be accessed in the next cycle

    • Line-way prediction takes the branch predictor outside the critical fetch loop.

    • On cache fills, line predictor value at each line points to the next sequential fetch line.

    • Line predictor is later trained by the branch predictor.• In effect, line-way predictor is similar to a very fast BTB.• Prediction of the line predictor is verified in stage 1 (Instruction slot).

    If line-way prediction is incorrect, slot stage is flushed and PC generated using the branch predictor information is used to redirect fetch.

  • Fetch - Line and Way Prediction

    Source: IEEE Micro, March-April 1999

  • Branch Prediction Mechanism• Hybrid Branch Predictor• Global predictor:

    – Good for inter-correlated branches. – Indexed by global path history register (T/NT status of last 12

    branches)– 4K-entry table of 2-bit counters

    • Local Predictor– Good for self-correlated branches.– 10 bits of PC indexes a per-address local history table, which in

    turn indexes a 1K-entry table of 3-bit counters.– Aliasing among branches is a problem.

    • Choice Predictor– Decides which predictor to use.– Indexed by global path history register– 4K-entry table of 2-bit counters

  • Branch Prediction Mechanism

    • Minimum branch penalty: 7 cycles• Typical branch penalty: 11+ cycles (IQ delay)• 48K bits of target addresses stored in I-cache• 32-entry return address stack• Predictor tables are reset on a context switch

    Source: Microprocessor Report, 10/28/96

  • Instruction Slotting• Check line predictor prediction• Branch predictor compares the next cache

    index it generates with the one generated by line predictor

    • Determine the subclusters integer instructions will go to

    • Some subclusters are specialized resource constraints

    • Perform load balancing on subclusters

  • Register Renaming• 31 Integer 31 FP architectural registers• 41 Int 41 FP extra physical registers • Uses a merged rename and architectural register file, one

    for Int one for FP• Same physical register holds the results of an instruction

    before and after commit• No separate architectural register file (no data copying on

    commit)• Register map table stores current mappings of architectural

    registers.• A map silo contains old mappings of up to 20 previous

    decode cycles (used in case of misprediction)

  • Register Renaming Logic

    • On decoding an instruction:– Search map CAMs for the source registers– Find the physical registers currently containing the value of the

    architectural source registers– Access free physical register list – Map the found free physical register to the architectural

    destination register

    Source: Presentation by R. Kessler, August 1998.

  • Register Renaming Logic• On completing an instruction:

    – Write result into the physical destination register– Mark the physical destination register as valid in the register

    scoreboard– Broadcast results to issue queue entries– Physical destination register number is broadcast as tag

    • On committing an instruction– Mark the physical destination register as committed– Free the physical register that corresponds to an old mapping of

    the same architectural register • On a misprediction/exception

    – Roll back the map state to what it was when the exception-causing instruction was renamed

    – To be able to do this, instructions should be associated with map entries

    – This is done using inums. Each instruction is given an 8-bit unique identifier during register mapping

  • Physical Register States• 4 states• Initially n architectural

    registers are in AR state.• Rest are Available• When an instruction with

    a destination register is issued, one of the available registers is allocated as rename buffer (RB)

    • When instruction finishes execution, state is set to valid

    • On instruction commit, state is set to AR and old AR mapping is reclaimed

    Source: Sima, D. The Design Space of Register Renaming Techniques. IEEE Micro, September/October 2000.

  • Integer Issue Queues - Clustering

    • 20 entries, maximum 4 per cycle• Two arbiters pick the instructions that will issue (One for upper

    subclusters, one for lower subclusters)• Each queue entry asserts a request to the arbiter when it contains an

    instruction that can be executed by the subcluster (if operand values are available within that subcluster)

    • 4 request signals (U0, U1, L0, L1)• Arbiters choose between simultaneous requesters of a subcluster based

    on the age of the request • Older instructions are given priority• Each arbiter picks 2 of the possible 20 requesters for service• A given instruction can request only upper or lower subclusters (load

    balancing based on the assignment done by Stage 1)• Subcluster assignment is static (Stage 1)• Cluster selection on issue is dynamic (Stage 2)

  • Integer/FP Execution Pipes• Integer cluster

    communication latency: 1 cycle

    • Advantage of clustering:– Fewer read/write

    ports to the register file

    – Register file will not be a cycle time limiter

    • FP issue queue:– 15 entries– 2 inst/cycle

    Source: IEEE Micro, March-April 1999

  • Memory References• Load Queue

    – Reorder buffer for loads– 32 entries, in-order– Maintains state of loads issued but not yet retired

    • Store Queue– Reorder buffer for stores– 32 entries, in-order– Maintains state of stores issued but not yet written to the data

    cache– Holds data associated with store instructions– Forwards data to older matching stores

    • Miss Address File– Holds physical addresses associated with pending L1 cache misses

    (instruction or data)– Maximum 8 misses to off-chip memory system

  • Load/Store Ordering• New memory references check their address and age against

    older references.• For example, when a store issues:

    – LDQ compares store address to the addresses of younger loads (CAM search)

    – If the older store issues to the same memory address as a younger load, LDQ squashes the load and initiates recovery

    • When a load is ready to issue:– STQ compares the load address to the addresses of younger

    stores– If a match is found:

    • If store data is available, STQ forwards the data• Else load issue is delayed until store data becomes

    available

  • Load/Store Ordering• When a load is ready to issue:

    – If a younger store exists in STQ with an unknown address:

    • Predict that the ready load will not access the same memory location unless this load was incorrectly ordered before (check the load wait table)

    – Exposes more ILP if prediction is correct– In case of misprediction:

    • Minimum 14 cycle penalty• Initiate recovery: Load and all subsequent instructions are

    squashed and re-executed• Mark the load in the load wait table so that it will wait for all

    younger stores to compute addresses next time around

  • Load/Store Ordering Example

    Source: IEEE Micro, March-April 1999

  • Features of Memory System• Data cache

    – 64 KB, 2-way, virtually-indexed physically tagged (translation in parallel with access)

    – Write-back, read/write allocate– 64-byte block size + ECC bits– Prevents synonyms by not allowing different physical addresses

    corresponding to the same virtual address to co-exist in the cache– Load hit/miss prediction to minimize load-use latency (Data cache

    access is 3 cycles after the issue queue + 1 cycle to get the hit/miss signal to issue queue)

    • Victim Buffer (Victim address and data files)– Contains evicted L1(Data and Inst) and L2 cache lines– 8 entries, Serial access

    • Off-chip L2 cache– Minimum data cache miss latency 13 cycles– Up to 16 MB– Dedicated access to L2 cache

  • Overall System Diagram

    Source: Microprocessor Report, 10/28/96

  • The Processor Itself

  • References• R.E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro.

    March/April 1999.• D. Leibholz and R. Razdan. The Alpha 21264: A 500 MHz Out-of-

    order Execution Microprocessor. COMPCON97, 1997.• Compaq Computer Corporation. Alpha 21264/EV6 Hardware

    Reference Manual.• R. Kessler, E. McLellan, and D. Webb. The Alpha 21264

    microprocessor architecture. International Conference on Computer Design, October 1998

    • B.A. Gieseke et. al. A 600 MHz Superscalar RISC Microprocessor with Out-of-order Execution. International Solid State Circuits Conference. 1997.

    • L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report. October 28, 1996.

    • Dezso Sima. The Design Space of Register Renaming Techniques. IEEE Micro. September/October 2000.

    • P.E. Gronowski et. al. High Performance Microprocessor Design. IEEE Journal of Solid State Circuits. May 1998.