ARM Pipelining

download ARM Pipelining

of 31

Transcript of ARM Pipelining

  • 7/28/2019 ARM Pipelining

    1/31

    1

    Lecture 5 - ARM Organization

    and Implementation- ICE 1222/2342

    Fall, 2008

    Daeyoung Kim

    [email protected]

    http://resl.icu.ac.kr/~kimd

    mailto:[email protected]://resl.icu.ac.kr/~kimdhttp://resl.icu.ac.kr/~kimdmailto:[email protected]
  • 7/28/2019 ARM Pipelining

    2/31

    2

    Contents

    3-stage pipeline ARM organization &implementation

    5-stage pipeline ARM organization &implementation

  • 7/28/2019 ARM Pipelining

    3/31

    3

    3-stage pipeline ARMOrganization

    multiply

    data out register

    instruction

    decode

    &

    control

    incrementer

    registerbank

    address register

    barrelshifter

    A[31:0]

    D[31:0]

    data in register

    ALU

    control

    PC

    PC

    ALUbus

    Abus

    Bbus

    register

    ARM Processors

    up to ARM7

  • 7/28/2019 ARM Pipelining

    4/31

    4

    3-stage pipeline

    Fetch Instruction is fetched and placed in the instruction pipeline

    Decode

    The instruction is decoded and the datapath control signalsprepared for the next cycle

    The instruction owns the decode logic but not the datapath

    Execute The instruction owns the datapath Register bank is read, an operand is shifted, ALU result

    generated and written back into a destination register

  • 7/28/2019 ARM Pipelining

    5/31

    5

    ARM single-cycle instruction 3-stage pipeline operation

    fetch decode execute

    time

    1

    fetch decode execute

    fetch decode execute

    2

    3

    instruction

  • 7/28/2019 ARM Pipelining

    6/31

    6

    ARM multi-cycle instruction 3-stage pipeline operation

    fetch ADD decode execute

    time

    1

    fetch STR decode calc. addr.

    fetch ADD decode execute

    2

    3

    data xfer

    fetch ADD decode execute4

    5 fetch ADD decode execute

    instruction

  • 7/28/2019 ARM Pipelining

    7/317

    To achieve higher performance

    Tprog = Ninst x CPI / fclk

    Increase the clock rate, fclk The logic in each pipeline stage to be simplified and, therefore, the

    number of pipeline stages to be increased

    Reduce the average number of clock cycles per instruction, CPI Instructions which occupy more than one pipeline slot are re-

    implemented to occupy fewer slots Pipeline stalls caused by dependencies between instructions are reduced

    Memory bottleneck Von Neumann bottleneck

    Deliver more than 32 bits per access Separate instruction and data memory

  • 7/28/2019 ARM Pipelining

    8/318

    ARM9TDMI 5-stage pipelineorganization

    Fetch Instruction is fetched and placed

    in the instruction pipeline

    Decode The instruction is decoded and

    register operands read

    Execute An operand is shifted and ALU

    result generated. Load/Store -> memory address

    is calculated in ALU

    Buffer/Data Data memory is accessed if

    required Otherwise ALU result is simply

    buffered

    Write-back Result is written back to register

    file

    I-cache

    rot/sgn ex

    +4

    byte repl.

    ALU

    I decode

    register read

    D-cache

    fetch

    instructiondecode

    execute

    buffer/data

    write-back

    forwardingpaths

    immediate

    fields

    nextpc

    regshift

    load/storeaddress

    LDR pc

    SUBS pc

    post-index

    pre-index

    LDM/STM

    register write

    r15

    pc + 8

    pc + 4

    +4

    mux

    shift

    mul

    B, BL

    MOV pc

  • 7/28/2019 ARM Pipelining

    9/319

    Data Forwarding

    A major source of complexity in 5-stage pipeline Instruction execution is spread

    across the stages To resolve data dependencies

    without stalling the pipeline Forwarding paths

    Even with forwarding we can notavoid stall

    LDR rN, [..] ADD r2, r1, rN One cycle stall required rN available at the end of

    buffer/data stage Use instruction level scheduling

    Do not put a dependentinstruction immediately after aload instruction

    I-cache

    rot/sgn ex

    +4

    byte repl.

    ALU

    I decode

    register read

    D-cache

    fetch

    instructiondecode

    execute

    buffer/data

    write-back

    forwardingpaths

    immediate

    fields

    nextpc

    regshift

    load/storeaddress

    LDR pc

    SUBS pc

    post-index

    pre-index

    LDM/STM

    register write

    r15

    pc + 8

    pc + 4

    +4

    mux

    shift

    mul

    B, BL

    MOV pc

  • 7/28/2019 ARM Pipelining

    10/3110

    Data Processing Instructions

    address register

    increment

    registers

    Rd

    Rn

    PC

    Rm

    as ins.

    as instruction

    mult

    data out data in i. pipe

    (a) regist er - register operations

    address register

    increment

    registers

    Rd

    Rn

    PC

    as ins.

    as instruction

    mult

    data out data in i. pipe

    [7:0]

    (b) register - immediate operations

  • 7/28/2019 ARM Pipelining

    11/3111

    Data Transfer Instructions (STR)

    address register

    increment

    registers

    Rn

    PC

    lsl #0

    = A / A + B / A - B

    mult

    data out data in i. pipe

    [11:0]

    (a) 1st cycle - compute address

    address register

    increment

    registers

    Rn

    Rd

    shifter

    = A + B / A - B

    mult

    PC

    byte? data in i. pipe

    (b) 2nd cycle - store data & auto-index

    immediate offset If store byte, replicates it four times,Lowest two bits are used for proper by

  • 7/28/2019 ARM Pipelining

    12/3112

    Branch Instructions

    address register

    increment

    registers

    PC

    lsl #2

    = A + B

    mult

    data out data in i. pipe

    [23:0]

    (a) 1st cycle - compute branch target

    address register

    increment

    registers

    R14

    PC

    shifter

    = A

    mult

    data out data in i. pipe

    (b) 2nd cycle - save r eturn address

  • 7/28/2019 ARM Pipelining

    13/31

    13

    ARM Implementation - 1

    Clocking Scheme Most ARMs do not operate with edge-sensitive registers Based around 2-phase non-overlapping clocks generated

    internally from a single input clock signal Allows level-sensitive transparent latches Data movement is controlled by passing the data alternatively

    through latches open during phase 1 and latches open during phase2

    Non-overlapping property ensures no race condition

    1 clock cycle

    phase 1

    phase 2

  • 7/28/2019 ARM Pipelining

    14/31

    14

    ARM Implementation - 2

    Datapath Timing (1)

    read bus valid

    shift out valid

    ALU out

    shift time

    ALU time

    registerwrite time

    registerreadtime

    ALU operandslatched

    phase 1

    phase 2

    prechargeinvalidatesbuses

  • 7/28/2019 ARM Pipelining

    15/31

    15

    ARM Implementation - 3

    Datapath Timing (2) The minimum datapath cycle time is the sum of

    Register read time Shifter delay

    ALU delay Dominates cycle time

    Logical operations relatively faster than Arithmetic operations Why?

    Register write set-up time

    Phase 2 and phase 1 non-overlap time

  • 7/28/2019 ARM Pipelining

    16/31

    16

    ARM Implementation - 4

    Adder Design 1http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.ht 32-bit addition time has a significant effect on the datapath

    cycle time Influence maximum clock rate and processors performance

    The first Arm processor prototype Ripple-carry adder circuit Worst-case carry path is 32 gates long

    AB

    Cin

    sum

    Cout

    http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.htmlhttp://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.html
  • 7/28/2019 ARM Pipelining

    17/31

    17

    ARM Implementation - 5

    Adder Design - 2 ARM2 4-bit look-ahead scheme

    To reduce the worst-case carry path length

    A[3:0]

    B[3:0]

    Cin[0]

    sum[3:0]

    Cout[3]

    4-bitadderlogic

    P

    G

  • 7/28/2019 ARM Pipelining

    18/31

    18

    Carry-Look-Ahead (CLA) Adder -1

    calculating the carry signals in advance a carry signal will be generated

    when both bits Ai and Bi are 1 when one of the two bits is 1 and the carry-in (carry of the previous

    stage) is 1

    COUT = Ci+1= Ai.Bi + (Ai $ Bi).Ci (1) Ci+1 = Gi + Pi.Ci (2)

    Gi = Ai.Bi (3) -Generate

    Pi = (Ai $ Bi) (4) Propagate

    Propagate and Generate terms only depend on the input bits will be valid after one gate delay

    If one uses the above expression to calculate the carry signals, onedoes not need to wait for the carry to ripple through all the previous

    stages to find its proper value. Lets apply this to a 4-bit adder

  • 7/28/2019 ARM Pipelining

    19/31

    19

    Carry-Look-Ahead (CLA) Adder -2

    Lets apply this to a 4-bit adder C1 = G0 + P0.C0 (5)

    C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6)C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7)

    C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8) carry-out bit, Ci+1, of the last stage will be available after three delays

    (one delay to calculate the Propagate signal and two delays as a result ofthe AND and OR gate)

    Sum signal can be calculated as follows

    Si = Ai $ Bi $ Ci = Pi $ Ci. (9)

  • 7/28/2019 ARM Pipelining

    20/31

    20

    Carry-Look-Ahead (CLA) Adder -3

    4-bit adder

  • 7/28/2019 ARM Pipelining

    21/31

    21

    Carry-Look-Ahead (CLA) Adder -4

    16-bit adder (Group) PG = P3.P2.P1.P0 (10)

    GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0 (11)

  • 7/28/2019 ARM Pipelining

    22/31

    22

    ARM Implementation - 6

    ALU functions Adder, address computations for memory transfer, branch

    calculations, bit-wise logical functions, and so on

    fs 5 f s 4 f s 3 f s 2 f s 1 fs 0 ALU o utput

    0 0 0 1 0 0 A and B

    0 0 1 0 0 0 A and not B

    0 0 1 0 0 1 A xor B

    0 1 1 0 0 1 A plus not B plus carry

    0 1 0 1 1 0 A plus B plus carry

    1 1 0 1 1 0 not A plus B plus carry

    0 0 0 0 0 0 A0 0 0 0 0 1 A or B

    0 0 0 1 0 1 B

    0 0 1 0 1 0 not B

    0 0 1 1 0 0 zero

  • 7/28/2019 ARM Pipelining

    23/31

    23

    ARM Implementation - 7

    ALU functions The ARM2 ALU logic for one result bit

    ALU

    bus

    432105

    NB

    bus

    NA

    bus

    carrylogic

    fs:

    G

    P

  • 7/28/2019 ARM Pipelining

    24/31

    24

    ARM Implementation - 8

    ARM6 Carry-Select Adder Computes the sums of various fields of the word for a carry-in of

    both zero and one The final result is selected by using the correct carry-in bit

    sum[31:16]sum[15:8]sum[7:4]sum[3:0]

    s s+1

    a,b[31:28]a,b[3:0]

    + +, +1

    c

    +, +1

    mux

    mux

    mux

  • 7/28/2019 ARM Pipelining

    25/31

    25

    ARM Implementation - 9

    ARM6 ALU Organization

    Z

    N

    VC

    logic/arithmetic

    C infunction

    invert A invert B

    result

    result mux

    logic functions

    A operand latch B operand latch

    XOR gates XOR gates

    adder

    zero detect

  • 7/28/2019 ARM Pipelining

    26/31

    26

    ARM Implementation - 10

    Barrel Shifter The shifter performance is critical

    Shifter time contributes to the datapath cycle time

    in[0]

    in[1]

    in[2]

    in[3]

    out[0] out[1] out[2] out[3]

    no shiftright 1right 2right 3

    left 1

    left 2

    left 3

  • 7/28/2019 ARM Pipelining

    27/31

    27

    ARM Implementation - 10

    The ARM register bank

    A bus read decoders

    B bus read decoders

    write decoders

    register cellsPC

    Vdd

    Vss

    ALUbus

    PC

    bus

    INCbus

    ALUbus

    A bus

    B bus

  • 7/28/2019 ARM Pipelining

    28/31

    28

    ARM Implementation - 11

    Control Structures

    decodePLA

    cyclecount

    multiplycontrol

    load/storemultiple

    addresscontrol

    registercontrol

    ALUcontrol

    shiftercontrol

    instruction

    coprocessor

  • 7/28/2019 ARM Pipelining

    29/31

    29

    ARM Coprocessor Interface - 1

    A general-purpose extension of its instruction set through theaddition of hardware coprocessors Also supports software emulation of coprocessors through

    undefined instruction trap

    Coprocessor Architecture 16 logical coprocessors Each coprocessor have up to 16 private registers of any

    reasonable size Load-store architecture

    Internal operations on registers Load and store from and to the memory Move data to or from an ARM register

    Implementation Board level coprocessor slow speed On-chip coprocessor high clock speed, cache and memory

    management, etc.

  • 7/28/2019 ARM Pipelining

    30/31

    30

    ARM Coprocessor Interface - 2

    ARM7TDMI Coprocessor interface Bus watching

    Coprocessor is attached to a bus where the ARM instruction streamflows into the ARM

    Coprocessor copies the instructions into an internal pipeline Handshake between ARM and coprocessor

    cpi* (from ARM to all coprocessors) Coprocessor instruction

    cpa (from the coprocessors to ARM) Coprocessor absent

    cpb (from the coproessors to ARM) Coprocessor busy

  • 7/28/2019 ARM Pipelining

    31/31

    31

    ARM Coprocessor Interface - 3

    Handshake outcomes ARM may decide not to execute it

    It falls in a branch shadow or fails condition code test / cpi* high

    ARM may decide to execute it (cpi* low), but cpa high Undefined instruction trap

    ARM decides to execute it and a coprocessor accepts it, butcannot execute it yet

    cpa low but cpb high Busy-wait while stalling instruction stream Enabled interrupt request arrives? Handle it and retry coprocessor

    instruction later

    ARM decides to execute it and coprocessor accepts it andexecutes it immediately

    cpi* low, cpa low, cpb low