Review: Compiler techniques for parallelism Loop …kubitron/courses/cs152-F99/... · ° Key idea...

24
10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz Lec16.1 CS152 Computer Architecture and Engineering Lecture 16 Dynamic Scheduling (Cont), Speculation, and ILP October 25, 1999 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ 10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz Lec16.2 Review: Compiler techniques for parallelism ° Loop unrolling Multiple iterations of loop in software: Amortizes loop overhead over several iterations Gives more opportunity for scheduling around stalls ° Software Pipelining Take one instruction from each of several iterations of the loop Software overlapping of loop iterations Today will show hardware overlapping of loop iterations ° Very Long Instruction Word machines (VLIW) Multiple operations coded in single, long instruction Requires sophisticated compiler to decide which operations can be done in parallel Trace scheduling find common path and schedule code as if branches didn’t exist (+ add “fixup code”) ° All of these require additional registers 10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz Lec16.3 Review: Dynamic hardware for out-of-order execution ° HW exploitation of ILP Works when can’t know dependence at compile time. Code for one machine runs well on another ° Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn’t handle forwarding. No automatic register renamingstalls for WAR and WAW hazards Are these fundamental limitations??? (No) 10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz Lec16.4 ° The Five Classic Components of a Computer ° Today’s Topics: Recap last lecture Hardware loop unrolling with Tomasulo algorithm Administrivia Speculation, branch prediction Reorder buffers The Big Picture: Where are We Now? Control Datapath Memory Processor Input Output

Transcript of Review: Compiler techniques for parallelism Loop …kubitron/courses/cs152-F99/... · ° Key idea...

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.1

CS152Computer Architecture and Engineering

Lecture 16

Dynamic Scheduling (Cont), Speculation, and ILP

October 25, 1999

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.2

Review: Compiler techniques for parallelism

° Loop unrolling ⇒ Multiple iterations of loop insoftware:

• Amortizes loop overhead over several iterations• Gives more opportunity for scheduling around stalls

° Software Pipelining ⇒ Take one instruction from eachof several iterations of the loop

• Software overlapping of loop iterations• Today will show hardware overlapping of loop iterations

° Very Long Instruction Word machines (VLIW) ⇒Multiple operations coded in single, long instruction

• Requires sophisticated compiler to decide whichoperations can be done in parallel

• Trace scheduling ⇒ find common path and schedulecode as if branches didn’t exist (+ add “fixup code”)

° All of these require additional registers

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.3

Review: Dynamic hardware for out-of-order execution° HW exploitation of ILP

• Works when can’t know dependence at compile time.• Code for one machine runs well on another

° Key idea of Scoreboard: Allow instructions behind stallto proceed (Decode => Issue instr & read operands)

• Enables out-of-order execution => out-of-order completion

• ID stage checked both for structural & data dependencies

• Original version didn’t handle forwarding.

• No automatic register renaming⇒stalls for WAR and WAW hazards

• Are these fundamental limitations??? (No)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.4

° The Five Classic Components of a Computer

° Today’s Topics:• Recap last lecture

• Hardware loop unrolling with Tomasulo algorithm

• Administrivia

• Speculation, branch prediction

• Reorder buffers

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.5

Another Dynamic Algorithm: Tomasulo Algorithm

° For IBM 360/91 about 3 years after CDC 6600 (1966)

° Goal: High Performance without special compilers

° Differences between IBM 360 & CDC 6600 ISA• IBM has only 2 register specifiers/instr vs. 3 in CDC 6600

• IBM has 4 FP registers vs. 8 in CDC 6600

• IBM has memory-register ops

° Why Study? lead to Alpha 21264, HP 8000, MIPS 10000,Pentium II, PowerPC 604, …

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.6

Tomasulo Algorithm vs. Scoreboard

° Control & buffers distributed with Function Units (FU) vs.centralized in scoreboard;

• FU buffers called “reservation stations”; have pending operands

° Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;

• avoids WAR, WAW hazards

• More reservation stations than registers, so can do optimizationscompilers can’t

° Results to FU from RS, not through registers, overCommon Data Bus that broadcasts results to all FUs

° Load and Stores treated as FUs with RSs as well

° Integer instructions can go past branches, allowingFP ops beyond basic block in FP queue

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.7

Tomasulo Organization

������������������

��������

��� ������������� ����������

���������

��� ��� ������������

��������������������

�� �����������������

�����

��� �!����

"������##���

��������##���

"���"����"����"���$"���%"���&

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.8

Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of Source operands• Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers(value to be written)• Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready• Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.9

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

° Normal data bus: data + destination (“go to” bus)

° Common data bus: data + source (“come from” bus)• 64 bits of data + 4 bits of Functional Unit source address

• Write if matches expected Functional Unit (produces result)

• Does the broadcast

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.10

Tomasulo Example

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.11

Tomasulo Example Cycle 1

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Load1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.12

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

2 FU Load2 Load1

������������� ��������������������������������

Tomasulo Example Cycle 2

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.13

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

3 FU Mult1 Load2 Load1

� ���������������������������������������������������������������� �!�"#���������$������%����

� "���&��������� �'������'�����(���"���&)

Tomasulo Example Cycle 3

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.14

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

4 FU Mult1 Load2 M(A1) Add1

� "���*��������� �'������'�����(���"���&)

Tomasulo Example Cycle 4

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.15

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes SUBD M(A1) M(A2)

Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

5 FU Mult1 M(A2) M(A1) Add1 Mult2

Tomasulo Example Cycle 5

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.16

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes SUBD M(A1) M(A2)

Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

6 FU Mult1 M(A2) Add2 Add1 Mult2

� +�����,---��������$������%����)

Tomasulo Example Cycle 6

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.17

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes SUBD M(A1) M(A2)

Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

7 FU Mult1 M(A2) Add2 Add1 Mult2

� ,��&��������� �'������'�����(����)

Tomasulo Example Cycle 7

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.18

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 No

2 Add2 Yes ADDD (M-M) M(A2)Add3 No

7 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

8 FU Mult1 M(A2) Add2 (M-M) Mult2

Tomasulo Example Cycle 8

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.19

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 No

1 Add2 Yes ADDD (M-M) M(A2)Add3 No

6 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

9 FU Mult1 M(A2) Add2 (M-M) Mult2

Tomasulo Example Cycle 9

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.20

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 No

0 Add2 Yes ADDD (M-M) M(A2)Add3 No

5 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

10 FU Mult1 M(A2) Add2 (M-M) Mult2

� ,��*��������� �'������'�����(����)

Tomasulo Example Cycle 10

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.21

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

� .�����������(�,---��������$������%����)� ,�/������������������������������0��1

Tomasulo Example Cycle 11

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.22

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 12

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.23

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

13 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 13

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.24

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 14

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.25

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 15

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.26

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 16

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.27

Faster than light computation(skip a couple of cycles)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.28

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 55

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.29

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU M*F4 M(A2) (M-M+M(M-M) Mult2

� !��*����������� �'������'�����(����)

Tomasulo Example Cycle 56

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.30

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

56 FU M*F4 M(A2) (M-M+M(M-M) Mult2

� 2����������+�3���������������3�(3�������4������������������$

Tomasulo Example Cycle 57

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.31

Instruction status: Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue Compl ResultLD F6 34+ R2 1 2 3 4 1 3 4

LD F2 45+ R3 5 6 7 8 2 4 5

MULTD F0 F2 F4 6 9 19 20 3 15 16

SUBD F8 F6 F2 7 9 11 12 4 7 8

DIVD F10 F0 F6 8 21 61 62 5 56 57

ADDD F6 F8 F2 13 14 16 22 6 10 11

� .�0��������������������%����5�� )����������6�7����"�����(�(��'�����

Compare to Scoreboard Cycle 62

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.32

Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)

window size: ~ 14 instructions ~ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers

Control: reservation stations central scoreboard

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.33

° Complexity• delays of 360/91, MIPS 10000, IBM 620?

° Many associative stores (CDB) at high speed

° Performance limited by Common Data Bus• Multiple CDBs => more FU logic for parallel assoc stores

Tomasulo Drawbacks

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.34

Administrivia

° Should be debugging Lab 5 by now!• Remember: a Working processor is necessary for full credit…

° Tomorrow: Sections are back in classroom

° More info on some of the things that we have beentalking about last two lectures:

• Computer Architecture: A Quantitative Approach by JohnHennesy and David Patterson

° Next: Memory systems• Start reading Chapter 7 (of your text) now…

• Lab 6 will be using memory systems.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.35

Administrivia: Be careful about clock edges in lab5!

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

Valid

IRex

Dcd

Ctr

l

IRm

em

Ex

Ctr

l

IRw

b

Mem

Ctr

l

WB

Ctr

l

D

° Since Register file has edge-triggered write:• Must have everything set up at end of memory stage• This means that “M” register here is not actual register!

° Same with edge-triggered memory ⇒ “D” register appears“inside” memory

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.36

Tomasulo Loop Example

Loop:LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

° Assume Multiply takes 4 clocks

° Assume first load takes 8 clocks (cache miss),second load takes 1 clock (hit)

° To be clear, will show clocks for SUBI, BNEZ

° Reality: integer instructions ahead

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.37

Loop Example

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F300 80 Fu

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.38

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Fu Load1

Loop Example Cycle 1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.39

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Fu Load1 Mult1

Loop Example Cycle 2

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.40

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F303 80 Fu Load1 Mult1

° Implicit renaming sets up “DataFlow” graph

Loop Example Cycle 3

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.41

What does this mean physically?

addr: 80addr: 80

F0: Load 1 F0: Load 1

F4: Mult1 F4: Mult1

������������������

��������

��� ������������� ����������

���������

��� ��� ������������

��������������������

�� �����������������

�����

��� �!����

"������##���"���"����"����"���$"���%"���&

R(F2) Load1mul

�������##���

Addr: 80Addr: 80 Mult1Mult1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.42

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F304 80 Fu Load1 Mult1

° Dispatching SUBI Instruction

Loop Example Cycle 4

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.43

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F305 72 Fu Load1 Mult1

° And, BNEZ instruction

Loop Example Cycle 5

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.44

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F306 72 Fu Load2 Mult1

° Notice that F0 never sees Load from location 80

Loop Example Cycle 6

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.45

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F307 72 Fu Load2 Mult2

° Register file completely detached from iteration 1

Loop Example Cycle 7

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.46

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F308 72 Fu Load2 Mult2

Loop Example Cycle 8

° First and Second iteration completely overlapped

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.47

What does this mean physically?

addr: 80addr: 80addr: 72addr: 72

F0: Load2 F0: Load2

F4: Mult2 F4: Mult2

������������������

��������

��� ������������� ����������

���������

��� ��� ������������

��������������������

�� �����������������

�����

��� �!����

"������##���"���"����"����"���$"���%"���&

R(F2) Load1mulR(F2) Load2mul

�������##���

Addr: 80Addr: 80 Mult1Mult1Addr: 72Addr: 72 Mult2Mult2

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.48

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F309 72 Fu Load2 Mult2

° Load1 completing: who is waiting?° Note: Dispatching SUBI

Loop Example Cycle 9

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.49

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3010 64 Fu Load2 Mult2

° Load2 completing: who is waiting?° Note: Dispatching BNEZ

Loop Example Cycle 10

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.50

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3011 64 Fu Load3 Mult2

° Next load in sequence

Loop Example Cycle 11

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.51

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3012 64 Fu Load3 Mult2

° Why not issue third multiply?

Loop Example Cycle 12

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.52

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3013 64 Fu Load3 Mult2

Loop Example Cycle 13

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.53

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3014 64 Fu Load3 Mult2

° Mult1 completing. Who is waiting?

Loop Example Cycle 14

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.54

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3015 64 Fu Load3 Mult2

° Mult2 completing. Who is waiting?

Loop Example Cycle 15

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.55

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3016 64 Fu Load3 Mult1

Loop Example Cycle 16

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.56

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3017 64 Fu Load3 Mult1

Loop Example Cycle 17

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.57

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3018 64 Fu Load3 Mult1

Loop Example Cycle 18

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.58

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3019 64 Fu Load3 Mult1

Loop Example Cycle 19

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.59

Instruction status: Exec Write

ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3020 64 Fu Load3 Mult1

Loop Example Cycle 20

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.60

Why can Tomasulo overlap iterations of loops?

° Register renaming

• Multiple iterations use different physical destinationsfor registers (dynamic loop unrolling).

• Replace static register names from code with dynamicregister “pointers”

• Effectively increases size of register file

• Permit instruction issue to advance past integercontrol flow operations.

° Crucial: integer unit must “get ahead” of floating pointunit so that we can issue multiple iterations

° Other idea: Tomasulo building “DataFlow” graph.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.61

Recall: Unrolled Loop That Minimizes Stalls

1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

&8�������0��������9$:�������������������'����������;<����������������1

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.62

Why issue in order?

° In-order issue permits us to analyze data flow of program° This way, we know exactly which results should flow to

which subsequent instructions• If we issued out-of-order, we would confuse RAW and

WAR hazards!• The most advanced machines that I know of all issue

in order.° This idea works perfectly well “in principle” with multiple

instructions issued per clock:• Need to multi-port “rename table” and be able to rename a

sequence of instructions together• Need to be able to issue to multiple reservation stations in a

single cycle.• Need to have 2x number of read ports and 1x number of

write ports in register file.° In-order issue can be serious bottleneck when issuing

multiple instructions per clock-cycle

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.63

Branches must be resolved quickly for loop overlap!

° In our example, we relied on the fact that brancheswere under control of “fast” integer unit in order toget overlap!

Loop: LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

° What happens if branch depends on result of multd??• We completely lose all of our advantages!

• Need to be able to “predict” branch outcome.

• If we were to predict that branch was taken, this wouldbe right most of the time.

° Problem much worse for superscalar machines!10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz

Lec16.64

Independent “Fetch” unit

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Unit

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

° Instruction fetch decoupled from execution

° Often issue logic (+ rename) included with Fetch

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.65

° Prediction has become essential to getting goodperformance from scalar instruction streams.

° We will discuss predicting branches. However,architects are now predicting everything: datadependencies, actual data, and results of groups ofinstructions:

• At what point does computation become a probabilistic operation +verification?

• We are pretty close with control hazards already…

° Why does prediction work?• Underlying algorithm has regularities.

• Data that is being operated on has regularities.

• Instruction sequence has redundancies that are artifacts of way thathumans/compilers think about problems.

° Prediction ⇒ Compressible information streams?

Prediction: Branches, Dependencies, Data

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.66

Dynamic Branch Prediction

° Prediction could be “Static” (at compile time) or“Dynamic” (at runtime)

• For our example, if we were to statically predict“taken”, we would only be wrong once each passthrough loop

° Is dynamic branch prediction better than staticbranch prediction?

• Seems to be. Still some debate to this effect

• Today, lots of hardware being devoted to dynamicbranch predictors.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.67

° Address of branch index to get prediction AND branchaddress (if taken)• Must check for branch match now, since can’t use wrong branch address

• Grab predicted PC from table since may take several cycles to compute

° Update predicted PC when branch is actually resolved

° Return instruction addresses predicted with stack

����'(��� �����'������

)*

����#�������'�����+�

�,

�����'����-����������-��

Simple dynamic prediction: Branch Target Buffer (BTB)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.68

Dynamic Branch Prediction

° Performance = ƒ(accuracy, cost of misprediction)

° Branch History Table: Lower bits of PC addressindex table of 1-bit values

• Says whether or not branch taken last time

• No address check

° Problem: in a loop, 1-bit BHT will cause twomispredictions (avg is 9 iteratios before exit):

• End of loop case, when it exits instead of looping as before

• First time through loop on next time through code, when itpredicts exit instead of looping

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.69

° Solution: 2-bit scheme where change predictiononly if get misprediction twice: (Figure 4.13, p. 264)

° Red: stop, not taken

° Green: go, taken

° Adds hysteresis to decision making process

Dynamic Branch Prediction

T

TNT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.70

BHT Accuracy

° Mispredict because either:• Wrong guess for that branch

• Got branch history of wrong branch when index the table

° 4096 entry table programs vary from 1%misprediction (nasa7, tomcatv) to 18% (eqntott),with spice at 9% and gcc at 12%

° 4096 about as good as infinite table(in Alpha 211164)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.71

Correlating Branches

° Hypothesis: recent branches are correlated; that is,behavior of recently executed branches affects predictionof current branch

° Two possibilities; Current branch depends on:• Last m most recently executed branches anywhere in program

Produces a “GA” (for “global address”) in the Yeh and Pattclassification (e.g. GAg)

• Last m most recent outcomes of same branch.Produces a “PA” (for “per address”) in same classification (e.g. PAg)

° Idea: record m most recently executed branches as takenor not taken, and use that pattern to select the properbranch history table entry

• A single history table shared by all branches (appends a “g” at end),indexed by history value.

• Address is used along with history to select table entry (appends a “p”at end of classification)

• If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.72

Correlating Branches

(2,2) GAs predictor• First 2 means that we keep two

bits of history

• Second means that we have 2bit counters in each slot.

• Then behavior of recentbranches selects between,say, four predictions of nextbranch, updating just thatprediction

• Note that the original two-bitcounter solution would be a(0,2) GAs predictor

• Note also that aliasing ispossible here...

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history register

° For instance, consider global history, set-indexedBHT. That gives us a GAs history table.

+�'(���������./���'������

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.73

Accuracy of Different Schemes

Fre

quen

cy o

f M

ispr

edic

tion

s

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa

7

mat

rix3

00

tom

catv

dodu

cd

spic

e

fppp

p gcc

espr

esso

eqnt

ott li

0%

1%

5%

6% 6%

11%

4%

6%

5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

0%

18%

Fre

qu

ency

of

Mis

pre

dic

tio

ns

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.74

HW support for More ILP

° Avoid branch prediction by turning branches intoconditionally executed instructions:

if (x) then A = B op C else NOP• If false, then neither store result nor cause exception

• Expanded ISA of Alpha, MIPS, PowerPC, SPARC haveconditional move; PA-RISC can annul any following instr.

• EPIC: 64 1-bit condition fields selected so conditional execution

° Drawbacks to conditional instructions• Still takes a clock even if “annulled”

• Stall if condition evaluated late

• Complex conditions reduce effectiveness;condition becomes known late in pipeline

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.75

Now what about exceptions???

° Out-of-order commit really messes up our chance toget precise exceptions!

• When committing results out-of-order, register filecontains results from later instructions while earlierones have not completed yet.

• What if need to cause exception on one of those earlyinstructions??

° Need to be able to “rollback” register file toconsistent state

• Remember that “precise” means that there is some PCsuch that: all instructions before have committedresults, and none after have committed results.

° Big problem for branch prediction as well:What if prediction wrong??

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.76

° Speculation is a form of guessing.

° Important for branch prediction:• Need to “take our best shot” at predicting branch direction.

• If we issue multiple instructions per cycle, lose lots of potentialinstructions otherwise:

- Consider 4 instructions per cycle

- If take single cycle to decide on branch, waste from 4 - 7instruction slots!

° If we speculate and are wrong, need to back up andrestart execution to point at which we predictedincorrectly:

• This is exactly same as precise exceptions!

° Technique for both precise interrupts/exceptions andspeculation: in-order completion or commit

Relationship between precise interrupts and specultation:

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.77

HW support for precise interrupts

° Need HW buffer for results ofuncommitted instructions:reorder buffer

• 3 fields: instr, destination, value

• Reorder buffer can be operandsource => more registers like RS

• Use reorder buffer number instead ofreservation station when executioncompletes

• Supplies operands betweenexecution complete & commit

• Once operand commits,result is put into register

• Instructionscommit

• As a result, its easy to undospeculated instructionson mispredicted branchesor on exceptions

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.78

1. Issue—get instruction from FP Op Queue• If reservation station and reorder buffer slot free, issue instr & send

operands & reorder buffer no. for destination (this stage sometimescalled “dispatch”)

2. Execution—operate on operands (EX)• When both operands ready then execute; if not ready, watch CDB for

result; when both in reservation station, execute; checks RAW(sometimes called “issue”)

3. Write result—finish execution (WB)• Write on Common Data Bus to all awaiting FUs & reorder buffer;

mark reservation station available.

4. Commit—update register with reorder result• When instr. at head of reorder buffer & result present, update

register with result (or store to memory) and remove instr fromreorder buffer.

• Mispredicted branch or interrupt flushes reorder buffer (sometimescalled “graduation”)

Four Steps of Speculative Tomasulo Algorithm

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.79

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder buffer:

���� ��0

������������������ ��� ������������� ����������

��������������������

��� �!����

� �1� �&

� �%

� �%

� ��

� ��

� �

----

F0F0<val2><val2>

<val2><val2>ST 0(R3),F0ST 0(R3),F0

ADDD F0,F4,F6ADDD F0,F4,F6YY

ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

����*

���� ����

�����

2�3���

#�� ��� ��0

1 10+R21 10+R2����

����������##��

���������

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.80

Dynamic Scheduling in PowerPC 604 and Pentium Pro

° Both In-order Issue, Out-of-order execution, In-order Commit

PPro central reservation station for anyfunctional units with one bus shared by abranch and an integer unit

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.81

Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro

Max. instructions issued/clock 4 3

Max. instr. complete exec./clock 6 5

Max. instr. commited/clock 6 3

Instructions in reorder buffer 16 40

Number of rename buffers 12 Int/8 FP 40

Number of reservations stations 12 20

No. integer functional units (FUs) 2 2No. floating point FUs 1 1No. branch FUs 1 1No. complex integer FUs 1 0No. memory FUs 1 1 load +1 store

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.82

Dynamic Scheduling in Pentium Pro

° PPro doesn’t pipeline 80x86 instructions

° PPro decode unit translates the Intel instructions into72-bit micro-operations (- MIPS)

° Sends micro-operations to reorder buffer & reservationstations

° Takes 1 clock cycle to determine length of 80x86instructions + 2 more to create the micro-operations

° Most instructions translate to 1 to 4 micro-operations

° Complex 80x86 instructions are executed by aconventional microprogram (8K x 72 bits) that issueslong sequences of micro-operations

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.83

Limits to Multi-Issue Machines

° Inherent limitations of ILP• 1 branch in 5: How to keep a 5-way superscalar busy?

• Latencies of units: many operations must be scheduled

• Need about Pipeline Depth x No. Functional Units of independentinstructions to keep fully busy

• Increase ports to Register File

- VLIW example needs 7 read and 3 write for Int. Reg.& 5 read and 3 write for FP reg

• Increase ports to memory

• Current state of the art: Many hardware structures (such asissue/rename logic) has delay proportional to square of number ofinstructions issued/cycle

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.84

° Conflicting studies of amount• Benchmarks (vectorized Fortran FP vs. integer C programs)

• Hardware sophistication

• Compiler sophistication

° How much ILP is available using existingmechanims with increasing HW budgets?

° Do we need to invent new HW/SW mechanisms tokeep on processor performance curve?

• Intel MMX

• Motorola AltaVec

• Supersparc Multimedia ops, etc.

Limits to ILP

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.85

Initial HW Model here; MIPS compilers.

Assumptions for ideal/perfect machine to start:

1. Register renaming–infinite virtual registers and allWAW & WAR hazards are avoided

2. Branch prediction–perfect; no mispredictions

3. Jump prediction–all jumps perfectly predicted =>machine with perfect speculation & an unboundedbuffer of instructions available

4. Memory-address alias analysis–addresses areknown & a store can be moved before a load providedaddresses not equal

1 cycle latency for all instructions; unlimited number ofinstructions issued per clock cycle

Limits to ILP

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.86

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

IPC

Upper Limit to ILP: Ideal Machine

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.87

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

35

41

16

61

5860

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

Perfect Selective predictor Standard 2-bit Static None

Change from Infinitewindow to examine to2000 and maximumissue of 64 instructionsper clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect No prediction

FP: 15 - 45

Integer: 6 - 12

IPC

More Realistic HW: Branch Impact

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.88

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

910

11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

Change 2000 instrwindow, 64 instrissue, 8K 2 levelPrediction

Integer: 5 - 15

FP: 11 - 45

IPC

More Realistic HW: Register Impact (rename regs)

64 None256Infinite 32128

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.89

Program

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect Inspection None

Change 2000 instrwindow, 64 instrissue, 8K 2 levelPrediction, 256renaming registers

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC

More Realistic HW: Alias Impact

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.90

Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 54

6

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Perfect disambiguation(HW), 1K SelectivePrediction, 16 entryreturn, 64 registers,issue as many aswindow

Integer: 6 - 12

FP: 8 - 45

IPC

Realistic HW for ‘9X: Window Impact

64 16256Infinite 32128 8 4

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.91

° 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

0

100

200

300

400

500

600

700

800

900

espr

esso

li

eqnt

ott

com

pres

s sc gcc

spic

e

dodu

c

mdl

jdp2

wav

e5

tom

catv or

a

alvi

nn ear

mdl

jsp2

swm

256

su2c

or

hydr

o2d

nasa

fppp

pBraniac vs. Speed Demon(1993)

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.92

Summary #1/2

° Reservations stations: renaming to larger set ofregisters + buffering source operands

• Prevents registers as bottleneck

• Avoids WAR, WAW hazards of Scoreboard

• Allows loop unrolling in HW

° Not limited to basic blocks(integer units gets ahead, beyond branches)

° Helps cache misses as well

° Lasting Contributions• Dynamic scheduling

• Register renaming

• Load/store disambiguation

° 360/91 descendants are Pentium II; PowerPC 604;MIPS R10000; HP-PA 8000; Alpha 21264

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.93

° Dynamic hardware schemes can unroll loopsdynamically in hardware

° Branch prediction very important to good performance

° Precise exceptions/Speculation: Out-of-orderexecution, In-order commit (reorder buffer)

° Superscalar and VLIW: CPI < 1 (IPC > 1)• Dynamic issue vs. Static issue

• More instructions issue at same time => larger hazard penalty

• Limitation is often number of instructions that you can successfullyfetch and decode per cycle ⇒ “Flynn barrier”

° SW Pipelining• Symbolic Loop Unrolling to get most from pipeline with little code

expansion, little overhead

Summary #2/2