Post on 16-Oct-2021
1
EE 457 Questions and Answers for Special Topics
Out-of-Order Execution, Exception,
Branch Prediction, CMP
Gandhi Puvvada, Weirong Jiang & Tony Toghia, USC 2008
Out of Order (OoO) ExecutionDynamic Scheduling of
Instructions(The Tomasulo Algorithm)
IntegerMultiplier
Issue UnitIn
t. D
ivid
er
63
2
TAG FIFO
Simplifiedfor EE457
Block Diagramprovided by Prof. Dubois
Mult
I -Cache
����
Dispatch
I-Fetch Queue
Integer Queue
Load/StoreQueue
Div
Queue
Mult Queue
CDB
Back-end
Front-end
Re-order Buffer
Reg File
BPB
Exe Unit Exe UnitCache
Exe Unit Exe Unit
����
Add Buff
OoO Execution and In-Order Committing with ROB (Re-Order Buffer)
Issue Unit
Q#1 What is the important difference between the two block diagrams?
Which supports precise exceptions
IntegerMultiplier
Issue Unit
Int.
Div
ider
63
2
TAG FIFO
A#1 ROB is the important difference between the two block diagrams.
The right-side block diagram supportsprecise exceptions.
IntegerMultiplier
Issue Unit
Int.
Div
ider
63
2
TAG FIFO
Q#2 Choose the right attributes to describe the block diagrams.
1. Left Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.
2. Right Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.
A#2 Choose the right attributes to describe the block diagrams.
1. Left Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.
2. Right Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.
9
Out-of-Order Execution (with ROB)Q#3 When we refer to an out-of-order
processor with ROB, do we mean:a. instructions are issued out-of-order?b. instructions start execution out-of-order?c. instructions finish execution out-of-order?d. instructions retire out of order?
• A#3: b and c. Instructions are issued and retired in-order, to maintain the functionality of in-order execution. What happens in between, however, the start and completion (of execution in integer and floating point units) of instructions, can be done out-of-order.
10
TAG FIFO (Token FIFO) in the left diagram
IntegerMultiplier
Issue Unit
Int.
Div
ider
63
2
TAG FIFO
Q#4 Q#4.1 Is it necessary to hold the 64 tokens in the 0 to 63 order initially on reset?Q#4.2 Is FIFO used for convenience or is it necessary that we follow the “First-In-First_Out orderQ#4.3 Can the FIFO overflow?Q#4.4 Can the FIFO become empty?
TAG FIFO (Token FIFO)A#4 A#4.1 It is not necessary to hold the 64 tokens in the 0 to 63 order initially on reset.
A#4.2 FIFO is used for convenience. It is not necessary that we follow the “First-In-First_Out” order.
A#4.3 The FIFO can not overflow as we can not receive more tokens than what we issued.
A#4.4 The FIFO can become empty if the backend capacity exceeds the total number of tokens.
Q#4 Q#4.1 Is it necessary to hold the 64 tokens in the 0 to 63 order initially on reset?
Q#4.2 Is FIFO used for convenience or is it necessary that we follow the “First-In-First_Out order
Q#4.3 Can the FIFO overflow?
Q#4.4 Can the FIFO become empty?
TAGs for destinations or sources or for both? (in ROB-less design)
• A new tag is assigned to the destination register of the instruction being dispatched.
• For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already in RAS) is conveyed to the instruction.
• If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.
Unique TAG
• Like SSN, we need a unique TAG
• SSNs are reused.
• Similarly TAGs can be reused.
• TAGs are similar to the number TOKENs.
4
4
(in ROB-less design)
TAGs (= Tokens)
• How many Tokens should the bank cashier have to start with?
• What happens if the tokens are run out?
• Does he need to have any order in holding tokens and issuing tokens?
• Does he have to collect tokens back?
4(in ROB-less design)
TAG FIFO (FIFOs are taught in EE560)
• To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit.
• Filled with (say) 64 tokens (in any order) initially on reset.
• Tokens return in out of order anyway.• Put tokens back in stack and issue.
01
63
wp rp
2
Full
wp
rp
63
2
2 tokens issued
1
63
wprp2
1 token returned
(in ROB-less design)
17
• Q#5 What is meant by retirement in an out-of-order processor?
• Q#6 What two conditions are required for retirement?
• A#5: Retirement is the point at which an instruction’s results can be committed(can be written into the register file or memory) or if it is a conditional branch or an exception it can be taken. In short its execution is insured and it is no longer speculative. Note: In speculative execution, conditional branches are executed based on prediction, and if it turns out to be a misprediction, wrong-path instructions are flushed.
• A#6: Execution must be completed, and the instruction must be the oldest instruction not yet retired. (It is the oldest instruction in the re-order buffer.) 18
19
• Q#7 __________________ (Architectural / Physical) registers are visible to software (i.e. can be used in instructions)
• Q#8 __________________ (Architectural / Physical) registers allow multiple copies of a register to support out-of-order execution (including speculative execution) via register renaming.
20
• Q#7 __________________ (Architectural / Physical) registers are visible to software (i.e. can be used in instructions)
• Q#8 __________________ (Architectural / Physical) registers allow multiple copies of a register to support out-of-order execution (including speculative execution) via register renaming.
Limited Architectural RegistersMore Physical Registers
Register Renaminglw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
It is clear that compiler is using $8 as a temporary register.
If there is a delay in obtaining $2, the first part of the code can not proceed.
Unfortunately, the second part of the code can not proceed because of name dependency for $8.
22
Q#9 Register renaming can NOT solvea. RAW hazardsb. WAR hazardsc. WAW hazards
Note: In a design with ROB, WAW and WAR will never occur as all writes are performed strictly in-order. So answer the above question for the ROB-less design.
• A#9: a, The RAW (Read After Write) hazard is the only hazard which cannot be solved by register renaming.
• For WAW (Write After Write) hazard:– if the instruction order is that $1 gets written twice, and if the later
write (W2) can execute before the first write (W1), then register renaming mechanism allows the earlier write to be discarded in a ROB-less design.
• For WAR (Write After Read) hazard:– register renaming allows the older version of the register to be
read and held in the Issue Queues, so that the later write can proceed.
• For RAW (Read After Write) hazard:– a dependent read MUST wait and cannot execute before a write
to the same location. (The to-be written value must be determined before it can be read by a later instruction.) The dependent instruction waits in the Issue Queues for the operand to be broadcast on the CDB. 23
IntegerMultiplier
Issue Unit
Int.
Div
ider
63
2
TAG FIFO
24
Q#10 What resource is the major bottleneck of Tomasulo algorithm?
IFQ / Dispatcher / Issue Queues / Execution Units / CDB
25
A#10 What resource is the major bottleneck of Tomasulo algorithm?
CDB
The issue unit has to throttle issuing instructions to the execution units based on CDB’s availability. It does not let multiple execution units to finish execution at the same time.
26
• Q#11a Suppose the following lwinstruction is in progress and is currently waiting for the cache to respond. lw $2, 0($4)Which of the following instructions in the integer issue queue will begin execution the earliest?
#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $6#1 (oldest)
add $1, $2, $3
27
• A#11a #2. #1 cannot begin execution, because it reads $2, which is still being written by the LW instruction (RAW hazard). Instruction #2 can begin execution. (Note: Register renaming solves the WAR hazard on $4.)
#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $6#1 (oldest)
add $1, $2, $3
28
• Q#11b Given the same situation (lw $2, 0($4) ) as the previous problem, now which of the following instructions in the integer issue queue will begin execution the earliest?
#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $1#1 (oldest)
add $1, $2, $3
Was $6
29
• A#11b Instruction #4 is the earliest instruction that does not read a value that is modified by an earlier instruction.
#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $1#1 (oldest)
add $1, $2, $3
Was $6
Without or with ROB? • Q#11c Are your answers to Q#11a and
Q#11b for the first design without ROB or the second design with ROB?
Without or with ROB? • Q#11c Are your answers to Q#11a and
Q#11b for the first design without ROB or the second design with ROB?
• A#11c For both! RAW dependency is the true dependency and every implementation has to honor that dependency.
Q#12 ROB is the important difference between the two block diagrams.
Compare and contrast
IntegerMultiplier
Issue Unit
Int.
Div
ider
63
2
TAG FIFO
A#12 Compare and contrastWithout ROB With ROB
1. TAG FIFO provides unique TAGs
1. ROB location IDs are TAGs
2. Register Status Table specifies if a register is obsolete.
2. ROB needs to be searched associatively to find the latest register content
3. Allows out-of-order completion
3. Enforces in-order-only completion
A#12 Compare and contrastWithout ROB With ROB
4. Can not support exceptions
4. Can support exceptions
5. Can not support speculative execution.
5. Can support speculative execution.
6. No speculation,No BPB.
6. Has BPB to aid in branch prediction
7. No good for real implementation
7. Good for real implementation
A#12 Compare and contrastWithout ROB With ROB
8. Writes are out of order. Hence dispatch is suspended after dispatching a conditional branch, until the branch is resolved.
8. Writes are in-order. Dispatch continues based on prediction. Design provides for flushing wrong-path execution.
9. Stores write to cache when they come out of lsq (load/store queue).
9. Stores write to cache when they reach the top of ROB.
A#12 Compare and contrastWithout ROB With ROB
10. Memory disambiguation rules are stricter.
10. Since WAW and WAR are not present, rules are simpler.
11. Only RAR is irrelevant. So two loads from the same address can execute in any order. Rest of loads and stores with matching addresses have go in-order.
11. Only RAW needs to be looked at. Loads read cache before going into ROB. Hence, loads have to wait until senior stores with matching addresses finish
A#12 Compare and contrastWithout ROB With ROB
12. Suppose a senior load is yet to calculate its memory address.A junior load (but not store) can leave LSQ. (No RAR, but WAR).Suppose a senior store is yet to calculate its memory address.A junior load/store can not leave. (RAW, WAW)
12. Stores leave a copy of their address in Address Buffer near LSQ, so that junior loads can figure out (without looking up the ROB) if they can read cache. It means junior stores, with a senior load yet to calculate address, can not leave LSQ. It means, junior stores with address matching to a senior load should not leave LSQ. Or they can leave if senior loads with matching address make a note of this.
38
Exceptions
• Q#1 What is the definition of an exception?
• Q#2 What is the difference between asynchronous and synchronous exceptions? Give two examples of each.
• Q#3 Precise exceptions are _______________ (synchronous, asynchronous ) and the excepting instruction _________ (must be/does not need to be) re-executed .
• A#1: Exceptions are very rare events forcing a transfer of program control to a software handler.
• A#2: Synchronous exceptions are triggered by specific instructions (e.g. Divide by zero, illegal instruction, page fault, etc.). Asynchronous exceptions include the hardware interrupts and are not tied to a specific executing instruction (e.g. keyboard interrupt, real-time clock, power failure)
• A#3: Precise exceptions are (synchronous, asynchronous ) and the excepting instruction (must be/does not need to be) re-executed (e.g. in the case page fault, ....).
39
40
Q#4• Interrupts are ___________
(Asynchronous/Synchronous) to program execution.
• Traps are ___________ (Asynchronous/Synchronous) to program execution.
41
A#4• Interrupts are ___________
(Asynchronous/Synchronous) to program execution. Example: Keyboard interrupt.
• Traps are ___________ (Asynchronous/Synchronous) to program execution. Example: addition overflow trap.
42
Q#5• Match the exceptions with the 5 pipeline
stages
IF ID EX MEM WB
Page Fault
Integer Overflow
Undefined Opcode
Memory Protection Violation
43
A#5• Match the exceptions with the 5 pipeline
stages
IF ID EX MEM WB
Page Fault X X
Integer Overflow X
Undefined Opcode X
Memory Protection Violation
X X
44
Q#6 For precise exceptions, the exceptions should be taken in
a. process orderb. temporal order
45
Q#6 For precise exceptions, the exceptions should be taken in
a. process orderb. temporal order
• A#6: Process order. Exceptions on earlier instructions must be handled before exceptions due to later instructions, regardless of when they are detected.
46
Q#7• For precise exceptions in the 5-stage
pipeline, an exception should be taken in which stage? Why?
• A#7: WB Stage. This is to insure that no earlier instruction in program order triggers an exception.
Well, as discussed in our class, an exception can be taken in MEM stage (instead of the WB stage) as the instruction in the WB stage would not cause a new exception.
47
48
Q#8• What are the functions of the Cause
Register and Exception PC (EPC)?
49
Q#8• What are the functions of the Cause
Register and Exception PC (EPC)?
• A#8: Cause register records what type of exception occurred, and the EPC tells the exception handler on which instruction the exception occurred.
50
Q#9 What are the requirements of precise exception handling in a pipelined processor?
51
Q#9 What are the requirements of precise exception handling in a pipelined processor?
A#9: All preceding instructions in process order must complete.All instructions following the faulting instruction plus the faulting instruction itself must be squashed.The execution of the handler must be started.
52
• Q#10
53
First run (before first exception handled)
54
Second run (after page fault handled)
55
A#10: First run (before first exception handled)
IF ID EX MEM WB
Cycle 1 SW Illegal –Exception Detected
ADD LW –Exception Detected
Cycle 2 Start of Exception Handler
NOP NOP NOP NOP (Exception)
56
A#10: Second run (after page fault handled)
IF ID EX MEM WB
Cycle 1
SW Illegal –Exception Detected
ADD LW
Cycle 2
NOP NOP NOP (Exception)
ADD LW
Cycle 3
NOP NOP NOP NOP (Exception)
ADD
Cycle 4
Start of Exception Handler
NOP NOP NOP NOP (Exception)
57
Branch PredictionQ#1 Which types of branches need
prediction?a. Indirect branch due to return from
function callb. Conditional branchc. Unconditional branch
58
Branch PredictionQ#1 Which types of branches need
prediction (direction prediction)?a. Indirect branch due to return from
function callb. Conditional branchc. Unconditional branch
A#1: Conditional branch
59
The misprediction rate (increases/decreases/stays the same) if the loop is re-executed.
branchPCBranch Prediction Buffer
N T
Q#2 Given a simple 1-bit (2-state) pattern history predictor, assuming the initial branch is predicted not taken what is the misprediction rate for the following loop? (Assume there are no other branches in the loop):
for (i=0; i<4, i++)
60
The misprediction rate stays the same for all subsequent runs of the loop.
branchPCBranch Prediction Buffer
N T
A#2 The predictor will predict the 1st branch not taken, and it will predict the 2nd, 3rd, 4th, and 5th branches taken. The 1st and last predictions will be incorrect. So, the misprediction rate is 40%.
for (i=0; i<4, i++)
I 0 1 2 3 4
Pred N T T T T
Examples
DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT …
100,000 iterations
How often is branch outcome != previous outcome?2 / 100,000
TNNT
DC44: TTTTT ... TNTTTTT … TNTTTTT …
2 / 100
DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT …
2 / 2
99.998%Prediction
Rate98.0%
0.0%
© Murali Annavaram, Gabe Loh & Gary Tyson, All rights reserved
Brandon Franzke, USC 2006 62
Use two bit history• 2-bit history
– Start as strongly not taken – Update BPB after every branch execution
branchPC
SN N
Branch Prediction Buffer
T ST
© Murali Annavaram, Gabe Loh & Gary Tyson, All rights reserved
TWO-BIT PREDICTOR
2-BIT UP-DOWN SATURATING COUNTER IN EACH ENTRY OF THE BPB
TAKEN==> ADD 1; UNTAKEN: SUBTRACT 1NOW IT TAKES 2 MISPREDICTIONS IN A ROW TO CHANGE THE PREDICTIONFOR THE NESTED LOOP, THE MISPRECTION AT ENTRY IS AVOIDED
COULD HAVE MORE THAN 2-BITS, BUT TWO BITS COVER MOST PATTERNS (LOOPS)
00Predict U
10Predict T
01Predict U
11Predict T
T
U T
U
T U
T
U
U: UntakenT: Taken
SN N
TST
SN
N
T
ST
Strongly Not Taken
Not Taken
Taken
Strongly Taken
SN N T ST
EE557 Michel Dubois USC 2007
64
• Q#3 Show the states and predictions for 2 runs of the loop shown in Q#2 using the 2-bit pattern history predictor?
First run: Second run:Iteration 0 1 2 3 4
Actual T T T T N
State
Prediction N
Iteration 0 1 2 3 4
Actual T T T T N
State
Prediction
SN N T ST
SN
65
• A#3 The 2-bit predictor works better than the 1-bit predictor after the initial training period.We can improve the initial training period by starting in the state.
First run: Second run:Iteration 0 1 2 3 4
Actual T T T T N
State
Prediction N N T T T
Iteration 0 1 2 3 4
Actual T T T T N
State
Prediction T T T T T
SN N T ST
SN N T ST ST T ST ST ST ST
T
66
Q#4 (Global / Local) predictors make use of the PC, while (global / local) predictors do not.
67
A#4 (Global / Local) predictors make use of the PC, while (global / local) predictors do not.
A#4 Local (also known as per-address) predictors, make use of the PC to distinguish between different branch instructions. Global predictors do not.
Correlating Branches
(2,2) predictor– Behavior of recent
branches selects between four predictions of next branch, updating just that prediction
Branch address
2-bits per branch predictor
Prediction
2-bit global branch history
4
CS252 UC Berkeley David A. Patterson
69
• Q#5 Two-Level Prediction:• Given the following branch history / pattern
history predictor:– 2-bit global branch history register (Shift-Left)– 3-bits of PC used to access pattern history table.– All predictors are 2-bits Predictors.– Instruction width = 32-bits– Assume the next branch instruction is at PC = 8004,
and it will be taken eventually.• On the following page:
– Provide the bits of the PC used by the predictor.– Indicate if the prediction is taken/not taken.– Show any changes to the branch history register and
pattern history table after the branch taken outcome info is provided.
700 1
00 10 11 10
11 10 01 01
01 01 01 11
00 01 00 10
00 10 11 10
11 10 01 01
01 01 01 11
00 01 00 10
PC A__ - A__
00 11
000
111BHR
Pattern History Table
01 10
710 1
00 10 11 10
11 10 01 01
01 01 01 11
00 01 00 10
00 10 11 10
11 10 01 01
01 01 01 11
00 01 00 10
PC A 4 - A 2
00 11
000
111BHR
Pattern History Table
01 10
001
A#5: 8004H => 00110 => Predict T (Taken)
This branch is taken as predicted eventually. Hence•Branch History Register shifts left from 01 to 11.•Pattern changes from state 10 to state 11 (refer to the 2-bit predictor state diagram).
Shift in a 1
72
Q#6 Is the following statement true or false? Explain.
“A predictor with more bits can always achieve a better performance”
73
Q#6 Is the following statement true or false? Explain.
“A predictor with more bits can always achieve a better performance”
A#6 : No. More bits can often just increase training time, which will reduce the accuracy for shorter loops. Also more bits mean more hysteresis which in turn means “refusing” to “adopt” or “change”.
Q#7 With a branch target buffer, the address of the next instruction can be predicted while the branch is in _____ (IF/ID/EX/MEM/WB) stage.
75
Q#7 With a branch target buffer, the address of the next instruction can be predicted while the branch is in _____ (IF/ID/EX/MEM/WB) stage.
76
A#7: IF Stage. The branch target buffer compares the PC against the known predicted taken branches and supplies the next address. Since only the PCs are being compared, the instruction does not have to be decoded. For accurately predicted branches, this results in zero clock penalty.
77
CMPQ#1 Uniprocessor pipelines (with no
multithreading) are constrained by ___________ level parallelism
Q#2 Dynamic power considerations favors ____(Uniprocessor / Parallel Processor)
78
CMPA#1 Uniprocessor pipelines (with no
multithreading) are constrained by instruction level parallelism (ILP)
A#2 Dynamic power considerations favors ____(Uniprocessor / Parallel Processor)
79
Q#3a Which types of processor multithreading need context switch through Process Control Block?
a. Software multithreadingb. Hardware multithreading
Q#3b Which has high over-head of context switching?
a. Software multithreadingb. Hardware multithreading
80
A#3a Which types of processor multithreading need context switch through Process Control Block?
a. Software multithreadingb. Hardware multithreading
A#3b Which has high over-head of context switching?
a. Software multithreadingb. Hardware multithreading
81
Q#4 Does Niagara have the cache coherence issue? If Yes, in which level of cache?
82
Q#4 Does Niagara have the cache coherence issue? If Yes, in which level of cache?
A#4: Yes, in L1 cache since it’s not shared.
83
Q#5a Is L1 cache shared across cores?
Q#5b Is L1 cache shared (used) by the different threads running on a single core?
84
Q#5a Is L1 cache shared across cores?
No.
Q#5b Is L1 cache shared (used) by the different threads running on a single core?
Yes.
85
• Q#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.
86
• Q#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.
• A#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.