Out-of-Order Execution, Exception, Branch Prediction, CMP

EE 457 Questions and Answers for Special Topics

Out-of-Order Execution, Exception,

Branch Prediction, CMP

Gandhi Puvvada, Weirong Jiang & Tony Toghia, USC 2008

Out of Order (OoO) ExecutionDynamic Scheduling of

Instructions(The Tomasulo Algorithm)

IntegerMultiplier

Issue UnitIn

TAG FIFO

Simplifiedfor EE457

Block Diagramprovided by Prof. Dubois

I -Cache

��

Dispatch

I-Fetch Queue

Integer Queue

Load/StoreQueue

Mult Queue

Back-end

Front-end

Re-order Buffer

Reg File

Exe Unit Exe UnitCache

Exe Unit Exe Unit

��

Add Buff

OoO Execution and In-Order Committing with ROB (Re-Order Buffer)

Issue Unit

Q#1 What is the important difference between the two block diagrams?

Which supports precise exceptions

IntegerMultiplier

Issue Unit

TAG FIFO

A#1 ROB is the important difference between the two block diagrams.

The right-side block diagram supportsprecise exceptions.

IntegerMultiplier

Issue Unit

TAG FIFO

Q#2 Choose the right attributes to describe the block diagrams.

1. Left Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

2. Right Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

A#2 Choose the right attributes to describe the block diagrams.

1. Left Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

2. Right Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

Out-of-Order Execution (with ROB)Q#3 When we refer to an out-of-order

processor with ROB, do we mean:a. instructions are issued out-of-order?b. instructions start execution out-of-order?c. instructions finish execution out-of-order?d. instructions retire out of order?

• A#3: b and c. Instructions are issued and retired in-order, to maintain the functionality of in-order execution. What happens in between, however, the start and completion (of execution in integer and floating point units) of instructions, can be done out-of-order.

TAG FIFO (Token FIFO) in the left diagram

IntegerMultiplier

Issue Unit

TAG FIFO

Q#4 Q#4.1 Is it necessary to hold the 64 tokens in the 0 to 63 order initially on reset?Q#4.2 Is FIFO used for convenience or is it necessary that we follow the “First-In-First_Out orderQ#4.3 Can the FIFO overflow?Q#4.4 Can the FIFO become empty?

TAG FIFO (Token FIFO)A#4 A#4.1 It is not necessary to hold the 64 tokens in the 0 to 63 order initially on reset.

A#4.2 FIFO is used for convenience. It is not necessary that we follow the “First-In-First_Out” order.

A#4.3 The FIFO can not overflow as we can not receive more tokens than what we issued.

A#4.4 The FIFO can become empty if the backend capacity exceeds the total number of tokens.

Q#4 Q#4.1 Is it necessary to hold the 64 tokens in the 0 to 63 order initially on reset?

Q#4.2 Is FIFO used for convenience or is it necessary that we follow the “First-In-First_Out order

Q#4.3 Can the FIFO overflow?

Q#4.4 Can the FIFO become empty?

TAGs for destinations or sources or for both? (in ROB-less design)

• A new tag is assigned to the destination register of the instruction being dispatched.

• For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already in RAS) is conveyed to the instruction.

• If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.

Unique TAG

• Like SSN, we need a unique TAG

• SSNs are reused.

• Similarly TAGs can be reused.

• TAGs are similar to the number TOKENs.

(in ROB-less design)

TAGs (= Tokens)

• How many Tokens should the bank cashier have to start with?

• What happens if the tokens are run out?

• Does he need to have any order in holding tokens and issuing tokens?

• Does he have to collect tokens back?

4(in ROB-less design)

TAG FIFO (FIFOs are taught in EE560)

• To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit.

• Filled with (say) 64 tokens (in any order) initially on reset.

• Tokens return in out of order anyway.• Put tokens back in stack and issue.

2 tokens issued

1 token returned

(in ROB-less design)

• Q#5 What is meant by retirement in an out-of-order processor?

• Q#6 What two conditions are required for retirement?

• A#5: Retirement is the point at which an instruction’s results can be committed(can be written into the register file or memory) or if it is a conditional branch or an exception it can be taken. In short its execution is insured and it is no longer speculative. Note: In speculative execution, conditional branches are executed based on prediction, and if it turns out to be a misprediction, wrong-path instructions are flushed.

• A#6: Execution must be completed, and the instruction must be the oldest instruction not yet retired. (It is the oldest instruction in the re-order buffer.) 18

• Q#7 __________________ (Architectural / Physical) registers are visible to software (i.e. can be used in instructions)

• Q#8 __________________ (Architectural / Physical) registers allow multiple copies of a register to support out-of-order execution (including speculative execution) via register renaming.

• Q#7 __________________ (Architectural / Physical) registers are visible to software (i.e. can be used in instructions)

• Q#8 __________________ (Architectural / Physical) registers allow multiple copies of a register to support out-of-order execution (including speculative execution) via register renaming.

Limited Architectural RegistersMore Physical Registers

Register Renaminglw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

It is clear that compiler is using $8 as a temporary register.

If there is a delay in obtaining $2, the first part of the code can not proceed.

Unfortunately, the second part of the code can not proceed because of name dependency for $8.

Q#9 Register renaming can NOT solvea. RAW hazardsb. WAR hazardsc. WAW hazards

Note: In a design with ROB, WAW and WAR will never occur as all writes are performed strictly in-order. So answer the above question for the ROB-less design.

• A#9: a, The RAW (Read After Write) hazard is the only hazard which cannot be solved by register renaming.

• For WAW (Write After Write) hazard:– if the instruction order is that $1 gets written twice, and if the later

write (W2) can execute before the first write (W1), then register renaming mechanism allows the earlier write to be discarded in a ROB-less design.

• For WAR (Write After Read) hazard:– register renaming allows the older version of the register to be

read and held in the Issue Queues, so that the later write can proceed.

• For RAW (Read After Write) hazard:– a dependent read MUST wait and cannot execute before a write

to the same location. (The to-be written value must be determined before it can be read by a later instruction.) The dependent instruction waits in the Issue Queues for the operand to be broadcast on the CDB. 23

IntegerMultiplier

Issue Unit

TAG FIFO

Q#10 What resource is the major bottleneck of Tomasulo algorithm?

IFQ / Dispatcher / Issue Queues / Execution Units / CDB

A#10 What resource is the major bottleneck of Tomasulo algorithm?

The issue unit has to throttle issuing instructions to the execution units based on CDB’s availability. It does not let multiple execution units to finish execution at the same time.

• Q#11a Suppose the following lwinstruction is in progress and is currently waiting for the cache to respond. lw $2, 0($4)Which of the following instructions in the integer issue queue will begin execution the earliest?

#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $6#1 (oldest)

add $1, $2, $3

• A#11a #2. #1 cannot begin execution, because it reads $2, which is still being written by the LW instruction (RAW hazard). Instruction #2 can begin execution. (Note: Register renaming solves the WAR hazard on $4.)

add $1, $2, $3

• Q#11b Given the same situation (lw $2, 0($4) ) as the previous problem, now which of the following instructions in the integer issue queue will begin execution the earliest?

add $1, $2, $3

Was $6

• A#11b Instruction #4 is the earliest instruction that does not read a value that is modified by an earlier instruction.

add $1, $2, $3

Was $6

Without or with ROB? • Q#11c Are your answers to Q#11a and

Q#11b for the first design without ROB or the second design with ROB?

Without or with ROB? • Q#11c Are your answers to Q#11a and

Q#11b for the first design without ROB or the second design with ROB?

• A#11c For both! RAW dependency is the true dependency and every implementation has to honor that dependency.

Q#12 ROB is the important difference between the two block diagrams.

Compare and contrast

IntegerMultiplier

Issue Unit

TAG FIFO

A#12 Compare and contrastWithout ROB With ROB

1. TAG FIFO provides unique TAGs

1. ROB location IDs are TAGs

2. Register Status Table specifies if a register is obsolete.

2. ROB needs to be searched associatively to find the latest register content

3. Allows out-of-order completion

3. Enforces in-order-only completion

4. Can not support exceptions

4. Can support exceptions

5. Can not support speculative execution.

5. Can support speculative execution.

6. No speculation,No BPB.

6. Has BPB to aid in branch prediction

7. No good for real implementation

7. Good for real implementation

8. Writes are out of order. Hence dispatch is suspended after dispatching a conditional branch, until the branch is resolved.

8. Writes are in-order. Dispatch continues based on prediction. Design provides for flushing wrong-path execution.

9. Stores write to cache when they come out of lsq (load/store queue).

9. Stores write to cache when they reach the top of ROB.

10. Memory disambiguation rules are stricter.

10. Since WAW and WAR are not present, rules are simpler.

11. Only RAR is irrelevant. So two loads from the same address can execute in any order. Rest of loads and stores with matching addresses have go in-order.

11. Only RAW needs to be looked at. Loads read cache before going into ROB. Hence, loads have to wait until senior stores with matching addresses finish

12. Suppose a senior load is yet to calculate its memory address.A junior load (but not store) can leave LSQ. (No RAR, but WAR).Suppose a senior store is yet to calculate its memory address.A junior load/store can not leave. (RAW, WAW)

12. Stores leave a copy of their address in Address Buffer near LSQ, so that junior loads can figure out (without looking up the ROB) if they can read cache. It means junior stores, with a senior load yet to calculate address, can not leave LSQ. It means, junior stores with address matching to a senior load should not leave LSQ. Or they can leave if senior loads with matching address make a note of this.

Exceptions

• Q#1 What is the definition of an exception?

• Q#2 What is the difference between asynchronous and synchronous exceptions? Give two examples of each.

• Q#3 Precise exceptions are _______________ (synchronous, asynchronous ) and the excepting instruction _________ (must be/does not need to be) re-executed .

• A#1: Exceptions are very rare events forcing a transfer of program control to a software handler.

• A#2: Synchronous exceptions are triggered by specific instructions (e.g. Divide by zero, illegal instruction, page fault, etc.). Asynchronous exceptions include the hardware interrupts and are not tied to a specific executing instruction (e.g. keyboard interrupt, real-time clock, power failure)

• A#3: Precise exceptions are (synchronous, asynchronous ) and the excepting instruction (must be/does not need to be) re-executed (e.g. in the case page fault, ....).

Q#4• Interrupts are ___________

(Asynchronous/Synchronous) to program execution.

• Traps are ___________ (Asynchronous/Synchronous) to program execution.

A#4• Interrupts are ___________

(Asynchronous/Synchronous) to program execution. Example: Keyboard interrupt.

• Traps are ___________ (Asynchronous/Synchronous) to program execution. Example: addition overflow trap.

Q#5• Match the exceptions with the 5 pipeline

stages

IF ID EX MEM WB

Page Fault

Integer Overflow

Undefined Opcode

Memory Protection Violation

A#5• Match the exceptions with the 5 pipeline

stages

IF ID EX MEM WB

Page Fault X X

Integer Overflow X

Undefined Opcode X

Memory Protection Violation

Q#6 For precise exceptions, the exceptions should be taken in

a. process orderb. temporal order

Q#6 For precise exceptions, the exceptions should be taken in

a. process orderb. temporal order

• A#6: Process order. Exceptions on earlier instructions must be handled before exceptions due to later instructions, regardless of when they are detected.

Q#7• For precise exceptions in the 5-stage

pipeline, an exception should be taken in which stage? Why?

• A#7: WB Stage. This is to insure that no earlier instruction in program order triggers an exception.

Well, as discussed in our class, an exception can be taken in MEM stage (instead of the WB stage) as the instruction in the WB stage would not cause a new exception.

Q#8• What are the functions of the Cause

Register and Exception PC (EPC)?

Q#8• What are the functions of the Cause

Register and Exception PC (EPC)?

• A#8: Cause register records what type of exception occurred, and the EPC tells the exception handler on which instruction the exception occurred.

Q#9 What are the requirements of precise exception handling in a pipelined processor?

A#9: All preceding instructions in process order must complete.All instructions following the faulting instruction plus the faulting instruction itself must be squashed.The execution of the handler must be started.

• Q#10

First run (before first exception handled)

Second run (after page fault handled)

A#10: First run (before first exception handled)

IF ID EX MEM WB

Cycle 1 SW Illegal –Exception Detected

ADD LW –Exception Detected

Cycle 2 Start of Exception Handler

NOP NOP NOP NOP (Exception)

A#10: Second run (after page fault handled)

IF ID EX MEM WB

Cycle 1

SW Illegal –Exception Detected

ADD LW

Cycle 2

NOP NOP NOP (Exception)

ADD LW

Cycle 3

Cycle 4

Start of Exception Handler

Branch PredictionQ#1 Which types of branches need

prediction?a. Indirect branch due to return from

function callb. Conditional branchc. Unconditional branch

Branch PredictionQ#1 Which types of branches need

prediction (direction prediction)?a. Indirect branch due to return from

function callb. Conditional branchc. Unconditional branch

A#1: Conditional branch

The misprediction rate (increases/decreases/stays the same) if the loop is re-executed.

branchPCBranch Prediction Buffer

Q#2 Given a simple 1-bit (2-state) pattern history predictor, assuming the initial branch is predicted not taken what is the misprediction rate for the following loop? (Assume there are no other branches in the loop):

for (i=0; i<4, i++)

The misprediction rate stays the same for all subsequent runs of the loop.

branchPCBranch Prediction Buffer

A#2 The predictor will predict the 1st branch not taken, and it will predict the 2nd, 3rd, 4th, and 5th branches taken. The 1st and last predictions will be incorrect. So, the misprediction rate is 40%.

for (i=0; i<4, i++)

I 0 1 2 3 4

Pred N T T T T

Examples

DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT …

100,000 iterations

How often is branch outcome != previous outcome?2 / 100,000

DC44: TTTTT ... TNTTTTT … TNTTTTT …

2 / 100

DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT …

99.998%Prediction

Rate98.0%

Brandon Franzke, USC 2006 62

Use two bit history• 2-bit history

– Start as strongly not taken – Update BPB after every branch execution

branchPC

Branch Prediction Buffer

TWO-BIT PREDICTOR

2-BIT UP-DOWN SATURATING COUNTER IN EACH ENTRY OF THE BPB

TAKEN==> ADD 1; UNTAKEN: SUBTRACT 1NOW IT TAKES 2 MISPREDICTIONS IN A ROW TO CHANGE THE PREDICTIONFOR THE NESTED LOOP, THE MISPRECTION AT ENTRY IS AVOIDED

COULD HAVE MORE THAN 2-BITS, BUT TWO BITS COVER MOST PATTERNS (LOOPS)

00Predict U

10Predict T

01Predict U

11Predict T

U: UntakenT: Taken

Strongly Not Taken

Not Taken

Strongly Taken

SN N T ST

EE557 Michel Dubois USC 2007

• Q#3 Show the states and predictions for 2 runs of the loop shown in Q#2 using the 2-bit pattern history predictor?

First run: Second run:Iteration 0 1 2 3 4

Actual T T T T N

Prediction N

Iteration 0 1 2 3 4

Actual T T T T N

Prediction

SN N T ST

• A#3 The 2-bit predictor works better than the 1-bit predictor after the initial training period.We can improve the initial training period by starting in the state.

First run: Second run:Iteration 0 1 2 3 4

Actual T T T T N

Prediction N N T T T

Iteration 0 1 2 3 4

Actual T T T T N

Prediction T T T T T

SN N T ST

SN N T ST ST T ST ST ST ST

Q#4 (Global / Local) predictors make use of the PC, while (global / local) predictors do not.

A#4 (Global / Local) predictors make use of the PC, while (global / local) predictors do not.

A#4 Local (also known as per-address) predictors, make use of the PC to distinguish between different branch instructions. Global predictors do not.

Correlating Branches

(2,2) predictor– Behavior of recent

branches selects between four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictor

Prediction

2-bit global branch history

CS252 UC Berkeley David A. Patterson

• Q#5 Two-Level Prediction:• Given the following branch history / pattern

history predictor:– 2-bit global branch history register (Shift-Left)– 3-bits of PC used to access pattern history table.– All predictors are 2-bits Predictors.– Instruction width = 32-bits– Assume the next branch instruction is at PC = 8004,

and it will be taken eventually.• On the following page:

– Provide the bits of the PC used by the predictor.– Indicate if the prediction is taken/not taken.– Show any changes to the branch history register and

pattern history table after the branch taken outcome info is provided.

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

PC A__ - A__

111BHR

Pattern History Table

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

PC A 4 - A 2

111BHR

Pattern History Table

A#5: 8004H => 00110 => Predict T (Taken)

This branch is taken as predicted eventually. Hence•Branch History Register shifts left from 01 to 11.•Pattern changes from state 10 to state 11 (refer to the 2-bit predictor state diagram).

Shift in a 1

Q#6 Is the following statement true or false? Explain.

“A predictor with more bits can always achieve a better performance”

Q#6 Is the following statement true or false? Explain.

“A predictor with more bits can always achieve a better performance”

A#6 : No. More bits can often just increase training time, which will reduce the accuracy for shorter loops. Also more bits mean more hysteresis which in turn means “refusing” to “adopt” or “change”.

Q#7 With a branch target buffer, the address of the next instruction can be predicted while the branch is in _____ (IF/ID/EX/MEM/WB) stage.

A#7: IF Stage. The branch target buffer compares the PC against the known predicted taken branches and supplies the next address. Since only the PCs are being compared, the instruction does not have to be decoded. For accurately predicted branches, this results in zero clock penalty.

CMPQ#1 Uniprocessor pipelines (with no

multithreading) are constrained by ___________ level parallelism

Q#2 Dynamic power considerations favors ____(Uniprocessor / Parallel Processor)

CMPA#1 Uniprocessor pipelines (with no

multithreading) are constrained by instruction level parallelism (ILP)

A#2 Dynamic power considerations favors ____(Uniprocessor / Parallel Processor)

Q#3a Which types of processor multithreading need context switch through Process Control Block?

a. Software multithreadingb. Hardware multithreading

Q#3b Which has high over-head of context switching?

A#3a Which types of processor multithreading need context switch through Process Control Block?

A#3b Which has high over-head of context switching?

Q#4 Does Niagara have the cache coherence issue? If Yes, in which level of cache?

A#4: Yes, in L1 cache since it’s not shared.

Q#5a Is L1 cache shared across cores?

Q#5b Is L1 cache shared (used) by the different threads running on a single core?

Q#5a Is L1 cache shared across cores?

Q#5b Is L1 cache shared (used) by the different threads running on a single core?

• Q#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.

• A#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.

Out-of-Order Execution, Exception, Branch Prediction, CMP

Documents

Transcript of Out-of-Order Execution, Exception, Branch Prediction, CMP

Oracle PL/SQL Programming Review of Oracle data types and sqlplus commands Creating PL/SQL Blocks Declare, Execution, Exception Sections Implicit and Explicit.

Speculative Execution of Parallel Programs with Precise ...€¦ · preserving precise exception semantics. Our approach includes (1) auto-matic generation of OpenCL kernels and JNI

Solution Keywords Exception handling Bene ts …...Exception handling must be designed carefully. For example, if execution unwinds before a delete statement in a local scope or in

CMP BUILDING | CMP GRUP

ADAPTOR REDUCER 737 - cmp-products.com Sheets/737.pdf · cmp cable & conduit accessories cmp products cable gland accessory catalogue cmp

2 CMP 1.0 System Overview CMP 2.0 Performance Benchmarking Readiness CMP 3.0 Performance Benchmarking CMP 4.0 Analysis and Deployment CMP 5.0 Development,

Exception Handling - HesserCAN...Exception handling is the name for the object-oriented techniques that manage such errors. Unplanned exceptions that occur during a program’s execution

CMP-P45ALWHG5 CMP-P71ALWHG5 CMP ......CMP-P45ALWEG5（自動昇降ムーブアイパネル） PLFY-P22〜45LMG4以降 PL-RP40LA13以降 CMP-P71ALWHG5（自動昇降パネル） CMP-P71ALWEG5（自動昇降ムーブアイパネル）

EXCEPTIONS IN JAVA. What’s Exception An exception is an abnormal condition that occurs at run time. For example divide by 0. During execution of a statement.

ADVANCES IN OPTIX - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2015/presentation/S5246-David... · OPTIX EXECUTION MODEL rtContextLaunch Exception Program Selector

CMP 250 Combi • CMP 300 Combi CMP 250 V.V. • CMP 300 V.V ...

CMP CMP PRODUCTS CABLE GLAND CATALOGUE Sheets/AMERICAS PX2K... · 2018-03-09 · CMP NORTH AMERICAN SPECIFICATION CONNECTORS CMP CMP PRODUCTS CABLE GLAND CATALOGUE TDS647 REV 3 Cable

CMP PRODUCTS SHORT FORM... · 1 CMP CABLE GLANDS & ACCESSORIES CMP CABLE GLAND AND CABLE CONNECTION SPECIALISTS TM Therm CMP PRODUCTS CMP Explosive Atmosphere Cable Glands 2-6 Industrial

Exception Handling - WordPress.com · What Is an Exception? •An exception is an event that occurs during the execution of a program that disrupts the normal flow of instructions.

AT&T EMEA Openingsdesign, exception handling, code, test Database structures – efficient design, creation and indexing Moderately complex SQL design, creation, execution performance

LESSON X. Exception Handling · 2. Exception Handling-Java's Mechanism of Exception Handling • When an exception occurs, an object of a particular exception class is created •

Developments in CMP and Impact on CMP Consumables … · · 2017-08-16Developments in CMP and Impact on CMP Consumables Mike ... ingredients/formulations – Advanced slurry formulations

CMP PRODUCTS CABLE GLAND CATALOGUE - … Glands.pdf · CMP EXPLOSIVE ATMOSPHERES PRODUCTS CMP CMP PRODUCTS CABLE GLAND CATALOGUE Cable Gland Cable TECHNICAL DATA

iGPU : Exception Support and Speculative Execution on GPUs

Index Exception handling Exception In Java Exception Types Uncaught Exception Throw Finally Customized Exception.