March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b
description
Transcript of March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b
![Page 1: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/1.jpg)
EENG449b/SavvidesLec 16.1
3/25/05
March 24, 2005
Prof. Andreas Savvides
Spring 2005
http://www.eng.yale.edu/courses/2005s/eeng449b
EENG 449bG/CPSC 439bG Computer Systems
Lecture 16
Instruction Level Parallelism IIDynamic Branch Prediction
![Page 2: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/2.jpg)
EENG449b/SavvidesLec 16.2
3/25/05
Announcements
• Reading for this lecture: Chapter 3, sections 3.4 & 3.5
• Homework #2
![Page 3: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/3.jpg)
EENG449b/SavvidesLec 16.3
3/25/05
Why do we Need Dynamic Hardware Prediction?
• Basic blocks are short, and we have already optimized them with dynamic scheduling in Tomasulo’s algorithm
– Now the bottleneck is control dependences
• Branches disrupt sequential flow of execution
– Need to find ways to avoid stalls from branches
• Need to predict 2 things– Branch outcome– Branch target address (what is the next address
we should execute code from?)
![Page 4: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/4.jpg)
EENG449b/SavvidesLec 16.4
3/25/05
Static Prediction Strategies
• Several static strategies can apply– Predict all branches NOT TAKEN– Predict all branges as TAKEN– Predict all branches with certain opcodes
as TAKEN, and all others as NOT TAKEN– Predict all forward branches as NOT TAKEN
and all backward branches as TAKEN– Opcodes have default predictions that the
compiler may reverse at compile time
![Page 5: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/5.jpg)
EENG449b/SavvidesLec 16.5
3/25/05
Dynamic Branch Prediction
• Builds on the premise that history matters
– Observe the behavior of branches in previous instances and try to predict future branch behavior
– Try to predict the outcome of a branch early on in order to avoid stalls
– Branch prediction is critical for multiple issue processors
» In an n-issue processor, branches will come n times faster than a single issue processor
![Page 6: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/6.jpg)
EENG449b/SavvidesLec 16.6
3/25/05
Branch Prediction Metrics
• To evaluate the effectiveness of branch prediction you need to consider
– Prediction accuracy– Penalties associated with branch taken and
branch not taken– The associated penalties are artifacts of
» Pipeline design» Type of predictor» Branch frequency» Strategy to deal with the misprediction
![Page 7: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/7.jpg)
EENG449b/SavvidesLec 16.7
3/25/05
Basic Branch Predictor
• Use a 1-bit branch predictor buffer or branch history table
• 1 bit of memory stating whether the branch was recently taken or not
– Indexed by the lower portion of the branch predict instruction
• Bit entry updated each time the branch instruction is executed
• Problem with 1-bit prediction– It will always give the wrong prediction twice– Imagine executing a loop
» Predictor will be wrong on the first and last iteration
![Page 8: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/8.jpg)
EENG449b/SavvidesLec 16.8
3/25/05
NT
A One-Bit Predictor
Branch outcomePrediction State Taken Not TakenTaken 1 1 0Not Taken 0 1 0
Actual T T T NT T T T T NT T NT T NT TState 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1
Predicts T T T T NT T T T T NT T NT T NTHit/Miss H H H M M H H H M M M M M M
• Predictor misses twice on typical loop branches– Once at the end of loop– Once at the end of the 1st iteration of next execution of loop
• The outcome sequence NT-T-NT-T makes it miss all the time
State 0
Predict Not Taken
State 1
Predict Taken
T
T
NT
![Page 9: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/9.jpg)
EENG449b/SavvidesLec 16.9
3/25/05
A Two-Bit Predictor
Branch outcomePrediction State Taken Not TakenTaken 3 3 2Taken 2 3 0Not Taken 0 1 0Not Taken 1 3 0
• A four-state Moore machine• Predictor misses once on typical loop branches
– hence popular
• Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
NTState 2
PredictTaken
State 3
Predict Taken
T
T
NTState 0
Predict Not Taken
State 1
Predict Not Taken
TNT
NTT
![Page 10: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/10.jpg)
EENG449b/SavvidesLec 16.10
3/25/05
A Two-Bit PredictorBranch outcome
Prediction State Taken Not TakenTaken 3 3 2Taken 2 3 0Not Taken 0 1 0Not Taken 1 3 0
Actual T T T NT T T T T NT NT T T NT NTState 3 3 3 3 2 3 3 3 3 2 0 1 3 2 0
Predicts T T T T T T T T T T NT NT T THit/Miss H H H M H H H H M M M M M M
• A four-state Moore machine• Predictor misses once on typical loop branches
– hence popular
• Input sequence NT-NT-T-T-NT-NT-T-T make it miss all the time
![Page 11: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/11.jpg)
EENG449b/SavvidesLec 16.11
3/25/05
Branch Prediction Implementation Implications
• Branch predictors held in branch predictor buffers
– Implemented as small caches accessed with instruction address at the IF phase of a pipeline
– OR it could be implemented as a pair of bits attached to each block in the instruction cache
• This branch prediction scheme does not help in the basic 5-stage pipeline
– The decision whether a branch is taken and the target address are computed at the same stage…
![Page 12: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/12.jpg)
EENG449b/SavvidesLec 16.12
3/25/05
Prediction if Program Depended: Branch Prediction
Accuracy on SPEC 89 Benchmark• Using 2-bit prediction, 4KB cache
FP programs
Integer programs
![Page 13: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/13.jpg)
EENG449b/SavvidesLec 16.13
3/25/05
Performance of SPEC 98 Benchmark
• Remember– To evaluate performance you need to know the
branch frequencies and misprediction penalties
• FP programs typically come from scientific applications and are more loop based
• Branches harder to predict in integer programs
– Typically have higher branch frequency
• How can this be improved?– Perhaps increase the cache buffer– Increase the effectiveness of the predictor
![Page 14: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/14.jpg)
EENG449b/SavvidesLec 16.14
3/25/05
Effects of Cache Buffer Size
Increasing branch predictor buffer Has little impact on branch prediction
![Page 15: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/15.jpg)
EENG449b/SavvidesLec 16.15
3/25/05
Correlating Bit Predictors
• Need to change predictor structure• What about considering the behavior
of other branches than the ones we are trying to predict?
– The branch outcome may be predicted based on the outcome of previous k branches
• Goal: Use correlating or 2-level predictors to exploit the correlation between consecutive branches…
![Page 16: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/16.jpg)
EENG449b/SavvidesLec 16.16
3/25/05
Branch Correlation Example
if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa!=bb){
DSUBUI R3, R1, #2
BNEZ R3, L1 ; branch b1
DADD R1, R0, R0
L1: DSUBUI R3,R2,#2
BNEZ R3, L2 ; branch b2
DADD R2,R0,R0
L2: DSUBU R3,R1,R2
BEQZ R3, L3 ; branch b3
Branch b3 is correlated with b1 and b2
![Page 17: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/17.jpg)
EENG449b/SavvidesLec 16.17
3/25/05
Correlated Branch Example
Consider the following code:
if (d==0)
d=1;
if (d==1)
BNEZ R1, L1 ; branch b1
DADDUI R1,R0,#1
L1: DADDUI R3,R1, #-1
BNEZ R3,L2 ; branch b2
…
L2:
What are the possible execution sequences when d=0,1,2?
![Page 18: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/18.jpg)
EENG449b/SavvidesLec 16.18
3/25/05
Using a 1-bit Predictor
Consider a sequence of b=2,0,2,0 and a 1-bit predictor
P=prediction, A=action, NP= new prediction
P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2
d=2 NT T T NT T T
d=0 T NT NT T NT NT
d=2 NT T T NT T T
d=0 T NT NT T NT NT
BNEZ R1, L1 ; branch b1
DADDUI R1,R0,#1
L1: DADDUI R3,R1, #-1
BNEZ R3,L2 ; branch b2
…
L2:
![Page 19: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/19.jpg)
EENG449b/SavvidesLec 16.19
3/25/05
Using a 1-bit Predictor
Consider a sequence of b=2,0,2,0 and a 1-bit predictor
P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2
d=2 NT T T NT T T
d=0 T NT NT T NT NT
d=2 NT T T NT T T
d=0 T NT NT T NT NT
All branches are mispredicted !!!
BNEZ R1, L1 ; branch b1
DADDUI R1,R0,#1
L1: DADDUI R3,R1, #-1
BNEZ R3,L2 ; branch b2
…
L2:
![Page 20: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/20.jpg)
EENG449b/SavvidesLec 16.20
3/25/05
Using a 1-bit Predictor with 1-bit Correlation
X/X
Prediction if last branchwas NOT taken
Prediction if last branchwas taken
NOTE: last branch refers to the preceding branch instruction not the previous execution of the current branch instruction
![Page 21: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/21.jpg)
EENG449b/SavvidesLec 16.21
3/25/05
Using a 1-bit Predictor with 1-bit Correlation
Consider a sequence of b=2,0,2,0 and a 1-bit predictor
P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2
d=2 NT/NT T T/NT NT/NT T NT/T
d=0 T/NT NT T/NT NT/T NT NT/T
d=2 T/NT T T/NT NT/T T NT/T
d=0 T/NT NT T/NT NT/T NT NT/T
BNEZ R1, L1 ; branch b1DADDUI R1,R0,#1
L1: DADDUI R3,R1, #-1BNEZ R3,L2 ; branch
b2…L2:
![Page 22: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/22.jpg)
EENG449b/SavvidesLec 16.22
3/25/05
Using a 1-bit Predictor with 1-bit Correlation
Consider a sequence of b=2,0,2,0 and a 1-bit predictor
P. b1 A. b1 NP. b1 P. b2 A. b2 NP. b2
d=2 NT/NT T T/NT NT/NT T NT/T
d=0 T/NT NT T/NT NT/T NT NT/T
d=2 T/NT T T/NT NT/T T NT/T
d=0 T/NT NT T/NT NT/T NT NT/T
Misprediction only on the first iteration of d=2!
This is called a (1,1) predictor
BNEZ R1, L1 ; branch b1
DADDUI R1,R0,#1
L1: DADDUI R3,R1, #-1
BNEZ R3,L2 ; branch b2
…
L2:
![Page 23: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/23.jpg)
EENG449b/SavvidesLec 16.23
3/25/05
(m,n) Predictors
• Use the behavior of last m branches to choose from 2m branch predictors. Each is an n-bit predictor for a single branch
Ex. A (2,2) branch predictor
Why do we have 4, 2-bit values per line?
![Page 24: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/24.jpg)
EENG449b/SavvidesLec 16.24
3/25/05
Example
How many branch-selected entries are in a (2,2) predictor that has a total of 8K bits in the prediction buffer?
22 x 2 x Number of prediction entries= 8K
=> 1K of prediction entries selected by the branch
![Page 25: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/25.jpg)
EENG449b/SavvidesLec 16.25
3/25/05
Tournament Predictors
• N-bit predictors – use local information• (m,n) predictors – use global
information• Tournament predictors
– Local + global – enhanced performance
• Example of tournament predictors– Multilevel branch predictors
» Uses several levels of branch prediction table» Has an algorithm to select from multiple
predictors» Advantage: Select the right predictor for the
right branch
![Page 26: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/26.jpg)
EENG449b/SavvidesLec 16.26
3/25/05
Comparing Predictors
![Page 27: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/27.jpg)
EENG449b/SavvidesLec 16.27
3/25/05
High Performance Instruction Delivery
• What else can be done besides branch prediction?
• Need to have high bandwidth instruction delivery
– Modern multiple issue processors require 4-8 instructions per CPI
• To achieve that we consider– Branch Target Buffers– Integrate Instruction Fetch Units– Branch Target Cache
![Page 28: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/28.jpg)
EENG449b/SavvidesLec 16.28
3/25/05
Branch-Target Buffers (BTB)
• How can we further reduce branch penalty?
• We need to know what is the next instruction at the end of IF
• If the instruction is a branch and we know the PC then the penalty would be zero
• Branch-target-buffer – stores the predicted address for the next instruction after a branch
• Advantage for a 5-stage pipeline– Know the predicted instruction address 1 cycle
earlier IF stage instead of ID stage
![Page 29: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/29.jpg)
EENG449b/SavvidesLec 16.29
3/25/05
BTB has a cache structure
Note that only predicted taken branches need to be stored
Represent addressesof known branches
![Page 30: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/30.jpg)
EENG449b/SavvidesLec 16.30
3/25/05
Branch Target Buffer Operation
![Page 31: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/31.jpg)
EENG449b/SavvidesLec 16.31
3/25/05
Integrated Instruction Fetch Units
• Instead of using instruction fetch as one of the pipeline phases, use a more advanced instruction fetch unit
– To support the demands of multiple issue processors
• Integrated IF has 3 main units– Integrated Branch Prediction – Instruction Prefetch
» autonomously fetching ahead the given instructions
– Instruction memory access and buffering» Tries to hide the overhead associated with
fetching instructions from multiple cache lines by buffering instructions
![Page 32: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/32.jpg)
EENG449b/SavvidesLec 16.32
3/25/05
Return Address Predictors
• Predict the return address of jumps that are not known at compile time
– Returns from procedure calls.» Procedures get called at different points in
the code
• Use a small stack of return addresses– Before a procedure is called put the return
address on a stack and pop the stack on return– If the stack has enough depth – optimal
prediction
![Page 33: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/33.jpg)
EENG449b/SavvidesLec 16.33
3/25/05
Prediction Stack Performance
Results based on a number of SPEC benchmarks
![Page 34: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/34.jpg)
EENG449b/SavvidesLec 16.34
3/25/05
Recap
So far we have seen• Dynamic Scheduling – reduce data
dependences– Tomasulo’s algorithms
• Dynamic Branch Prediction – Trying to reduce control dependences
– N-bit predictors, (m,n) predictors, Tournament Predictors
• Achieve and ideal CPI of 1– Branch target buffer, integrated IF, return
address prediction
![Page 35: March 24, 2005 Prof. Andreas Savvides Spring 2005 eng.yale/courses/2005s/eeng449b](https://reader036.fdocuments.net/reader036/viewer/2022062409/56815164550346895dbf8f9a/html5/thumbnails/35.jpg)
EENG449b/SavvidesLec 16.35
3/25/05
Next Lecture
• Multiple issue processors • Speculation• Completion of Ch. 3