A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the...
Transcript of A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the...
logolund
Lecture 4: EIT090 Computer Architecture
Anders Ardö
EIT – Electrical and Information Technology, Lund University
Sept 23, 2009
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 1 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 2 / 78
logolund
Previous lecture
Introduction to pipeline basics
General principles of pipelining(review from EIT070 Computer Organization)Techniques to avoid pipeline stalls due to hazardsWhat makes pipelining hard to implement?Support of multi-cycle instructions in a pipeline
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 3 / 78
logolund
A Pipelined MIPS Datapath
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 4 / 78
logolund
Pipelining Lessons
Pipelining doesn’t help latency of a single instruction, it helpsthroughput of the entire workloadPipeline rate is limited by the slowest pipeline stageMultiple tasks are operating simultaneouslyPotential speedup = Number of pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to fill pipeline and time to drain reduces speedup
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 5 / 78
logolund
Summary
Pipelining:Speeds up throughput, not latencySpeedup ≤ #stages
Speedup =CPIunpipelined
Ideal CPI + # stall-cycles/instr∗
TCunpipelinedTCpipelined
≈ #stages1 + # stall-cycles/instruction
Hazards limit performance, generate stalls:Structural: need more HWData (RAW,WAR,WAW): need forwarding and compiler schedulingControl: delayed branch, branch prediction
Complications:Precise exceptions may be difficult to implementThe instruction set can be more or less suited for pipelining
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 6 / 78
logolund
Lecture 4 agenda
Appendix A.6-A.7, and Chapters 2.1-2.6, 2.9 in "ComputerArchitecture"
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 7 / 78
logolund
Lecture 4
A case study of a deep pipeline: the MIPS R4000Introduction to Instruction Level Parallelism: ILPBranch predictionInstruction scheduling
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 8 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 9 / 78
logolund
The MIPS R4000
R4000 - MIPS64:First (1992) true 64-bit architecture (addresses and data)Clock frequency (1997): 100 MHz-250 MHzDeep 8 stage pipeline (superpipelined)
Implications because of the deeper pipeline:load latency: 2 cyclesbranch latency: 3 cycles (incl one delay slot)=⇒ High demands on the compilerBypassing (forwarding) from more stages
Performance equation: CPI ∗ Tc must be lower for the longerpipeline to make it worthwhile
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 10 / 78
logolund
High Clock Frequency Implications
200 MHz clock frequency corresponds to 5 ns cycle time in eachpipe stage.While an ALU can be clocked in this frequency, the memoryaccesses become main bottlenecks.
The MIPS R4000 has 8-16 Kbytes I- and D-cachesExtra pipeline stages comes from decomposing memory access
Operations at a cache access:Address translationCache look-upHit/miss detection
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 11 / 78
logolund
The MIPS R4000 Pipeline
Instruction fetch is done in the IF and IS stages speculativelyindependent of a cache hit or missRF performs the same as ID in the classic 5-stage pipeline butalso check for instruction cache hit/missThe MEM stage is divided into three stages (DF, DS, and TC).Cache misses are detected in TC (Tag check)WB is the same as in classic 5-stage pipeline
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 12 / 78
logolund
Branch Penalties
One branch delay slotPredict-Not taken scheme squashes the next two sequentiallyfetched instructions if the branch is taken
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 13 / 78
logolund
Branch Penalties
branch takenInstruction Clock number
1 2 3 4 5 6 7 8 9Branch instr IF IS RF EX DF DS TC WBDelay slot IF IS RF EX DF DS TC WBStall s s s s s s sStall s s s s s sBranch target IF IS RF EX DF
branch not takenInstruction Clock number
1 2 3 4 5 6 7 8 9Branch instr IF IS RF EX DF DS TC WBDelay slot IF IS RF EX DF DS TC WBBranch instr +1 IF IS RF EX DF DS TCBranch instr +2 IF IS RF EX DF DS
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 14 / 78
logolund
Load Penalties
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 15 / 78
logolund
Load Penalties
Instruction Clock number1 2 3 4 5 6 7 8 9
LD R1,... IF IS RF EX DF DS TC WBDADD R2,R1,R4 IF IS RF stall stall EX DF DSDSUB R3,R1,R5 IF IS stall stall RF EX DF
Data is available after the DS stage (second stage in ’Datamemory’)=⇒ 2 load delay slotsFour bypasses instead of two because of the two extra pipestages to read from memory.
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 16 / 78
logolund
R4000 Performance
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 17 / 78
logolund
R4000 Performance
CPI penaltiesAverage Load Branch FP data FP struct
CPI hazard hazard2.00 0.10 0.36 0.46 0.09
(SPEC92 CPI measurements)
The penalty for control hazards is very high for integer programs(up to 0.61)The penalty for FP data hazards is also highThe higher clock frequency compensates for the higher CPI
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 18 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 19 / 78
logolund
Instruction Level Parallelism - ILP
ILP: Overlap execution of unrelated instructions: PipeliningTwo main approaches:
DYNAMIC =⇒ hardware detects parallelismSTATIC =⇒ software detects parallelism
Often a mix between both.
Pipeline CPI = Ideal CPI + Structural stalls+ Data hazard stalls + Control stalls
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 20 / 78
logolund
Instruction Level Parallelism - ILP
Example:DIVD F0,F2,F4ADDD F10,F1,F2SUBD F8,F8,F14
These instructions are independent and could be executed inparallelgcc has 17% control transfer instructions:
5 sequential instructions / branchWe must look beyond a single basic block to get more ILP
Loop level parallelism one opportunity, SW & HW
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 21 / 78
logolund
Loop-level Parallelism
for (i=1; i≤1000; i=i+1)x[i] = x[i] + 10;
There is very little available parallelism within an iterationHowever, there is parallelism between loop iterations; eachiteration is independent of the other
Potential speedup = 1000!!!
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 22 / 78
logolund
MIPS-code for the Loop
loop: LD F0, 0(R1) ; F0 = array elementADDD F4, F0, F2 ; Add scalar constantSD F4, 0(R1) ; Save resultDADDUI R1, R1, #-8 ; decrement array ptr.BNE R1,R2, loop ; reiterate if R1 != R2
Instruction producing Instruction using Latencyresult resultFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1Load double Store double 0Integer op Integer op 0
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 23 / 78
logolund
Loop Showing Stalls
1 loop: LD F0, 0(R1) ; F0 = array element2 stall3 ADDD F4, F0, F2 ; Add scalar constant4 stall5 stall6 SD F4, 0(R1) ; Save result7 DADDUI R1, R1, #-8 ; decrement array ptr.8 stall9 BNE R1,R2 loop ; reiterate if R1 != R210 nop
How can we get rid of the stalls?
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 24 / 78
logolund
Restructured Loop Minimizing Stalls
1 loop: LD F0, 0(R1) ; F0 = array element2 DADDUI R1, R1, #-8 ; decrement array ptr.3 ADDD F4, F0, F2 ; Add scalar constant4 stall5 BNE R1, R2, loop ; reiterate if R1 != R26 SD F4, 8(R1) ; Save result
Swap DADDUI and SD by changing offset in SD6 clock cycles per iterationSophisticated compiler analysis requiredCan we do better?
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 25 / 78
logolund
Unroll Loop Four Times
1 loop: LD F0, 0(R1)2 ADDD F4, F0, F23 SD F4, 0(R1) ; drop DADDUI & BNE4 LD F6, -8(R1)5 ADDD F8, F6, F26 SD F8, -8(R1) ; drop DADDUI & BNE7 LD F10, -16(R1)8 ADDD F12, F10, F29 SD F12, -16(R1) ; drop DADDUI & BNE10 LD F14, -24(R1)11 ADDD F16, F14, F212 SD F16, -24(R1)13 DADDUI R1, R1, #-32 ; alter to 4*814 BNE R1, R2, loop15 NOP
15 + 4*(1+2) + 1 = 28 clock cycles, or 7 per iterationAssumes R1 is a multiple of 4
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 26 / 78
logolund
Scheduled Unrolled Loop
1 loop: LD F0, 0(R1)2 LD F6, -8(R1)3 LD F10, -16(R1)4 LD F14, -24(R1)5 ADDD F4, F0, F26 ADDD F8, F6, F27 ADDD F12, F10, F28 ADDD F16, F14, F29 SD F4, 0(R1)10 SD F8, -8(R1)11 DADDUI R1, R1, #-3212 SD F12, 16(R1) ; 16-32 = -1613 BNE R1, R2, loop14 SD F16, 8(R1) ; 8-32 = -24
14 clock cycles, or 3.5 per iterationWhen is it safe to move instructions?
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 27 / 78
logolund
Why loop unrolling works
Longer sequences of straight code without branches (longer basicblocks) allows for easier compiler static reschedulingLonger basic blocks also facilitates dynamic rescheduling such asScoreboard and Tomasulo’s algorithm
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 28 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 29 / 78
logolund
Dynamic Branch Prediction
Branches limit performance because:Branch penaltiesLimit to available Instruction Level Parallelism
Solution: Dynamic branch prediction to predict the outcome ofconditional branches.Benefits:
Reduce the time to when the branch condition is knownReduce the time to calculate the branch target address
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 30 / 78
logolund
Branch History Table
Simplest branch predictionThe branch-prediction buffer is indexed by bits frombranch-instruction PC valuesThe bit corresponding to a branch indicates whether the branch ispredicted as taken or notWhen prediction is wrong: invert bit
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 31 / 78
logolund
A two-bit prediction scheme
Requires prediction to miss twice in order to change prediction=⇒ better performance1%-18% miss prediction frequency for SPEC89Integer programs have higher miss frequency than FP programs
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 32 / 78
logolund
Branch Target Buffer - BTB
Predicts branch target address in the IF stageCan be combined with 2-bit branch prediction
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 33 / 78
logolund
Branch Target Buffer (BTB) Algorithm
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 34 / 78
logolund
Branch prediction penalties
for a Branch Target Buffer scheme
Instruction Prediction Actual Penaltyin buffer branch cycles
Yes Taken Taken 0Yes Taken Not taken 2No Taken 2No Not taken 0
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 35 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 36 / 78
logolund
Dependencies
Two instructions must be independent in order to execute inparallelThere are three general types of dependencies that limitparallelism:
Data dependenciesName dependenciesControl dependencies
Dependencies are properties of the programWhether a dependency leads to a hazard or not is a property ofthe pipeline implementation
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 37 / 78
logolund
Data Dependencies
An instruction j is data dependent on instruction i if:
Instruction i produces a result used by instr. j, orInstruction j is data dependent on instruction k and instr. k is datadependent on instr. i
Example:LD F0,0(R1)ADDD F4,F0,F2SD 0(R1),F4
Easy to detect for registers
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 38 / 78
logolund
Name Dependencies
Two instructions use same name (register or memory address) but donot exchange data
Anti-dependence (WAR if hazard in HW)ADDD F2,F0,F2 ; Must execute before LDLD F0,0(R1)
Output dependence (WAW if hazard in HW)ADDD F0,F2,F2 ; Must execute before LDLD F0,0(R1)
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 39 / 78
logolund
Where Are the Name Dependencies?
1 loop: LD F0, 0(R1)2 ADDD F4, F0, F23 SD F4, 0(R1) ; drop DADDUI & BNE4 LD F0, -8(R1)5 ADDD F4, F0, F26 SD F4, -8(R1) ; drop DADDUI & BNE7 LD F0, -16(R1)8 ADDD F4, F0, F29 SD F4, -16(R1) ; drop DADDUI & BNE10 LD F0, -24(R1)11 ADDD F4, F0, F212 SD F4, -24(R1)13 DADDUI R1, R1, #-32 ; alter to 4*814 BNE R1, R2 loop15 NOP
How can we remove them?
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 40 / 78
logolund
Control Dependencies
Determines order between an instruction and a branch instruction
Example:if Test1 then { S1 }if Test2 then { S2 }
S1 is control dependent on Test1S2 is control dependent on Test2; but not on Test1
We cannot move an instruction that is dependent on a branchbefore the branch instructionWe cannot move an instruction not control dependent on a branchafter the branch instruction
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 41 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 42 / 78
logolund
Instruction scheduling
Scheduling is the process that determines when to start a particularinstruction, when to read its operands, and when to write its result,Target of scheduling: rearrange instructions to reduce stalls whendata or control dependencies are present
Static (compiler at compile-time) scheduling:Pro: May look into future; no HW support neededCon: Cannot detect all dependencies, esp. across branches
Dynamic (hardware at run-time) scheduling:Pro: Can potentially detect all dependenciesCon: Cannot look into the future; HW support needed whichcomplicates the pipeline hardware
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 43 / 78
logolund
Dynamic instruction scheduling
Key idea: allow instructions behind stall to proceed
DIVD F0,F2,F4 ; takes long timeADDD F10,F0,F8 ; stalls waiting for F0MULTD F12,F8,F14 ; Let this instr. bypass the ADDD
MULTD is not data dependent on anything in the pipelineEnables out-of-order execution =⇒ out-of-order completionID stage checks for structural and data dependencies
Scoreboard dates to CDC 6600 in 1963Tomasulo in IBM 360/91 in 1967
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 44 / 78
logolund
Dynamic instruction scheduling
Why in hardware at run time?
It works when we cannot know real dependence at compile timeIt makes the compiler simplerCode for one implementation runs well on another
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 45 / 78
logolund
Outline
1 Reiteration
2 MIPS R4000
3 Instruction Level Parallelism
4 Branch Prediction
5 Dependencies
6 Instruction Scheduling
7 Scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 46 / 78
logolund
Scoreboard pipeline
Goal of scoreboarding is to maintain an execution rate of oneinstruction per clock cycle by executing an instruction as early aspossible.Instructions execute out-of-order when there are sufficientresources and no data dependencies.A scoreboard is a hardware unit that keeps track of
the instructions that are in the process of being executed,the functional units that are doing the executing,and the registers that will hold the results of those units.
A scoreboard centrally performs all hazard detection andresolution and thus controls the instruction progression from onestep to the next.
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 47 / 78
logolund
Scoreboard pipeline
Issue: Decode and check for structural hazardsRead operands: wait until no data hazards, then read operandsAll data hazards are handled by the scoreboard
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 48 / 78
logolund
Scoreboard complications
Out-of-order execution =⇒WAR & WAW hazards
Solutions for WAR:Stall instruction in the Write stage until all previously issuedinstructions (with a WAR hazard) have read their operands
For WAW, must detect hazard: stall in Issue until other instructioncompletesRAW hazards handled in Read OperandsScoreboard keeps track of dependencies and state of operations
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 49 / 78
logolund
Scoreboard functionality
Issue: An instruction is issued if:The needed functional unit is free (there is no structural hazard)No functional unit has a destination operand equal to thedestination of the instruction (resolves WAW hazards)
Read: Wait until no data hazards, then read operandsPerformed in parallel for all functional unitsResolves RAW hazards dynamically
EX: normal executionNotify the scoreboard when ready
Write: The instruction can update destination if:All instructions issued earlier have read their operands (resolvesWAR hazards)
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 50 / 78
logolund
Data structures in the scoreboard
1. Instruction status – keeps track of which of the 4 steps theinstruction is in2. Functional unit status – Indicates the state of the functionalunit (FU). 9 fields for each FU:
Busy : Indicates whether the unit is busy or notOp: Operation to perform in the unit (e.g. add or sub)Fi : Destination register nameFj, Fk : Source register namesQj, Qk : Name of functional unit producing regs Fj, FkRj, Rk : Flags indicating when Fj and Fk are ready
3. Register result status – Indicates which functional unit willwrite each register, if any.
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 51 / 78
logolund
Scoreboard Example
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 52 / 78
logolund
Detailed Scoreboard Pipeline Control
See figure A.54 in “Computer Architecture”
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 53 / 78
logolund
Scoreboard Example, CP 1
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 54 / 78
logolund
Scoreboard Example, CP 2
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 55 / 78
logolund
Scoreboard Example, CP 3
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 56 / 78
logolund
Scoreboard Example, CP 4
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 57 / 78
logolund
Scoreboard Example, CP 5
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 58 / 78
logolund
Scoreboard Example, CP 6
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 59 / 78
logolund
Scoreboard Example, CP 7
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 60 / 78
logolund
Scoreboard Example, CP 8a
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 61 / 78
logolund
Scoreboard Example, CP 8
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 62 / 78
logolund
Scoreboard Example, CP 9
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 63 / 78
logolund
Scoreboard Example, CP 11
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 64 / 78
logolund
Scoreboard Example, CP 12
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 65 / 78
logolund
Scoreboard Example, CP 13
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 66 / 78
logolund
Scoreboard Example, CP 14
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 67 / 78
logolund
Scoreboard Example, CP 16
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 68 / 78
logolund
Scoreboard Example, CP 17
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 69 / 78
logolund
Scoreboard Example, CP 19
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 70 / 78
logolund
Scoreboard Example, CP 20
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 71 / 78
logolund
Scoreboard Example, CP 21
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 72 / 78
logolund
Scoreboard Example, CP 22
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 73 / 78
logolund
Scoreboard Example, CP 61
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 74 / 78
logolund
Scoreboard Example, CP 62
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 75 / 78
logolund
Factors that limit performance
The scoreboard technique is limited by:The amount of parallelism available in codeThe number of scoreboard entries (window size)
A large window can look ahead across more instructionsThe number and types of functional units
Contention for functional units leads to structural hazardsThe presence of anti-dependencies and output dependencies
These lead to WAR and WAW hazards that are handled by stallingthe pipeline in the Scoreboard
Number of datapaths to registers
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 76 / 78
logolund
Scoreboard Summary
Main advantage:managing multiple FUsout-of-order execution of multi-cycle operationsmaintaining all data dependencies (RAW, WAW, WAR)
Scoreboard limitations:single issue scheme, however: scheme is extendable tomultiple-issuein-order issueanti-dependencies and output dependencies may lead to WARand WAW stalls,no forwarding hardware all results go through the registers
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 77 / 78
logolund
Summary
ILP:Rescheduling and loop unrolling are important to take advantage ofpotential Instruction Level Parallelism
Dynamic instruction schedulingAn alternative to compile-time schedulingDoes not need recompilation to increase performanceUsed in most new processor implementations
Dynamic Branch Predictionreduce branch penalties by early prediction of conditional branchoutcomes
A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 78 / 78