Post on 17-Jan-2022
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 1
1 Copyright © 2012, Elsevier Inc. All rights reserved.
Chapter 3
Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition
2 Copyright © 2012, Elsevier Inc. All rights reserved.
Introduction n Pipelining become universal technique in 1985
n Overlaps execution of instructions n Exploits “Instruction Level Parallelism”
n Beyond this, there are two main approaches: n Hardware-based dynamic approaches
n Used in server and desktop processors n Not used as extensively in PMP processors
n Compiler-based static approaches n Not as successful outside of scientific applications
Introduction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 2
3 Copyright © 2012, Elsevier Inc. All rights reserved.
Instruction-Level Parallelism n When exploiting instruction-level parallelism,
goal is to maximize CPI n Pipeline CPI =
n Ideal pipeline CPI + n Structural stalls + n Data hazard stalls + n Control stalls
n Parallelism with basic block is limited n Typical size of basic block = 3-6 instructions n Must optimize across branches
Introduction
4 Copyright © 2012, Elsevier Inc. All rights reserved.
Data Dependence n Loop-Level Parallelism
n Unroll loop statically or dynamically n Use SIMD (vector processors and GPUs)
n Challenges: n Data dependency
n Instruction j is data dependent on instruction i if n Instruction i produces a result that may be used by instruction j n Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i
n Dependent instructions cannot be executed simultaneously
Introduction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 3
5 Copyright © 2012, Elsevier Inc. All rights reserved.
Data Dependence n Dependencies are a property of programs n Pipeline organization determines if dependence
is detected and if it causes a stall
n Data dependence conveys: n Possibility of a hazard n Order in which results must be calculated n Upper bound on exploitable instruction level
parallelism
n Dependencies that flow through memory locations are difficult to detect
Introduction
6 Copyright © 2012, Elsevier Inc. All rights reserved.
Name Dependence n Two instructions use the same name but no flow
of information n Not a true data dependence, but is a problem when
reordering instructions n Antidependence: instruction j writes a register or
memory location that instruction i reads n Initial ordering (i before j) must be preserved
n Output dependence: instruction i and instruction j write the same register or memory location
n Ordering must be preserved
n To resolve, use renaming techniques
Introduction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 4
7 Copyright © 2012, Elsevier Inc. All rights reserved.
Other Factors n Data Hazards
n Read after write (RAW) n Write after write (WAW) n Write after read (WAR)
n Control Dependence n Ordering of instruction i with respect to a branch
instruction n Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controller by the branch
n An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch
Introduction
8 Copyright © 2012, Elsevier Inc. All rights reserved.
Examples n OR instruction dependent
on DADDU and DSUBU
n Assume R4 isn’t used after skip n Possible to move DSUBU
before the branch
Introduction • Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6
L: … OR R7,R1,R8
• Example 2:
DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9
skip: OR R7,R8,R9
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 5
9 Copyright © 2012, Elsevier Inc. All rights reserved.
Compiler Techniques for Exposing ILP
n Pipeline scheduling n Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
n Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s;
Com
piler Techniques
10 Copyright © 2012, Elsevier Inc. All rights reserved.
Pipeline Stalls
Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop
Com
piler Techniques
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 6
11 Copyright © 2012, Elsevier Inc. All rights reserved.
Pipeline Scheduling Scheduled code: Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall stall S.D F4,8(R1) BNE R1,R2,Loop
Com
piler Techniques
12 Copyright © 2012, Elsevier Inc. All rights reserved.
Loop Unrolling n Loop unrolling
n Unroll by a factor of 4 (assume # elements is divisible by 4) n Eliminate unnecessary instructions
Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop
Com
piler Techniques
n note: number of live registers vs. original loop
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 7
13 Copyright © 2012, Elsevier Inc. All rights reserved.
Loop Unrolling/Pipeline Scheduling
n Pipeline schedule the unrolled loop:
Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop
Com
piler Techniques
14 Copyright © 2012, Elsevier Inc. All rights reserved.
Strip Mining n Unknown number of loop iterations?
n Number of iterations = n n Goal: make k copies of the loop body n Generate pair of loops:
n First executes n mod k times n Second executes n / k times n “Strip mining”
Com
piler Techniques
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 8
15 15
Three Limits to Loop Unrolling 1. Decrease in amount of overhead
amortized with each extra unrolling • Amdahl’s Law
2. Growth in code size • For larger loops, concern
it increases the instruction cache miss rate
3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling • If it is not possible to allocate all live values to registers,
we may lose some or all of its advantage
n Loop unrolling reduces impact of branches on pipeline; another way is branch prediction
Conclude: a little loop unrolling is beneficial; trying to do a lot is hard.
16 16
12%
22%
18%
11% 12%
4%6%
9% 10%
15%
0%
5%
10%
15%
20%
25%
compresseqntott
espresso gc
c li
doduc
ear
hydro2d
mdljdp
su2cor
Mis
pred
ictio
n R
ate
Static Branch Prediction n To reorder code around branches,
need to predict branch statically when compiled n Simplest scheme is to predict a branch as taken
n Average misprediction = untaken branch frequency = 34% SPEC
n More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run:
Integer Floating Point
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 9
17 17
Dynamic Branch Prediction n Why does prediction work?
n Underlying algorithm has regularities n Data that is being operated on has regularities n Instruction sequence has redundancies that are artifacts of way that
humans/compilers think about problems
n Is dynamic branch prediction better than static branch prediction? n Seems to be n There are a small number of important branches in programs that have
dynamic behavior
18 18
Dynamic Branch Prediction n Performance = ƒ(accuracy, cost of
misprediction) n Branch History Table:
Lower bits of PC address index table of 1-bit values n Says whether or not branch taken last time n No address check
n Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): n End of loop case, when it exits instead of looping as before n First time through loop on next time through code,
when it predicts exit instead of looping
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 10
19 19
n Solution: 2-bit scheme where change prediction only if get misprediction twice
n Red: stop, not taken n Green: go, taken n Adds hysteresis to decision making process
Dynamic Branch Prediction
T
T NT
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not Taken T
NT T
NT
20 20
18%
5%
12%10% 9%
5%
9% 9%
0% 1%
0%2%4%6%8%10%12%14%16%18%20%
eqntott
espresso gc
c lispice
doducspice
fpppp
matrix300nasa7
Mis
pred
ictio
n R
ate
BHT Accuracy n Mispredict because either:
n Wrong guess for that branch n Got branch history of wrong branch when index the table
n 4096 entry table:
Integer Floating Point
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 11
21 21
Correlated Branch Prediction n Idea: record m most recently executed branches as
taken or not taken, and use that pattern to select the proper n-bit branch history table
n In general, (m,n) predictor means record last m branches
to select between 2m history tables, each with n-bit counters
n Thus, old 2-bit BHT is a (0,2) predictor
n Global Branch History: m-bit shift register keeping T/NT status of last m branches.
n Each entry in table has m n-bit predictors.
22 22
Correlating Branches
(2,2) predictor – Behavior of recent
branches selects between four predictions of next branch, updating just that prediction
Branch address
2-bits per branch predictor
Prediction
2-bit global branch history
4
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 12
23 23
0%
Freq
uenc
y of
Misp
redi
ctio
ns
0% 1%
5% 6% 6%
11%
4%
6% 5%
1% 2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of Different Schemes
4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT
nasa
7
mat
rix30
0
dodu
cd
spic
e
fppp
p
gcc
expr
esso
eqnt
ott li
tom
catv
24 24
Tournament Predictors n Multilevel branch predictor
n Use n-bit saturating counter to choose between predictors
n Usual choice between global and local predictors
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 13
25 25
Tournament Predictors Tournament predictor using, say, 4K 2-bit
counters indexed by local branch address. Chooses between:
n Global predictor n 4K entries index by history of last 12 branches (212 = 4K)
n Each entry is a standard 2-bit predictor
n Local predictor n Local history table: 1024 10-bit entries recording last 10 branches,
index by branch address
n The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters
26 26
Comparing Predictors (Fig. 2.8) Advantage of tournament predictor is ability to
select the right predictor for a particular branch n Particularly crucial for integer benchmarks. n A typical tournament predictor will select the global predictor
almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 14
27 27
Pentium 4 Misprediction Rate (per 1000 instructions, not per branch)
11
13
7
12
9
10 0 0
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
Bra
nch
mis
pre
dic
tion
s p
er 1
00
0 I
nst
ruct
ion
s
SPECint2000 SPECfp2000
≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)
≈2% misprediction rate per branch SPECfp (5% of FP instructions are branch)
28
• Branch target calculation is costly and stalls the instruction fetch.
• BTB stores PCs the same way as caches • The PC of a branch is sent to the BTB • When a match is found
the corresponding Predicted PC is returned • Even if the branch was predicted taken,
instruction fetch continues at the returned Predicted PC, i.e. no stall
Branch Target Buffers (BTB)
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 15
29
Branch Target Buffers
Organization?
Proceed normally?
PC of instruction to fetch? What if prediction is wrong?
30 30
Dynamic Branch Prediction Summary n Prediction becoming important part of execution n Branch History Table: 2 bits for loop accuracy n Correlation: Recently executed branches correlated with
next branch n Either different branches (GA) n Or different executions of same branches (PA)
n Tournament predictors take insight to next level by using multiple predictors
n usually one based on global information and one based on local information, and combining them with a selector
n In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5 and Pentium 4
n Branch Target Buffer: include branch address & prediction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 16
31 Copyright © 2012, Elsevier Inc. All rights reserved.
Branch Prediction n Basic 2-bit predictor:
n For each branch: n Predict taken or not taken n If the prediction is wrong two consecutive times, change prediction
n Correlating predictor: n Multiple 2-bit predictors for each branch n One for each possible combination of outcomes of preceding n
branches n Local predictor:
n Multiple 2-bit predictors for each branch n One for each possible combination of outcomes for the last n
occurrences of this branch n Tournament predictor:
n Combine correlating predictor with local predictor
Branch P
rediction
32 Copyright © 2012, Elsevier Inc. All rights reserved.
Branch Prediction Performance
Branch P
rediction
Branch predictor performance
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 17
33 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling n Rearrange order of instructions to reduce stalls
while maintaining data flow
n Advantages: n Compiler doesn’t need to have knowledge of
microarchitecture n Handles cases where dependencies are unknown at
compile time
n Disadvantage: n Substantial increase in hardware complexity n Complicates exceptions
Branch P
rediction
34 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling n Dynamic scheduling implies:
n Out-of-order execution n Out-of-order completion
n Creates the possibility for WAR and WAW hazards
n Tomasulo’s Approach n Tracks when operands are available n Introduces register renaming in hardware
n Minimizes WAW and WAR hazards
Branch P
rediction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 18
35 Copyright © 2012, Elsevier Inc. All rights reserved.
Register Renaming n Example:
DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8
+ name dependence with F6
Branch P
rediction
antidependence
antidependence
36 Copyright © 2012, Elsevier Inc. All rights reserved.
Register Renaming n Example:
DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T
n Now only RAW hazards remain, which can be strictly
ordered
Branch P
rediction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 19
37 Copyright © 2012, Elsevier Inc. All rights reserved.
Register Renaming n Register renaming is provided by reservation stations
(RS) n Contains:
n The instruction n Buffered operand values (when available) n Reservation station number of instruction providing the
operand values n RS fetches and buffers an operand as soon as it becomes
available (not necessarily involving register file) n Pending instructions designate the RS to which they will send
their output n Result values broadcast on a result bus, called the common data bus (CDB)
n Only the last output updates the register file n As instructions are issued, the register specifiers are renamed
with the reservation station n May be more reservation stations than registers
Branch P
rediction
38 Copyright © 2012, Elsevier Inc. All rights reserved.
Tomasulo’s Algorithm n Load and store buffers
n Contain data and addresses, act like reservation stations
n Top-level design:
Branch P
rediction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 20
39 Copyright © 2012, Elsevier Inc. All rights reserved.
Tomasulo’s Algorithm n Three Steps:
n Issue n Get next instruction from FIFO queue n If available RS, issue the instruction to the RS with operand values if
available n If operand values not available, stall the instruction
n Execute n When operand becomes available, store it in any reservation
stations waiting for it n When all operands are ready, issue the instruction n Loads and store maintained in program order through effective
address n No instruction allowed to initiate execution until all branches that
proceed it in program order have completed n Write result
n Write result on CDB into reservation stations and store buffers n (Stores must wait until address and value are received)
Branch P
rediction
40 40
Advantages of Dynamic Scheduling
n Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior
n It handles cases when dependences are unknown at compile time n it allows the processor to tolerate unpredictable delays such as cache
misses, by executing other code while waiting for the miss to resolve
n It allows code that compiled for one pipeline to run efficiently on a different pipeline
n It simplifies the compiler n Hardware speculation, a technique with
significant performance advantages, builds on dynamic scheduling (next lecture)
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 21
41 41
HW Schemes: Instruction Parallelism n Key idea: Allow instructions behind stall to proceed
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14
n Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)
n In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)
n Will distinguish when an instruction begins execution and when it completes execution; between two times, the instruction is in execution
n Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder
42 42
Dynamic Scheduling Step 1
n Simple pipeline had one stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
n Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
n Issue—Decode instructions, check for structural hazards
n Read operands—Wait until no data hazards, then read operands
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 22
43 43
A Dynamic Algorithm: Tomasulo’s
n For IBM 360/91 (before caches!) n ⇒ Long memory latency
n Goal: High Performance without special compilers n Small number of floating point registers (four in 360)
prevented interesting compiler scheduling of operations n This led Tomasulo to try to figure out how to get more registers — renaming
in hardware!
n Why Study 1966 Computer? n The descendants of this have flourished!
n Alpha 21264, Pentium 4, AMD Opteron, Power 5, …
44 44
Tomasulo Algorithm n Control & buffers distributed with Function Units (FU)
n FU buffers called “reservation stations”; have pending operands
n Registers in instructions replaced by values or pointers to reservation stations (RS); called register renaming; n Renaming avoids WAR, WAW hazards n More reservation stations than registers,
so can do optimizations compilers can’t n Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs n Avoids RAW hazards by executing an instruction
only when its operands are available
n Load and Stores treated as FUs with RSs as well n Integer instructions can go past branches (predict taken),
allowing FP ops beyond basic block in FP queue
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 23
45 45
Tomasulo Organization
FP adders
Add1 Add2 Add3
FP multipliers
Mult1 Mult2
From Mem FP Registers
Reservation Stations
Common Data Bus (CDB)
To Mem
FP Op Queue
Load Buffers
Store Buffers
Load1 Load2 Load3 Load4 Load5 Load6
46 46
Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands
n Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source registers (value to be written) n Note: Qj,Qk=0 => ready n Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 24
47 47
Three Stages of Tomasulo Algorithm
1. "Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),
control issues instruction & sends operands (renames registers). 2. "Execute—operate on operands (EX)
When both operands ready then execute; if not ready, watch Common Data Bus for result
3. "Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;
mark reservation station available n Normal data bus: data + destination (“go to” bus) n Common data bus: data + source (“come from” bus)
n 64 bits of data + 4 bits of Functional Unit source address n Write if matches expected Functional Unit (produces result) n Does the broadcast
n Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clocks for /
48 48
Tomasulo Example Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU
Clock cycle counter
FU count down
Instruction stream
3 Load/Buffers
3 FP Adder R.S. 2 FP Mult R.S.
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 25
49 49
Tomasulo Example Cycle 1 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1
50 50
Tomasulo Example Cycle 2 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1
Note: Can have multiple loads outstanding
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 26
51 51
Tomasulo Example Cycle 3 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load1 completing; what is waiting for Load1?
52 52
Tomasulo Example Cycle 4 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1
• Load2 completing; what is waiting for Load2?
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 27
53 53
Tomasulo Example Cycle 5 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No
10 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
54 54
Tomasulo Example Cycle 6 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2
• Issue ADDD here despite name dependency on F6?
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 28
55 55
Tomasulo Example Cycle 7 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
56 56
Tomasulo Example Cycle 8 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 No2 Add2 Yes ADDD (M-M) M(A2)
Add3 No7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 29
57 57
Tomasulo Example Cycle 9 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 No1 Add2 Yes ADDD (M-M) M(A2)
Add3 No6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2
58 58
Tomasulo Example Cycle 10 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 No0 Add2 Yes ADDD (M-M) M(A2)
Add3 No5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it?
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 30
59 59
Tomasulo Example Cycle 11 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 No
4 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
• Write result of ADDD here? • All quick instructions complete in this cycle!
60 60
Tomasulo Example Cycle 12 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 No
3 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 31
61 61
Tomasulo Example Cycle 13 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 No
2 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
62 62
Tomasulo Example Cycle 14 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 No
1 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 32
63 63
Tomasulo Example Cycle 15 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 No
0 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
64 64
Tomasulo Example Cycle 16 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 33
65 65
Faster than light computation (skip a couple of cycles)
66 66
Tomasulo Example Cycle 55 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 34
67 67
Tomasulo Example Cycle 56 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it?
68 68
Tomasulo Example Cycle 57 Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution and out-of-order completion.
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 35
69 69
Why can Tomasulo overlap iterations of loops?
n Register renaming n Multiple iterations use different physical destinations
for registers (dynamic loop unrolling). n Reservation stations
n Permit instruction issue to advance past integer control flow operations
n Also buffer old values of registers—totally avoiding the WAR stall n Other perspective: Tomasulo building data flow
dependency graph on the fly
70 70
Tomasulo’s scheme offers 2 major advantages
1. Distribution of the hazard detection logic n distributed reservation stations and the CDB n If multiple instructions waiting on single result, and each
instruction has other operands, then instructions can be released simultaneously by broadcast on CDB
n If a centralized register file were used, the units would have to read their results from the registers when register buses are available
2. Elimination of stalls for WAW and WAR hazards
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 36
71 71
Tomasulo Drawbacks
n Complexity n delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon! n Many associative stores (CDB) at high speed n Performance limited by Common Data Bus
n Each CDB must go to multiple functional units ⇒high capacitance, high wiring density
n Number of functional units that can complete per cycle limited to one!
n Multiple CDBs ⇒ more FU logic for parallel assoc stores n Non-precise interrupts!
n We will address this later
72 Copyright © 2012, Elsevier Inc. All rights reserved.
Example
Branch P
rediction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 37
73 Copyright © 2012, Elsevier Inc. All rights reserved.
Hardware-Based Speculation n Execute instructions along predicted execution
paths but only commit the results if prediction was correct
n Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative
n Need an additional piece of hardware to prevent any irrevocable action until an instruction commits n I.e. updating state or taking an execution
Branch P
rediction
74 Copyright © 2012, Elsevier Inc. All rights reserved.
Reorder Buffer n Reorder buffer – holds the result of instruction
between completion and commit
n Four fields: n Instruction type: branch/store/register n Destination field: register number n Value field: output value n Ready field: completed execution?
n Modify reservation stations: n Operand source is now reorder buffer instead of
functional unit
Branch P
rediction
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 38
75 Copyright © 2012, Elsevier Inc. All rights reserved.
Reorder Buffer n Register values and memory values are not
written until an instruction commits n On misprediction:
n Speculated entries in ROB are cleared
n Exceptions: n Not recognized until it is ready to commit
Branch P
rediction
76 Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue and Static Scheduling
n To achieve CPI < 1, need to complete multiple instructions per clock
n Solutions: n Statically scheduled superscalar processors n VLIW (very long instruction word) processors n dynamically scheduled superscalar processors
Multiple Issue and S
tatic Scheduling
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 39
77 Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue M
ultiple Issue and Static S
cheduling
78 Copyright © 2012, Elsevier Inc. All rights reserved.
VLIW Processors
n Package multiple operations into one instruction
n Example VLIW processor: n One integer instruction (or branch) n Two independent floating-point operations n Two independent memory references
n Must be enough parallelism in code to fill the available slots
Multiple Issue and S
tatic Scheduling
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 40
79 Copyright © 2012, Elsevier Inc. All rights reserved.
VLIW Processors
n Disadvantages: n Statically finding parallelism n Code size n No hazard detection hardware n Binary code compatibility
Multiple Issue and S
tatic Scheduling
80 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling, Multiple Issue, and Speculation
n Modern microarchitectures: n Dynamic scheduling + multiple issue + speculation
n Two approaches: n Assign reservation stations and update pipeline
control table in half clock cycles n Only supports 2 instructions/clock
n Design logic to handle any possible dependencies between the instructions
n Hybrid approaches
n Issue logic can become bottleneck
Dynam
ic Scheduling, M
ultiple Issue, and Speculation
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 41
81 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynam
ic Scheduling, M
ultiple Issue, and Speculation
Overview of Design
82 Copyright © 2012, Elsevier Inc. All rights reserved.
n Limit the number of instructions of a given class that can be issued in a “bundle” n I.e. on FP, one integer, one load, one store
n Examine all the dependencies amoung the instructions in the bundle
n If dependencies exist in bundle, encode them in reservation stations
n Also need multiple completion/commit
Dynam
ic Scheduling, M
ultiple Issue, and Speculation
Multiple Issue
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 42
83 Copyright © 2012, Elsevier Inc. All rights reserved.
Loop: LD R2,0(R1) ;R2=array element DADDIU R2,R2,#1 ;increment R2 SD R2,0(R1) ;store result DADDIU R1,R1,#8 ;increment pointer BNE R2,R3,LOOP ;branch if not last element
Dynam
ic Scheduling, M
ultiple Issue, and Speculation
Example
84 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynam
ic Scheduling, M
ultiple Issue, and Speculation
Example (No Speculation)
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 43
85 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynam
ic Scheduling, M
ultiple Issue, and Speculation
Example
86 Copyright © 2012, Elsevier Inc. All rights reserved.
n Need high instruction bandwidth! n Branch-Target buffers
n Next PC prediction buffer, indexed by current PC
Adv. Techniques for Instruction D
elivery and Speculation
Branch-Target Buffer
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 44
87 Copyright © 2012, Elsevier Inc. All rights reserved.
n Optimization: n Larger branch-target buffer n Add target instruction into buffer to deal with longer
decoding time required by larger buffer n “Branch folding”
Adv. Techniques for Instruction D
elivery and Speculation
Branch Folding
88 Copyright © 2012, Elsevier Inc. All rights reserved.
n Most unconditional branches come from function returns
n The same procedure can be called from multiple sites n Causes the buffer to potentially forget about the
return address from previous calls n Create return address buffer organized as a
stack
Adv. Techniques for Instruction D
elivery and Speculation
Return Address Predictor
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 45
89 Copyright © 2012, Elsevier Inc. All rights reserved.
n Design monolithic unit that performs: n Branch prediction n Instruction prefetch
n Fetch ahead
n Instruction memory access and buffering n Deal with crossing cache lines
Adv. Techniques for Instruction D
elivery and Speculation
Integrated Instruction Fetch Unit
90 Copyright © 2012, Elsevier Inc. All rights reserved.
n Register renaming vs. reorder buffers n Instead of virtual registers from reservation stations and
reorder buffer, create a single register pool n Contains visible registers and virtual registers
n Use hardware-based map to rename registers during issue n WAW and WAR hazards are avoided n Speculation recovery occurs by copying during commit n Still need a ROB-like queue to update table in order n Simplifies commit:
n Record that mapping between architectural register and physical register is no longer speculative
n Free up physical register used to hold older value n In other words: SWAP physical registers on commit
n Physical register de-allocation is more difficult
Adv. Techniques for Instruction D
elivery and Speculation
Register Renaming
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 46
91 Copyright © 2012, Elsevier Inc. All rights reserved.
n Combining instruction issue with register renaming: n Issue logic pre-reserves enough physical registers
for the bundle (fixed number?) n Issue logic finds dependencies within bundle, maps
registers as necessary n Issue logic finds dependencies between current
bundle and already in-flight bundles, maps registers as necessary
Adv. Techniques for Instruction D
elivery and Speculation
Integrated Issue and Renaming
92 Copyright © 2012, Elsevier Inc. All rights reserved.
n How much to speculate n Mis-speculation degrades performance and power
relative to no speculation n May cause additional misses (cache, TLB)
n Prevent speculative code from causing higher costing misses (e.g. L2)
n Speculating through multiple branches n Complicates speculation recovery n No processor can resolve multiple branches per
cycle
Adv. Techniques for Instruction D
elivery and Speculation
How Much?
The University of Adelaide, School of Computer Science 8 October 2012
Chapter 2 — Instructions: Language of the Computer 47
93 Copyright © 2012, Elsevier Inc. All rights reserved.
n Speculation and energy efficiency n Note: speculation is only energy efficient when it
significantly improves performance
n Value prediction n Uses:
n Loads that load from a constant pool n Instruction that produces a value from a small set of values
n Not been incorporated into modern processors n Similar idea--address aliasing prediction--is used on
some processors
Adv. Techniques for Instruction D
elivery and Speculation
Energy Efficiency