Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters
description
Transcript of Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters
![Page 1: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/1.jpg)
Back-End: Instruction Scheduling, Memory Access Instructions, and
ClustersJ. Nelson Amaral
![Page 2: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/2.jpg)
Tomasulo Algorithm
![Page 3: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/3.jpg)
IBM 360/91 Floating Point Arithmetic UnitTomasulo Algorithm:
A reservation stationfor each functional unit.
Baer, p. 97
Free/Occupied bit
Flag = on → Data = valueFlag = off → Data = tag
A tag (pointer) to the ROB entry that will store result.
![Page 4: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/4.jpg)
Decode-rename Stage
Reservation Station
Available?
Structural HazardStall incoming instructions
No
Free ROB Entry?
Structural HazardStall incoming instructions
No
Assign reservation station and tail of ROB
to instruction
Yes
Yes
Baer p. 97
![Page 5: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/5.jpg)
Dispatch Stage
Map for each source
operand?
ROBEntry
Forward ROB tag to RS.
ReadyBit(RS) ← 0
LogicalRegister
ROB Entry Flag?
Forward value to Reservation Station (RS)
ReadyBit(RS) ← 1
Tag
Value
Map result register to tag
Enter tag into RS
Enter instruction at tail of ROBResultFlag(tail of ROB) ←0 Baer p. 98
![Page 6: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/6.jpg)
Issue Stage
Both Flags in RS are
on?
Issue instruction to functional unit to start
execution
Yes
No
No
Function unit stalled? (waiting for
CDB)
Yes
If multiple functional units of the same typeare available, use a scheduling algorithm
CDB = Common Data Bus
Baer p. 98
![Page 7: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/7.jpg)
ExecuteLast cycle
of execution?
Broadcast result and associated tag
Yes
No
Got ownership
of CDB
No
If multiple functional units request ownershipof the Common Data Bus (CDB) on the samecycle a hardwired priority protocol picks thewinner.
Baer p. 98
ROB stores result in entry identified by tag.
Set correspondingReadyBit.
RSs with same tag store result and set corresponding flag.
Yes
![Page 8: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/8.jpg)
Commit Stage
Is there a result at the
head of ROB?
No
Store result in logical register
Delete ROB entry
Yes
Baer p. 97
![Page 9: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/9.jpg)
Operation Timings
Assuming no dependencies
Baer, p. 98
Addition:0 1 2 3 4 5 6 7Time:
Multiplication:0 1 2 3 4 5 6 7Time:
Decoded Dispatched Issuedfinish execution
broadcast commit(if head of ROB)
Decoded Dispatched Issuedfinish execution
broadcast
commit(if head of ROB)
![Page 10: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/10.jpg)
Example
i1: R4 ← R0 * R2 # use reservation station 1 of multiplieri2: R6 ← R4 * R8 # use reservation station 2 of multiplieri3: R8 ← R2 + R12 # use reservation station 1 of adderi4: R4 ← R14 + R16 # use reservation station 2 of adder
![Page 11: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/11.jpg)
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E1 E2
Flag Data Log. Reg0 E1 R4 head0 E2 R6
tail
8
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
ExecutingDispatched
i2 is inthis res.station.
![Page 12: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/12.jpg)
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
0 E4 R4
0 E1 R4 head0 E2 R60 E3 R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag01 1 (R14) 1 (R16) E40
Adder Reservation Stations
ExecutingDispatched
Ready to Broadc.Dispatched
“register R4, which was renamed as ROB entry E1 and tagged as such in the reservation station Mult2, is now mapped to ROB entry E4.” (Baer, p. 102)
![Page 13: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/13.jpg)
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
0 E4 R4
0 E1 R4 head0 E2 R61 (i3) R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
DispatchedBroadcast
Ready to Broadc.
Ready to Broadc.
Assume Adder has priority to broadcast.
![Page 14: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/14.jpg)
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
1 (i4) R4
0 E1 R4 head0 E2 R61 (i3) R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 0 E1 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
Dispatched
Broadcast
Ready to Broadc.
Assume Adder has priority to broadcast.
![Page 15: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/15.jpg)
Index ⋅⋅⋅ 4 5 6 7Register Map
ROB i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2
Flag Data Log. Reg
1 (i4) R4
1 (i1) R4 head0 E2 R61 (i3) R8
tail
E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
01 1 (i1) 1 (R8) E2
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
DispatchedBroadcast
![Page 16: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/16.jpg)
Index ⋅⋅⋅ 4 5 6 7Register Map
i1: R4 ← R0 * R2i2: R6 ← R4 * R8i3: R8 ← R2 + R12i4: R4 ← R14 + R16
0 1 2 3 4 5 6 7Time:
8
E4 E2 E38
Free Flag1 Oper1 Flag2 Oper2 TagMultiplier Reservation Stations
00
Free Flag1 Oper1 Flag2 Oper2 Tag000
Adder Reservation Stations
CommitExecuting
ROB Flag Data Log. Reg
0 (i4) R4
0 (i1) R40 E2 R6 head0 (i3) R8
tail
![Page 17: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/17.jpg)
IBM 360/91 – unveiled in 1966
![Page 18: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/18.jpg)
Some variant of the Tomasulo algorithm is the basis for the
design of all out-of-order processors.
Baer p. 97
![Page 19: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/19.jpg)
Data dependency between instruction
Where should these instructions wait?
How do they become ready for issue?
Several instructions get to the end of thefront end and have to wait for operands.
Baer p. 177
![Page 20: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/20.jpg)
Wakeup Stage
Detects instruction readiness.
We hope for m instructions to be woken up on each cycle.
Baer p. 177
![Page 21: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/21.jpg)
Select Step
• or Scheduling step: Arbitrates between multiple instructions vieing for the same instruction unit.– Variations of fist-come-first-serve (of FIFO)
• Bypassing (or forwarding) of operands to units allows earlier selection.
• Critical instructions may have preference for selection.
Baer p. 177
![Page 22: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/22.jpg)
Out-of-Order Architectures
Key idea: allow instructions following a stalled oneto start execution out of order.
A FIFO schedule is not a good idea!
Where to store stalled instructions?
Baer p. 178
![Page 23: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/23.jpg)
Two Extreme Solutions
Tomasulo: a separatereservation station foreach functional unit.(distributed window)
Instruction Window: a centralizedreservation stationfor all functional units(centralized window)
IBM PowerPC series
Intel P6architecture
Baer p. 178
![Page 24: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/24.jpg)
A Hybrid Solution
Reservation stations are shared among groups of functional units(hybrid window).
MIPS R10000: 3 sets of reservationstations:• address calculations• floating-point units•load-store units
Baer p. 178
![Page 25: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/25.jpg)
How a design team selects between a centralized, distributed
or hybrid window?
What are the compromises?
Baer p. 179
![Page 26: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/26.jpg)
Window design
• Resource allocation: centralized is better– static partitioning of resources is worse than
dynamic allocation
• Large windows: speed and power come into play
Baer p. 179
![Page 27: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/27.jpg)
Two-Step Instruction Issue
Wakeup: instruction is ready for execution
Select: instruction is assigned to an execution unit.
![Page 28: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/28.jpg)
Wakeup Step
Baer p. 180
f
Functional units
Window entries
w
![Page 29: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/29.jpg)
Window entry with buses from 8 exec units
![Page 30: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/30.jpg)
Wakeup Step
Baer p. 180
f
Functional units
Window entries
w
We need onebus from eachfunctional unit to each window entry
We also need twocomparators foreach functionalunit in eachwindow entry
Thus we need2fw comparators
If we separate thefunctional units andwindow slots into twoequal-size groups,we only need.fw/2 comparators
We will also need fewer (shorter)buses fromunits to slots.
![Page 31: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/31.jpg)
Select Step
• Priority encoder: a circuit that receives several requests and issues one grant
• woken up instructions vying for the same unit send requests.• priority related to position in window
• Smaller window → smaller priority encoder
Baer p. 181
![Page 32: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/32.jpg)
When should a centralized window be replaced by a distributed or
hybrid one?
When the wakeup-select step are on the critical path.
Threshold appears to be windows with around 64 entries on a 4-wide
superscalar processor Baer p. 182
![Page 33: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/33.jpg)
Intel Pentium 4:2 large windows 2 schedulers per window
Intel Pentium III and Intel core:Smaller centralized window
AMD Opteron:4 sets of reservation stations
Baer p. 182
![Page 34: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/34.jpg)
Relation between Select and Wake Up
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
The name given to the result of instruction i (R51)must be broadcast as soon as instruction i is selected.
Broadcasting the tag of R51 wakes up instruction i+1.
For single-cycle latency instructions, thestart of the execution is too late to broadcast the tag.
Baer p. 183
![Page 35: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/35.jpg)
Speculative Wake Up and Select
i: R51 ← load(R22)i+1: R43 ← R27 – R51
i+2: R35 ← R51 + R28
Example:
In this case the tag of the destination of instruction iis broadcast.
Instructions i+1 and i+2 are speculatively woken upand selected based on a cache-hit latency.
In the case of a cache miss all dependent instructionsthat have been woken up and selected must be aborted.
Baer p. 183
![Page 36: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/36.jpg)
Speculative Selection and the Reservation Stations
• An instruction must remain in a reservation station after it is scheduled– A bit indicates that the instruction has been
selected– Station is free once it is sure that the instruction
selection is not speculative anymore• Windows are large in comparison with the
number of functional units– accommodate many instructions in flight, some
speculatively.
Baer p. 183
![Page 37: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/37.jpg)
Integrated Register File
Tomasulo Reservation Stations
What happens upon selection of an instruction?
FunctionalUnit
Reservation Station
Opcode
Operands
Opcode
Operands
FunctionalUnit
Instruction Window
PhysicalRegister File
Baer p. 183
![Page 38: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/38.jpg)
The complexity of Bypassing
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
FunctionalUnit A
Compute i
FunctionalUnit B
Compute i+1
Output of A must be forwardedto B bypassing storage.
Baer p. 183
![Page 39: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/39.jpg)
The complexity of Bypassing
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
FunctionalUnit A
Compute i
FunctionalUnit B
Now the bypass must forwardthe output to the input of A.
Compute i+1 But the hardware has to implement both buses.
Baer p. 183
![Page 40: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/40.jpg)
The complexity of Bypassing
i: R51 ← R22 + R33
i+1: R43 ← R27 – R51
Example:
FunctionalUnit A
Compute i
FunctionalUnit B
Compute i+1
Also, we need buses to forwardthe output of B.
In general, given k functionalunits we may need k2 buses.
Buses become long to avoidcrossing each other.
Forwarding may limit the numberof functional units in a processor.
Forwarding may need more thanone cycle to complete.
Baer p. 184
![Page 41: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/41.jpg)
Load Speculation
• Load Address Speculation– Used for data prefetching
• Memory dependence prediction– Used to speculate data flow from a store to a
subsequent load.
Baer p. 185
![Page 42: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/42.jpg)
Store Buffer
• Store Buffer: A circular queue – Entry allocated when store instruction is decoded– Entry removed when store is committed• Keep data for stores that have not yet committed
Baer p. 185
![Page 43: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/43.jpg)
States of a Store Buffer Entry
AV:Available
AD:Address is
known
CO:Committed
RE:Result and
Address known
AddressComputation
Data to be stored is still to be computed by another instruction
Store instructionreaches top of ROB
Datawritten
to cache
What happens with store buffer on a branch misprediction?
Baer p. 185
![Page 44: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/44.jpg)
Handling Store Buffer on Branch Misprediction and Exceptions.
• Entries preceeding the mispredicted branch: – are in COMMIT state– must be written to cache
• Entries following misprediction– become AVAILABLE
• Exceptions: similar– Must write the COMMIT entries to cache before
handling exception
Baer p. 186
![Page 45: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/45.jpg)
Load Instructions and Load Speculation
Baer p. 187
![Page 46: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/46.jpg)
Load /Store Window Implementation –
Most Restricted
Load/StoreWindow
(FIFO)
Loads/Stores inserted inprogram order.
Loads/Stores removed insame order – at mot oneper cycle.
Single windowfor loads andstores.
Baer p. 187
![Page 47: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/47.jpg)
Load Bypassing• Compare address of load with all addresses in
store buffer– Load bypassing: If there is no match → load can
proceed– What happens if the operand address of any entry
in store buffer is not yet computed?• load cannot proceed
– What happens if there is a match to an entry that is not committed? • load cannot access cache• “match” is the last match in program order.
• Need associative search of operand addresses in store buffer
Baer p. 187
![Page 48: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/48.jpg)
Load Forwarding
• If these conditions are true:– A load match a store buffer entry AND– The result is available for the entry ( entry is in RE
or CO state)
• Then the result can be sent to the register specified by the load
• If the match is with an entry in AD state then:– Load waits for entry to reach RE state
![Page 49: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/49.jpg)
Load Speculation in Out-of-Order Architectures
Dynamic Memory Disambiguation Problem:
Loads are issued speculatively ahead of precedingstores in program order. How to ensure that datadependences are not violated?
![Page 50: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/50.jpg)
Three approachesPessimistic: Wait until certain that load can proceed.(like load forwarding and bypassing)
Optimistic: Load always proceeds speculatively.Need a recovery mechanism.
Dependence prediction: use a predictorto decide to speculate or not.Try to have fewer recoveries.
![Page 51: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/51.jpg)
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 188
true dependency
Pessimistic: i3 and i4 cannot issue until i2 has computed its result:• i2 must be at least in RE (Result)• i4 proceeds once i1 and i2 are in AD (Address)
![Page 52: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/52.jpg)
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependency
Optimistic: i3 and i4 issue as soon as possible (load-buffer entries are created)
A store reaches COaddress comparedassociativelywith load-buffer entries
![Page 53: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/53.jpg)
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependencyAD i1 memadd1AD i2 memadd2
Store Buffer:
1 i3 memadd31 i4 memadd4
Load Buffer:
Indicates thatthe load is speculative
CO
Nothing happens becausethere is no match in load buffer.
![Page 54: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/54.jpg)
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependencyCO i1 memadd1CO i2 memadd2
Store Buffer:
1 i3 memadd31 i4 memadd4
Load Buffer:match
i3 has to be reissued
i4 has to be reissued because itis after i3 in program order
some implementations only reissue instructions that depend on i3
![Page 55: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/55.jpg)
Example
i1: st R1, memadd1⋅⋅⋅
i2: st R2, memadd2⋅⋅⋅ i3: ld R3, memadd3⋅⋅⋅ i4: ld R4, memadd4
Baer p. 189
true dependency
Dependence Prediction: with correct predictions, i4 can proceed and we avoid reissueing i3.
![Page 56: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/56.jpg)
Motivation: Optimistic
Memory dependencies are rare:Less than 10% of loads depend on an earlier store
Baer p. 190
![Page 57: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/57.jpg)
Motivation:Dependence Prediction
Load misspeculations are expensive and predictors can reduce them.
What strategy should we use forpredicting profitable speculations?
Baer p. 190
![Page 58: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/58.jpg)
Simple StrategyMemory dependencies are infrequent
Predict that all loads can be speculated
If a load L is misspeculated
All subsequent instances of L must wait
We need a bit to remember. Where should this bit be stored?
Baer p. 190
![Page 59: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/59.jpg)
Simple strategy (cont.)Single prediction bit P associated with instruction in cache.
When load instr. brought into cache → P = 1
Load is misspeculated → P = 0
Line evicted from cache and reloaded → P = 1
Strategy used in theDEC Alpha 21264
Baer p. 190
![Page 60: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/60.jpg)
Principle Behind Load Prediction
“static store-load instruction pairs thatcause most of the dynamic data misspredictionare relatively few and exhibit temporal locality.”
Moshovos A. , Breach S. E., Vijaykumar T. N., Sohi G. S.,“Dynamic Speculation and Synchronization of DataDependences,” International Symposium on ComputerArchitecture, (ISCA) 1997, Denver, CO, USA
![Page 61: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/61.jpg)
Ideal load speculationAvoids mis-speculation.
Allows loads to execute as early as possible.
Loads with no true dependences→ Execute without delay.
A load with a true dependence→ Execute as soon as the store that produces the data commits.
MoshovosISCA97.
![Page 62: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/62.jpg)
A Real Predictor
MoshovosISCA97.
Dynamically identify store-load pairs that are likelyto be data dependent.
i
Provide a synchronization mechanism to instancesof these dependences.
ii
Uses this mechanism to synchronize the storeand the load.
iii
![Page 63: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/63.jpg)
Load Predictor Table
Baer p. 190
Hash basedon PC
Saturatingcounters
Predictor States:• 00: strong nospeculate• 01: weak nospeculate• 10: weak speculate• 11: strong speculate
tagload buffer entry:
op.address: memory address of operand
spec.bit: speculative load?
update.bit: should update predictor at commit/abort?
Each load instruction has a loadspec bit.
Incrementing asaturating countermoves it towardstrong speculate.
![Page 64: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/64.jpg)
Load/Decode Stage
• Set loadspec bit according to value of counter associated with the load PC
Baer p. 190
![Page 65: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/65.jpg)
After Operand Address is Computed
UncommittedYoungerStores?
Enter in the load buffer:
op.ad 0tag 0
spec.bit update.bit
loadspec
Issue Cache Access
Enter in the load buffer:
op.ad 1tag 0
Enter in the load buffer:
op.ad 0tag 1
Wait (like in pessimistic solution)
No
On
Off
Yes
Baer p. 190
![Page 66: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/66.jpg)
Store Commit Stage
For all matchesin load buffer
spec.bit update.bit ← 1
Load Abort:Predictor ← Strong NoSpeculateRecover from misspeculated load
Baer p. 191
Off
On
It was correct to not speculateand should keep not speculatingin the future
![Page 67: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/67.jpg)
Store Commit Stage
spec.bit increment saturating counter
speculating was correct
On
update.bitincrement
saturating counter
would like to speculatein the future
Off
Off
predictor ← strong nospeculate
On
Baer p. 191
![Page 68: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/68.jpg)
Store Sets
Baer p. 191
![Page 69: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/69.jpg)
Motivation for Store Sets• The past is a good predictor for future
memory-order violations.• Must also predict:
When one load is dependent on multiple stores
store A
store B
store C
load Dload E
load F
When multiple loads depend on one store.
ChrysosISCA98
Chrysos, G. Z. and Emer, J. S., “Memory Dependence Prediction using Store Sets,” International Symposium on Computer Architecture, 1998 pp. 142-153.
![Page 70: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/70.jpg)
Store Set Definition
Given a load L, the store set of L is the set of all stores that L has ever depended upon.
Ideally, any time a store-load dependence isdetected, the store is added to the load’s store set table.
To make a prediction, the store set table ofthe load is searched for all uncommitted younger stores.
ChrysosISCA98
Too expensive! We need an approximation.
![Page 71: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/71.jpg)
Implementation of Store Sets Memory Dependence Prediction
Both loads and stores have entries in Store Set ID Table.
ChrysosISCA98
![Page 72: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/72.jpg)
Store Set Examples:Multiple loads depend on one store
j: loadadd1 k: loadadd2
⋅⋅⋅
i: storeadd3
i→
j→
SSITLFST
k→
Baer p. 192
![Page 73: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/73.jpg)
Store Set Examples:Multiple loads depend on one store
i: storedd2 j: storeadd3
k: loadadd1
⋅⋅⋅
i→
j→
SSITLFST
k→
Baer p. 192
![Page 74: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/74.jpg)
Store Set Examples:Multiple loads depend on multiple stores
i: storedd2 j: storeadd3
k: loadadd1
i→
j→
SSITLFST
l→
l: loadadd4
⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
k→
We have a conflict betweenthe LFST entry associated with i and l.
Winner is the entry with smaller index in SSITMake loser point to the winner’s entry.
Baer p. 192
![Page 75: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/75.jpg)
Evaluating Load Speculation
• Performance benefits from load speculation depends on:– speculation miss rate– cost of misspeculation recovery
Baer p. 194
![Page 76: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/76.jpg)
Evaluating Load Speculation - Terminology
Conflicting load: at the time the load is ready to issue there is a previous store in the instruction window whose operand address is unknown.
Colliding load: the load is dependent on one of the stores with which it conflicts.
Baer p. 194
![Page 77: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/77.jpg)
Evaluating Load Speculation – Typical measurements
• In a 32-entry load-store window, there are– 25% of loads are non-conflicting– of the 75% conflicting loads:• only 10% actually collide
• In larger windows the percentage of:– non-conflicting loads increase– colliding loads decrease
Baer p. 194
![Page 78: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/78.jpg)
Back-End Optimizations
• Branch prediction– “a must”
• Load speculation (load-bypassing stores)– “important” because other instructions depend on
the load
• Prediction of load latency– “common” to hide load latency in the cache
hierarchy
Baer p. 195
![Page 79: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/79.jpg)
Other Back-End Optimizations
• Value Prediction– Predict the value that an instruction will compute• May restrict to the value loaded by loads
• Critical Instructions– Predict which instructions are in the critical path.
Baer p. 196-201
![Page 80: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/80.jpg)
Clustered Microarchitectures
Baer p. 201
![Page 81: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/81.jpg)
Back-end Limitations to m
Large windows: large m requires largewindows.Expensive in hardware and power dissipation
Many functional units: many (long) buses;affect forwarding.
Centralized Resources (p. e. Register File): large resources, many ports.
Baer p. 201
![Page 82: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/82.jpg)
Definition of a Cluster
• A cluster is formed by:– A set of functional units– A register file– An instruction window (or reservation stations)
Baer p. 201
![Page 83: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/83.jpg)
Clustered Microarchitecture
Baer p. 202
![Page 84: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/84.jpg)
Register File Replication
• A copy of the register file in each cluster– Small number of clusters– Can use crossbar switch for interconnection– Example (Alpha 21264):• integer unit is two clusters;• each cluster has a full copy of the 80 registers
Baer p. 202
![Page 85: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/85.jpg)
Changes because of Clustering
• Front end– steer instruction to window of a cluster• static: compile time decision• dynamic: by hardware at runtime
• Back end– Copy results into registers of other clusters– Intercluster latency affects wake up and select
Baer p. 202
![Page 86: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/86.jpg)
Effect of Clustering in Performance
• Latency to forward results between clusters• Sensitive to load balancing between clusters• Conflicting goals:– keep producers and consumers of data into same
cluster– balance the workload
Baer p. 202
![Page 87: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/87.jpg)
Distributed Register Files• Steering affects Renaming– Assume that an instruction a is assigned to cluster
ci • A free register form ci will be used for the result of a
– If an operand of a is produced by an instruction b in a cluster cj, what needs to be done?
1. Another free register of ci is assigned tothis operand.
2. A copy instruction is inserted in cj immediately b.
3. The copy is kept in ci for use by other instructions.Baer p. 203
![Page 88: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/88.jpg)
Clustered microarchitectures can be seen as astep in the evolution from monolithic processorsto multiprocessors.
![Page 89: Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters](https://reader035.fdocuments.net/reader035/viewer/2022062314/56814544550346895db2115d/html5/thumbnails/89.jpg)
Chapter Summary: Back end is important for performance
– Tomasulo Algorithm– Centralized/Distributed/Hybrid Windows– Wakeup/Select steps– Scheduling: Critical instructions first– Loads:• Bypassing stores• Forwarding values• Speculating on the absence of dependences with stores
– Clustering to reduce wiring complexity