Out-of-Order Execution Structures Optimizations

53
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures Optimizations

description

Out-of-Order Execution Structures Optimizations. Tag Elimination. Conventional Schedulers are Overdesigned. For MIPS-like ISA Two source tags One destination tag Not all instructions use two source operands Eg, addi $1, $2, 10 - PowerPoint PPT Presentation

Transcript of Out-of-Order Execution Structures Optimizations

Page 1: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution StructuresOptimizations

Page 2: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tag Elimination

Page 3: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conventional Schedulers are Overdesigned• For MIPS-like ISA

– Two source tags – One destination tag

• Not all instructions use two source operands– Eg, addi $1, $2, 10

• Not all instructions produce a result that is interesting for scheduling– E.g., beq

• Some operands are ready when the instruction enters the scheduler

• Source: Efficient Dynamic Scheduling Through Tag Elimination, Dan Ernst and Todd Austin, ISCA 2002

Page 4: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Some Operands are Ready when the Instruction Enters the Scheduler

Page 5: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization• Have reservation stations with different

source operand wait capabilities

Page 6: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization• At rename check how many source operands

are not ready• If there is an appropriate slot proceed to

schedule• If not, stall at rename

• Advantages:– Destination bus only runs over reservation

stations with comparators– Load on the destination bus is reduced

• Disadvantages:– Stalls due to unavailability of reservation stations– Complexity of res. Station assignment

Page 7: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance

Performance as IPC – Actual Clock Frequency not considered

Page 8: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance

Performance as IPC per ns

Page 9: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Last Tag Prediction• Observe:

– Instruction becomes ready after the last tag it waits for appears

• Last Tag prediction– Predict which of the two tags will that be

• Speculatively execute – Correct speculation: that was the last tag– Incorrect speculation:

• Need to reschedule• Detection? Try to read a value that is not

available

Page 10: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

GShare-Style Last Tag Prediction

Two-bit saturating counters

Page 11: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Accuracy

• Over all instructions with two outstanding operands

Page 12: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance

Performance as IPC – Actual Clock Frequency not considered

Page 13: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window Specialization - Performance

Performance as IPC per ns

Page 14: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling

Data-flow prescheduling for largeinstruction windows in out-of-order

processorsPierre Michaud, André Seznec,

HPCA 2001

Page 15: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling

• Predict latencies• Put scheduled instructions into a FIFO• Slide into a smaller window

Page 16: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling Method

Page 17: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Prescheduling Example

Page 18: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Latency Prediction

Page 19: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Latency Prediction Contd.

Page 20: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Broadcast Free Scheduler

Page 21: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Broadcast Free Scheduler• Cyclone design

– D. Ernst, A. Hamel, T. Austin– ISCA 2003

• Preschedule Instructions• Put them into a dual strip cyclical FIFO • Vertical paths allow for motion between

the strips

Page 22: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone ArchitectureWill be ready in cycle + 6

Page 23: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle +1

Page 24: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 2

Page 25: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 3

Page 26: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 4

Page 27: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 5

Page 28: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 6

Page 29: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Cycle + 6

Page 30: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone Architecture – Mis-scheduling

Estimate new latency

Page 31: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Pre-scheduler

Insert instruction with predicted latency N at the front of the FIFOHave it switch at N/2

Can only do two cascaded MAX calculationsDue to timing considerations

Page 32: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone IPC Performance

Page 33: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cyclone True Performance and Area

Page 34: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Matrix Schedulers

Page 35: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conventional Scheduler

WS requests

IW grants

Page 36: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conventional Scheduler Timing

A2

A2 B1

B1

B3

B3

Source: A High-Speed Dynamic Instruction Scheduling Schemefor Superscalar ProcessorsMasahiro Goshima Kengo Nishino Yasuhiko Nakashima Shin-ichiro MoriToshiaki Kitamura Shinji TomitaMICRO 2001

Can’t pipeline without introducingBubbles between dependent Instructions:

Page 37: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Towards a Matrix Scheduler• Observe:

– In conventional scheduling dependences are discovered twice:

• Once at renaming• Once during scheduling

– Why? Dependences are implicitly represented

• Producer and Consumer link via a name• This is indirect

• Matrix Scheduler idea:– Represent dependences explicitly

Page 38: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dependence MatrixW

ho a

m I

Who do I depend upon?

Left source Right source

Page 39: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Matrix Scheduler

wakeup

Write port

Page 40: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Inserting an entry

wakeup

Write port

Page 41: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup

wakeup

Page 42: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Mispeculation Recovery• Do not cleanup• Use external logic to inhibit request

signals

Page 43: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Delay

Partial wakeup lines0.18um1.8V85C

Page 44: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Delay measurement points

Page 45: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduling Priorities

Page 46: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Conflict Resolution• More instructions ready than available issue slots

– Which get to go?• Age vs. Pseudo-Random Resolution

• Age is important• Priority Enforcer picks the oldest

– Complex

Source:Matrix Scheduler ReloadedISCA 2007

Page 47: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Compacting Scheduler• Implemented in the Alpha 21264• Physical order within scheduler

corresponds to age• Entry freed:

– Shift up all younger entries

Page 48: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Virtual Physical Registers• Physical register names are used for two

purposes– Scheduling– Communicating

• A physical register is held much in advance than needed– We need the register only after the value is

produced• De-couple scheduling from

communication names

Page 49: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Used vs. Allocated Registers

Page 50: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Goal

Page 51: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Virtual Physical Registers

Page 52: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Deadlock• Older instruction completes later than

younger ones– No registers available

• Steal a register and re-execute

Page 53: Out-of-Order Execution Structures Optimizations

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Performance vs. Physical Registers