Chapter3 ilp

Chapter 3

ILP

1

2

Inst I before inst j in in the program

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

3

• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7


All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

4


• Write After Write (WAW) InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and – Writes are always in stage 5

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

5

Branch Hazards

• Loop unrolling

• Branch prediction

–Static

–Dynamic

6

Static Branch Prediction

•Scheduling (reordering) code around delayed branch• need to predict branch statically at compile time

• use profile information collected from earlier runs•Behavior of branch is often bimodally distributed!

• Biased toward taken or not taken•Effectiveness depend on

• frequency of branches and accuracy of the scheme

12%

22%

18%

11% 12%

4%6%

9% 10%

15%

0%

5%

10%

15%

20%

25%M

isp

red

icti

on

Ra

te

Integer FP

Integer benchmarks have higher branch frequency

7

Dynamic Branch Prediction

• Why does prediction work?• Underlying algorithm has regularities• Data that is being operated on has regularities

•Is dynamic branch prediction better than static branch prediction?

• Seems to be • There are a small number of important branches in

programs which have dynamic behavior

Performance = ƒ(accuracy, cost of misprediction)

8

1-Bit Branch Prediction

•Branch History Table: Lower bits of PC address index table of 1-bit values

• Says whether or not branch taken last time

for (i=0; i<100; i++) {….}addi r10, r0, 100addi r1, r1, r0

L1:… …… …addi r1, r1, 1bne r1, r10, L1… …

0x400101000x40010104

0x40010108

…0x40010A040x40010A08

NT

T

T

NT

T

::

T

NT

1-bit Branch History Table

Prediction

9

1-Bit Bimodal Prediction (SimpleScalar Term)

• For each branch, keep track of what happened last time and use that outcome as the prediction• Change mind fast

10

1-Bit Branch Prediction

•What is the drawback of using lower bits of the PC?• Different branches may have the same lower bit value

•What is the performance shortcome of 1-bit BHT? • in a loop, 1-bit BHT will cause two mispredictions• End of loop case, when it exits instead of looping as before• First time through loop on next time through code, when it predicts

exit instead of looping

11

2-bit Saturating Up/Down Counter Predictor

•Solution: 2-bit scheme where change prediction only if get misprediction twice

2-bit branch predictionState diagram

12

2-Bit Bimodal Prediction (SimpleScalar Term)

• For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1)• If (counter >= 2), predict taken, else predict not taken• Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”)• Can be easily extended to N-bits (in most processors, N=2)

13

Branch History Table

•Misprediction reasons:• Wrong guess for that branch• Got branch history of wrong branch when indexing the table

14

Branch History Table (4096-entry, 2-bits)

Branch intensive benchmarks have higher miss rate. How can we solve this problem?

Increase the buffer size orIncrease the accuracy

15

Branch History Table (Increase the size?)

Need to focus on increasing the accuracy of the scheme!

16

Correlated Branch Prediction

• Standard 2-bit predictor uses local information

• Fails to look at the global picture

•Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

17


• A shift register captures the local path through the program• For each unique path a predictor is maintained• Prediction is based on the behavior history of each local path• Shift register length determines program region size

18


•Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

• In general, (m,n) predictor means record last m branches to select between 2^m history tables each with n-bit counters

• Old 2-bit BHT is then a (0,2) predictor

If (aa == 2) aa=0;If (bb == 2) bb = 0;If (aa != bb) do something;

19


Global Branch History: m-bit shift register keeping T/NT status of last m branches.

20

Accuracy of Different Schemes

0%

Fre

quen

cy o

f M

isp

redi

ctio

ns

0%1%

5%6% 6%

11%

4%

6%5%

1%2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

nasa

7

mat

rix3

00

dodu

cd

spic

e

fppp

p

gcc

expr

esso

eqnt

ott li

tom

catv

21

Tournament Predictors

• A local predictor might work well for some branches or programs, while a global predictor might work well for others

• Provide one of each and maintain another predictor to identify which predictor is best for each branch

TournamentPredictor

Branch PC

LocalPredictor

GlobalPredictor

MUX

22


•Multilevel branch predictor• Selector for the Global and Local predictors of correlating branch

prediction •Use n-bit saturating counter to choose between predictors•Usual choice between global and local predictors

23


•Advantage of tournament predictor is ability to select the right predictor for a particular branch•A typical tournament predictor selects global predictor 40% of the time for SPEC integer benchmarks

• AMD Opteron and Phenom use tournament style

24

Tournament Predictors (Intel Core i7)

• Based on predictors used in Core Due chip• Combines three different predictors

• Two-bit• Global history• Loop exit predictor

• Uses a counter to predict the exact number of taken branches (number of loop iterations) for a branch that is detected as a loop branch

• Tournament: Tracks accuracy of each predictor• Main problem of speculation:

• A mispredicted branch may lead to another branch being mispredicted !

25

Branch Prediction

•Sophisticated Techniques:• A “branch target buffer” to help us look up the destination• Correlating predictors that base prediction on global behavior

and recently executed branches (e.g., prediction for a specificbranch instruction based on what happened in previous branches)

• Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

• A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

•Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

•Modern processors predict correctly 95% of the time!

26

Branch Target Buffers (BTB)

•Branch target calculation is costly and stalls the instruction fetch.

•BTB stores PCs the same way as caches

•The PC of a branch is sent to the BTB

•When a match is found the corresponding Predicted PC is returned

•If the branch was predicted taken, instruction fetch continues at the returned predicted PC

27

Branch Target Buffers (BTB)

28

Pipeline without Branch Predictor

IF (br)

PC

Reg ReadCompareBr-target

PC + 4

In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch

29

Pipeline with Branch Predictor

IF (br)

PC

Reg ReadCompareBr-targetBranch

Predictor

30

Dynamic Vs. Static ILP

• Static ILP:+ The compiler finds parallelism no extra hw higher clock speeds and lower power+ Compiler knows what is next better global schedule- Compiler can not react to dynamic events (cache misses)- Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument)- Static branch prediction is poor even statically scheduled processors use hardware branch predictors

31

Dynamic Scheduling

•Hardware rearranges instruction execution to reduce stalls• Maintains data flow and exception behavior

•Advantages:• Handles cases where dependences are unknown at

compile time• Simplifies compiler• Allows processor to tolerate unpredictable delays by

executing other code • Cache miss delay

• Allows code compiled with one pipeline in mind to run efficiently on a different pipeline

•Disadvantage: • Complex hardware

32

Dynamic Scheduling

• Simple Pipelining technique:• In-order instruction issue and execution• If an instruction is stalled, no later instructions can

proceed• Dependence between two closely spaced instructions

leads to hazard• What if there are multiple functional units?

• Units could stay idle• If instruction “j” depends on a long running instruction “i”

• All instructions after “j” stalls

DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F12,F8,F14

Can be eliminated by not requiring instructions to

execute in-order

33

Idea

• Classic five stage pipeline• Structural and data hazards could be checked during ID

• What do we need to allow us execute the SUB.D?

•Separate ID process into two • Check for hazards (Issue)• Decode (read operands)

•In-order instruction issue (in program order)• Begin execution as soon as its operands are available• Out-of-order execution (out-of-order completion)

DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F12,F8,F14

34

Out-of-order complications

•Introduces possibility of WAW, WAR hazards• Do not exist in 5 stage pipeline

• Solution• Register renaming

DIV.D F0,F2,F4ADD.D F6,F0,F8SUB.D F8,F10,F14MUL.D F6,F10,F8

WARWAW

35

Out-of-order complications

•Handling exceptions• Out-of-order completion must preserve exception

behavior• Exactly those exceptions that would arise if the program

was executed in strict program order actually do arise• Preserve exception behavior by:

• Ensuring that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed

36

Splitting ID Stage

•Instruction Fetch:• Fetch into register or queue

•Instruction Decode• Issue:

• Decode instructions• Check for structural hazards

• Read operands• Wait until no data hazards• Then read operands

•Execute• Just as 5-stage pipeline execute stage• May take multiple cycles

37

Hardware requirement

•Pipeline must allow multiple instructions to be in execution stage

• Multiple functional units

•Instructions pass through issue stage in order (in-order issue)

•Instructions can be stalled and bypass each other in the second stage (read operands)

•Instructions enter execution stage out-of-order

38

Dynamic Scheduling using Tomasulo’s Method

•Sophisticated scheme to allow out-of-order execution

•Objective is to minimize RAW hazards• Introduces register renaming to minimize WAR and

WAW hazards

•Many variations of this technique are used in modern processors

•Common features:• Tracking instruction dependencies to allow as soon as

(only when) operands are available (resolves RAW)• Renaming destination registers (WAR, WAW)

39


•Register renaming:

DIV.D F0,F2,F4ADD.D F6,F0,F8S.D F6,0(R1)SUB.D F8,F10,F14MUL.D F6,F10,F8

DIV.D F0,F2,F4ADD.D S,F0,F8S.D S,0(R1)SUB.D T,F10,F14MUL.D F6,F10,T

Finding any use of F8 requires sophisticated compiler or hardware- later use of F8!

anti output

true

40


1

2

FIFO

3

41

Steps of an instruction

42

Steps of an instruction

2. Execution —operate on operands (EX) When both operands ready then execute; if not ready,

watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

Note: several instructions may become ready for the same FU at the same time : arbitrary choice for FUs,

Loads and stores complicated! Need to maintain the program order

43

4 Steps of Speculative Tomasulo

3. Write result —finish execution (WB) Write on Common Data Bus to all awaiting FUs

mark reservation station available.

Load waits for memory unit to be availableStore waits for operand and then memory to be available

44

Reservation Station

•Op: Operation to perform in the unit (e.g., + or –)•Vj, Vk: Value of Source operands

• For loads Vk is used to hold the offset•Qj, Qk: Reservation stations producing source registers (value to be written)

Note: Qj,Qk=0 => ready in Vj, Vk• Busy: Indicates reservation station or FU is busy• A : Hold information for the memory address calculation for load/store.

• Initially, immediate value is stored in A, • After address calculation: effective address

•Register result status (Qi)—Indicates the number of the reservation station that contains the operation whose results should be stored in to this register.

• Blank when no pending instructions to write that register.

Refer to D. Patterson’s

Tomasulo Slides

45

46

Loop Unrolling

•Determine loop unrolling useful by finding that loop iterations were independent

• Determine address offsets for different loads/stores• Increases program size

•Use different registers to avoid unnecessary constraints forced by using same registers for different computations

• Stress on registers

•Eliminate the extra test and branch instructions and adjust the loop termination and iteration code

47

Loop Unrolling

•If a loop only has dependences within an iteration, the loop is considered parallel multiple iterations can be executed together so long as order within an iteration is preserved

• If a loop has dependences across iterations, it is not parallel and these dependences are referred to as “loop-carried”

48

Example

For (i=1000; i>0; i=i-1) x[i] = x[i] + s;

For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2}

For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2}

For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

49

Example

For (i=1000; i>0; i=i-1) x[i] = x[i] + s;

For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2}


For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

S2 depends on S1 in the same iterationS1 depends on S1 from prev iterationS2 depends on S2 from prev iteration

S1 depends on S2 from prev iteration

S1 depends on S1 from 3 prev iterationsReferred to as a recursionDependence distance 3; limited parallelism

No dependences

50

Constructing Parallel Loops

If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel


S1 depends on S2 from prev iteration

A[1] = A[1] + B[1];For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4}

S4 depends on S3 of same iteration

Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

B[101] = C[100] + D[100];

51


•Register renaming is provided by reservation stations• Buffer the operands of instructions waiting to issue

•Reservation station fetches and buffers an operand as soon as it is available

• Eliminates the need to get the operand from a register

•Pending instructions designate the reservation station that will provide their input (register renaming)

•Successive writes to a register: last one is used to update

•There can be more reservations stations than real registers!• Name dependencies that can’t be eliminated by a compiler , now

can be eliminated by hardware

52


•Data structures attached to • reservation stations• Load/store buffers: acting like reservation station (hold data or

address coming from or going to memory)• Register file

•Once an instruction has issued and is waiting for source operand

• Refers to the operand by reservation station number where the instruction that will generate the value has been assigned

• If reservation station number is 0• Operand is already in the register file

•1 cycle latency between source and result• Effective latency between producing instruction and consuming

instruction is at least 1 cycle longer than the latency of the function unit producing the result

53


•Loads and Stores• If they access different address

• Can be done out of order safely• Else

• Load and then Store program order• WAR

• Store and then Load program order• RAW

• Store and then Store• WAW

•To detect these hazards• Effective address calculation must be in program order• Check the “A” field in Load/Store queue before issuing the

Load/Store instruction

54


•Advantage of distributed reservation stations and CDB• If multiple instructions are waiting for the same result

• Instructions released concurrently by the broadcast of the result through CDB

• Centralized register file: units read from register file when the bus is available

•Advantage of using reservation station number instead of register name

• eliminates WAR, WAW

55


•Disadvantage• Complex hardware• Can achieve high performance if branches are

predicted accurately• <1 cycle per instruction with a dual issue!

• Reservation station must be associative• Single CDB!

56

Hardware-Based Speculation

•Branch prediction reduces the stalls attributable to branches • For a processor executing multiple instructions

• Just predicting branch is not enough• Multiple issue processor may execute a branch every

clock cycle

•Exploiting parallelism requires that we overcome the limitation of control dependence

57


•Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct

• extension over branch prediction with dynamic scheduling• Speculation fetch, issue, and execute instructions as if branch

predictions were always correct • Dynamic scheduling only fetches and issues such instructions

•Essentially a data flow execution model: Operations execute

as soon as their operands are available

58


•3 components of HW-based speculation:• Dynamic branch prediction to choose which instructions

to execute

• Speculation to allow execution of instructions before control dependences are resolved

• ability to undo effects of incorrectly speculated sequence

• Dynamic scheduling to deal with scheduling of different combinations of basic blocks

• without speculation only partially overlaps basic blocks • requires that a branch be resolved before actually executing

any instructions in the successor basic block.

59

Hardware-Based Speculation in Tomasulo

•The key idea • allow instructions to execute out of order • force instructions to commit in order• prevent any irrevocable action (such as updating state or

taking an exception) until an instruction commits.

•Hence: • Must separate execution from allowing instruction to

finish or “commit”• instructions may finish execution considerably before

they are ready to commit.

•This additional step called instruction commit

60

Hardware-Based Speculation in Tomasulo

•When an instruction is no longer speculative, allow it to update the register file or memory

•Requires additional set of buffers to hold results of instructions that have finished execution but have not committed : reorder buffer (ROB)

•This reorder buffer (ROB) is also used to pass results among instructions that may be speculated

61

Reorder Buffer

•In Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file

•With speculation, the register file is not updated until the instruction commits

• (we know definitively that the instruction should execute)

•Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit

• ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm

• ROB extends architectured registers like RS

62

Reorder Buffer Structure (Four fields)

•instruction type field• Indicates whether the instruction is a branch (and has no

destination result), a store (which has a memory address destination), or a register operation (ALU operation or load, which has register destinations).

•destination field • supplies the register number (for loads and ALU operations) or the

memory address (for stores) where the instruction result should be written.

•value field • hold the value of the instruction result until the instruction commits.

•ready field • indicates that the instruction has completed execution, and the

value is ready.

63

Reorder Buffer Operation

•Holds instructions in FIFO order, exactly as issued•When instructions complete, results placed into ROB

• Supplies operands to other instruction between execution complete & commit => more registers like RS

• Tag results with ROB buffer number instead of reservation station•Instructions commit =>values at head of ROB placed in registers•As a result, easy to undo speculated instructions on mispredicted branches or on exceptions

ReorderBufferFP

OpQueue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Commit path

64

Where is the store queue?

65


1- Issue —get instruction from instruction queueIf reservation station and reorder buffer slot free, issue instr & send operands (if available: either from ROB or FP registers) to reservation station & send reorder buffer no. allocated for result to reservation station (tag the result when it is placed on CDB)

66


2. Execution —operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW

Loads still require 2 step process, instructions may take multiple cycles,, stores is only effective address calculation (need Rs),

67


3. Write result —finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

If the value to be stored is available, it is written into the Value field of the ROB entry for the store.

If the value to be stored is not available yet, the CDB must be monitored until that value is broadcast, at which time the Value field of the ROB entry of the store is updated.

68


4. Commit a) when an instruction reaches the head of the ROB and its

result is present in the buffer; • update the register with the result and remove the instruction from

the ROB.

b) Committing a store is similar except that memory is updated rather than a result register.

c) If a branch with incorrect prediction reaches the head ROB• it indicates that the speculation was wrong. • ROB is flushed and execution is restarted at the correct successor

of the branch.

d) If the branch was correctly predicted, the branch is finished.

• Once an instruction commits, its entry in the ROB is reclaimed. If the ROB fills, we simply stop issuing instructions until an entry is made free.

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0F0 LD F0,16(R2)LD F0,16(R2) NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

LD F0,16(R2)ADDD F10,F4,F0DIVD F2,F10,F6

Dest. Value Instruction

70

2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F10

F0F0ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,16(R2)LD F0,16(R2)NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers



71

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,16(R2)LD F0,16(R2)

NN

NN

NN

Done?

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers



72

Avoiding Memory Hazards

• WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending

• RAW hazards through memory are avoided by two restrictions: 1. not allowing a load to initiate the second step of its execution

if any active ROB entry occupied by a store has a Destination field that matches the value of the Addr. field of the load, and

2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.

• these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

73

Multi-Issue - Getting CPI Below 1

• CPI ≥ 1 if issue only 1 instruction every clock cycle • Multiple-issue processors come in 3 flavors:

1. statically-scheduled superscalar processors,2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors (static

sched.)

• The 2 types of superscalar processors issue varying numbers of instructions per clock – use in-order execution if they are statically scheduled, or – out-of-order execution if they are dynamically scheduled

• VLIW processors, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)

74

VLIW: Very Large Instruction Word

• Each “instruction” has explicit coding for multiple operations

– In IA-64, grouping called a “packet”

– In Transmeta, grouping called a “molecule” (with “atoms” as ops)

– Moderate LIW also used in Cray/Tera MTA-2

• Tradeoff instruction space for simple decoding– The long instruction word has room for many operations

– By definition, all the operations the compiler puts in one long instruction word are independent => can execute in parallel

– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

– Need compiling techniques to schedule across several branches (called “trace scheduling”)

75

Thrice Unrolled Loop that Eliminates Stalls for Scalar Pipeline Computers

1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 ADD.D F4,F0,F25 ADD.D F8,F6,F26 ADD.D F12,F10,F27 S.D 0(R1),F48 S.D -8(R1),F89 DSUBUI R1,R1,#2410 BNEZ R1,LOOP11 S.D 8(R1),F12 ; 8-24 = -16

11 clock cycles, or 3.67 per iteration

Minimum times between pairs of instructions: L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles

A single branch delay slot follows the BNEZ.

76

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch

1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 ADD.D F4,F0,F25 ADD.D F8,F6,F26 ADD.D F12,F10,F27 S.D 0(R1),F48 S.D -8(R1),F89 DSUBUI R1,R1,#2410 BNEZ R1,LOOP11 S.D 8(R1),F12

L.D to ADD.D: +1 CycleADD.D to S.D: +2 Cycles

77



L.D F0,0(R1) 1

2

ADD.D F4,F0,F2 3

4

5

S.D 0(R1),F4 6



78



L.D F0,0(R1) L.D F6,-8(R1) 1

L.D F10,-16(R1) L.D F14,-24(R1) 2

L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3

L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4

ADD.D F20,F18,F2 ADD.D F24,F22,F2 5

S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6

S.D -16(R1),F12 S.D -24(R1),F16 7

S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#56 8

S.D 8(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid stall delays from ADD.D to S.D 7 results in 9 clocks, or 1.3 clocks per iteration (2.8X: 1.3 vs 3.67) Average: 2.5 ops per clock (23 ops in 45 slots), 51% efficiencyNote: 8, not -48, after DSUBUI R1,R1,#56 - which may be out of place. See next slide.

Note: We needed more registers in VLIW (used 15 pairs vs. 6 in SuperScalar)



79

Problems with 1st Generation VLIW

• Increase in code size– generating enough operations in a straight-line code fragment

requires ambitiously unrolling loops

– whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding

• Operated in lock-step; no hazard detection HW– a stall in any functional unit pipeline caused entire processor

to stall, since all functional units must be kept synchronized

– Compiler might predict function unit stalls, but cache stalls are hard to predict

• Binary code incompatibility– Pure VLIW => different numbers of functional units and unit

latencies require different versions of the code

80

Multiple Issue Processors

• Exploiting ILP– Unrolling simple loops

– More importantly, able to exploit parallelism in a less structured code size

• Modern Processors: – Multiple Issue

– Dynamic Scheduling

– Speculation

81

Multiple Issue, Dynamic, Speculative Processors

• How do you issue two instructions concurrently? – What happens at the reservation station if two instructions issued concurrently have

true dependency?

– Solution 1:

» Issue first during first half and Issue second instruction during second half of the clock cycle

– Problem:

» Can we issue 4 instructions?

– Solution 2:

» Pipeline and widen the issue logic

» Make instruction issue take multiple clock cycles!

– Problem:

» Can not pipeline indefinitely, new instructions issued every clock cycle

» Must be able to assign reservation station

» Dependent instruction that is being used should be able to refer to the correct reservation stations for its operands

• Issue step is the bottleneck in dynamically scheduled superscalars!

82

Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”

• IA-64: instruction set architecture – 64 bits per integer• 128 64-bit integer regs + 128 82-bit floating point regs

– Not separate register files per functional unit as in old VLIW

• Hardware checks dependencies (interlocks => binary compatibility over time)

• Itanium™ was first implementation (2001)– Highly parallel and deeply pipelined hardware at 800Mhz

– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

• Itanium 2™ is name of 2nd implementation (2005)– 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process

– Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3

83

Speculation vs. Dynamic Scheduling

• no instruction after the earliest uncompleted instruction (MUL.D) is allowed to complete. In contrast, in dynamic scheduling fast instructions (SUB.D and ADD.D) have also completed.

• ROB can dynamically execute code while maintaining a precise interrupt model.

• if MUL.D caused an interrupt, we could simply wait until it reached the head of the ROB and take the interrupt, flushing any other pending instructions from the ROB. Because instruction commit happens in order, this yields a precise exception.

• By contrast, in Tomasulo’s algorithm, the SUB.D and ADD.D completed before the MUL.D raised the exception. F8 and F6 could be overwritten, and the interrupt would be imprecise.

L.D F6, 32(R2)L.D F2,44(R3)MUL.D F0,F2,F4SUB.D F8,F6,F2DIV.D F10,F0,F6ADD.D F6,F8,F2

84

ARM Cortex-A8 and Intel Core i7

• A8: • Multiple issue• iPad, Motorola Droid, iPhones

•I7: • Multiple issue• high end dynamically scheduled speculative• High-end desktops, server

85

ARM Cortex-A8

• A8 Design goal: low power, reasonably high clock rate • Dual-issue• Statically scheduled superscalar• Dynamic issue detection

• Issue one or two instructions per clock (in-order)• 13 stage pipeline

• Fully bypassing • Dynamic branch predictor

• 512-entry, 2-way set associative branch target buffer• 4K-entry global history buffer• If branch target buffer misses

• Prediction through global history buffer• 8-entry return address stack

• I7: aggressive 4-issue dynamically scheduled speculative pipeline

86

ARM Cortex-A8

The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled.

87

ARM Cortex-A8

The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.

88

ARM Cortex-A8

The five-stage instruction decode of the A8. Multiply operations are always performed in ALU pipeline 0.

89

ARM Cortex-A8

Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.

90

ARM Cortex-A8 vs A9

Figure 3.40 The performance ratio for the A9 compared to the A8, both using a 1 GHz clock and the same size caches for L1 and L2, shows that the A9 is about 1.28 times faster. Both runs use a 32 KB primary cache and a 1 MB secondary cache, which is 8-way set associative for the A8 and 16-way for the A9. The block sizes in the caches are 64 bytes for the A8 and 32 bytes for the A9. As mentioned in the caption of Figure 3.39, eon makes intensive use of integer multiply, and the combination of dynamic scheduling and a faster multiply pipeline significantly improves performance on the A9. twolf experiences a small slowdown, likely due to the fact that its cache behavior is worse with the smaller L1 block size of the A9.

A9: Issue 2 instructions/clkDynamic schedulingSpeculation

91

Intel Core i7

• The total pipeline depth is 14 stages.

• There are 48 load and 32 store buffers.

• The six independent functional units can each begin execution of a ready micro-op in the same cycle.

92

Intel Core i7

• Instruction Fetch:• Multilevel branch target buffer• Return address stack (function

return)• Fetch 16 bytes from instruction

cache• 16-bytes in predecode instruction

buffer• Macro-op fusion: compare

followed by branch fused into one instruction

• Break 16 bytes into instructions• Place into 18-entry queue

93

Intel Core i7

• Micro-op decode: translate x86 instructions into micro-ops (directly executable by the pipeline)

• Generate up to 4 micro-ops/cycle

• Place into 28-entry buffer• Micro-op buffer:

• loop stream detection: • Small sequence of

instructions in a loop (<28 instructions)

• Eliminate fetch, decode• Microfusion

• Fuse load/ALU, ALU/store pairs

• Issue to single reservation station

94

Intel Core i7 vs. Atom 230 (45nm technology)

Intel i7 920 ARM A8 Intel Atom 230

4-cores each with FP 1 core, no FP 1 core, with FP

Clock rate 2.66GHz 1GHz 1.66GHz

Power 130W 2W 4W

Cache 3-level, all 4-way, 128 I, 64 D, 512 L2

1-levelFully associative32 I, 32 D

2-levelAll 4-way16 I, 16 D, 64 L2

Pipeline 4ops/cycle 2ops/cycle 2 ops/cycle

Speculative, OOO In-order, dynamic issue

In-orderDynamic issue

Branch pred Two-level Two-level512-entry BTB4K global history8-entry return

Two-level

Copyright © 2011, Elsevier Inc. All rights Reserved.

Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency.

Intel Core i7 vs. Atom 230 (45nm technology)

96

Improving Performance

• Techniques to increase performance: pipelining

improves clock speed increases number of in-flight instructions

hazard/stall elimination branch prediction register renaming out-of-order execution bypassing

increased pipeline bandwidth

97

Deep Pipelining

• Increases the number of in-flight instructions

• Decreases the gap between successive independent instructions

• Increases the gap between dependent instructions

• Depending on the ILP in a program, there is an optimal pipeline depth

• Tough to pipeline some structures; increases the cost of bypassing

98

Increasing Width

• Difficult to find more than four independent instructions

• Difficult to fetch more than six instructions (else, must predict multiple branches)

• Increases the number of ports per structure

99

Reducing Stalls in Fetch

• Better branch prediction novel ways to index/update and avoid aliasing cascading branch predictors

• Trace cache stores instructions in the common order of execution, not in sequential order in Intel processors, the trace cache stores pre-decoded instructions

100

Reducing Stalls in Rename/Regfile

• Larger ROB/register file/issue queue

• Virtual physical registers: assign virtual register names to instructions, but assign a physical register only when the value is made available

• Runahead: while a long instruction waits, let a thread run ahead to prefetch (this thread can deallocate resources more aggressively than a processor supporting precise execution)

• Two-level register files: values being kept around in the register file for precise exceptions can be moved to 2nd level

101

Performance beyond single thread ILP

•There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes)

•Explicit Thread Level Parallelism or Data Level Parallelism

•Thread: process with own instructions and data • thread may be a process part of a parallel program of

multiple processes, or it may be an independent program

• Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

•Data Level Parallelism: Perform identical operations on data, and lots of data

102

Thread Level Parallelism (TLP)

•ILP exploits implicit parallel operations within a loop or straight-line code segment

•TLP explicitly represented by the use of multiple threads of execution that are inherently parallel

•Goal: Use multiple instruction streams to improve • Throughput of computers that run many programs • Execution time of multi-threaded programs

•TLP could be more cost-effective to exploit than ILP

103

Thread-Level Parallelism

• Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves

• Strategies for thread-level parallelism: multiple threads share the same large processor reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor simple design, low interference between threads Chip Multi-Processing (CMP)

104

New Approach: Mulithreaded Execution

•Multithreading: multiple threads to share the functional units of 1 processor via overlapping

• processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

• memory shared through the virtual memory mechanisms, which already support multiple processes

• HW for fast thread switch; much faster than full process switch ~100s to 1000s of clocks

•When switch?• Alternate instruction per thread (fine grain)• When a thread is stalled, perhaps for a cache miss, another thread

can be executed (coarse grain)

105

Fine-Grained Multithreading

•Switches between threads on each instruction, causing the execution of multiples threads to be interleaved

•Usually done in a round-robin fashion, skipping any stalled threads

•CPU must be able to switch threads every clock

•Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls

•Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads

•Used on Sun’s Niagara

106

Coarse-Grained Multithreading

•Switches threads only on costly stalls, such as L2 cache misses

•Advantages • Relieves need to have very fast thread-switching• Doesn’t slow down thread, since instructions from other threads

issued only when the thread encounters a costly stall

•Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs

• Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen

• New thread must fill pipeline before instructions can complete

•Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time•Used in IBM AS/400

107

Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units

108

Simultaneous Multithreading (SMT)

•Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading

• Large set of virtual registers that can be used to hold the register sets of independent threads

• Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads

• Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW

•Just adding a per thread renaming table and keeping separate PCs

• Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

109

Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

110

Head to Head ILP competition

Processor Micro architecture Fetch / Issue /

Execute

FU Clock Rate (GHz)

Transis-tors

Die size

Power

Intel Pentium

4 Extreme

Speculative dynamically

scheduled; deeply pipelined; SMT

3/3/4 7 int. 1 FP

3.8 125 M 122 mm2

115 W

AMD Athlon 64

FX-57

Speculative dynamically scheduled

3/3/4 6 int. 3 FP

2.8 114 M 115 mm2

104 W

IBM Power5 (1 CPU only)

Speculative dynamically

scheduled; SMT; 2 CPU cores/chip

8/4/8 6 int. 2 FP

1.9 200 M 300 mm2 (est.)

80W (est.)

Intel Itanium 2

Statically scheduled VLIW-style

6/5/11 9 int. 2 FP

1.6 592 M 423 mm2

130 W

111

Limits to ILP

•Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to

• issue 3 or 4 data memory accesses per cycle, • resolve 2 or 3 branches per cycle, • rename and access more than 20 registers per cycle, and • fetch 12 to 24 instructions per cycle.

•The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate

• E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power!

112

Limits to ILP

•Most techniques for increasing performance increase power consumption

•The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance?

•Multiple issue processors techniques all are energy inefficient:• Issuing multiple instructions incurs some overhead in logic that

grows faster than the issue rate grows• Growing gap between peak issue rates and sustained performance

•Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance => increasing energy per unit of performance

113

Conclusion

•Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options

•Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance

•Coarse grain vs. Fine grained multihreading• Only on big stall vs. every clock cycle

•Simultaneous Multithreading if fine grained multithreading based on OOO superscalar microarchitecture

• Instead of replicating registers, reuse rename registers

•Balance of ILP and TLP decided in marketplace

Chapter3 ilp

Documents

Transcript of Chapter3 ilp