Instruction Level Parallelism2

7/29/2019 Instruction Level Parallelism2

1/56

Instruction Level Parallelism


2/56

Instruction Level Parallelism(ILP)

Overlap the execution of instructions to

improve performance

Two approaches to exploit ILP

1. dynamic and hardware intensive

(desktop and server markets)

2. static and software intensive(embedded market)


3/56


Pipeline CPI = Ideal pipeline CPI +

structural stalls +

data hazard stalls +control stalls

reduce each of the terms in the RHS to reduce

the overall CPI and thus increase instructionsper cycle (IPC)


4/56


Amount of parallelism available within a Basic

Block is very small.

Therefore, we exploit ILP across multiple

blocks


5/56


Loop-level parallelism: exploit parallelism

across iterations of a loop

Ex:

for (i=1;i


6/56


Converting loop-level parallelism intoinstruction level parallelism either statically bythe compiler or dynamically by the hardware

Alternatively, use vector instructions thatoperate on a sequence of data items

ex: we need 4 vector instructions to execute

this code: 2 for loading x and y into memory, 2for adding the two vectors and 1 for storingback the result vector


7/56

Data dependences and Hazards

Finding dependences is critical in determining

how much parallelism exists in a program

Which instructions can be executed in

parallel?

Whether an instruction is dependent on other

instruction?


8/56

Data dependences and Hazards

Three different types of dependences:

1. data dependences (also called true data

dependences)2. name dependences

3. control dependences


9/56

Data dependences

An instruction j is data dependent on instruction Iif either of the following holds:

1. instruction i produces a result that may be

used by instruction j, or2. instruction j is data dependent on

instruction k, and instruction k is data

dependent on instruction i(this dependence chain can be as long as the

entire program)


10/56

Data dependences

Ex: LOOP: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DAAUI R1, R1, #-8

BNE R1, R2, LOOP

If two instructions are data dependent, they cannot beexecuted simultaneously or be completely overlapped

The dependence implies that there would be a chain of oneor more data hazards between the two instructions


11/56

Data dependences

The effect of the original data dependence mustbe preserved

The presence of the dependence indicates the

potential for a hazard, but the actual hazard andthe length of any stall is a property of thepipeline.

Ex: there is a data dependence between DADDIU and BNE;

this dependence causes a stall because we moved the branchtest for the MIPS pipeline to the ID stage. Had the branch test

stayed in EX, this dependence would not cause a stall


12/56

Data dependences

The importance of the data dependences isthat a dependence

(1) indicates the possibility of a hazard.

(2) determines the order in which results must

be calculated, and

(3) sets an upper bound on how much

parallelism can possibly be exploited


13/56

Data dependences

A dependence can be overcome in two

different ways:

1. maintaining the dependence but avoiding a

hazard, &

2. eliminating a dependence by transforming

the code


14/56

Data dependences

Primary method used to avoid a hazard is byscheduling the code without altering thedependencies (we see a hardware scheme forscheduling code dynamically as it is executed)

Dependences that flow through registers areeasy to detect than dependences that flowthrough memory locations(register names are fixed in the instruction,

100(R4) & 20(R6) may be identical,

20(R4) & 20(R4) may be different)


15/56

Name dependences

Occurs when two instructions use the same

register or memory location, called a name,

but there is no flow of data between the

instructions associated with that name.


16/56

Name dependences

There are two types of name dependencesbetween an instruction i that precedesinstruction j in program order:

1. An antidependence between instruction i &instruction j occurs when instruction j

writes register or memory location that

instruction i reads. The original orderingmust be preserved to ensure that I reads the

correct values


17/56

Name dependences

2. An output dependence occurs when

instruction i & instruction j writes the same

register or memory location. The ordering

between the instructions must be preservedto ensure that value finally written

corresponds to instruction j


18/56

Name dependences

Instructions involved in a name dependence

can execute simultaneously or be reordered,

if the name( register number or memory)

used in the instruction is changed so theinstructions do not conflict (register

renaming)


19/56

Dependences and data hazards

Data hazards are of three types depending on

the order of read and write accesses in the

instructions:

Consider two instructions i and j, with i

occurring before j in program order

1. RAW (read after write) true data

dependence. Ex: LOAD followed by an ALU

instrn that directly uses the LOAD result


20/56

Dependences and data hazards

2. WAW (write after write) output

dependence.

3. WAR (write after read) - antidependence

RAR (read after read) is not a hazard


21/56

Control dependences

Determines the ordering of an instruction i

with respect to a branch instruction

Ex: if p1 {

s1;

}

if p2 {s2;

}

s1 is control dependent on p1,

and s2 is control dependent on

p2, but not on p1


22/56

Control dependences

Two constraints imposed by control

dependencies:

1. An instruction that is control dependent on a

branch cannot be moved before the branch

so that its execution is no longer controlled

by the branch

2. An instruction that is not control dependent

on a branch cannot be moved after the

branch so that its execution is controlled by

the branch


23/56

Control dependences

Control dependencies are preserved by two

properties in a simple pipeline:

1. instructions execute in program order: this

ensures that the instruction that occurs

before the branch is executed before the

branch

2. the detection of a control or branch hazard

ensures that an instruction that is control

dependent on a branch is not executed until

the branch direction is known.


24/56

Control dependences

However we may be willing to violate the

control dependencies without affecting the

correctness of the program

Two properties that are critical to program

correctness and that must be preserved using

data and control dependencies are

exception behavior and the data flow.


25/56

Control dependences

Preserving the exception behavior: reordering of

instruction execution must not change how

exceptions are raised in the program (or must not

cause any new exceptions in the program)

Ex: DADDU R2, R3, R4

BEQZ R2, L1

LW R1, 0(R2)

Speculation a hardware technique which allowsus to overcome this exception problem

To show how maintainingdata and control dependences do

not cause any new exceptions

after instruction reordering


26/56

Control dependences

Preserving the data flow : data flow is the actual

flow of data values among instructions that produce

results and those that consume them

Branches make the data flow dynamic

ex: DADDU R1, R2, R3

BEQZ R4, L

DSUBU R1, R5, R6

L1: OR R7, R1, R8

Data dependence alone is

insufficient for correct execution.

Instead when the instructions

execute, the data flow must be

preserved. This data flow is

preserved by preserving controlflow.


27/56

Overcoming Data Hazards using

dynamic scheduling

Hardware rearranges the instruction execution

to reduce the stalls while maintaining data

flow and exception behavior

Advantage of dynamic scheduling is gained at

the cost of significant increase in hardware

complexity


28/56

Dynamic scheduling

Limitation of a simple pipeline: Instructions

are issued in program order, and if an

instruction is stalled in the pipeline, no later

instruction can proceed

Ex: DIV.D F0, F2, F4

ADD.D F10, F0, F8

SUB.D F12, F8, F14

( SUB.D cannot execute because the dependence of ADD.D on

DIV.D causes the pipeline to stall even though it is not data

dependent on anything in the pipeline)


29/56

Dynamic scheduling

In the 5-stage pipeline, both structural and data

hazards are checked in the ID stage

In order to begin executing SUB.D , we must separate

the ID stage into two parts:

1. checking for structural hazards2. Waiting for the absence of data hazard

We still use in-order instruction issue, but we want

an instruction to begin execution as soon as its dataoperands are available (out-of-order execution and

hence out-of order-completion)


30/56

Dynamic scheduling

Out-of-order execution introduces the possibility of

WAR and WAW hazards

Ex: DIV.D F0, F2, F4

ADD.D F6, F0, F8

SUB.D F8, F10, F14

MUL.D F6, F10, F8

( WAR hazard because of ADD.D and SUB.D if SUB.D

executes before ADD.D.WAW hazard because of ADD.D and MUL.D )


31/56

Dynamic schedulingTomosulos

approach

The scheme

- tracks when operands for instructions are

available, to minimize RAW hazards and- uses register renaming, to minimize WAR and WAW

hazards

In Tomosulos approach, register renaming is

provided by the reservation stations


32/56

Dynamic schedulingTomasulos

approach

How register renaming eliminates WAR and WAW

hazards:

Ex: code before renaming:

DIV.D F0, F2, F4

ADD.D F6, F0, F8

S.D F6, 0(R1)

SUB.D F8, F10, F14

MUL.D F6, F10, F8

Code after renaming:

DIV.D F0, F2, F4

ADD.D S, F0, F8

S.D S, 0(R1)

SUB.D T, F10, F14

MUL.D F6, F10, T


33/56

Tomasulo-based MIPS processor


34/56


Each reservation station has 7 fields

Op:Operation to perform on source operands

Vj, Vk: Value of Source operands

Loads have offset value in Vk field

Qj, Qk: Reservation stations producing the corresponding source

operands (value of 0 indicates that the value of source operand

already available in Vj or Vk, or is unnecessary)

A : used to hold information for the memory address

calculation for a load or store

Busy: Indicates reservation station or Functional unit is busy


35/56


The register file has one field

Qi : The number/name of the reservation station that

contains the operation whose result should be

stored into this register

The Load and Store buffers each have a field

A : holds the result of the effective address


36/56

Three Stages of Tomasulo

Algorithm1. Issueget instruction from FP Op Queue (Maintains

the correct data flow)

If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).2. Executeoperate on operands (EX)

When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write resultfinish execution (WB)Write on Common Data Bus to all awaiting units;

mark reservation station available


37/56

Dynamic Scheduling : Example

Show what the information tables look like for the following codesequence when only the first load has completed and written its

result:

1. L.D F6, 34(R2)

2. L.D F2, 45(R3)

3. MUL.D F0, F2, F4

4. SUB.D F8, F2, F6

5. DIV.D F10, F0, F6

6. ADD.D F6, F8, F2


38/56

Tomasulo ExampleInstruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

L.D F6 34+ R2 Load1 NoL.D F2 45+ R3 Load2 No

MULT.D F0 F2 F4 Load3 No

SUB.D F8 F6 F2

DIV.D F10 F0 F6

ADD.D F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU


39/56

Tomasulo Example Cycle 1Instruction status: Exec Write


LD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 No

MULTD F0 F2 F4 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No


1 FU Load1


40/56



LD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3

MULTD F0 F2 F4 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2


Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No


2 FU Load2 Load1


41/56



LD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2


Add1 No

Add2 No

Add3 No

Mult1 Yes MULTD R(F4) Load2

Mult2 No


3 FU Mult1 Load2 Load1


42/56



LD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3


SUBD F8 F6 F2 4

DIVD F10 F0 F6

ADDD F6 F8 F2


Add1 Yes SUBD M(A1) Load2

Add2 No

Add3 No

Mult1 Yes MULTD R(F4) Load2

Mult2 No


4 FU Mult1 Load2 M(A1) Add1

Load2 completing; what is waiting for Load2?


43/56



LD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 No


SUBD F8 F6 F2 4

DIVD F10 F0 F6 5

ADDD F6 F8 F2


2 Add1 Yes SUBD M(A1) M(A2)

Add2 No

Add3 No

10 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1


5 FU Mult1 M(A2) M(A1) Add1 Mult2


44/56





SUBD F8 F6 F2 4

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6



Add2 Yes ADDD M(A2) Add1

Add3 No




6 FU Mult1 M(A2) Add2 Add1 Mult2


45/56





SUBD F8 F6 F2 4 7

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6



Add2 Yes ADDD M(A2) Add1

Add3 No




7 FU Mult1 M(A2) Add2 Add1 Mult2

Add1 completing; what is waiting for it?


46/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6


Add1 No

2 Add2 Yes ADDD (M-M) M(A2)

Add3 No




8 FU Mult1 M(A2) Add2 (M-M) Mult2


47/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6


Add1 No


Add3 No






48/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10


Add1 No


Add3 No





Add2 completing; what is waiting for it?


49/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11


Add1 No

Add2 No

Add3 No




11 FU Mult1 M(A2) (M-M+ (M-M) Mult2


50/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11


Add1 No

Add2 No

Add3 No






51/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11


Time Name Busy Op Vj Vk Qj Qk Add1 No

Add2 No

Add3 No






52/56





SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11



Add2 No

Add3 No






53/56




MULTD F0 F2 F4 3 15 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11



Add2 No

Add3 No






54/56




MULTD F0 F2 F4 3 15 16 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11



Add2 No

Add3 No

Mult1 No

40 Mult2 Yes DIVD M*F4 M(A1)


16 FU M*F4 M(A2) (M-M+ (M-M) Mult2


55/56

Faster than light computation

(skip a couple of cycles)

l l l


56/56



LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 15 16 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11



Add2 No

Add3 No

Mult1 No

1 Mult2 Yes DIVD M*F4 M(A1)


55 FU M*F4 M(A2) (M-M+ (M-M) Mult2

Instruction Level Parallelism2

Documents

Transcript of Instruction Level Parallelism2