Loop Scheduling and Software Pipelining - CAPSL Scheduling and Software Pipelining ... formulate and...

53
Loop Scheduling and Loop Scheduling and Loop Scheduling and Loop Scheduling and Software Pipelining Software Pipelining Software Pipelining Software Pipelining 2008-04-24 \course\cpeg421-08s\Topic-7.ppt 1 Software Pipelining Software Pipelining Software Pipelining Software Pipelining

Transcript of Loop Scheduling and Software Pipelining - CAPSL Scheduling and Software Pipelining ... formulate and...

Loop Scheduling and Loop Scheduling and Loop Scheduling and Loop Scheduling and

Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 1

Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining

Reading ListReading ListReading ListReading List

• Slides: Topic 7 and 7a

• Other papers as assigned in class

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 2

or homework:

ABET Outcome

• Ability to apply knowledge of basic code generation techniques, e.g. Loop

scheduling e.g. software pipelining techniques to solve code generation

problems.

• An ability to identify, formulate and solve loops scheduling problems using

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 3

• An ability to identify, formulate and solve loops scheduling problems using

software pipelining techniques

• Ability to analyze the basic algorithms on the above techniques and

conduct experiments to show their effectiveness.

• Ability to use a modern compiler development platform and tools for the

practice of above.

• A Knowledge on contemporary issues on this topic.

OutlineOutlineOutlineOutline

• Brief overview

• Problem formulation of the modulo

scheduling problem

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 4

scheduling problem

• Solution methods

• Summary

• Good IPO

• Good LNO

• Good global optimization

• Good integration of IPO/LNO/OPT

• Smooth information passing between FE and CG

Inter-ProceduralOptimization (IPA)

Loop NestOptimization (LNO)

Global Optimization

Source

General Compiler Framework

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 5

between FE and CG

• Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation

Global Optimization(OPT)

InnermostLoop

scheduling

Global instscheduling

Reg alloc

Local instscheduling

Executable

Arch

Models

BE/CG

ME

• How to formulate the loop

scheduling problem?

Questions ?

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 6

• How to model it?

• How to solve it?

Questions (cont’d)Questions (cont’d)Questions (cont’d)Questions (cont’d)

• Instruction scheduling for code without loops – a

review

• Dependence graphs may become cyclic! So,

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 7

• Dependence graphs may become cyclic! So,

“critical path length” is less obvious!

• Is it becoming harder!(?)

• What new insights are required to formulate and

solve it?

Challenges of Loop SchedulingChallenges of Loop SchedulingChallenges of Loop SchedulingChallenges of Loop Scheduling

A DDG With Cycles

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 8

Right strategy?

ObservationsObservationsObservationsObservations

• Execution of “good loops” tend to be “regular”

and “repetitive” - a “pattern” may appear

• This gives “cyclic scheduling” problem a new

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 9

• This gives “cyclic scheduling” problem a new

twist!

• How to efficiently derive a “pattern”?

Problem Formulation (I)Problem Formulation (I)Problem Formulation (I)Problem Formulation (I)

Given a weighted dependence graph, derive a schedule

which is “time-optimal” under a machine model M.

Def: A schedule S of a loop L is time-optimal if among

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 10

Def: A schedule S of a loop L is time-optimal if among

all “legal” schedules of L, no other schedule that is faster

than S.

Note: There may be more than one time-optimal

schedule.

A Short Tour on Data A Short Tour on Data

Dependence Graphs for LoopsDependence Graphs for Loops

A Short Tour on Data A Short Tour on Data

Dependence Graphs for LoopsDependence Graphs for Loops

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 11

Dependence Graphs for LoopsDependence Graphs for LoopsDependence Graphs for LoopsDependence Graphs for Loops

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

• Data dependence between 2 accesses

The same memory location

Exist an execution path between them

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 12

Exist an execution path between them

One of them is a write

• Three types of data dependence

• Dependence graphs

• Things are not simple when dealing with loops

Types of Data DependenceTypes of Data DependenceTypes of Data DependenceTypes of Data Dependence

• Flow dependenceX =

= X... δ

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 13

• Anti-dependence

• Output dependenceX =

X =

... 0δ

= X

X =

... -1δ--

Data DependenceData DependenceData DependenceData Dependence

Example 1:

S1: A = 0

S2: B = A

S1

S2

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 14

S2: B = A

S3: C = A + D

S4: D = 2

Sx δ Sy ⇒ Sy depends on Sx

S3

S4

Example 2:

S1: A = 0

S2: B = A

S3: A = B + 1

S4: C = A

Con’d

S1

S2

S3

Data DependenceData DependenceData DependenceData Dependence

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 15

S4: C = A

S10 S3 : Output-dep

S2-1 S3 : anti-dep

δ

δ

S3

S4

Should we consider input Should we consider input dependence?dependence?Should we consider input Should we consider input dependence?dependence?

= XIs the reading of the

same X important?

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 16

= XWell, it may be!

(if we intend to

group the 2 reads

together for

cache optimization!)

Subscript VariablesSubscript VariablesSubscript VariablesSubscript Variables

• Extension of def-use chains to employ a more precise treatment of arrays, especially in iterative loops.

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 17

DO I = 1, N

A(I + 1) = X(I + 1) + B(I)

X(I) = A(I) * 5

ENDDO

Applications- register allocation- instruction scheduling

Con’d

Dependence GraphDependence GraphDependence GraphDependence Graph

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 18

- instruction scheduling- loop scheduling- vectorization- parallelization - memory hierarchy optimization

Data Dependence in LoopsData Dependence in LoopsData Dependence in LoopsData Dependence in Loops

An Example

Find the dependence relations due to the array X in the program below:

(1) for I = 2 to 9 do

(2) X[I] = Y[I] + Z[I]

(3) A[I] = X[I-1] + 1

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 19

(3) A[I] = X[I-1] + 1

(4) end for

Solution

I = 2 I = 3 I = 4

(2) X[2]=Y[2]+Z[2] (3) A[2]=X[1]+1

To find the data dependence relations in a simple loop, we can unroll the loop and see which statement instances depend on which others:

X[3] =Y[3]+Z[3]A[3] =X[2]+1

X[4]=Y[4]+Z[4]A[4]=X[3]+1

In our example, there is a loop-carried, lexically forward flow

dependence relation.

Data Dependence in LoopsData Dependence in LoopsData Dependence in LoopsData Dependence in LoopsCon’d

S

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 20

S2

S3

(1)

Data dependence graph for statements in a loop.

- Loop-carried vs loop-independent

- Lexical-forward vs lexical backward

Dependence distance = 1

An ExampleAn ExampleAn ExampleAn Example

for i = 0 to N - 1 do

a: a[i] = a[i - 1] + R[i];

b: b[i] = a[i] + c[i - 1];

c: c[i] = b[i] + 1;

end;

time i = 0 i = 1 i = 2

0 a

1 b

i = 3

iterations

. . .

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 21

end; 1 b

2 c a

3 b

4 c a

a

b

c

II

So, iteration interval II = 2tim

e

. . .

Assume each operation takes 1 cycle and there

is only one addition unit!

Note: We use a token here to represent a flow dependence of

distance = 1

Software Pipeline: ConceptSoftware Pipeline: ConceptSoftware Pipeline: ConceptSoftware Pipeline: Concept

Software Pipeline is a technique that reduces the execution

time of important loops by interweaving operations

from many iterations to optimize the use of resources.

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 22

0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time

ldffadds

stf

sub

cmp

bg

% prologue

a[0] = a[-1] + R[0];

% pattern

for i = 0 to N-2 do

b[i] = a[i] + c[i -1];

a[i + 1] = a[i] + R[i + 1];

prologue

b ,

The Structrure of the SWP code

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 23

a[i + 1] = a[i] + R[i + 1];

c[i] = b[i] + 1;

end;

% epilogue

b[N - 1] = a[N - 1] + c[N - 2];

c[N - 1] = b[N - 1] + 1

bi ,

ci , a i+1

epilogue

Software Pipeline (Cont’d)Software Pipeline (Cont’d)Software Pipeline (Cont’d)Software Pipeline (Cont’d)

What limits the speed of a loop?

• Data dependencies: recurrence initiation interval (rec_mii)

• Processor resources: resource initiation interval (res_mii)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 24

0 1 2 3 4 5 6 7 8 9 10 11 12 16151413 time

ldffadds

stf

sub

cmp

bg

Initiation interval

Previous ApproachesPrevious ApproachesPrevious ApproachesPrevious Approaches

• Approach I (Operational):

“Emulate” the loop execution under the machine model and a

“pattern” will eventually occur

[AikenNic88, EbciogluNic89, GaoEtAl91]

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 25

[AikenNic88, EbciogluNic89, GaoEtAl91]

• Approach II (Periodic scheduling):

Specify the scheduling problem into a periodical scheduling

problem and find optimal solution

[Lam88, RauEtAl81,GovindAltmanGao94]

Periodic SchedulePeriodic SchedulePeriodic SchedulePeriodic Schedule(Modulo Scheduling)

The time (cycle) when the I-th instance of

the operation v is scheduled :

t(i, v) = T * i+ A[v] where T = II

so t(i + 1, v) -t(i, v) = T*(i + 1) – T*(i) = T

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 26

so t(i + 1, v) -t(i, v) = T*(i + 1) – T*(i) = T

For our example:

t(i, v) = 2i + A[v]

where A(a) = 1

A(b) = 0

A(c) = 1

Question: Is this an optimal schedule?

Yes, the schedule

t(i, v) = 2i + A[v]

is time-optimal!

Periodic Schedule (cont’d)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 27

is time-optimal!

With II = 2

Given a DDG of a loop L, how to determine the fastest

computation rate of L -- also called minimum initiation

interval (MII) ?

Restate the problem

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 28

interval (MII) ?

Hint: Consider the “Critical Cycles” as well

as critical resource usage in L

Recurrence MII -- RecMII

• An Example

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 29

• An Example

An Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMII

for i = 0 to N - 1 do

a: a[i] = a[i - 1] + R[i];

b: b[i] = a[i] + c[i - 1];

c: c[i] = b[i] + 1;

end;

time i = 0 i = 1 i = 2

0 a

1 b

i = 3

iterations

. . .

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 30

1 b

2 c a

3 b

4 c a

a

b

c

II

So, RecMII = 2tim

e

. . .

Assume each operation takes 1 cycle and there

is only one addition unit!

Theorem: The maximum computation rate of a

loop is bounded by the following ratio

ropt = min{ Dc/Wc}

where C is a dependence cycle, Dc is the total

(C)

Maximum Computation RateMaximum Computation RateMaximum Computation RateMaximum Computation Rate

Hint: now one must think about deadlines.

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 31

where C is a dependence cycle, Dc is the total

dependence distance along C and Wc is total

execution time of C. i.e.

Dc di

c

=∑

Wc wi

c

=∑

(di is the dependence distance along the edge i in C)

(wi is the edge weight along theedge i in C)

And the optimal period

MII = Topt = 1/ropt

Def: A cycle is critical if the period of the cycle

RecMII (Contn’d)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 32

Def: A cycle is critical if the period of the cycle

equal to MII

(We should write it as RecMII)

Note: a loop may have multiple critical cycles!

An Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMIIAn Example (Revisit) : RecMII

for i = 0 to N - 1 do

a: a[i] = a[i - 1] + R[i];

b: b[i] = a[i] + c[i - 1];

c: c[i] = b[i] + 1;

end;

time i = 0 i = 1 i = 2

0 a

1 b

i = 3

iterations

. . .

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 33

1 b

2 c a

3 b

4 c a

a

b

c

II

So, RecMII = 2tim

e

. . .

Assume each operation takes 1 cycle and there

is only one addition unit!

How About Machine How About Machine How About Machine How About Machine Resource and Register Resource and Register Resource and Register Resource and Register

Constraints?Constraints?Constraints?Constraints?

• Numbers and types of FUs, etc,

(ResMII)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 34

(ResMII)

• Number of registers

Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining

• Review of previous work

[RauFisher93] [Rau94]

• Minimum initiation interval [MII]

MII = max {RecMII, ResMII}

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 35

MII = max {RecMII, ResMII}

where

RecMII: determined by “critical’

recurrence cycles in DDG

ResMII: determined by resource

constraints (#s and types of FUs)

The Resource Constraint MII (ResMII)

• Can be calculated by totaling, for each resource, the usage

requirement imposed by one iteration of the loop

• can be done by bin-packing a resource reservation table

Consider a simple example of n nodes and 1 FU -- what is ResMII ?

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 36

• can be done by bin-packing a resource reservation table

(expensive)

• usually derive a lower-bound is enough as a starting point to begin

the search process

• More price with complex hardware pipelines ?

Example 1: Reservation Table Example 1: Reservation Table Example 1: Reservation Table Example 1: Reservation Table ---- An ExampleAn ExampleAn ExampleAn Example

1

543210Stage

XXX1

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 37

2

3XXX1

XX2

X3

(a) a Pipeline M (b) The Reservation Table of M

A Quiz ?

• What is the RT of a fully pipelined adder

with 3 stages ?

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 38

• How about an unpipelined adder ?

• How to computer ResMII for each case ?

Modulo Reservation Table

• The modulo reservation table only has length = II

• Each entry in the table records a sequence of

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 39

• Each entry in the table records a sequence of

reservations corresponds a sequence of slots every II

cycles

• Hints: think about your weekly calendar

Modulo SchedulingModulo SchedulingModulo SchedulingModulo Scheduling

MII

II = II + 1

Choose Next Operation X

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 40

Remove from L

is L =<>?

succeed

failed try to place x in

a time slot i in II

No

A legal schedule

is found

Heuristic Method for Heuristic Method for Heuristic Method for Heuristic Method for

Modulo SchedulingModulo SchedulingModulo SchedulingModulo Scheduling

Why a simple variant list

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 41

Why a simple variant list

scheduling may not work?

Hint: consider the deadline constraints of operations ina cycle.

4 2

2 4

[1]

B D

A C(a)

Counter Example I: List Scheduling May Fail !

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 42

0

1

2

3

A B

C

D

0

1

2

3

A

B

C

D

(b) MII = RecMII = 4

Note: if simple list scheduling is used

B cannot be scheduled due to the

deadline set by scheduling C: deadlock!

(c)

MII = RecMII = 4

Note: we cannot fire C as early as possible!

In previous figure,

We show an example demonstrating the problem with

greedy scheduling in the presence of recurrences.

Example I (Cont’d)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 43

(a) The data dependence graph with a cycle.

(b) The resulting partial schedule when C is scheduled

greedily. B cannot be scheduled.

(c) The resulting valid schedule when C is not scheduled

greedyly – delayed to two cycles later.

2

1

2

3M6A1

C2

A3

A4

0

1

2

3

M6A1

C2

A3

A4

0

1

2

3

A1

A4

M6C2

Adder Mult Bus Adder Mult Bus

Example 2: Example 2: Example 2: Example 2: Problems with Greedy schedule due to ResMIIProblems with Greedy schedule due to ResMIIProblems with Greedy schedule due to ResMIIProblems with Greedy schedule due to ResMII

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 44

2

3

A4

M5

3

4

5

A3 3

4

5

A3 M5

(a) DDG (b) A greedy schedule which (c) a non-greedy schedule

cannot schedule A4 with which achieves ResMII

ResMII

ResMII = 6 ResMII = 6

A: non-pipelined adders

M: non-pipelined multipliers

In previous figure,

We show an example demonstrating the problem with greedy

scheduling in the presence of complex reservation tables.

(a) The data dependence graph without cycle.

Example 2 (Cont’d)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 45

(a) The data dependence graph without cycle.

(b) The resulting partial schedule when A1, M6, C2, and A3

are scheduled greedily. A4 cannot be scheduled.

(c) The resulting valid schedule when A3 is scheduled one

cycle later.

Example 3: infeasibility of MIIExample 3: infeasibility of MIIExample 3: infeasibility of MIIExample 3: infeasibility of MII

2A1

Adder Mult Bus

3M2

Adder Mult Bus

5MA3

Adder Mult Bus

Con’d

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 46

(a)

Adder Mult Bus Adder Mult Bus Adder Mult Bus

Note: ResMII = 2. But is there a legal schedule under II = 2 ?

Note: The presence of complex reservation tables

Example 3 (cont’d)Example 3 (cont’d)Example 3 (cont’d)Example 3 (cont’d)

Adder Mult Bus

A1 M2 M2

A1

Adder Mult Bus

A1 M2

MA3 A1

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 47

A1 MA3

MA3 M2

A1

MII = 2

II = 2

(b) cannot fit MA3

under II=2

MII = 2

II = 3

(c) A feasible schedule

for II=3

Note: It is possible that there is no valid schedule at MII!

The previous slide shows an example demonstrating

the infeasibility of the MII in the presence of complex

reservation tables. (a) The three operations and their

Infeasibility of MIIInfeasibility of MIIInfeasibility of MIIInfeasibility of MII Con’d

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 48

reservation tables. (b) The MRT corresponding to the

dead-end partial schedule for an II of 2 after A1 and

M2 have been scheduled. (c) The MRT

corresponding to a valid schedule for an II of 3.

2

2

2

A1

A2

A3

[2]

2

2

A3

A4

Adder Mult Bus

2

2

Example 4: Infeasibility of MIIExample 4: Infeasibility of MIIExample 4: Infeasibility of MIIExample 4: Infeasibility of MII

Assume: fully pipelined adders

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 49

2

2

A3

A4

(a)

2A4 2

ResMII = 4

RecMII = 4

(b)

Note: It is possible that there is no valid schedule at MII!

And, the reservation table here is simple!

Example 4 (cont’d)Example 4 (cont’d)Example 4 (cont’d)Example 4 (cont’d)

The previous slides shows an example demonstrating the

infeasibility of the MII due to the interaction between the

recurrence constraints and the resource usage constraints.

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 50

(a) The data dependence graph.

(b) The MRT corresponding to the dead-end partial

schedule for an II of 4 after A1 and A2 have been

scheduled.

How to derive a best feasible

schedule ?

• It is possible do so via exhaustive search.

• But, it is expensive!

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 51

Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc.)OperationalApproach

Formal Model (GaoWonNin91)PLDI'91

SoftwarePipelining

In-Exact Method (Heuristic)(RauGla81, Lam88, RauEtAI92, Huff93, DehnertTow93,Rau94, WanEis93, StouchininEtAI00)

Basic Formulation(DongenGao92)CONPAR'92

PeriodicScheduling

(Modulo Scheduling)

Register Optimal(Ninggao91, NingGao93, Ning93)

POPL'93

Resource ConstrainedExact Method ILP based (GovindAITGao94)

A Taxonomy of Software PipeliningA Taxonomy of Software PipeliningA Taxonomy of Software PipeliningA Taxonomy of Software Pipelining

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 52

Method ILP based (GovindAITGao94)Micro27

Resource & Register"Showdown"(GovindAITGao95 , Altman95,(RuttenbergGao

PLDI'95StouchininWoody96)

Exhaustive

EichenbergerDav95)PLDI'96

SearchEuroPar'96

(Altman95)FSA Co-Schedulingformulation(GovindAltmanGao'96)

FSA BasedMethod FSA Construction Method

(Model hardware (GovindAltmanGao98)

pipeline withsharing and hazards)

FSA Heuristic/optimization(ZhangGovindRyanGao99)

Theory of Co-Scheduling(GovindAltmanGao00)

Advanced TopicsAdvanced TopicsAdvanced TopicsAdvanced Topics

• Consider register constraints

• Realistic pipeline architecture constraints

(with structural hazards)

2008-04-24 \course\cpeg421-08s\Topic-7.ppt 53

(with structural hazards)

• Loop body with conditionals

• Multi-Dimensional Loops