Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Chapter 10

Scheduling

Presented by Vladimir Yanovsky

The goals

• Scheduling: Mapping of parallelism within the constraints of limited available parallel resources

• In general, we must sacrifice some parallelism to fit a program within the available resources

• Our goal: Minimize the amount of parallelism sacrificed/maximize utilization of the resources

Lecture Outline

– Straight line scheduling

– Trace Scheduling

– Loops: Kernel Scheduling (Software Pipelining)

– Vector unit scheduling

Scheduling - Motivation

• Transistor sizes have shrank. Can be exploited by:

1.Several processors on the same silicone.

2.Multiple identical execution units.

• The more parallelisms allows the processor, the more important scheduling is.

Processor Types

• Superscalar

Multiple functional units controlled and scheduled by the hardware.

• VLIW (Very Large Instruction Word)

Scheduled by the compiler

VLIW vs Superscalar

• Compatibility

• Capability of run-time adjustments (branches & cache misses)

• Design simplicity

• Global view of the program

Scheduling – standard approach

• Scheduling in VLIW and Superscalar architectures:

– Receive a sequential stream of instructions– Reorder this sequential stream to utilize available

parallelism– Reordering must preserve dependences

• Our model for this talk is VLIW

Reuse Constrains

• Need to execute:

a = b + c + d + e

• One possible sequential stream:add a, b, cadd a, a, dadd a, a, e

• And, another:add r1, b, cadd r2, d, eadd a, r1, r2

Fundamental Problem

• Fundamental conflict in scheduling: – If the original instruction stream takes into

account available resources it will create artificial dependences

– If not, then there may not be enough resources to correctly execute the stream

• Who should be earlier, register allocation or scheduling?

Processor Model

• VLIW type

• Processor contains a number of issue units• Issue unit has an associated type and a

delay

• Purpose: to select set of instructions for each cycle such that the number of instructions of each type is not greater than the number of execution units of this type

Straight Line Scheduling

• Scheduling a basic block: receives a dependence graphG = (N, E, type, delay)– N: set of instructions in the code

– E: (n1, n2) E iff n2 must wait completion of n1 due to a dependency

– Each n N has a type, type(n), and a delay, delay(n).

Straight Line Scheduling

• A correct schedule is a mapping, S, from vertices in the graph to nonnegative integers representing cycle numbers such that:1. If (n1,n2) E, S(n1) + delay(n1) S(n2), i.e. deps satisfied2. Hardware constraints are satisfied.

• The length of a schedule, S, denoted L(S) is defined as:L(S) = maxn (S(n) + delay(n))

• Goal of straight-line scheduling: Find a shortest possible correct schedule.

List Scheduling

• Use variant of topological sort: – Maintain a list of instructions which have no

unscheduled predecessors in the graph– Schedule these instructions– This will allow other instructions to be added to

the list– Repeat until all instructions are scheduled

List Scheduling

• We maintain two arrays:– count determines for each instruction how many

predecessors are still to be scheduled– earliest array maintains the earliest cycle on which the

instruction can be scheduled.

• Maintain a number of worklists which hold instructions to be scheduled for a particular cycle number. All their predecessors are scheduled.

List Scheduling - Initialization

for each nN do begin count[n] := 0; earliest[n] = 0 endfor each (n1,n2)E do begin

count[n2] := count[n2] + 1;successors[n1] := successors[n1] {n2};

endfor i := 0 to MaxC – 1 do W[i] := ; //MaxC max(delay)+1Wcount := 0; //The number of ready instructionsfor each nN do

if count[n] = 0 then begin //No dependenciesW[0] := W[0] {n}; Wcount := Wcount + 1;

endendc := 0; // c is the cycle number cW := 0; // cW is the number of the worklist for cycle c instr[c] := ;

List Scheduling Algorithmwhile Wcount > 0 do begin

while W[cW] = do beginc := c + 1; instr[c] := ; cW := mod(cW+1,MaxC);

endnextc := mod(c+1,MaxC); //next cyclewhile W[cW] ≠ do begin

select and remove an arbitrary instruction x from W[cW];if free issue units of type(x) on cycle c then begin

instr[c] := instr[c] {x}; Wcount := Wcount - 1;for each ysuccessors[x] do begin

count[y] := count[y] – 1;earliest[y] := max(earliest[y], c+delay(x));if count[y] = 0 then begin

loc := mod(earliest[y],MaxC);W[loc] := W[loc] {y}; Wcount := Wcount + 1;

endend

else W[nextc] := W[nextc] {x}; //x could not be scheduledFor each unused unit insert stallend

end

Priority

Finding the critical pathfor each n N do begin count[n] := 0; remaining[n] := delay(n); endfor each (n1,n2) E do begin

count[n1] := count[n1] + 1; //count[n]==0 iff nothing depends on npredecessors[n2] := predecessors[n2] {n1};

endW := ;∅for each n N do if count[n] = 0 then W := W {n};//init: W-inst without depswhile W ≠ ∅ do begin

select and remove an arbitrary instruction x from W;for each y predecessors[x] do begin

count[y] := count[y] – 1;remaining[y] := max(remaining[y], remaining[x]+delay(y));if count[y] = 0 then W := W {y};

endend

Problems of list scheduling

• Previous basic block must complete before the next is started.

• Cannot schedule loops.

Trace Scheduling

• Exploit parallelism between several basic blocks.

• Trace: is a collection of basic blocks that form a single path through all or part of the program.

• CFG without loops

Trace SchedulingScheduling

j=j+1

i=i+2

if e1

i = i + 2 is moved below the split – inserted fixup code

Trace Scheduling

• Trace scheduling algorithm:1. Select a trace based on profiling information.

2. Schedule the trace using basic block scheduler adding dependencies from the splits/joints to the upstream/downstream instructions respectively.

3. Insert a fixup code.

4. Remove the scheduled trace from the CFG

5. If CFG not empty Goto 1

Trace & line scheduling - conclusions

1. Problem with line & trace scheduling – cannot schedule loops effectively. Must unroll loops to have more “meat” for work.

2. Trace scheduling increases code size by inserting fixup code, may lead to exponential code increase.

3. Need up-to-date memory dependencies information to do anything about moving memory accesses.

Kernel Scheduling

• Moves instructions not only in space but also in time – across iterations.

• Allows to better exploit parallelism between loop iterations.

Kernel Scheduling problem

• A kernel scheduling problem is a graph:G = (N, E, delay, type, cross)where cross (n1, n2) defined for each edge in E is the number of iterations crossed by the dependence relating n1 and n2

Software Pipelining• Example: ld r1,0

ld r2,400 fld fr1, c

l0 fld fr2,a(r1)l1 fadd fr2,fr2,fr1l2 fst fr2,b(r1)l3 ai r1,r1,8l4 comp r1,r2l5 ble l0

• A legal schedule:

10: fld fr2,a(r1) ai r1,r1,8

Floating Pt.

comp r1,r2

fst fr3,b-16(r1) ble l0 fadd fr3,fr2,fr1

IntegerLoad/Store

fld

fadd

fst

2

3

Software Pipelining

ld r1,0 ld r2,400 fld fr1, cl0 fld fr2,a(r1)l1 fadd fr2,fr2,fr1l2 fst fr2,b(r1)l3 ai r1,r1,8l4 comp r1,r2l5 ble l0

S[10] = 0; I[l0] = 0;S[l1] = 2; I[l1] = 0;S[l2] = 2; I[l2] = 1;S[l3] = 0; I[l3] = 0;S[l4] = 1; I[l4] = 0;S[l5] = 2; I[l5] = 0;

10: fld fr2,a(r1) ai r1,r1,8

Floating Pt.

comp r1,r2

fst fr3,b-16(r1) ble l0 fadd fr3,fr2,fr1

IntegerLoad/Store

Software Pipelining• Have to generate epilog and prolog to ensure correctness• Prolog:

ld r1,0 ld r2,400 fld fr1, c

p1 fld fr2,a(r1); ai r1,r1,8p2 comp r1,r2p3 beq e1; fadd fr3,fr2,fr1

• Epilog:

e1 nop

e2 nop

e3 fst fr3,b-8(r1)

Kernel Scheduling

• A solution to the kernel scheduling problem is a pair of tables (S,I), where:– the schedule S maps each instruction n to a cycle

within the kernel– the iteration I maps each instruction to an iteration

offset from zero, such that: S[n1] + delay(n1) S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S) for each edge (n1,n2) in E, where: Lk(S) = maxn (S(n)) is the length of the kernel for S.

• Another name for kernel’s length is II – initiation interval

Kernel scheduling - intuition

• S[n1] + delay(n1) S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S)

• Instructions with I[n] = 0 are running in the “current” iteration.

• If I[n]>0 this means that the instruction is delayed by I[n] iterations.

• Even if n1 has large delay, n2 can be moved to a later iteration instead of forcing it to be scheduled in the cycle S[n1] + delay(n1)

Resource Constrains• Resource usage constraint:

– No recurrence in the loop– #t: number of instructions in each iteration that must

issue in a unit of type t

Lk(S)

• We can always find a schedule S, such that

Lk(S) =

tmax

# t

tm

tmax

# t

tm

Kernel Schedulingfor each instruction x in G in topological order do begin

earlyS := 0; earlyI := 0;for each predecessor y of x in G do

thisS := S[y] + delay(y); thisI := I[y];if thisS ≥ L then begin

thisS := mod(thisS,L); thisI := thisI + ;endif thisI > earlyI or ((thisI = earlyI) && (thisS>earlyS)) then begin

earlyI := thisI; earlyS := thisS;end

endstarting at cycle earlyS, find the first cycle c0where the resource needed by x is available,wrapping to the beginning of the kernel if necessary;S[x] := c0;if c0 < earlyS then I[x] := earlyI +1 else I[x] := earlyI; //Wrapped over kernel

end

thisS/L

Software Pipelining Example

l0 ld a,x(i)

l1 ai a,a,1

l2 ai a,a,1

l3 ai a,a,1

l4 st a,x(i)

Memory1 Integer1 Integer2 Integer3 Memory2

l0: S=0; I=0 l1: S=0; I=1 l2: S=0; I=2 l3: S=0; I=3 l4: S=0; I=4

• 2 memory units, 3 integer units.

•II=1 is enough. Each time next instruction is pushed to the next iteration.

Register Pressure

• l0 ld a0,x(i)• l1 ai a1,a0,1• l2 ai a2,a1,1• l3 ai a3,a2,1• l4 st a3,x(i)

1. The same register a cannot be used in 4 different iterations running simultaneously.

2. Need to store register’s value for each overlapping iterations and rename them cyclically after each iteration.

3. Issue 2 can be solved by unrolling with renaming though this will increase code size

Prolog & Epilog

Stage AStage BStage CStage DStage E





iter 1iter 2

iter xiter x-1

Prologue

Kernel

Epilogue

Fill Pipeline

Steady State

Drain Pipeline

II

Block Code Layout What's Happening

1. Current iteration when entering the kernel is 5.

2. I(Stage A)=0, that is we execute Stage A in the same iteration as initially.

3. I(Stage B) = 1, i.e. Stage B is always delayed to the next iteration.

4. Prolog: StageA1; StageB1,StageA2;StageC1,StageB2,StageA3…

Prolog & Epilog generation• Prolog:

for k = 0 to maxn(I(n))-1 lay out the kernel replacing all n s.t. I(n)>k by NO-OP

• Epilog:for to k=1 to maxn(I(n)) lay out the kernel replacing all n s.t. I(n)<k by NO-OP

• Compact both using list schedule.

Recurrences• Given a recurrence (n1, n2, …, nk):

Lk(S)

– Right hand side is called the slope of the recurrence. Nominator is the number of cycles it takes to complete all the computations of the recurrence, denominator is the number iterations available to do this.

– Lk(S) MAXc

k

i

k

i

11ii

1i

),cross(

)delay(

n n

n

delay(in )

i1

k

cross(in ,

i1 n )i1

k

Kernel Scheduling – General Case

1. Compute MII to be the maximum of resource constraint and the maximum slope.

2. II=MII

3. Remove an edge from every recurrence.

4. Schedule(II) using the simple kernel scheduling algorithm.

5. If failed (dependency of any removed edge is violated), increase II and got 4.

Kernel Scheduling - Conclusions

• Handling control flow is difficult. May use hardware support for predicated execution or handling the “control flow regions” as black boxes.

• Increased register pressure may limit only to single basic block inner loops anyway.

• Benefits from unrolling with renaming.

Vector Unit Scheduling

• A vector instruction involves the execution of many scalar instructions

• Much of the benefit from the pipelining is already achieved

• Still, something can be done

Chaining• Chaining:

vload t1, avload t2, b

vadd t3, t1, t2 vstore t3, c

• Two load units

• Each operation takes 64 cycles

• 192 cycles without chaining

• 66 cycles with chaining

• Proximity within instructions required for hardware to identify opportunities for chaining

Chaining rearrangingvload a,x(i)vload b,y(i)vadd t1,a,bvload c,z(i)vmul t2,c,t1vmul t3,a,bvadd t4,c,t3

• Rearranging: vload a,x(i) vload b,y(i) vadd t1,a,b vmul t3,a,b vload c,z(i) vmul t2,c,t1 vadd t4,c,t3

2 load, 1 addition,

1 multiplication pipe

Instruction fusion

vload a,x(i)vload b,y(i)vadd t1,a,bvload c,z(i)vmul t2,c,t1vmul t3,a,bvadd t4,c,t3

Instruction fusion – cont.

vload a,x(i)vload b,y(i)vadd t1,a,bvmul t3,a,bvload c,z(i)vmul t2,c,t1vadd t4,c,t3

After Fusion

The End!

Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Documents

Transcript of Chapter 10 Scheduling Presented by Vladimir Yanovsky.