Chapter 10 Scheduling Presented by Vladimir Yanovsky.

44
Chapter 10 Scheduling Presented by Vladimir Yanovsky

Transcript of Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Page 1: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Chapter 10

Scheduling

Presented by Vladimir Yanovsky

Page 2: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

The goals

• Scheduling: Mapping of parallelism within the constraints of limited available parallel resources

• In general, we must sacrifice some parallelism to fit a program within the available resources

• Our goal: Minimize the amount of parallelism sacrificed/maximize utilization of the resources

Page 3: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Lecture Outline

– Straight line scheduling

– Trace Scheduling

– Loops: Kernel Scheduling (Software Pipelining)

– Vector unit scheduling

Page 4: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Scheduling - Motivation

• Transistor sizes have shrank. Can be exploited by:

1.Several processors on the same silicone.

2.Multiple identical execution units.

• The more parallelisms allows the processor, the more important scheduling is.

Page 5: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Processor Types

• Superscalar

Multiple functional units controlled and scheduled by the hardware.

• VLIW (Very Large Instruction Word)

Scheduled by the compiler

Page 6: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

VLIW vs Superscalar

• Compatibility

• Capability of run-time adjustments (branches & cache misses)

• Design simplicity

• Global view of the program

Page 7: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Scheduling – standard approach

• Scheduling in VLIW and Superscalar architectures:

– Receive a sequential stream of instructions– Reorder this sequential stream to utilize available

parallelism– Reordering must preserve dependences

• Our model for this talk is VLIW

Page 8: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Reuse Constrains

• Need to execute:

a = b + c + d + e

• One possible sequential stream:add a, b, cadd a, a, dadd a, a, e

• And, another:add r1, b, cadd r2, d, eadd a, r1, r2

Page 9: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Fundamental Problem

• Fundamental conflict in scheduling: – If the original instruction stream takes into

account available resources it will create artificial dependences

– If not, then there may not be enough resources to correctly execute the stream

• Who should be earlier, register allocation or scheduling?

Page 10: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Processor Model

• VLIW type

• Processor contains a number of issue units• Issue unit has an associated type and a

delay

• Purpose: to select set of instructions for each cycle such that the number of instructions of each type is not greater than the number of execution units of this type

Page 11: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Straight Line Scheduling

• Scheduling a basic block: receives a dependence graphG = (N, E, type, delay)– N: set of instructions in the code

– E: (n1, n2) E iff n2 must wait completion of n1 due to a dependency

– Each n N has a type, type(n), and a delay, delay(n).

Page 12: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Straight Line Scheduling

• A correct schedule is a mapping, S, from vertices in the graph to nonnegative integers representing cycle numbers such that:1. If (n1,n2) E, S(n1) + delay(n1) S(n2), i.e. deps satisfied2. Hardware constraints are satisfied.

• The length of a schedule, S, denoted L(S) is defined as:L(S) = maxn (S(n) + delay(n))

• Goal of straight-line scheduling: Find a shortest possible correct schedule.

Page 13: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

List Scheduling

• Use variant of topological sort: – Maintain a list of instructions which have no

unscheduled predecessors in the graph– Schedule these instructions– This will allow other instructions to be added to

the list– Repeat until all instructions are scheduled

Page 14: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

List Scheduling

• We maintain two arrays:– count determines for each instruction how many

predecessors are still to be scheduled– earliest array maintains the earliest cycle on which the

instruction can be scheduled.

• Maintain a number of worklists which hold instructions to be scheduled for a particular cycle number. All their predecessors are scheduled.

Page 15: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

List Scheduling - Initialization

for each nN do begin count[n] := 0; earliest[n] = 0 endfor each (n1,n2)E do begin

count[n2] := count[n2] + 1;successors[n1] := successors[n1] {n2};

endfor i := 0 to MaxC – 1 do W[i] := ; //MaxC max(delay)+1Wcount := 0; //The number of ready instructionsfor each nN do

if count[n] = 0 then begin //No dependenciesW[0] := W[0] {n}; Wcount := Wcount + 1;

endendc := 0; // c is the cycle number cW := 0; // cW is the number of the worklist for cycle c instr[c] := ;

Page 16: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

List Scheduling Algorithmwhile Wcount > 0 do begin

while W[cW] = do beginc := c + 1; instr[c] := ; cW := mod(cW+1,MaxC);

endnextc := mod(c+1,MaxC); //next cyclewhile W[cW] ≠ do begin

select and remove an arbitrary instruction x from W[cW];if free issue units of type(x) on cycle c then begin

instr[c] := instr[c] {x}; Wcount := Wcount - 1;for each ysuccessors[x] do begin

count[y] := count[y] – 1;earliest[y] := max(earliest[y], c+delay(x));if count[y] = 0 then begin

loc := mod(earliest[y],MaxC);W[loc] := W[loc] {y}; Wcount := Wcount + 1;

endend

else W[nextc] := W[nextc] {x}; //x could not be scheduledFor each unused unit insert stallend

end

Priority

Page 17: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Finding the critical pathfor each n N do begin count[n] := 0; remaining[n] := delay(n); endfor each (n1,n2) E do begin

count[n1] := count[n1] + 1; //count[n]==0 iff nothing depends on npredecessors[n2] := predecessors[n2] {n1};

endW := ;∅for each n N do if count[n] = 0 then W := W {n};//init: W-inst without depswhile W ≠ ∅ do begin

select and remove an arbitrary instruction x from W;for each y predecessors[x] do begin

count[y] := count[y] – 1;remaining[y] := max(remaining[y], remaining[x]+delay(y));if count[y] = 0 then W := W {y};

endend

Page 18: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Problems of list scheduling

• Previous basic block must complete before the next is started.

• Cannot schedule loops.

Page 19: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Trace Scheduling

• Exploit parallelism between several basic blocks.

• Trace: is a collection of basic blocks that form a single path through all or part of the program.

• CFG without loops

Page 20: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Trace SchedulingScheduling

j=j+1

i=i+2

if e1

i = i + 2 is moved below the split – inserted fixup code

Page 21: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Trace Scheduling

• Trace scheduling algorithm:1. Select a trace based on profiling information.

2. Schedule the trace using basic block scheduler adding dependencies from the splits/joints to the upstream/downstream instructions respectively.

3. Insert a fixup code.

4. Remove the scheduled trace from the CFG

5. If CFG not empty Goto 1

Page 22: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Trace & line scheduling - conclusions

1. Problem with line & trace scheduling – cannot schedule loops effectively. Must unroll loops to have more “meat” for work.

2. Trace scheduling increases code size by inserting fixup code, may lead to exponential code increase.

3. Need up-to-date memory dependencies information to do anything about moving memory accesses.

Page 23: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel Scheduling

• Moves instructions not only in space but also in time – across iterations.

• Allows to better exploit parallelism between loop iterations.

Page 24: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel Scheduling problem

• A kernel scheduling problem is a graph:G = (N, E, delay, type, cross)where cross (n1, n2) defined for each edge in E is the number of iterations crossed by the dependence relating n1 and n2

Page 25: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Software Pipelining• Example: ld r1,0

ld r2,400 fld fr1, c

l0 fld fr2,a(r1)l1 fadd fr2,fr2,fr1l2 fst fr2,b(r1)l3 ai r1,r1,8l4 comp r1,r2l5 ble l0

• A legal schedule:

10: fld fr2,a(r1) ai r1,r1,8

Floating Pt.

comp r1,r2

fst fr3,b-16(r1) ble l0 fadd fr3,fr2,fr1

IntegerLoad/Store

fld

fadd

fst

2

3

Page 26: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Software Pipelining

ld r1,0 ld r2,400 fld fr1, cl0 fld fr2,a(r1)l1 fadd fr2,fr2,fr1l2 fst fr2,b(r1)l3 ai r1,r1,8l4 comp r1,r2l5 ble l0

S[10] = 0; I[l0] = 0;S[l1] = 2; I[l1] = 0;S[l2] = 2; I[l2] = 1;S[l3] = 0; I[l3] = 0;S[l4] = 1; I[l4] = 0;S[l5] = 2; I[l5] = 0;

10: fld fr2,a(r1) ai r1,r1,8

Floating Pt.

comp r1,r2

fst fr3,b-16(r1) ble l0 fadd fr3,fr2,fr1

IntegerLoad/Store

Page 27: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Software Pipelining• Have to generate epilog and prolog to ensure correctness• Prolog:

ld r1,0 ld r2,400 fld fr1, c

p1 fld fr2,a(r1); ai r1,r1,8p2 comp r1,r2p3 beq e1; fadd fr3,fr2,fr1

• Epilog:

e1 nop

e2 nop

e3 fst fr3,b-8(r1)

Page 28: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel Scheduling

• A solution to the kernel scheduling problem is a pair of tables (S,I), where:– the schedule S maps each instruction n to a cycle

within the kernel– the iteration I maps each instruction to an iteration

offset from zero, such that: S[n1] + delay(n1) S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S) for each edge (n1,n2) in E, where: Lk(S) = maxn (S(n)) is the length of the kernel for S.

• Another name for kernel’s length is II – initiation interval

Page 29: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel scheduling - intuition

• S[n1] + delay(n1) S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S)

• Instructions with I[n] = 0 are running in the “current” iteration.

• If I[n]>0 this means that the instruction is delayed by I[n] iterations.

• Even if n1 has large delay, n2 can be moved to a later iteration instead of forcing it to be scheduled in the cycle S[n1] + delay(n1)

Page 30: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Resource Constrains• Resource usage constraint:

– No recurrence in the loop– #t: number of instructions in each iteration that must

issue in a unit of type t

Lk(S)

• We can always find a schedule S, such that

Lk(S) =

tmax

# t

tm

tmax

# t

tm

Page 31: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel Schedulingfor each instruction x in G in topological order do begin

earlyS := 0; earlyI := 0;for each predecessor y of x in G do

thisS := S[y] + delay(y); thisI := I[y];if thisS ≥ L then begin

thisS := mod(thisS,L); thisI := thisI + ;endif thisI > earlyI or ((thisI = earlyI) && (thisS>earlyS)) then begin

earlyI := thisI; earlyS := thisS;end

endstarting at cycle earlyS, find the first cycle c0where the resource needed by x is available,wrapping to the beginning of the kernel if necessary;S[x] := c0;if c0 < earlyS then I[x] := earlyI +1 else I[x] := earlyI; //Wrapped over kernel

end

thisS/L

Page 32: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Software Pipelining Example

l0 ld a,x(i)

l1 ai a,a,1

l2 ai a,a,1

l3 ai a,a,1

l4 st a,x(i)

Memory1 Integer1 Integer2 Integer3 Memory2

l0: S=0; I=0 l1: S=0; I=1 l2: S=0; I=2 l3: S=0; I=3 l4: S=0; I=4

• 2 memory units, 3 integer units.

•II=1 is enough. Each time next instruction is pushed to the next iteration.

Page 33: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Register Pressure

• l0 ld a0,x(i)• l1 ai a1,a0,1• l2 ai a2,a1,1• l3 ai a3,a2,1• l4 st a3,x(i)

1. The same register a cannot be used in 4 different iterations running simultaneously.

2. Need to store register’s value for each overlapping iterations and rename them cyclically after each iteration.

3. Issue 2 can be solved by unrolling with renaming though this will increase code size

Page 34: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Prolog & Epilog

Stage AStage BStage CStage DStage E

Stage AStage BStage CStage DStage E

Stage AStage BStage CStage DStage E

Stage AStage BStage CStage DStage E

Stage AStage BStage CStage DStage E

iter 1iter 2

iter xiter x-1

Prologue

Kernel

Epilogue

Fill Pipeline

Steady State

Drain Pipeline

II

Block Code Layout What's Happening

1. Current iteration when entering the kernel is 5.

2. I(Stage A)=0, that is we execute Stage A in the same iteration as initially.

3. I(Stage B) = 1, i.e. Stage B is always delayed to the next iteration.

4. Prolog: StageA1; StageB1,StageA2;StageC1,StageB2,StageA3…

Page 35: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Prolog & Epilog generation• Prolog:

for k = 0 to maxn(I(n))-1 lay out the kernel replacing all n s.t. I(n)>k by NO-OP

• Epilog:for to k=1 to maxn(I(n)) lay out the kernel replacing all n s.t. I(n)<k by NO-OP

• Compact both using list schedule.

Page 36: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Recurrences• Given a recurrence (n1, n2, …, nk):

Lk(S)

– Right hand side is called the slope of the recurrence. Nominator is the number of cycles it takes to complete all the computations of the recurrence, denominator is the number iterations available to do this.

– Lk(S) MAXc

k

i

k

i

11ii

1i

),cross(

)delay(

n n

n

delay(in )

i1

k

cross(in ,

i1 n )i1

k

Page 37: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel Scheduling – General Case

1. Compute MII to be the maximum of resource constraint and the maximum slope.

2. II=MII

3. Remove an edge from every recurrence.

4. Schedule(II) using the simple kernel scheduling algorithm.

5. If failed (dependency of any removed edge is violated), increase II and got 4.

Page 38: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Kernel Scheduling - Conclusions

• Handling control flow is difficult. May use hardware support for predicated execution or handling the “control flow regions” as black boxes.

• Increased register pressure may limit only to single basic block inner loops anyway.

• Benefits from unrolling with renaming.

Page 39: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Vector Unit Scheduling

• A vector instruction involves the execution of many scalar instructions

• Much of the benefit from the pipelining is already achieved

• Still, something can be done

Page 40: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Chaining• Chaining:

vload t1, avload t2, b

vadd t3, t1, t2 vstore t3, c

• Two load units

• Each operation takes 64 cycles

• 192 cycles without chaining

• 66 cycles with chaining

• Proximity within instructions required for hardware to identify opportunities for chaining

Page 41: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Chaining rearrangingvload a,x(i)vload b,y(i)vadd t1,a,bvload c,z(i)vmul t2,c,t1vmul t3,a,bvadd t4,c,t3

• Rearranging: vload a,x(i) vload b,y(i) vadd t1,a,b vmul t3,a,b vload c,z(i) vmul t2,c,t1 vadd t4,c,t3

2 load, 1 addition,

1 multiplication pipe

Page 42: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Instruction fusion

vload a,x(i)vload b,y(i)vadd t1,a,bvload c,z(i)vmul t2,c,t1vmul t3,a,bvadd t4,c,t3

Page 43: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

Instruction fusion – cont.

vload a,x(i)vload b,y(i)vadd t1,a,bvmul t3,a,bvload c,z(i)vmul t2,c,t1vadd t4,c,t3

After Fusion

Page 44: Chapter 10 Scheduling Presented by Vladimir Yanovsky.

The End!