Chapter 10 Scheduling Presented by Vladimir Yanovsky.
-
Upload
evangeline-stewart -
Category
Documents
-
view
232 -
download
0
Transcript of Chapter 10 Scheduling Presented by Vladimir Yanovsky.
Chapter 10
Scheduling
Presented by Vladimir Yanovsky
The goals
• Scheduling: Mapping of parallelism within the constraints of limited available parallel resources
• In general, we must sacrifice some parallelism to fit a program within the available resources
• Our goal: Minimize the amount of parallelism sacrificed/maximize utilization of the resources
Lecture Outline
– Straight line scheduling
– Trace Scheduling
– Loops: Kernel Scheduling (Software Pipelining)
– Vector unit scheduling
Scheduling - Motivation
• Transistor sizes have shrank. Can be exploited by:
1.Several processors on the same silicone.
2.Multiple identical execution units.
• The more parallelisms allows the processor, the more important scheduling is.
Processor Types
• Superscalar
Multiple functional units controlled and scheduled by the hardware.
• VLIW (Very Large Instruction Word)
Scheduled by the compiler
VLIW vs Superscalar
• Compatibility
• Capability of run-time adjustments (branches & cache misses)
• Design simplicity
• Global view of the program
Scheduling – standard approach
• Scheduling in VLIW and Superscalar architectures:
– Receive a sequential stream of instructions– Reorder this sequential stream to utilize available
parallelism– Reordering must preserve dependences
• Our model for this talk is VLIW
Reuse Constrains
• Need to execute:
a = b + c + d + e
• One possible sequential stream:add a, b, cadd a, a, dadd a, a, e
• And, another:add r1, b, cadd r2, d, eadd a, r1, r2
Fundamental Problem
• Fundamental conflict in scheduling: – If the original instruction stream takes into
account available resources it will create artificial dependences
– If not, then there may not be enough resources to correctly execute the stream
• Who should be earlier, register allocation or scheduling?
Processor Model
• VLIW type
• Processor contains a number of issue units• Issue unit has an associated type and a
delay
• Purpose: to select set of instructions for each cycle such that the number of instructions of each type is not greater than the number of execution units of this type
Straight Line Scheduling
• Scheduling a basic block: receives a dependence graphG = (N, E, type, delay)– N: set of instructions in the code
– E: (n1, n2) E iff n2 must wait completion of n1 due to a dependency
– Each n N has a type, type(n), and a delay, delay(n).
Straight Line Scheduling
• A correct schedule is a mapping, S, from vertices in the graph to nonnegative integers representing cycle numbers such that:1. If (n1,n2) E, S(n1) + delay(n1) S(n2), i.e. deps satisfied2. Hardware constraints are satisfied.
• The length of a schedule, S, denoted L(S) is defined as:L(S) = maxn (S(n) + delay(n))
• Goal of straight-line scheduling: Find a shortest possible correct schedule.
List Scheduling
• Use variant of topological sort: – Maintain a list of instructions which have no
unscheduled predecessors in the graph– Schedule these instructions– This will allow other instructions to be added to
the list– Repeat until all instructions are scheduled
List Scheduling
• We maintain two arrays:– count determines for each instruction how many
predecessors are still to be scheduled– earliest array maintains the earliest cycle on which the
instruction can be scheduled.
• Maintain a number of worklists which hold instructions to be scheduled for a particular cycle number. All their predecessors are scheduled.
List Scheduling - Initialization
for each nN do begin count[n] := 0; earliest[n] = 0 endfor each (n1,n2)E do begin
count[n2] := count[n2] + 1;successors[n1] := successors[n1] {n2};
endfor i := 0 to MaxC – 1 do W[i] := ; //MaxC max(delay)+1Wcount := 0; //The number of ready instructionsfor each nN do
if count[n] = 0 then begin //No dependenciesW[0] := W[0] {n}; Wcount := Wcount + 1;
endendc := 0; // c is the cycle number cW := 0; // cW is the number of the worklist for cycle c instr[c] := ;
List Scheduling Algorithmwhile Wcount > 0 do begin
while W[cW] = do beginc := c + 1; instr[c] := ; cW := mod(cW+1,MaxC);
endnextc := mod(c+1,MaxC); //next cyclewhile W[cW] ≠ do begin
select and remove an arbitrary instruction x from W[cW];if free issue units of type(x) on cycle c then begin
instr[c] := instr[c] {x}; Wcount := Wcount - 1;for each ysuccessors[x] do begin
count[y] := count[y] – 1;earliest[y] := max(earliest[y], c+delay(x));if count[y] = 0 then begin
loc := mod(earliest[y],MaxC);W[loc] := W[loc] {y}; Wcount := Wcount + 1;
endend
else W[nextc] := W[nextc] {x}; //x could not be scheduledFor each unused unit insert stallend
end
Priority
Finding the critical pathfor each n N do begin count[n] := 0; remaining[n] := delay(n); endfor each (n1,n2) E do begin
count[n1] := count[n1] + 1; //count[n]==0 iff nothing depends on npredecessors[n2] := predecessors[n2] {n1};
endW := ;∅for each n N do if count[n] = 0 then W := W {n};//init: W-inst without depswhile W ≠ ∅ do begin
select and remove an arbitrary instruction x from W;for each y predecessors[x] do begin
count[y] := count[y] – 1;remaining[y] := max(remaining[y], remaining[x]+delay(y));if count[y] = 0 then W := W {y};
endend
Problems of list scheduling
• Previous basic block must complete before the next is started.
• Cannot schedule loops.
Trace Scheduling
• Exploit parallelism between several basic blocks.
• Trace: is a collection of basic blocks that form a single path through all or part of the program.
• CFG without loops
Trace SchedulingScheduling
j=j+1
i=i+2
if e1
i = i + 2 is moved below the split – inserted fixup code
Trace Scheduling
• Trace scheduling algorithm:1. Select a trace based on profiling information.
2. Schedule the trace using basic block scheduler adding dependencies from the splits/joints to the upstream/downstream instructions respectively.
3. Insert a fixup code.
4. Remove the scheduled trace from the CFG
5. If CFG not empty Goto 1
Trace & line scheduling - conclusions
1. Problem with line & trace scheduling – cannot schedule loops effectively. Must unroll loops to have more “meat” for work.
2. Trace scheduling increases code size by inserting fixup code, may lead to exponential code increase.
3. Need up-to-date memory dependencies information to do anything about moving memory accesses.
Kernel Scheduling
• Moves instructions not only in space but also in time – across iterations.
• Allows to better exploit parallelism between loop iterations.
Kernel Scheduling problem
• A kernel scheduling problem is a graph:G = (N, E, delay, type, cross)where cross (n1, n2) defined for each edge in E is the number of iterations crossed by the dependence relating n1 and n2
Software Pipelining• Example: ld r1,0
ld r2,400 fld fr1, c
l0 fld fr2,a(r1)l1 fadd fr2,fr2,fr1l2 fst fr2,b(r1)l3 ai r1,r1,8l4 comp r1,r2l5 ble l0
• A legal schedule:
10: fld fr2,a(r1) ai r1,r1,8
Floating Pt.
comp r1,r2
fst fr3,b-16(r1) ble l0 fadd fr3,fr2,fr1
IntegerLoad/Store
fld
fadd
fst
2
3
Software Pipelining
ld r1,0 ld r2,400 fld fr1, cl0 fld fr2,a(r1)l1 fadd fr2,fr2,fr1l2 fst fr2,b(r1)l3 ai r1,r1,8l4 comp r1,r2l5 ble l0
S[10] = 0; I[l0] = 0;S[l1] = 2; I[l1] = 0;S[l2] = 2; I[l2] = 1;S[l3] = 0; I[l3] = 0;S[l4] = 1; I[l4] = 0;S[l5] = 2; I[l5] = 0;
10: fld fr2,a(r1) ai r1,r1,8
Floating Pt.
comp r1,r2
fst fr3,b-16(r1) ble l0 fadd fr3,fr2,fr1
IntegerLoad/Store
Software Pipelining• Have to generate epilog and prolog to ensure correctness• Prolog:
ld r1,0 ld r2,400 fld fr1, c
p1 fld fr2,a(r1); ai r1,r1,8p2 comp r1,r2p3 beq e1; fadd fr3,fr2,fr1
• Epilog:
e1 nop
e2 nop
e3 fst fr3,b-8(r1)
Kernel Scheduling
• A solution to the kernel scheduling problem is a pair of tables (S,I), where:– the schedule S maps each instruction n to a cycle
within the kernel– the iteration I maps each instruction to an iteration
offset from zero, such that: S[n1] + delay(n1) S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S) for each edge (n1,n2) in E, where: Lk(S) = maxn (S(n)) is the length of the kernel for S.
• Another name for kernel’s length is II – initiation interval
Kernel scheduling - intuition
• S[n1] + delay(n1) S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S)
• Instructions with I[n] = 0 are running in the “current” iteration.
• If I[n]>0 this means that the instruction is delayed by I[n] iterations.
• Even if n1 has large delay, n2 can be moved to a later iteration instead of forcing it to be scheduled in the cycle S[n1] + delay(n1)
Resource Constrains• Resource usage constraint:
– No recurrence in the loop– #t: number of instructions in each iteration that must
issue in a unit of type t
Lk(S)
• We can always find a schedule S, such that
Lk(S) =
tmax
# t
tm
tmax
# t
tm
Kernel Schedulingfor each instruction x in G in topological order do begin
earlyS := 0; earlyI := 0;for each predecessor y of x in G do
thisS := S[y] + delay(y); thisI := I[y];if thisS ≥ L then begin
thisS := mod(thisS,L); thisI := thisI + ;endif thisI > earlyI or ((thisI = earlyI) && (thisS>earlyS)) then begin
earlyI := thisI; earlyS := thisS;end
endstarting at cycle earlyS, find the first cycle c0where the resource needed by x is available,wrapping to the beginning of the kernel if necessary;S[x] := c0;if c0 < earlyS then I[x] := earlyI +1 else I[x] := earlyI; //Wrapped over kernel
end
thisS/L
Software Pipelining Example
l0 ld a,x(i)
l1 ai a,a,1
l2 ai a,a,1
l3 ai a,a,1
l4 st a,x(i)
Memory1 Integer1 Integer2 Integer3 Memory2
l0: S=0; I=0 l1: S=0; I=1 l2: S=0; I=2 l3: S=0; I=3 l4: S=0; I=4
• 2 memory units, 3 integer units.
•II=1 is enough. Each time next instruction is pushed to the next iteration.
Register Pressure
• l0 ld a0,x(i)• l1 ai a1,a0,1• l2 ai a2,a1,1• l3 ai a3,a2,1• l4 st a3,x(i)
1. The same register a cannot be used in 4 different iterations running simultaneously.
2. Need to store register’s value for each overlapping iterations and rename them cyclically after each iteration.
3. Issue 2 can be solved by unrolling with renaming though this will increase code size
Prolog & Epilog
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
Stage AStage BStage CStage DStage E
iter 1iter 2
iter xiter x-1
Prologue
Kernel
Epilogue
Fill Pipeline
Steady State
Drain Pipeline
II
Block Code Layout What's Happening
1. Current iteration when entering the kernel is 5.
2. I(Stage A)=0, that is we execute Stage A in the same iteration as initially.
3. I(Stage B) = 1, i.e. Stage B is always delayed to the next iteration.
4. Prolog: StageA1; StageB1,StageA2;StageC1,StageB2,StageA3…
Prolog & Epilog generation• Prolog:
for k = 0 to maxn(I(n))-1 lay out the kernel replacing all n s.t. I(n)>k by NO-OP
• Epilog:for to k=1 to maxn(I(n)) lay out the kernel replacing all n s.t. I(n)<k by NO-OP
• Compact both using list schedule.
Recurrences• Given a recurrence (n1, n2, …, nk):
Lk(S)
– Right hand side is called the slope of the recurrence. Nominator is the number of cycles it takes to complete all the computations of the recurrence, denominator is the number iterations available to do this.
– Lk(S) MAXc
k
i
k
i
11ii
1i
),cross(
)delay(
n n
n
delay(in )
i1
k
cross(in ,
i1 n )i1
k
Kernel Scheduling – General Case
1. Compute MII to be the maximum of resource constraint and the maximum slope.
2. II=MII
3. Remove an edge from every recurrence.
4. Schedule(II) using the simple kernel scheduling algorithm.
5. If failed (dependency of any removed edge is violated), increase II and got 4.
Kernel Scheduling - Conclusions
• Handling control flow is difficult. May use hardware support for predicated execution or handling the “control flow regions” as black boxes.
• Increased register pressure may limit only to single basic block inner loops anyway.
• Benefits from unrolling with renaming.
Vector Unit Scheduling
• A vector instruction involves the execution of many scalar instructions
• Much of the benefit from the pipelining is already achieved
• Still, something can be done
Chaining• Chaining:
vload t1, avload t2, b
vadd t3, t1, t2 vstore t3, c
• Two load units
• Each operation takes 64 cycles
• 192 cycles without chaining
• 66 cycles with chaining
• Proximity within instructions required for hardware to identify opportunities for chaining
Chaining rearrangingvload a,x(i)vload b,y(i)vadd t1,a,bvload c,z(i)vmul t2,c,t1vmul t3,a,bvadd t4,c,t3
• Rearranging: vload a,x(i) vload b,y(i) vadd t1,a,b vmul t3,a,b vload c,z(i) vmul t2,c,t1 vadd t4,c,t3
2 load, 1 addition,
1 multiplication pipe
Instruction fusion
vload a,x(i)vload b,y(i)vadd t1,a,bvload c,z(i)vmul t2,c,t1vmul t3,a,bvadd t4,c,t3
Instruction fusion – cont.
vload a,x(i)vload b,y(i)vadd t1,a,bvmul t3,a,bvload c,z(i)vmul t2,c,t1vadd t4,c,t3
After Fusion
The End!