Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all...
-
Upload
annabelle-underwood -
Category
Documents
-
view
219 -
download
0
Transcript of Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all...
Combining Scheduling & AllocationComp 412
Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
COMP 412FALL 2010
The Last LectureThe Last Lecture
Comp 412, Fall 2010 2
Combining Scheduling & Allocation
Sometimes, combining two optimizations can produce solutions that cannot be obtained by solving them independently.
•Requires bilateral interactions between optimizations— Click and Cooper, “Combining Analyses, Combining
Optimizations”, TOPLAS 17(2), March 1995.
•Combining two optimizations can be a challenge (SCCP)
Scheduling & allocation are a classic example•Scheduling changes variable lifetimes•Renaming in the allocator changes dependences•Spilling changes the underlying code
falsedependences
Comp 412, Fall 2010 3
Many authors have tried to combine allocation & scheduling
• Underallocate to leave “room” for the scheduler— Can result in underutilization of registers
• Preallocate to use all registers— Can create false dependences
• Solving the problems together can produce solutions that cannot be obtained by solving them independently— See Click and Cooper, “Combining Analyses, Combining
Optimizations”, TOPLAS 17(2), March 1995.
In general, these papers try to combine global allocators with local or regional schedulers — an algorithmic mismatch
Combining Scheduling & Allocation
Before we go there, a long digression about how much improvement we might expect …
Before we go there, a long digression about how much improvement we might expect …
Comp 412, Fall 2010 4
Iterative Repair Scheduling
The Problem
• List scheduling has dominated field for 20 years
• Anecdotal evidence both good & bad, little solid evidence
• No intuitive paradigm for how it works
• It works well, but will it work well in the future ?
• Is there room for improvement? (e.g., with allocation?)
Schielke’s Idea
• Try more powerful algorithms from other domains
• Look for better schedules
• Look for understanding of the solution space
This led us to iterative repair scheduling
Comp 412, Fall 2010 5
Iterative Repair SchedulingThe Algorithm
• Start from some approximation to a schedule (bad or broken)
• Find & prioritize all cycles that need repair (tried 6 schemes)— Either resource or data constraints
• Perform the needed repairs, in priority order— Break ties randomly— Reschedule dependent operations, in random order— Evaluation function on repair can reject the repair (try another)
• Iterate until repair list is empty
• Repeat this process many times to explore the solution space— Keep the best result !
Randomization & restart is a fundamental theme of our recent work
Randomization & restart is a fundamental theme of our recent work
Iterative repair works well on many kinds of scheduling problems.
• Scheduling cargo for the space shuttle
• Typical problems in the literature involve 10s or 100s of repairs
We used it with millions of repairs
Iterative repair works well on many kinds of scheduling problems.
• Scheduling cargo for the space shuttle
• Typical problems in the literature involve 10s or 100s of repairs
We used it with millions of repairs
Comp 412, Fall 2010 6
Iterative Repair Scheduling
How does iterative repair do versus list scheduling?
• Found many schedules that used fewer registers
• Found very few faster schedules
• Were disappointed with the results
• Began a study of the properties of scheduling problems
Iterative repair, itself, doesn’t justify the additional costs
• Can we identify schedulers where it will win?
• Can we learn about the properties of scheduling problems ?
— And about the behavior of list scheduling ...
Hopeful sign for this lectureHopeful sign for this lecture
Comp 412, Fall 2010 7
Methodology
• Looked at blocks & extended blocks in benchmark programs
• Used his RBF algorithm & tested for optimality
• If non-optimal, used IR to find its best schedule (simple tests)
• Checked these results against an IP formulation using CPLEX
The Results
• List scheduling1 does quite well on a conventional uniprocessor
Over 92% of blocks scheduled optimally for speed
Over 73% of extended blocks scheduled optimally for speed
• CPLEX had a hard time with the easy blocks
— Too many optimal solutions to investigate
Instruction Scheduling Study
These results were obtained with code from benchmark programs. Recall, from the local scheduling lecture, that RBF generated optimal schedules for 80% of the randomly generated blocks.
These results were obtained with code from benchmark programs. Recall, from the local scheduling lecture, that RBF generated optimal schedules for 80% of the randomly generated blocks.
Holes in schedule?Delays on critical
path?
Holes in schedule?Delays on critical
path?
Comp 412, Fall 2010 8
Combining Allocation & Scheduling
The Problem
• Well-understood that the problems are intricately related
• Previous work under-allocates or under-schedules — Except Goodman & Hsu
Our Approach
• Formulate an iterative repair framework — Moves for scheduling, as before— Moves to decrease register pressure or to spill
• Allows fair competition in a combined attack
Grows out of search for novel techniques from other areas
Back to today’s subjectBack to today’s subject
Comp 412, Fall 2010 9
Combining Allocation & Scheduling
The Details
• Run IR scheduler & keep the schedule with lowest demand for registers (register pressure)
• Start with ALAP schedule rather than ASAP schedule
• Reject any repair that increases maximum pressure
• Cycle with pressure > k triggers “pressure repair”— Identify ops that reduce pressure & move one— Lower threshold for k seems to help
• Ran it against the classic method— Schedule, allocate, schedule (using Briggs’
allocator)
Comp 412, Fall 2010 10
Combining Allocation & Scheduling
The Results
• Many opportunities to lower pressure— 12% of basic blocks — 33% of extended blocks
• These schedule may be faster, too— Best case was 41.3% (procedure)— Average case, 16 regs, was 5.4%— Average case, 32 regs, was 3.5% (whole
applications)
This approach finds faster codes that spill fewer valuesIt is competing against a very good global allocator
— Rematerialization catches many of the same effects
Knowing that new solutions exist does not ensure that they are better solutions!
This work confirms years of suspicion, while providing an effective, albeit nontraditional, technique
Knowing that new solutions exist does not ensure that they are better solutions!
This work confirms years of suspicion, while providing an effective, albeit nontraditional, technique
The opportunity is present, but the IR scheduler is still quite slow …The opportunity is present, but the IR scheduler is still quite slow …
Comp 412, Fall 2010 11
Balancing Speed and Register Pressure
Goodman & Hsu proposed a novel scheme• Context: debate about prepass versus postpass
scheduling• Problem: tradeoff between allocation & scheduling• Solution:
— Schedule for speed until fewer than Threshold registers— Schedule for registers until more than Threshold registers
• Details:— “for speed” means one of the latency-weighted priorities— “for registers” means an incremental adaptation of SU
scheme
James R. Goodman and Wei-Chung Hsu, “Code Scheduling and Register Allocation in Large Basic Blocks,” Proceedings of the 2nd International Conference on Supercomputing, St. Malo, France, 1988, pages 442-452.
James R. Goodman and Wei-Chung Hsu, “Code Scheduling and Register Allocation in Large Basic Blocks,” Proceedings of the 2nd International Conference on Supercomputing, St. Malo, France, 1988, pages 442-452.
Other approaches in the literatureOther approaches in the literature
Comp 412, Fall 2010 12
Local Scheduling & Register Allocation
List scheduling is a local, incremental algorithm• Decisions made on an operation-by-operation basis• Use local (basic-block level) metrics
Need a local, incremental register-allocation algorithm• Best’s algorithm, called “bottom-up local” in EaC
— To free a register, evict the value with furthest next use
• Uses local (basic-block level) metrics
Combining these two algorithms leads to a fair, local algorithm for the combined problem— Idea is due to Dae-Hwan Kim & Hyuk-Jae Lee— Can use a non-local eviction heuristic (new twist on
Best’s alg.)See Dae-Hwan Kim and Hyuk-Jae Lee, “Integrated instruction scheduling and fine-grain register allocation for embedded processors,” LNCS 4017, pages 269-278, July 2006
(6th Int’l Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2006) Samos, Greece)
See Dae-Hwan Kim and Hyuk-Jae Lee, “Integrated instruction scheduling and fine-grain register allocation for embedded processors,” LNCS 4017, pages 269-278, July 2006
(6th Int’l Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2006) Samos, Greece)
Comp 412, Fall 2010 13
Original Code for Local List Scheduling
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op
Cycle Cycle + 1
update the Ready queue
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op
Cycle Cycle + 1
update the Ready queue
Paraphrasing from the local scheduling lecture … Paraphrasing from the local scheduling lecture …
Comp 412, Fall 2010 14
The Combined Algorithm
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready
make operands available in registers allocate a register for target
S(op) Cycle Active Active op
Cycle Cycle + 1
update the Ready queue
Reload Live on Exit values, if necessary
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready
make operands available in registers allocate a register for target
S(op) Cycle Active Active op
Cycle Cycle + 1
update the Ready queue
Reload Live on Exit values, if necessary
Bottom-up local:
Keep a list of free registersOn last use, put register back on free listTo free register, store value used farthest in the future
Fast, simple, & effectiveFast, simple, & effective
Notes on the Final Exam
• Closed-notes, closed-book exam• Exam available Wednesday.• Three hour time limit
— I aimed for a two-hour exam, but I don’t want you to feel time pressure. You may take one break of up to fifteen minutes apiece.
• You are responsible for the entire course— Exam focuses primarily on material since the midterm— Chapters 5, 6, 7, 8, 9.1, 9.2, 11, 12, & 13— All the lecture notes
• Return the exam to DH 3080 (Penny Anderson’s office) by 5PM on the last day of exams – December 15, 2010
• If you must leave, you can email me a Word file or a PDF document.
Comp 412, Fall 2010 15
Comp 412, Fall 2010 16
Scheilke’s RBF Algorithm for Local Scheduling
Relying on randomization & restart, we can smooth the behavior of classic list scheduling algorithms
Schielke’s RBF algorithm
•Run 5 passes of forward list schedulingand 5 passes of backward list scheduling
•Break each tie randomly•Keep the best schedule
— Shortest time to completion— Other metrics are possible (shortest time + fewest
registers)
In practice, this approach does very well— Reuses the dependence graph
RandomizedBackward &Forward
RandomizedBackward &Forward
My “algorithm of choice” for list scheduling …My “algorithm of choice” for list scheduling …