LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
description
Transcript of LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
Wed 5 March, 11:15am, Daniel Lezcano, Mike Turquette
LCA14-306: CPUidle & CPUfreq integration with scheduler
Introduction
● Power aware discussion
● Patchset « Small task packing »− Some informations shared between cpuidle and the
scheduler− https://lwn.net/Articles/520857/
● « Line on the sand » by Ingo Molnar− Integrate first cpuidle and cpufreq with the scheduler− http://lwn.net/Articles/552885/
Scheduler CPUidle
Idle task
Governor CPUidle backenddriver
cpuidle_idle_callswitch_to
cpuidle_select cpuidle_enter
CPUidle + scheduler : Current design
Idle time measurement
● From the scheduler :− The duration of the idle task is running− Includes the interrupt processing time
● From CPUidle :− The duration between interrupts
● CPUIdle code happens with local interrupts disabled
● T(idle task) = Σ T(CPUidle) + Σ T(irqs)
Idle time measurement
Idle time measurement unification
● What is the impact of returning to the scheduler each time an interrupt occurred ?− Scheduler will choose the idle task again if nothing
to do− Mainloop code simplified− Idle time measured nearly the same for the
scheduler and cpuidle− Probably a negative impact on performance to fix
Load balance
● Taking the decision to balance a task when going to idle
■ Use of avg_idle● Does not use how long the cpu will sleep
■ The idle state should be selected before■ CPUIdle should give the state the cpu will be
● Balance a task to the idlest cpu■ Does not use the cpu's exit latency■ CPUidle should give back the state the cpu is
CPUidle main function
● Reduce the distance between the scheduler and the cpuidle framework− Move the idle task to kernel/sched− Move the cpuidle_idle function in the idle task code− Integrate the idle mainloop and cpuidle_idle_call
● Allows to access the scheduler's private structure definition
Menu governor split
● The events could be classified in three categories :1. Predictable → timers2. Repetitive → IOs3. Random → key stroke, incoming packet
● Category 2 could be integrated into the scheduler
IO latency tracking
● IO are repetitive within a reasonable interval to assume it as predictable enough
IO latency tracking
● Measurement from the scheduler− io_schedule− io_schedule_timeout
● Count per task the io latency− Task migration moves IO history unlike current
governor− Latency constraint for the task
Combine informations
● Move predictable event framework in the scheduler
● Informations combined between the scheduler and menu governor will be more accurate− Idle balance decision based on the idle state a cpu
is or about to enter− Load tracking from task for idle state exit latency− CPU computation power and topology− DVFS strategies for exit idle state boost
Scheduler + CPUidle
● The scheduler should have all the informations to tell CPUidle :− How long it will sleep− What is the latency constraint
● The CPUidle should use the information provided by the scheduler :− Select an idle state− Use the backend driver idle callback− No more heuristics
Status
● A lot of cleanups around the idle mainloop
● CPUidle main function inside the idle mainloop− Code distance reduced, sharing the structures
scheduler/cpuidle− Communication between sub-systems made easier
Work in progress
● First iteration of IO latency tracking implemented− Validation in progress
● Simple governor for CPUIdle− Select a state
● Idle time unification experimentation
CPUfreq + scheduler
The title is misleading … CPUfreq may completely disappear in the future.
CPUfreq + scheduler
The title is misleading … CPUfreq may completely disappear in the future.
Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler
CPUfreq + scheduler
The title is misleading … CPUfreq may completely disappear in the future.
Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler
Nobody knows what this will look like, so please ask questions and raise suggestions
• Polling workqueue• E.g. ondemand
• Based on idle time / busyness
• No relation to decisions taken by the scheduler
• Task may be run at any time
• No relation to idle task• In fact, task will not wake-up during idle
CPUfreq today
• Replace polling loop with event driven action
• Scheduler already takes action which affects available compute capacity• Load balance• Migrating tasks to and from CPUs of different compute capacity
• DVFS transitions are a natural fit
Event driven behavior
• Method to initiate CPU DVFS transitions from the scheduler
• Identify call sites to initiate those transitions• Enqueue/dequeue task• Load balance• Idle entry/exit• Aggressively schedule deadline tasks• Maybe others
• Define interface between the scheduler & the DVFS thingy• Currently a power driver in Morten’s RFC• Remove CPUfreq governor layer from the power driver completely?
Lots of work ahead
• Experiment with policy• When and where to evaluate if frequency should be changed• What metrics are important to the algorithm?• DVFS versus race-to-idle
• Integrate with power model
• Benchmark performance & power• Performance regressions• Does it save power?
• Make it work with non-CPUfreq things like PSCI and ACPI for changing CPU P-state
Lots of work ahead, part 2
• https://lkml.org/lkml/2013/10/11/547
• Replaces polling loop in CPUfreq governor with scheduler event-driven action
• CPUfreq machine drivers are re-used initially
• CPUfreq governor becomes a shim layer to the power driver
Morten’s power aware scheduling RFC
• DVFS task is itself scheduled on a workqueue• Might not be run for some time after the scheduler determines that a
DVFS transition should happen
• Kworker threads are filtered out• Prevents infinite reentrancy into the scheduler• CPU capacity is not changed when enqueuing and dequeuing these
tasks
Nitty gritty details
include/linux/sched/power.h
struct power_driver { /* * Power driver calls may happen from scheduler context with irq * disabled and rq locks held. This must be taken into account in * the power driver. */ /* cpu already at max capacity? */ int (*at_max_capacity) (int cpu); /* Increase cpu capacity hint */ int (*go_faster) (int cpu, int hint); /* Decrease cpu capacity hint */ int (*go_slower) (int cpu, int hint); /* Best cpu to wake up */ int (*best_wake_cpu) (void); /* Scheduler call-back without rq lock held and with irq enabled */ void (*late_callback) (int cpu);};
• https://github.com/mturquette/linux/commits/sched-cpufreq
• Replaced workqueue method with per-CPU kthread• This allows removal of the kworker filter• Please commence bikeshedding over the name of this kthread
• Use SCHED_FIFO policy for the task• Will be run before the normal work (right?)
• These patches were just validated yesterday• Bugs• Holes in logic• Misunderstandings• Voided warranties
Incremental changes on top
• Gather more opinions on the power driver interface
• Is go_faster/go_slower the right way?• Spoiler alert: Probably not.
• When else might we want to evaluate CPU frequency?• Idle entry/exit as mentioned by Daniel• Cluster-level considerations
• Sched domains• Not just per-core• Four Cortex-A9’s with single CPU clock
• Coordinate with the power model work
What’s next?
Questions?
More about Linaro Connect: http://connect.linaro.orgMore about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/Linaro members: www.linaro.org/members