LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler

Wed 5 March, 11:15am, Daniel Lezcano, Mike Turquette

LCA14-306: CPUidle & CPUfreq integration with scheduler

Introduction

● Power aware discussion

● Patchset « Small task packing »− Some informations shared between cpuidle and the

scheduler− https://lwn.net/Articles/520857/

● « Line on the sand » by Ingo Molnar− Integrate first cpuidle and cpufreq with the scheduler− http://lwn.net/Articles/552885/

https://lwn.net/Articles/520857/

https://lwn.net/Articles/520857/

http://lwn.net/Articles/552885/

http://lwn.net/Articles/552885/

Scheduler CPUidle

Idle task

Governor CPUidle backenddriver

cpuidle_idle_callswitch_to

cpuidle_select cpuidle_enter

CPUidle + scheduler : Current design

Idle time measurement

● From the scheduler :− The duration of the idle task is running− Includes the interrupt processing time

● From CPUidle :− The duration between interrupts

● CPUIdle code happens with local interrupts disabled

● T(idle task) = Σ T(CPUidle) + Σ T(irqs)

Idle time measurement

Idle time measurement unification

● What is the impact of returning to the scheduler each time an interrupt occurred ?− Scheduler will choose the idle task again if nothing

to do− Mainloop code simplified− Idle time measured nearly the same for the

scheduler and cpuidle− Probably a negative impact on performance to fix

Load balance

● Taking the decision to balance a task when going to idle

■ Use of avg_idle● Does not use how long the cpu will sleep

■ The idle state should be selected before■ CPUIdle should give the state the cpu will be

● Balance a task to the idlest cpu■ Does not use the cpu's exit latency■ CPUidle should give back the state the cpu is

CPUidle main function

● Reduce the distance between the scheduler and the cpuidle framework− Move the idle task to kernel/sched− Move the cpuidle_idle function in the idle task code− Integrate the idle mainloop and cpuidle_idle_call

● Allows to access the scheduler's private structure definition

Menu governor split

● The events could be classified in three categories :1. Predictable → timers2. Repetitive → IOs3. Random → key stroke, incoming packet

● Category 2 could be integrated into the scheduler

IO latency tracking

● IO are repetitive within a reasonable interval to assume it as predictable enough

IO latency tracking

● Measurement from the scheduler− io_schedule− io_schedule_timeout

● Count per task the io latency− Task migration moves IO history unlike current

governor− Latency constraint for the task

Combine informations

● Move predictable event framework in the scheduler

● Informations combined between the scheduler and menu governor will be more accurate− Idle balance decision based on the idle state a cpu

is or about to enter− Load tracking from task for idle state exit latency− CPU computation power and topology− DVFS strategies for exit idle state boost

Scheduler + CPUidle

● The scheduler should have all the informations to tell CPUidle :− How long it will sleep− What is the latency constraint

● The CPUidle should use the information provided by the scheduler :− Select an idle state− Use the backend driver idle callback− No more heuristics

Status

● A lot of cleanups around the idle mainloop

● CPUidle main function inside the idle mainloop− Code distance reduced, sharing the structures

scheduler/cpuidle− Communication between sub-systems made easier

Work in progress

● First iteration of IO latency tracking implemented− Validation in progress

● Simple governor for CPUIdle− Select a state

● Idle time unification experimentation

CPUfreq + scheduler

The title is misleading … CPUfreq may completely disappear in the future.

CPUfreq + scheduler


Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler

CPUfreq + scheduler


Goal is to initiate CPU dynamic voltage & frequency scaling (DVFS) from the Linux scheduler

Nobody knows what this will look like, so please ask questions and raise suggestions

• Polling workqueue• E.g. ondemand

• Based on idle time / busyness

• No relation to decisions taken by the scheduler

• Task may be run at any time

• No relation to idle task• In fact, task will not wake-up during idle

CPUfreq today

• Replace polling loop with event driven action

• Scheduler already takes action which affects available compute capacity• Load balance• Migrating tasks to and from CPUs of different compute capacity

• DVFS transitions are a natural fit

Event driven behavior

• Method to initiate CPU DVFS transitions from the scheduler

• Identify call sites to initiate those transitions• Enqueue/dequeue task• Load balance• Idle entry/exit• Aggressively schedule deadline tasks• Maybe others

• Define interface between the scheduler & the DVFS thingy• Currently a power driver in Morten’s RFC• Remove CPUfreq governor layer from the power driver completely?

Lots of work ahead

• Experiment with policy• When and where to evaluate if frequency should be changed• What metrics are important to the algorithm?• DVFS versus race-to-idle

• Integrate with power model

• Benchmark performance & power• Performance regressions• Does it save power?

• Make it work with non-CPUfreq things like PSCI and ACPI for changing CPU P-state

Lots of work ahead, part 2

• https://lkml.org/lkml/2013/10/11/547

• Replaces polling loop in CPUfreq governor with scheduler event-driven action

• CPUfreq machine drivers are re-used initially

• CPUfreq governor becomes a shim layer to the power driver

Morten’s power aware scheduling RFC

https://lkml.org/lkml/2013/10/11/547

https://lkml.org/lkml/2013/10/11/547

• DVFS task is itself scheduled on a workqueue• Might not be run for some time after the scheduler determines that a

DVFS transition should happen

• Kworker threads are filtered out• Prevents infinite reentrancy into the scheduler• CPU capacity is not changed when enqueuing and dequeuing these

tasks

Nitty gritty details

include/linux/sched/power.h

struct power_driver { /* * Power driver calls may happen from scheduler context with irq * disabled and rq locks held. This must be taken into account in * the power driver. */ /* cpu already at max capacity? */ int (*at_max_capacity) (int cpu); /* Increase cpu capacity hint */ int (*go_faster) (int cpu, int hint); /* Decrease cpu capacity hint */ int (*go_slower) (int cpu, int hint); /* Best cpu to wake up */ int (*best_wake_cpu) (void); /* Scheduler call-back without rq lock held and with irq enabled */ void (*late_callback) (int cpu);};

• https://github.com/mturquette/linux/commits/sched-cpufreq

• Replaced workqueue method with per-CPU kthread• This allows removal of the kworker filter• Please commence bikeshedding over the name of this kthread

• Use SCHED_FIFO policy for the task• Will be run before the normal work (right?)

• These patches were just validated yesterday• Bugs• Holes in logic• Misunderstandings• Voided warranties

Incremental changes on top

https://github.com/mturquette/linux/commits/sched-cpufreq

https://github.com/mturquette/linux/commits/sched-cpufreq

• Gather more opinions on the power driver interface

• Is go_faster/go_slower the right way?• Spoiler alert: Probably not.

• When else might we want to evaluate CPU frequency?• Idle entry/exit as mentioned by Daniel• Cluster-level considerations

• Sched domains• Not just per-core• Four Cortex-A9’s with single CPU clock

• Coordinate with the power model work

What’s next?

Questions?

More about Linaro Connect: http://connect.linaro.orgMore about Linaro: http://www.linaro.org/about/

More about Linaro engineering: http://www.linaro.org/engineering/Linaro members: www.linaro.org/members

http://www.linaro.org/about/

http://www.linaro.org/about/

http://www.linaro.org/engineering/

http://www.linaro.org/members

LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler

Documents

Transcript of LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler