A Balancing Act with Threads

The work of the scheduler never ends

The concept of a thread has not been alien to computing for a long time. Before even the first multicore processors came into manufacture and widespread usage, threads of execution, representing a sequence of executing instructions, have been used to both isolate code at runtime and optimise use of available hardware resources.

Multithreading, a process by which a computer seemingly does many things at the same time, even with a processor which initially appears only able to perform tasks sequentially, has become a central tenet to most areas of computing. What is it which controls and manages these threads, and how do they relate to the management of power?

The thread scheduler

The thread scheduler is a component of the operating system kernel, which at its core allocates work to be done to physical hardware resources capable of performing that work. (It may be of interest to note that in Windows, this is called the "dispatcher".) Fig. 5 provides a very abstract overview of any scheduler: it maintains a work queue of tasks, assigning each to available resources according to some algorithm, and ideally maintains that each individual task will be certain to make progress.

What does it mean for all threads to make progress? A good scheduler design will ensure that a heavy workload carried out in one thread of execution will not, assuming all threads have equal priority, starve other threads from processor time. In modern PC architectures, hardware support for pre-emptive multitasking enables the kernel to pre-empt — forcibly suspend the currently executing thread — and return control to the scheduler, which then decides the next thread or threads to resume.

In essence, a scheduling algorithm balances one or more of the following requirements in deciding the optimal allocation of resources:

Priority and affinity: higher priority threads have more time to do work, and affinitised threads are gated to run on a subset of processor (cores).
Latency or Quality of Service requirements: a media playback thread may be lightweight but have tighter requirements on how long it can take to respond to ensure smooth playback.
Historical execution metrics: depending on previous measured executions of a given thread, such as time to finish or the core on which it was run, a scheduler may attempt to delay pre-emption to allow work to complete, or reassign a thread to the same core to make use of relevant data still present in the core's cache.
User-visible operations: allowing greater priority for work which is visible to users, such as the UI of an application, in which slowness would impact experience.

But what about power?

Of course, this article is interested in what the scheduler can do to realise power savings in a system. We have already examined the interfaces available to the system in the first article, and the components present in the OS to use them. Then, there is a fifth requirement a modern scheduler takes into account:

CPU power usage: threads should be executed in a way which balances energy usage and performance.

How is this achieved? Recall, from the previous article the responsibilities of the power manager component. It keeps track of various characteristics of each logical core in the system, including but not limited to the current power state and clock frequency, maintains a table describing all supported states, their power draw, and the latencies associated with a transition to or from them.

Most importantly, in the context of threading, the power manager communicates with the thread scheduler to both provide the algorithm with profile data that factors into the scheduling decision, and receives information on load and activity which hints the most appropriate subsequent power state to place cores in when the power manager next makes a decision.

Interactions between the scheduler and power manager — Fig. 6: The interactions between the thread scheduler and power manager

Introduced here then, is the concept of a power manager quantum, a period of time where it is known the manager will not make any changes to the performance state of processor cores. In one quantum, the thread scheduler will make a decision, and allocate cores appropriately to threads.

In the actual process itself, the scheduler has many heuristics with which it can make its call. Foremost, given data on transition costs, it will attempt to avoid unnecessarily scheduling threads to cores which have a high cost to begin execution. This means that if a core is in a deep idle state, where it may have its cache flushed and/or clock highly gated, a thread due for resumption will first be queued to run on more active cores.

The scheduler may further decide to pre-empt a lower priority thread, if the heuristic concludes that leaving the core sleeping is worthwhile. And indeed, a naive algorithm which attempts to maximise usage of all cores for performance instead of efficiency, may incur an even greater overhead of awakening a deeply sleeping core; potentially worsening latencies, something which — if such a thread were a critical audio thread with tight deadlines — would be completely unacceptable.

Before coalescing of tasks — Fig. 7: An example of before and after timer coalescing (3)

After coalescing of tasks — Fig. 7: An example of before and after timer coalescing (3)

Evidently, such an example demonstrates that performance and power efficiency need not be mutually exclusive. A third major optimisation is given the name Timer Coalescing, and is implemented on more recent OS versions across many different vendors. Once again, bearing in mind the costs associated with changing states in a processor, a scheduler may attempt to group together, or in other words, coalesce threads which it knows will need to start running at around the same time. This way, frequent movement between performance and idle states which would otherwise have been caused by various incoming threads being serviced immediately, as opposed to in batches, is reduced or eliminated entirely.

Take a look at Fig. 7; before this optimisation was applied, average power consumption hovers around 80% of the maximum on account of needless transitions when cores are in the worst case woken up right after they are put to sleep, or in the best case remain idle at a power-hungry state in anticipation of a new thread arriving soon. After the grouping is applied, the scheduler can make use of well-defined periods in which the processor runs at high performance to finish all pending tasks, and remains at rest for the remainder of the time, achieving dramatic improvements in average power draw, now around 50%.

Closing thoughts

A modern PC (this of course, includes mobile phones, tablet computers, Macs, and all that are personal computing devices) is complex and interconnected; this is why this website exists. The task of the operating system is to mediate all the competition for a system's resources, and act as arbitrator between power-oblivious and heavy workloads, light and conservative apps, and ultimately, the user's experience and power bill. It needs to exploit all the underlying platform has to offer in terms of both performance and efficiency, and decide which of these to prioritise many thousands of times a second. As preserving the environment, improving battery life, and maintaining efficiency of software becomes ever more important, hopefully these series of articles has helped in casting light on what goes on behind the scenes to make that a reality. ~tigerw 2018