123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427 |
- =======================
- Energy Aware Scheduling
- =======================
- 1. Introduction
- ---------------
- Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
- the impact of its decisions on the energy consumed by CPUs. EAS relies on an
- Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
- with a minimal impact on throughput. This document aims at providing an
- introduction on how EAS works, what are the main design decisions behind it, and
- details what is needed to get it to run.
- Before going any further, please note that at the time of writing::
- /!\ EAS does not support platforms with symmetric CPU topologies /!\
- EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
- because this is where the potential for saving energy through scheduling is
- the highest.
- The actual EM used by EAS is _not_ maintained by the scheduler, but by a
- dedicated framework. For details about this framework and what it provides,
- please refer to its documentation (see Documentation/power/energy-model.rst).
- 2. Background and Terminology
- -----------------------------
- To make it clear from the start:
- - energy = [joule] (resource like a battery on powered devices)
- - power = energy/time = [joule/second] = [watt]
- The goal of EAS is to minimize energy, while still getting the job done. That
- is, we want to maximize::
- performance [inst/s]
- --------------------
- power [W]
- which is equivalent to minimizing::
- energy [J]
- -----------
- instruction
- while still getting 'good' performance. It is essentially an alternative
- optimization objective to the current performance-only objective for the
- scheduler. This alternative considers two objectives: energy-efficiency and
- performance.
- The idea behind introducing an EM is to allow the scheduler to evaluate the
- implications of its decisions rather than blindly applying energy-saving
- techniques that may have positive effects only on some platforms. At the same
- time, the EM must be as simple as possible to minimize the scheduler latency
- impact.
- In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
- for the scheduler to decide where a task should run (during wake-up), the EM
- is used to break the tie between several good CPU candidates and pick the one
- that is predicted to yield the best energy consumption without harming the
- system's throughput. The predictions made by EAS rely on specific elements of
- knowledge about the platform's topology, which include the 'capacity' of CPUs,
- and their respective energy costs.
- 3. Topology information
- -----------------------
- EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
- differentiate CPUs with different computing throughput. The 'capacity' of a CPU
- represents the amount of work it can absorb when running at its highest
- frequency compared to the most capable CPU of the system. Capacity values are
- normalized in a 1024 range, and are comparable with the utilization signals of
- tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
- to capacity and utilization values, EAS is able to estimate how big/busy a
- task/CPU is, and to take this into consideration when evaluating performance vs
- energy trade-offs. The capacity of CPUs is provided via arch-specific code
- through the arch_scale_cpu_capacity() callback.
- The rest of platform knowledge used by EAS is directly read from the Energy
- Model (EM) framework. The EM of a platform is composed of a power cost table
- per 'performance domain' in the system (see Documentation/power/energy-model.rst
- for futher details about performance domains).
- The scheduler manages references to the EM objects in the topology code when the
- scheduling domains are built, or re-built. For each root domain (rd), the
- scheduler maintains a singly linked list of all performance domains intersecting
- the current rd->span. Each node in the list contains a pointer to a struct
- em_perf_domain as provided by the EM framework.
- The lists are attached to the root domains in order to cope with exclusive
- cpuset configurations. Since the boundaries of exclusive cpusets do not
- necessarily match those of performance domains, the lists of different root
- domains can contain duplicate elements.
- Example 1.
- Let us consider a platform with 12 CPUs, split in 3 performance domains
- (pd0, pd4 and pd8), organized as follows::
- CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
- PDs: |--pd0--|--pd4--|---pd8---|
- RDs: |----rd1----|-----rd2-----|
- Now, consider that userspace decided to split the system with two
- exclusive cpusets, hence creating two independent root domains, each
- containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
- above figure. Since pd4 intersects with both rd1 and rd2, it will be
- present in the linked list '->pd' attached to each of them:
- * rd1->pd: pd0 -> pd4
- * rd2->pd: pd4 -> pd8
- Please note that the scheduler will create two duplicate list nodes for
- pd4 (one for each list). However, both just hold a pointer to the same
- shared data structure of the EM framework.
- Since the access to these lists can happen concurrently with hotplug and other
- things, they are protected by RCU, like the rest of topology structures
- manipulated by the scheduler.
- EAS also maintains a static key (sched_energy_present) which is enabled when at
- least one root domain meets all conditions for EAS to start. Those conditions
- are summarized in Section 6.
- 4. Energy-Aware task placement
- ------------------------------
- EAS overrides the CFS task wake-up balancing code. It uses the EM of the
- platform and the PELT signals to choose an energy-efficient target CPU during
- wake-up balance. When EAS is enabled, select_task_rq_fair() calls
- find_energy_efficient_cpu() to do the placement decision. This function looks
- for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
- each performance domain since it is the one which will allow us to keep the
- frequency the lowest. Then, the function checks if placing the task there could
- save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
- in its previous activation.
- find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
- energy consumed by the system if the waking task was migrated. compute_energy()
- looks at the current utilization landscape of the CPUs and adjusts it to
- 'simulate' the task migration. The EM framework provides the em_pd_energy() API
- which computes the expected energy consumption of each performance domain for
- the given utilization landscape.
- An example of energy-optimized task placement decision is detailed below.
- Example 2.
- Let us consider a (fake) platform with 2 independent performance domains
- composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
- are big.
- The scheduler must decide where to place a task P whose util_avg = 200
- and prev_cpu = 0.
- The current utilization landscape of the CPUs is depicted on the graph
- below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
- Each performance domain has three Operating Performance Points (OPPs).
- The CPU capacity and power cost associated with each OPP is listed in
- the Energy Model table. The util_avg of P is shown on the figures
- below as 'PP'::
- CPU util.
- 1024 - - - - - - - Energy Model
- +-----------+-------------+
- | Little | Big |
- 768 ============= +-----+-----+------+------+
- | Cap | Pwr | Cap | Pwr |
- +-----+-----+------+------+
- 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 |
- ## ## | 341 | 150 | 768 | 800 |
- 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 |
- PP ## ## +-----+-----+------+------+
- 170 -## - - - - ## ##
- ## ## ## ##
- ------------ -------------
- CPU0 CPU1 CPU2 CPU3
- Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
- find_energy_efficient_cpu() will first look for the CPUs with the
- maximum spare capacity in the two performance domains. In this example,
- CPU1 and CPU3. Then it will estimate the energy of the system if P was
- placed on either of them, and check if that would save some energy
- compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
- (which is coherent with the behaviour of the schedutil CPUFreq
- governor, see Section 6. for more details on this topic).
- **Case 1. P is migrated to CPU1**::
- 1024 - - - - - - -
- Energy calculation:
- 768 ============= * CPU0: 200 / 341 * 150 = 88
- * CPU1: 300 / 341 * 150 = 131
- * CPU2: 600 / 768 * 800 = 625
- 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
- ## ## => total_energy = 1364
- 341 =========== ## ##
- PP ## ##
- 170 -## - - PP- ## ##
- ## ## ## ##
- ------------ -------------
- CPU0 CPU1 CPU2 CPU3
- **Case 2. P is migrated to CPU3**::
- 1024 - - - - - - -
- Energy calculation:
- 768 ============= * CPU0: 200 / 341 * 150 = 88
- * CPU1: 100 / 341 * 150 = 43
- PP * CPU2: 600 / 768 * 800 = 625
- 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
- ## ## => total_energy = 1485
- 341 =========== ## ##
- ## ##
- 170 -## - - - - ## ##
- ## ## ## ##
- ------------ -------------
- CPU0 CPU1 CPU2 CPU3
- **Case 3. P stays on prev_cpu / CPU 0**::
- 1024 - - - - - - -
- Energy calculation:
- 768 ============= * CPU0: 400 / 512 * 300 = 234
- * CPU1: 100 / 512 * 300 = 58
- * CPU2: 600 / 768 * 800 = 625
- 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
- ## ## => total_energy = 1437
- 341 -PP - - - - ## ##
- PP ## ##
- 170 -## - - - - ## ##
- ## ## ## ##
- ------------ -------------
- CPU0 CPU1 CPU2 CPU3
- From these calculations, the Case 1 has the lowest total energy. So CPU 1
- is be the best candidate from an energy-efficiency standpoint.
- Big CPUs are generally more power hungry than the little ones and are thus used
- mainly when a task doesn't fit the littles. However, little CPUs aren't always
- necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
- of the little CPUs can be less energy-efficient than the lowest OPPs of the
- bigs, for example. So, if the little CPUs happen to have enough utilization at
- a specific point in time, a small task waking up at that moment could be better
- of executing on the big side in order to save energy, even though it would fit
- on the little side.
- And even in the case where all OPPs of the big CPUs are less energy-efficient
- than those of the little, using the big CPUs for a small task might still, under
- specific conditions, save energy. Indeed, placing a task on a little CPU can
- result in raising the OPP of the entire performance domain, and that will
- increase the cost of the tasks already running there. If the waking task is
- placed on a big CPU, its own execution cost might be higher than if it was
- running on a little, but it won't impact the other tasks of the little CPUs
- which will keep running at a lower OPP. So, when considering the total energy
- consumed by CPUs, the extra cost of running that one task on a big core can be
- smaller than the cost of raising the OPP on the little CPUs for all the other
- tasks.
- The examples above would be nearly impossible to get right in a generic way, and
- for all platforms, without knowing the cost of running at different OPPs on all
- CPUs of the system. Thanks to its EM-based design, EAS should cope with them
- correctly without too many troubles. However, in order to ensure a minimal
- impact on throughput for high-utilization scenarios, EAS also implements another
- mechanism called 'over-utilization'.
- 5. Over-utilization
- -------------------
- From a general standpoint, the use-cases where EAS can help the most are those
- involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
- being run, they will require all of the available CPU capacity, and there isn't
- much that can be done by the scheduler to save energy without severly harming
- throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
- 'over-utilized' as soon as they are used at more than 80% of their compute
- capacity. As long as no CPUs are over-utilized in a root domain, load balancing
- is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
- the most energy efficient CPUs of the system more than the others if that can be
- done without harming throughput. So, the load-balancer is disabled to prevent
- it from breaking the energy-efficient task placement found by EAS. It is safe to
- do so when the system isn't overutilized since being below the 80% tipping point
- implies that:
- a. there is some idle time on all CPUs, so the utilization signals used by
- EAS are likely to accurately represent the 'size' of the various tasks
- in the system;
- b. all tasks should already be provided with enough CPU capacity,
- regardless of their nice values;
- c. since there is spare capacity all tasks must be blocking/sleeping
- regularly and balancing at wake-up is sufficient.
- As soon as one CPU goes above the 80% tipping point, at least one of the three
- assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
- is raised for the entire root domain, EAS is disabled, and the load-balancer is
- re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
- wake-up and load balance under CPU-bound conditions. This provides a better
- respect of the nice values of tasks.
- Since the notion of overutilization largely relies on detecting whether or not
- there is some idle time in the system, the CPU capacity 'stolen' by higher
- (than CFS) scheduling classes (as well as IRQ) must be taken into account. As
- such, the detection of overutilization accounts for the capacity used not only
- by CFS tasks, but also by the other scheduling classes and IRQ.
- 6. Dependencies and requirements for EAS
- ----------------------------------------
- Energy Aware Scheduling depends on the CPUs of the system having specific
- hardware properties and on other features of the kernel being enabled. This
- section lists these dependencies and provides hints as to how they can be met.
- 6.1 - Asymmetric CPU topology
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- As mentioned in the introduction, EAS is only supported on platforms with
- asymmetric CPU topologies for now. This requirement is checked at run-time by
- looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
- domains are built.
- See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
- flag to be set in the sched_domain hierarchy.
- Please note that EAS is not fundamentally incompatible with SMP, but no
- significant savings on SMP platforms have been observed yet. This restriction
- could be amended in the future if proven otherwise.
- 6.2 - Energy Model presence
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- EAS uses the EM of a platform to estimate the impact of scheduling decisions on
- energy. So, your platform must provide power cost tables to the EM framework in
- order to make EAS start. To do so, please refer to documentation of the
- independent EM framework in Documentation/power/energy-model.rst.
- Please also note that the scheduling domains need to be re-built after the
- EM has been registered in order to start EAS.
- EAS uses the EM to make a forecasting decision on energy usage and thus it is
- more focused on the difference when checking possible options for task
- placement. For EAS it doesn't matter whether the EM power values are expressed
- in milli-Watts or in an 'abstract scale'.
- 6.3 - Energy Model complexity
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- The task wake-up path is very latency-sensitive. When the EM of a platform is
- too complex (too many CPUs, too many performance domains, too many performance
- states, ...), the cost of using it in the wake-up path can become prohibitive.
- The energy-aware wake-up algorithm has a complexity of:
- C = Nd * (Nc + Ns)
- with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
- total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
- A complexity check is performed at the root domain level, when scheduling
- domains are built. EAS will not start on a root domain if its C happens to be
- higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
- time of writing).
- If you really want to use EAS but the complexity of your platform's Energy
- Model is too high to be used with a single root domain, you're left with only
- two possible options:
- 1. split your system into separate, smaller, root domains using exclusive
- cpusets and enable EAS locally on each of them. This option has the
- benefit to work out of the box but the drawback of preventing load
- balance between root domains, which can result in an unbalanced system
- overall;
- 2. submit patches to reduce the complexity of the EAS wake-up algorithm,
- hence enabling it to cope with larger EMs in reasonable time.
- 6.4 - Schedutil governor
- ^^^^^^^^^^^^^^^^^^^^^^^^
- EAS tries to predict at which OPP will the CPUs be running in the close future
- in order to estimate their energy consumption. To do so, it is assumed that OPPs
- of CPUs follow their utilization.
- Although it is very difficult to provide hard guarantees regarding the accuracy
- of this assumption in practice (because the hardware might not do what it is
- told to do, for example), schedutil as opposed to other CPUFreq governors at
- least _requests_ frequencies calculated using the utilization signals.
- Consequently, the only sane governor to use together with EAS is schedutil,
- because it is the only one providing some degree of consistency between
- frequency requests and energy predictions.
- Using EAS with any other governor than schedutil is not recommended.
- 6.5 Scale-invariant utilization signals
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- In order to make accurate prediction across CPUs and for all performance
- states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
- be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
- callbacks.
- Using EAS on a platform that doesn't implement these two callbacks is not
- supported.
- 6.6 Multithreading (SMT)
- ^^^^^^^^^^^^^^^^^^^^^^^^
- EAS in its current form is SMT unaware and is not able to leverage
- multithreaded hardware to save energy. EAS considers threads as independent
- CPUs, which can actually be counter-productive for both performance and energy.
- EAS on SMT is not supported.
|