sched-energy.rst 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427
  1. =======================
  2. Energy Aware Scheduling
  3. =======================
  4. 1. Introduction
  5. ---------------
  6. Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
  7. the impact of its decisions on the energy consumed by CPUs. EAS relies on an
  8. Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
  9. with a minimal impact on throughput. This document aims at providing an
  10. introduction on how EAS works, what are the main design decisions behind it, and
  11. details what is needed to get it to run.
  12. Before going any further, please note that at the time of writing::
  13. /!\ EAS does not support platforms with symmetric CPU topologies /!\
  14. EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
  15. because this is where the potential for saving energy through scheduling is
  16. the highest.
  17. The actual EM used by EAS is _not_ maintained by the scheduler, but by a
  18. dedicated framework. For details about this framework and what it provides,
  19. please refer to its documentation (see Documentation/power/energy-model.rst).
  20. 2. Background and Terminology
  21. -----------------------------
  22. To make it clear from the start:
  23. - energy = [joule] (resource like a battery on powered devices)
  24. - power = energy/time = [joule/second] = [watt]
  25. The goal of EAS is to minimize energy, while still getting the job done. That
  26. is, we want to maximize::
  27. performance [inst/s]
  28. --------------------
  29. power [W]
  30. which is equivalent to minimizing::
  31. energy [J]
  32. -----------
  33. instruction
  34. while still getting 'good' performance. It is essentially an alternative
  35. optimization objective to the current performance-only objective for the
  36. scheduler. This alternative considers two objectives: energy-efficiency and
  37. performance.
  38. The idea behind introducing an EM is to allow the scheduler to evaluate the
  39. implications of its decisions rather than blindly applying energy-saving
  40. techniques that may have positive effects only on some platforms. At the same
  41. time, the EM must be as simple as possible to minimize the scheduler latency
  42. impact.
  43. In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
  44. for the scheduler to decide where a task should run (during wake-up), the EM
  45. is used to break the tie between several good CPU candidates and pick the one
  46. that is predicted to yield the best energy consumption without harming the
  47. system's throughput. The predictions made by EAS rely on specific elements of
  48. knowledge about the platform's topology, which include the 'capacity' of CPUs,
  49. and their respective energy costs.
  50. 3. Topology information
  51. -----------------------
  52. EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
  53. differentiate CPUs with different computing throughput. The 'capacity' of a CPU
  54. represents the amount of work it can absorb when running at its highest
  55. frequency compared to the most capable CPU of the system. Capacity values are
  56. normalized in a 1024 range, and are comparable with the utilization signals of
  57. tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
  58. to capacity and utilization values, EAS is able to estimate how big/busy a
  59. task/CPU is, and to take this into consideration when evaluating performance vs
  60. energy trade-offs. The capacity of CPUs is provided via arch-specific code
  61. through the arch_scale_cpu_capacity() callback.
  62. The rest of platform knowledge used by EAS is directly read from the Energy
  63. Model (EM) framework. The EM of a platform is composed of a power cost table
  64. per 'performance domain' in the system (see Documentation/power/energy-model.rst
  65. for futher details about performance domains).
  66. The scheduler manages references to the EM objects in the topology code when the
  67. scheduling domains are built, or re-built. For each root domain (rd), the
  68. scheduler maintains a singly linked list of all performance domains intersecting
  69. the current rd->span. Each node in the list contains a pointer to a struct
  70. em_perf_domain as provided by the EM framework.
  71. The lists are attached to the root domains in order to cope with exclusive
  72. cpuset configurations. Since the boundaries of exclusive cpusets do not
  73. necessarily match those of performance domains, the lists of different root
  74. domains can contain duplicate elements.
  75. Example 1.
  76. Let us consider a platform with 12 CPUs, split in 3 performance domains
  77. (pd0, pd4 and pd8), organized as follows::
  78. CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
  79. PDs: |--pd0--|--pd4--|---pd8---|
  80. RDs: |----rd1----|-----rd2-----|
  81. Now, consider that userspace decided to split the system with two
  82. exclusive cpusets, hence creating two independent root domains, each
  83. containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
  84. above figure. Since pd4 intersects with both rd1 and rd2, it will be
  85. present in the linked list '->pd' attached to each of them:
  86. * rd1->pd: pd0 -> pd4
  87. * rd2->pd: pd4 -> pd8
  88. Please note that the scheduler will create two duplicate list nodes for
  89. pd4 (one for each list). However, both just hold a pointer to the same
  90. shared data structure of the EM framework.
  91. Since the access to these lists can happen concurrently with hotplug and other
  92. things, they are protected by RCU, like the rest of topology structures
  93. manipulated by the scheduler.
  94. EAS also maintains a static key (sched_energy_present) which is enabled when at
  95. least one root domain meets all conditions for EAS to start. Those conditions
  96. are summarized in Section 6.
  97. 4. Energy-Aware task placement
  98. ------------------------------
  99. EAS overrides the CFS task wake-up balancing code. It uses the EM of the
  100. platform and the PELT signals to choose an energy-efficient target CPU during
  101. wake-up balance. When EAS is enabled, select_task_rq_fair() calls
  102. find_energy_efficient_cpu() to do the placement decision. This function looks
  103. for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
  104. each performance domain since it is the one which will allow us to keep the
  105. frequency the lowest. Then, the function checks if placing the task there could
  106. save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
  107. in its previous activation.
  108. find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
  109. energy consumed by the system if the waking task was migrated. compute_energy()
  110. looks at the current utilization landscape of the CPUs and adjusts it to
  111. 'simulate' the task migration. The EM framework provides the em_pd_energy() API
  112. which computes the expected energy consumption of each performance domain for
  113. the given utilization landscape.
  114. An example of energy-optimized task placement decision is detailed below.
  115. Example 2.
  116. Let us consider a (fake) platform with 2 independent performance domains
  117. composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
  118. are big.
  119. The scheduler must decide where to place a task P whose util_avg = 200
  120. and prev_cpu = 0.
  121. The current utilization landscape of the CPUs is depicted on the graph
  122. below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
  123. Each performance domain has three Operating Performance Points (OPPs).
  124. The CPU capacity and power cost associated with each OPP is listed in
  125. the Energy Model table. The util_avg of P is shown on the figures
  126. below as 'PP'::
  127. CPU util.
  128. 1024 - - - - - - - Energy Model
  129. +-----------+-------------+
  130. | Little | Big |
  131. 768 ============= +-----+-----+------+------+
  132. | Cap | Pwr | Cap | Pwr |
  133. +-----+-----+------+------+
  134. 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 |
  135. ## ## | 341 | 150 | 768 | 800 |
  136. 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 |
  137. PP ## ## +-----+-----+------+------+
  138. 170 -## - - - - ## ##
  139. ## ## ## ##
  140. ------------ -------------
  141. CPU0 CPU1 CPU2 CPU3
  142. Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
  143. find_energy_efficient_cpu() will first look for the CPUs with the
  144. maximum spare capacity in the two performance domains. In this example,
  145. CPU1 and CPU3. Then it will estimate the energy of the system if P was
  146. placed on either of them, and check if that would save some energy
  147. compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
  148. (which is coherent with the behaviour of the schedutil CPUFreq
  149. governor, see Section 6. for more details on this topic).
  150. **Case 1. P is migrated to CPU1**::
  151. 1024 - - - - - - -
  152. Energy calculation:
  153. 768 ============= * CPU0: 200 / 341 * 150 = 88
  154. * CPU1: 300 / 341 * 150 = 131
  155. * CPU2: 600 / 768 * 800 = 625
  156. 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
  157. ## ## => total_energy = 1364
  158. 341 =========== ## ##
  159. PP ## ##
  160. 170 -## - - PP- ## ##
  161. ## ## ## ##
  162. ------------ -------------
  163. CPU0 CPU1 CPU2 CPU3
  164. **Case 2. P is migrated to CPU3**::
  165. 1024 - - - - - - -
  166. Energy calculation:
  167. 768 ============= * CPU0: 200 / 341 * 150 = 88
  168. * CPU1: 100 / 341 * 150 = 43
  169. PP * CPU2: 600 / 768 * 800 = 625
  170. 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
  171. ## ## => total_energy = 1485
  172. 341 =========== ## ##
  173. ## ##
  174. 170 -## - - - - ## ##
  175. ## ## ## ##
  176. ------------ -------------
  177. CPU0 CPU1 CPU2 CPU3
  178. **Case 3. P stays on prev_cpu / CPU 0**::
  179. 1024 - - - - - - -
  180. Energy calculation:
  181. 768 ============= * CPU0: 400 / 512 * 300 = 234
  182. * CPU1: 100 / 512 * 300 = 58
  183. * CPU2: 600 / 768 * 800 = 625
  184. 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
  185. ## ## => total_energy = 1437
  186. 341 -PP - - - - ## ##
  187. PP ## ##
  188. 170 -## - - - - ## ##
  189. ## ## ## ##
  190. ------------ -------------
  191. CPU0 CPU1 CPU2 CPU3
  192. From these calculations, the Case 1 has the lowest total energy. So CPU 1
  193. is be the best candidate from an energy-efficiency standpoint.
  194. Big CPUs are generally more power hungry than the little ones and are thus used
  195. mainly when a task doesn't fit the littles. However, little CPUs aren't always
  196. necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
  197. of the little CPUs can be less energy-efficient than the lowest OPPs of the
  198. bigs, for example. So, if the little CPUs happen to have enough utilization at
  199. a specific point in time, a small task waking up at that moment could be better
  200. of executing on the big side in order to save energy, even though it would fit
  201. on the little side.
  202. And even in the case where all OPPs of the big CPUs are less energy-efficient
  203. than those of the little, using the big CPUs for a small task might still, under
  204. specific conditions, save energy. Indeed, placing a task on a little CPU can
  205. result in raising the OPP of the entire performance domain, and that will
  206. increase the cost of the tasks already running there. If the waking task is
  207. placed on a big CPU, its own execution cost might be higher than if it was
  208. running on a little, but it won't impact the other tasks of the little CPUs
  209. which will keep running at a lower OPP. So, when considering the total energy
  210. consumed by CPUs, the extra cost of running that one task on a big core can be
  211. smaller than the cost of raising the OPP on the little CPUs for all the other
  212. tasks.
  213. The examples above would be nearly impossible to get right in a generic way, and
  214. for all platforms, without knowing the cost of running at different OPPs on all
  215. CPUs of the system. Thanks to its EM-based design, EAS should cope with them
  216. correctly without too many troubles. However, in order to ensure a minimal
  217. impact on throughput for high-utilization scenarios, EAS also implements another
  218. mechanism called 'over-utilization'.
  219. 5. Over-utilization
  220. -------------------
  221. From a general standpoint, the use-cases where EAS can help the most are those
  222. involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
  223. being run, they will require all of the available CPU capacity, and there isn't
  224. much that can be done by the scheduler to save energy without severly harming
  225. throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
  226. 'over-utilized' as soon as they are used at more than 80% of their compute
  227. capacity. As long as no CPUs are over-utilized in a root domain, load balancing
  228. is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
  229. the most energy efficient CPUs of the system more than the others if that can be
  230. done without harming throughput. So, the load-balancer is disabled to prevent
  231. it from breaking the energy-efficient task placement found by EAS. It is safe to
  232. do so when the system isn't overutilized since being below the 80% tipping point
  233. implies that:
  234. a. there is some idle time on all CPUs, so the utilization signals used by
  235. EAS are likely to accurately represent the 'size' of the various tasks
  236. in the system;
  237. b. all tasks should already be provided with enough CPU capacity,
  238. regardless of their nice values;
  239. c. since there is spare capacity all tasks must be blocking/sleeping
  240. regularly and balancing at wake-up is sufficient.
  241. As soon as one CPU goes above the 80% tipping point, at least one of the three
  242. assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
  243. is raised for the entire root domain, EAS is disabled, and the load-balancer is
  244. re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
  245. wake-up and load balance under CPU-bound conditions. This provides a better
  246. respect of the nice values of tasks.
  247. Since the notion of overutilization largely relies on detecting whether or not
  248. there is some idle time in the system, the CPU capacity 'stolen' by higher
  249. (than CFS) scheduling classes (as well as IRQ) must be taken into account. As
  250. such, the detection of overutilization accounts for the capacity used not only
  251. by CFS tasks, but also by the other scheduling classes and IRQ.
  252. 6. Dependencies and requirements for EAS
  253. ----------------------------------------
  254. Energy Aware Scheduling depends on the CPUs of the system having specific
  255. hardware properties and on other features of the kernel being enabled. This
  256. section lists these dependencies and provides hints as to how they can be met.
  257. 6.1 - Asymmetric CPU topology
  258. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  259. As mentioned in the introduction, EAS is only supported on platforms with
  260. asymmetric CPU topologies for now. This requirement is checked at run-time by
  261. looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
  262. domains are built.
  263. See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
  264. flag to be set in the sched_domain hierarchy.
  265. Please note that EAS is not fundamentally incompatible with SMP, but no
  266. significant savings on SMP platforms have been observed yet. This restriction
  267. could be amended in the future if proven otherwise.
  268. 6.2 - Energy Model presence
  269. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  270. EAS uses the EM of a platform to estimate the impact of scheduling decisions on
  271. energy. So, your platform must provide power cost tables to the EM framework in
  272. order to make EAS start. To do so, please refer to documentation of the
  273. independent EM framework in Documentation/power/energy-model.rst.
  274. Please also note that the scheduling domains need to be re-built after the
  275. EM has been registered in order to start EAS.
  276. EAS uses the EM to make a forecasting decision on energy usage and thus it is
  277. more focused on the difference when checking possible options for task
  278. placement. For EAS it doesn't matter whether the EM power values are expressed
  279. in milli-Watts or in an 'abstract scale'.
  280. 6.3 - Energy Model complexity
  281. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  282. The task wake-up path is very latency-sensitive. When the EM of a platform is
  283. too complex (too many CPUs, too many performance domains, too many performance
  284. states, ...), the cost of using it in the wake-up path can become prohibitive.
  285. The energy-aware wake-up algorithm has a complexity of:
  286. C = Nd * (Nc + Ns)
  287. with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
  288. total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
  289. A complexity check is performed at the root domain level, when scheduling
  290. domains are built. EAS will not start on a root domain if its C happens to be
  291. higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
  292. time of writing).
  293. If you really want to use EAS but the complexity of your platform's Energy
  294. Model is too high to be used with a single root domain, you're left with only
  295. two possible options:
  296. 1. split your system into separate, smaller, root domains using exclusive
  297. cpusets and enable EAS locally on each of them. This option has the
  298. benefit to work out of the box but the drawback of preventing load
  299. balance between root domains, which can result in an unbalanced system
  300. overall;
  301. 2. submit patches to reduce the complexity of the EAS wake-up algorithm,
  302. hence enabling it to cope with larger EMs in reasonable time.
  303. 6.4 - Schedutil governor
  304. ^^^^^^^^^^^^^^^^^^^^^^^^
  305. EAS tries to predict at which OPP will the CPUs be running in the close future
  306. in order to estimate their energy consumption. To do so, it is assumed that OPPs
  307. of CPUs follow their utilization.
  308. Although it is very difficult to provide hard guarantees regarding the accuracy
  309. of this assumption in practice (because the hardware might not do what it is
  310. told to do, for example), schedutil as opposed to other CPUFreq governors at
  311. least _requests_ frequencies calculated using the utilization signals.
  312. Consequently, the only sane governor to use together with EAS is schedutil,
  313. because it is the only one providing some degree of consistency between
  314. frequency requests and energy predictions.
  315. Using EAS with any other governor than schedutil is not recommended.
  316. 6.5 Scale-invariant utilization signals
  317. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  318. In order to make accurate prediction across CPUs and for all performance
  319. states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
  320. be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
  321. callbacks.
  322. Using EAS on a platform that doesn't implement these two callbacks is not
  323. supported.
  324. 6.6 Multithreading (SMT)
  325. ^^^^^^^^^^^^^^^^^^^^^^^^
  326. EAS in its current form is SMT unaware and is not able to leverage
  327. multithreaded hardware to save energy. EAS considers threads as independent
  328. CPUs, which can actually be counter-productive for both performance and energy.
  329. EAS on SMT is not supported.