aba53204cec6dd7cc3e35f17c20ea255ddfa396b
937 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
a4739eca44 |
sched/numa: Stop multiple tasks from moving to the CPU at the same time
Task migration under NUMA balancing can happen in parallel. More than one task might choose to migrate to the same CPU at the same time. This can result in: - During task swap, choosing a task that was not part of the evaluation. - During task swap, task which just got moved into its preferred node, moving to a completely different node. - During task swap, task failing to move to the preferred node, will have to wait an extra interval for the next migrate opportunity. - During task movement, multiple task movements can cause load imbalance. This problem is more likely if there are more cores per node or more nodes in the system. Use a per run-queue variable to check if NUMA-balance is active on the run-queue. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 200194 203353 1.57797 1 311331 328205 5.41995 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 197654 214384 8.46429 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 192605 188553 -2.10379 1 213402 196273 -8.02664 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 52227.1 57581.2 10.2516 1 102529 103468 0.915838 There is a regression on power 9 box. If we look at the details, that box has a sudden jump in cache-misses with this patch. All other parameters seem to be pointing towards NUMA consolidation. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,345,784 13,941,377 migrations 1,127,820 1,157,323 faults 374,736 382,175 cache-misses 55,132,054,603 54,993,823,500 sched:sched_move_numa 1,923 2,005 sched:sched_stick_numa 52 14 sched:sched_swap_numa 595 529 migrate:mm_migrate_pages 1,932 1,573 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 60605 67099 numa_hint_faults_local 51804 58456 numa_hit 239945 240416 numa_huge_pte_updates 14 18 numa_interleave 60 65 numa_local 239865 240339 numa_other 80 77 numa_pages_migrated 1931 1574 numa_pte_updates 67823 77182 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,016,467 3,176,453 migrations 37,326 30,238 faults 115,342 87,869 cache-misses 11,692,155,554 12,544,479,391 sched:sched_move_numa 965 23 sched:sched_stick_numa 8 0 sched:sched_swap_numa 35 6 migrate:mm_migrate_pages 1,168 10 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 16286 236 numa_hint_faults_local 11863 201 numa_hit 112482 72293 numa_huge_pte_updates 33 0 numa_interleave 20 26 numa_local 112419 72233 numa_other 63 60 numa_pages_migrated 1144 8 numa_pte_updates 32859 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,629,724 8,478,820 migrations 221,052 171,323 faults 308,661 307,499 cache-misses 135,574,913 240,353,599 sched:sched_move_numa 147 214 sched:sched_stick_numa 0 0 sched:sched_swap_numa 2 4 migrate:mm_migrate_pages 64 89 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11481 5301 numa_hint_faults_local 10968 4745 numa_hit 89773 92943 numa_huge_pte_updates 0 0 numa_interleave 1116 899 numa_local 89220 92345 numa_other 553 598 numa_pages_migrated 62 88 numa_pte_updates 11694 5505 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,272,887 2,066,172 migrations 12,206 11,076 faults 163,704 149,544 cache-misses 4,801,186 10,398,067 sched:sched_move_numa 44 43 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 17 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 2261 3552 numa_hint_faults_local 1993 3347 numa_hit 25726 25611 numa_huge_pte_updates 0 0 numa_interleave 239 213 numa_local 25498 25583 numa_other 228 28 numa_pages_migrated 17 6 numa_pte_updates 2266 3535 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 117,980,962 99,358,136 migrations 3,950,220 4,041,607 faults 736,979 749,653 cache-misses 224,976,072,879 225,562,543,251 sched:sched_move_numa 504 771 sched:sched_stick_numa 50 14 sched:sched_swap_numa 239 204 migrate:mm_migrate_pages 1,260 1,180 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 18293 27409 numa_hint_faults_local 11969 20677 numa_hit 240854 239988 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 240851 239983 numa_other 3 5 numa_pages_migrated 1190 1016 numa_pte_updates 18106 27916 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 61,053,158 60,899,307 migrations 551,586 544,668 faults 244,174 270,834 cache-misses 74,326,766,973 74,543,455,635 sched:sched_move_numa 344 735 sched:sched_stick_numa 24 25 sched:sched_swap_numa 140 174 migrate:mm_migrate_pages 568 816 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 6461 11059 numa_hint_faults_local 2283 4733 numa_hit 35661 41384 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 35661 41383 numa_other 0 1 numa_pages_migrated 568 815 numa_pte_updates 6518 11323 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
7477a3504e |
sched/numa: Remove unused numa_stats::nr_running field
nr_running in struct numa_stats is not used anywhere in the code. Remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
d90707ebeb |
sched/numa: Remove unused code from update_numa_stats()
With:
commit
|
||
![]() |
4ad3831a9d |
sched/fair: Don't move tasks to lower capacity CPUs unless necessary
When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a CPU with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
757ffdd705 |
sched/fair: Set rq->rd->overload when misfit
Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 | a a a a _ _ LITTLE_1 | b b b b _ _ ---------|------------- big_0 | c c c c a a big_1 | d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
e90c8fe15a |
sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong can happen because of the compiler. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
dbbad71944 |
sched/fair: Change 'prefer_sibling' type to bool
This variable is entirely local to update_sd_lb_stats, so we can safely change its type and slightly clean up its initialisation. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
5fbdfae522 |
sched/fair: Kick nohz balance if rq->misfit_task_load
There already are a few conditions in nohz_kick_needed() to ensure a nohz kick is triggered, but they are not enough for some misfit task scenarios. Excluding asym packing, those are: - rq->nr_running >=2: Not relevant here because we are running a misfit task, it needs to be migrated regardless and potentially through active balance. - sds->nr_busy_cpus > 1: If there is only the misfit task being run on a group of low capacity CPUs, this will be evaluated to False. - rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here, misfit task needs to be migrated regardless of rt/IRQ pressure As such, this commit adds an rq->misfit_task_load condition to trigger a nohz kick. The idea to kick a nohz balance for misfit tasks originally came from Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for the Android Common Kernel - see: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
cad68e552e |
sched/fair: Consider misfit tasks when load-balancing
On asymmetric CPU capacity systems load intensive tasks can end up on CPUs that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to CPUs with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-CPU capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the CPU with the highest misfit load as the source CPU. 5. If the misfit task is alone on the source CPU, go for active balancing. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
e3d6d0cb66 |
sched/fair: Add sched_group per-CPU max capacity
The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
3b1baa6496 |
sched/fair: Add 'group_misfit_task' load-balance type
To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
df054e8445 |
sched/topology: Add static_key for asymmetric CPU capacity optimizations
The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
882a78a9f3 |
sched/fair: Fix kernel-doc notation warning
Fix kernel-doc warning for missing 'flags' parameter description:
../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
![]() |
bb3485c8ac |
sched/fair: Fix load_balance redo for !imbalance
It can happen that load_balance() finds a busiest group and then a busiest rq but the calculated imbalance is in fact 0. In such situation, detach_tasks() returns immediately and lets the flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned tasks and removed from the load balance mask. then, we redo a load balance without the busiest CPU. This creates wrong load balance situation and generates wrong task migration. If the calculated imbalance is 0, it's useless to try to find a busiest rq as no task will be migrated and we can return immediately. This situation can happen with heterogeneous system or smp system when RT tasks are decreasing the capacity of some CPUs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: jhugo@codeaurora.org Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
287cdaac57 |
sched/fair: Fix scale_rt_capacity() for SMT
Since commit: |
||
![]() |
d0cdb3ce88 |
sched/fair: Fix vruntime_normalized() for remote non-migration wakeup
When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.
For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.
Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.
Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().
Based on a similar patch from John Dias <joaodias@google.com>.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: John Dias <joaodias@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miguel de Dios <migueldedios@google.com>
Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: kernel-team@android.com
Fixes:
|
||
![]() |
12b04875d6 |
sched/pelt: Fix update_blocked_averages() for RT and DL classes
update_blocked_averages() is called to periodiccally decay the stalled load of idle CPUs and to sync all loads before running load balance. When cfs rq is idle, it trigs a load balance during pick_next_task_fair() in order to potentially pull tasks and to use this newly idle CPU. This load balance happens whereas prev task from another class has not been put and its utilization updated yet. This may lead to wrongly account running time as idle time for RT or DL classes. Test that no RT or DL task is running when updating their utilization in update_blocked_averages(). We still update RT and DL utilization instead of simply skipping them to make sure that all metrics are synced when used during load balance. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: |
||
![]() |
958f338e96 |
Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge L1 Terminal Fault fixes from Thomas Gleixner: "L1TF, aka L1 Terminal Fault, is yet another speculative hardware engineering trainwreck. It's a hardware vulnerability which allows unprivileged speculative access to data which is available in the Level 1 Data Cache when the page table entry controlling the virtual address, which is used for the access, has the Present bit cleared or other reserved bits set. If an instruction accesses a virtual address for which the relevant page table entry (PTE) has the Present bit cleared or other reserved bits set, then speculative execution ignores the invalid PTE and loads the referenced data if it is present in the Level 1 Data Cache, as if the page referenced by the address bits in the PTE was still present and accessible. While this is a purely speculative mechanism and the instruction will raise a page fault when it is retired eventually, the pure act of loading the data and making it available to other speculative instructions opens up the opportunity for side channel attacks to unprivileged malicious code, similar to the Meltdown attack. While Meltdown breaks the user space to kernel space protection, L1TF allows to attack any physical memory address in the system and the attack works across all protection domains. It allows an attack of SGX and also works from inside virtual machines because the speculation bypasses the extended page table (EPT) protection mechanism. The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646 The mitigations provided by this pull request include: - Host side protection by inverting the upper address bits of a non present page table entry so the entry points to uncacheable memory. - Hypervisor protection by flushing L1 Data Cache on VMENTER. - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT by offlining the sibling CPU threads. The knobs are available on the kernel command line and at runtime via sysfs - Control knobs for the hypervisor mitigation, related to L1D flush and SMT control. The knobs are available on the kernel command line and at runtime via sysfs - Extensive documentation about L1TF including various degrees of mitigations. Thanks to all people who have contributed to this in various ways - patches, review, testing, backporting - and the fruitful, sometimes heated, but at the end constructive discussions. There is work in progress to provide other forms of mitigations, which might be less horrible performance wise for a particular kind of workloads, but this is not yet ready for consumption due to their complexity and limitations" * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) x86/microcode: Allow late microcode loading with SMT disabled tools headers: Synchronise x86 cpufeatures.h for L1TF additions x86/mm/kmmio: Make the tracer robust against L1TF x86/mm/pat: Make set_memory_np() L1TF safe x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert x86/speculation/l1tf: Invert all not present mappings cpu/hotplug: Fix SMT supported evaluation KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry x86/speculation: Simplify sysfs report of VMX L1TF vulnerability Documentation/l1tf: Remove Yonah processors from not vulnerable list x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d x86: Don't include linux/irq.h from asm/hardirq.h x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() cpu/hotplug: detect SMT disabled by BIOS ... |
||
![]() |
f2701b77bb |
Merge 4.18-rc7 into master to pick up the KVM dependcy
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
![]() |
b6a60cf36d |
sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
numa_migrate_preferred() is called periodically or when task preferred node changes. Preferred node evaluations happen once per scan sequence. If the scan completion happens just after the periodic NUMA migration, then we try to migrate to the preferred node and the preferred node might change, needing another node migration. Avoid this by checking for scan sequence completion only when checking for periodic migration. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25862.6 26158.1 1.14258 1 74357 72725 -2.19482 Running SPECjbb2005 on a 16 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 8 117019 113992 -2.58 1 179095 174947 -2.31 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 449.46 770.77 615.22 101.70 numa01.sh Sys: 132.72 208.17 170.46 24.96 numa01.sh User: 39185.26 60290.89 50066.76 6807.84 numa02.sh Real: 60.85 61.79 61.28 0.37 numa02.sh Sys: 15.34 24.71 21.08 3.61 numa02.sh User: 5204.41 5249.85 5231.21 17.60 numa03.sh Real: 785.50 916.97 840.77 44.98 numa03.sh Sys: 108.08 133.60 119.43 8.82 numa03.sh User: 61422.86 70919.75 64720.87 3310.61 numa04.sh Real: 429.57 587.37 480.80 57.40 numa04.sh Sys: 240.61 321.97 290.84 33.58 numa04.sh User: 34597.65 40498.99 37079.48 2060.72 numa05.sh Real: 392.09 431.25 414.65 13.82 numa05.sh Sys: 229.41 372.48 297.54 53.14 numa05.sh User: 33390.86 34697.49 34222.43 556.42 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 424.63 566.18 498.12 59.26 23.50% numa01.sh Sys: 160.19 256.53 208.98 37.02 -18.4% numa01.sh User: 37320.00 46225.58 42001.57 3482.45 19.20% numa02.sh Real: 60.17 62.47 60.91 0.85 0.607% numa02.sh Sys: 15.30 22.82 17.04 2.90 23.70% numa02.sh User: 5202.13 5255.51 5219.08 20.14 0.232% numa03.sh Real: 823.91 844.89 833.86 8.46 0.828% numa03.sh Sys: 130.69 148.29 140.47 6.21 -14.9% numa03.sh User: 62519.15 64262.20 63613.38 620.05 1.740% numa04.sh Real: 515.30 603.74 548.56 30.93 -12.3% numa04.sh Sys: 459.73 525.48 489.18 21.63 -40.5% numa04.sh User: 40561.96 44919.18 42047.87 1526.85 -11.8% numa05.sh Real: 396.58 454.37 421.13 19.71 -1.53% numa05.sh Sys: 208.72 422.02 348.90 73.60 -14.7% numa05.sh User: 33124.08 36109.35 34846.47 1089.74 -1.79% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
f35678b6a1 |
sched/numa: Use group_weights to identify if migration degrades locality
On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be consolidated to the closest group of nodes. In such a case, relying on group_fault metric may not always help to consolidate. There can always be a case where a node closer to the preferred node may have lesser faults than a node further away from the preferred node. In such a case, moving to node with more faults might avoid numa consolidation. Using group_weight would help to consolidate task/memory around the preferred_node. While here, to be on the conservative side, don't override migrate thread degrades locality logic for CPU_NEWLY_IDLE load balancing. Note: Similar problems exist with should_numa_migrate_memory and will be dealt separately. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25645.4 25960 1.22 1 72142 73550 1.95 Running SPECjbb2005 on a 16 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 8 110199 120071 8.958 1 176303 176249 -0.03 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 490.04 774.86 596.26 96.46 numa01.sh Sys: 151.52 242.88 184.82 31.71 numa01.sh User: 41418.41 60844.59 48776.09 6564.27 numa02.sh Real: 60.14 62.94 60.98 1.00 numa02.sh Sys: 16.11 30.77 21.20 5.28 numa02.sh User: 5184.33 5311.09 5228.50 44.24 numa03.sh Real: 790.95 856.35 826.41 24.11 numa03.sh Sys: 114.93 118.85 117.05 1.63 numa03.sh User: 60990.99 64959.28 63470.43 1415.44 numa04.sh Real: 434.37 597.92 504.87 59.70 numa04.sh Sys: 237.63 397.40 289.74 55.98 numa04.sh User: 34854.87 41121.83 38572.52 2615.84 numa05.sh Real: 386.77 448.90 417.22 22.79 numa05.sh Sys: 149.23 379.95 303.04 79.55 numa05.sh User: 32951.76 35959.58 34562.18 1034.05 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 493.19 672.88 597.51 59.38 -0.20% numa01.sh Sys: 150.09 245.48 207.76 34.26 -11.0% numa01.sh User: 41928.51 53779.17 48747.06 3901.39 0.059% numa02.sh Real: 60.63 62.87 61.22 0.83 -0.39% numa02.sh Sys: 16.64 27.97 20.25 4.06 4.691% numa02.sh User: 5222.92 5309.60 5254.03 29.98 -0.48% numa03.sh Real: 821.52 902.15 863.60 32.41 -4.30% numa03.sh Sys: 112.04 130.66 118.35 7.08 -1.09% numa03.sh User: 62245.16 69165.14 66443.04 2450.32 -4.47% numa04.sh Real: 414.53 519.57 476.25 37.00 6.009% numa04.sh Sys: 181.84 335.67 280.41 54.07 3.327% numa04.sh User: 33924.50 39115.39 37343.78 1934.26 3.290% numa05.sh Real: 408.30 441.45 417.90 12.05 -0.16% numa05.sh Sys: 233.41 381.60 295.58 57.37 2.523% numa05.sh User: 33301.31 35972.50 34335.19 938.94 0.661% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
30619c89b1 |
sched/numa: Update the scan period without holding the numa_group lock
The metrics for updating scan periods are local or task specific. Currently this update happens under the numa_group lock, which seems unnecessary. Hence move this update outside the lock. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25355.9 25645.4 1.141 1 72812 72142 -0.92 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
2d4056fafa |
sched/numa: Remove numa_has_capacity()
task_numa_find_cpu() helps to find the CPU to swap/move the task to. It's guarded by numa_has_capacity(). However node not having capacity shouldn't deter a task swapping if it helps NUMA placement. Further load_too_imbalanced(), which evaluates possibilities of move/swap, provides similar checks as numa_has_capacity. Hence remove numa_has_capacity() to enhance possibilities of task swapping even if load is imbalanced. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25657.9 25804.1 0.569 1 74435 73413 -1.37 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@surriel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
0ad4e3dfe6 |
sched/numa: Modify migrate_swap() to accept additional parameters
There are checks in migrate_swap_stop() that check if the task/CPU combination is as per migrate_swap_arg before migrating. However atleast one of the two tasks to be swapped by migrate_swap() could have migrated to a completely different CPU before updating the migrate_swap_arg. The new CPU where the task is currently running could be a different node too. If the task has migrated, numa balancer might end up placing a task in a wrong node. Instead of achieving node consolidation, it may end up spreading the load across nodes. To avoid that pass the CPUs as additional parameters. While here, place migrate_swap under CONFIG_NUMA_BALANCING. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25377.3 25226.6 -0.59 1 72287 73326 1.437 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
10864a9e22 |
sched/numa: Remove unused task_capacity from 'struct numa_stats'
The task_capacity field in 'struct numa_stats' is redundant. Also move nr_running for better packing within the struct. No functional changes. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25308.6 25377.3 0.271 1 72964 72287 -0.92 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Rik van Riel <riel@surriel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
0ee7e74dc0 |
sched/numa: Skip nodes that are at 'hoplimit'
When comparing two nodes at a distance of 'hoplimit', we should consider nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit' distance too. Hence two nodes at a distance of 'hoplimit' will have same groupweight. Fix this by skipping nodes at hoplimit. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25375.3 25308.6 -0.26 1 72617 72964 0.477 Running SPECjbb2005 on a 16 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 8 113372 108750 -4.07684 1 177403 183115 3.21979 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 478.45 565.90 515.11 30.87 numa01.sh Sys: 207.79 271.04 232.94 21.33 numa01.sh User: 39763.93 47303.12 43210.73 2644.86 numa02.sh Real: 60.00 61.46 60.78 0.49 numa02.sh Sys: 15.71 25.31 20.69 3.42 numa02.sh User: 5175.92 5265.86 5235.97 32.82 numa03.sh Real: 776.42 834.85 806.01 23.22 numa03.sh Sys: 114.43 128.75 121.65 5.49 numa03.sh User: 60773.93 64855.25 62616.91 1576.39 numa04.sh Real: 456.93 511.95 482.91 20.88 numa04.sh Sys: 178.09 460.89 356.86 94.58 numa04.sh User: 36312.09 42553.24 39623.21 2247.96 numa05.sh Real: 393.98 493.48 436.61 35.59 numa05.sh Sys: 164.49 329.15 265.87 61.78 numa05.sh User: 33182.65 36654.53 35074.51 1187.71 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 414.64 819.20 556.08 147.70 -7.36% numa01.sh Sys: 77.52 205.04 139.40 52.05 67.10% numa01.sh User: 37043.24 61757.88 45517.48 9290.38 -5.06% numa02.sh Real: 60.80 63.32 61.63 0.88 -1.37% numa02.sh Sys: 17.35 39.37 25.71 7.33 -19.5% numa02.sh User: 5213.79 5374.73 5268.90 55.09 -0.62% numa03.sh Real: 780.09 948.64 831.43 63.02 -3.05% numa03.sh Sys: 104.96 136.92 116.31 11.34 4.591% numa03.sh User: 60465.42 73339.78 64368.03 4700.14 -2.72% numa04.sh Real: 412.60 681.92 521.29 96.64 -7.36% numa04.sh Sys: 210.32 314.10 251.77 37.71 41.74% numa04.sh User: 34026.38 45581.20 38534.49 4198.53 2.825% numa05.sh Real: 394.79 439.63 411.35 16.87 6.140% numa05.sh Sys: 238.32 330.09 292.31 38.32 -9.04% numa05.sh User: 33456.45 34876.07 34138.62 609.45 2.741% While there is a regression with this change, this change is needed from a correctness perspective. Also it helps consolidation as seen from perf bench output. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
f03bb6760b |
sched/numa: Use task faults only if numa_group is not yet set up
When numa_group faults are available, task_numa_placement only uses numa_group faults to evaluate preferred node. However it still accounts task faults and even evaluates the preferred node just based on task faults just to discard it in favour of preferred node chosen on the basis of numa_group. Instead use task faults only if numa_group is not set. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25549.6 25215.7 -1.30 1 73190 72107 -1.47 Running SPECjbb2005 on a 16 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 8 113437 113372 -0.05 1 196130 177403 -9.54 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 506.35 794.46 599.06 104.26 numa01.sh Sys: 150.37 223.56 195.99 24.94 numa01.sh User: 43450.69 61752.04 49281.50 6635.33 numa02.sh Real: 60.33 62.40 61.31 0.90 numa02.sh Sys: 18.12 31.66 24.28 5.89 numa02.sh User: 5203.91 5325.32 5260.29 49.98 numa03.sh Real: 696.47 853.62 745.80 57.28 numa03.sh Sys: 85.68 123.71 97.89 13.48 numa03.sh User: 55978.45 66418.63 59254.94 3737.97 numa04.sh Real: 444.05 514.83 497.06 26.85 numa04.sh Sys: 230.39 375.79 316.23 48.58 numa04.sh User: 35403.12 41004.10 39720.80 2163.08 numa05.sh Real: 423.09 460.41 439.57 13.92 numa05.sh Sys: 287.38 480.15 369.37 68.52 numa05.sh User: 34732.12 38016.80 36255.85 1070.51 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 478.45 565.90 515.11 30.87 16.29% numa01.sh Sys: 207.79 271.04 232.94 21.33 -15.8% numa01.sh User: 39763.93 47303.12 43210.73 2644.86 14.04% numa02.sh Real: 60.00 61.46 60.78 0.49 0.871% numa02.sh Sys: 15.71 25.31 20.69 3.42 17.35% numa02.sh User: 5175.92 5265.86 5235.97 32.82 0.464% numa03.sh Real: 776.42 834.85 806.01 23.22 -7.47% numa03.sh Sys: 114.43 128.75 121.65 5.49 -19.5% numa03.sh User: 60773.93 64855.25 62616.91 1576.39 -5.36% numa04.sh Real: 456.93 511.95 482.91 20.88 2.930% numa04.sh Sys: 178.09 460.89 356.86 94.58 -11.3% numa04.sh User: 36312.09 42553.24 39623.21 2247.96 0.246% numa05.sh Real: 393.98 493.48 436.61 35.59 0.677% numa05.sh Sys: 164.49 329.15 265.87 61.78 38.92% numa05.sh User: 33182.65 36654.53 35074.51 1187.71 3.368% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
8cd45eee43 |
sched/numa: Set preferred_node based on best_cpu
Currently preferred node is set to dst_nid which is the last node in the iteration whose group weight or task weight is greater than the current node. However it doesn't guarantee that dst_nid has the numa capacity to move. It also doesn't guarantee that dst_nid has the best_cpu which is the CPU/node ideal for node migration. Lets consider faults on a 4 node system with group weight numbers in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task is running on 3 and 0 is its preferred node but its capacity is full. Consider nodes 1, 2 and 3 have capacity. Then the task should be migrated to node 1. Currently the task gets moved to node 2. env.dst_nid points to the last node whose faults were greater than current node. Modify to set the preferred node based of best_cpu. Earlier setting preferred node was skipped if nr_active_nodes is 1. This could result in the task being moved out of the preferred node to a random node during regular load balancing. Also while modifying task_numa_migrate(), use sched_setnuma to set preferred node. This ensures out numa accounting is correct. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25122.9 25549.6 1.698 1 73850 73190 -0.89 Running SPECjbb2005 on a 16 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 8 105930 113437 7.08676 1 178624 196130 9.80047 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 435.78 653.81 534.58 83.20 numa01.sh Sys: 121.93 187.18 145.90 23.47 numa01.sh User: 37082.81 51402.80 43647.60 5409.75 numa02.sh Real: 60.64 61.63 61.19 0.40 numa02.sh Sys: 14.72 25.68 19.06 4.03 numa02.sh User: 5210.95 5266.69 5233.30 20.82 numa03.sh Real: 746.51 808.24 780.36 23.88 numa03.sh Sys: 97.26 108.48 105.07 4.28 numa03.sh User: 58956.30 61397.05 60162.95 1050.82 numa04.sh Real: 465.97 519.27 484.81 19.62 numa04.sh Sys: 304.43 359.08 334.68 20.64 numa04.sh User: 37544.16 41186.15 39262.44 1314.91 numa05.sh Real: 411.57 457.20 433.29 16.58 numa05.sh Sys: 230.05 435.48 339.95 67.58 numa05.sh User: 33325.54 36896.31 35637.84 1222.64 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 506.35 794.46 599.06 104.26 -10.76% numa01.sh Sys: 150.37 223.56 195.99 24.94 -25.55% numa01.sh User: 43450.69 61752.04 49281.50 6635.33 -11.43% numa02.sh Real: 60.33 62.40 61.31 0.90 -0.195% numa02.sh Sys: 18.12 31.66 24.28 5.89 -21.49% numa02.sh User: 5203.91 5325.32 5260.29 49.98 -0.513% numa03.sh Real: 696.47 853.62 745.80 57.28 4.6339% numa03.sh Sys: 85.68 123.71 97.89 13.48 7.3347% numa03.sh User: 55978.45 66418.63 59254.94 3737.97 1.5323% numa04.sh Real: 444.05 514.83 497.06 26.85 -2.464% numa04.sh Sys: 230.39 375.79 316.23 48.58 5.8343% numa04.sh User: 35403.12 41004.10 39720.80 2163.08 -1.153% numa05.sh Real: 423.09 460.41 439.57 13.92 -1.428% numa05.sh Sys: 287.38 480.15 369.37 68.52 -7.964% numa05.sh User: 34732.12 38016.80 36255.85 1070.51 -1.704% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
5f95ba7a43 |
sched/numa: Simplify load_too_imbalanced()
Currently load_too_imbalance() cares about the slope of imbalance. It doesn't care of the direction of the imbalance. However this may not work if nodes that are being compared have dissimilar capacities. Few nodes might have more cores than other nodes in the system. Also unlike traditional load balance at a NUMA sched domain, multiple requests to migrate from the same source node to same destination node may run in parallel. This can cause huge load imbalance. This is specially true on a larger machines with either large cores per node or more number of nodes in the system. Hence allow move/swap only if the imbalance is going to reduce. Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25058.2 25122.9 0.25 1 72950 73850 1.23 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 516.14 892.41 739.84 151.32 numa01.sh Sys: 153.16 192.99 177.70 14.58 numa01.sh User: 39821.04 69528.92 57193.87 10989.48 numa02.sh Real: 60.91 62.35 61.58 0.63 numa02.sh Sys: 16.47 26.16 21.20 3.85 numa02.sh User: 5227.58 5309.61 5265.17 31.04 numa03.sh Real: 739.07 917.73 795.75 64.45 numa03.sh Sys: 94.46 136.08 109.48 14.58 numa03.sh User: 57478.56 72014.09 61764.48 5343.69 numa04.sh Real: 442.61 715.43 530.31 96.12 numa04.sh Sys: 224.90 348.63 285.61 48.83 numa04.sh User: 35836.84 47522.47 40235.41 3985.26 numa05.sh Real: 386.13 489.17 434.94 43.59 numa05.sh Sys: 144.29 438.56 278.80 105.78 numa05.sh User: 33255.86 36890.82 34879.31 1641.98 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 435.78 653.81 534.58 83.20 38.39% numa01.sh Sys: 121.93 187.18 145.90 23.47 21.79% numa01.sh User: 37082.81 51402.80 43647.60 5409.75 31.03% numa02.sh Real: 60.64 61.63 61.19 0.40 0.637% numa02.sh Sys: 14.72 25.68 19.06 4.03 11.22% numa02.sh User: 5210.95 5266.69 5233.30 20.82 0.608% numa03.sh Real: 746.51 808.24 780.36 23.88 1.972% numa03.sh Sys: 97.26 108.48 105.07 4.28 4.197% numa03.sh User: 58956.30 61397.05 60162.95 1050.82 2.661% numa04.sh Real: 465.97 519.27 484.81 19.62 9.385% numa04.sh Sys: 304.43 359.08 334.68 20.64 -14.6% numa04.sh User: 37544.16 41186.15 39262.44 1314.91 2.478% numa05.sh Real: 411.57 457.20 433.29 16.58 0.380% numa05.sh Sys: 230.05 435.48 339.95 67.58 -17.9% numa05.sh User: 33325.54 36896.31 35637.84 1222.64 -2.12% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
305c1fac32 |
sched/numa: Evaluate move once per node
task_numa_compare() helps choose the best CPU to move or swap the selected task. To achieve this task_numa_compare() is called for every CPU in the node. Currently it evaluates if the task can be moved/swapped for each of the CPUs. However the move evaluation is mostly independent of the CPU. Evaluating the move logic once per node, provides scope for simplifying task_numa_compare(). Running SPECjbb2005 on a 4 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 16 25705.2 25058.2 -2.51 1 74433 72950 -1.99 Running SPECjbb2005 on a 16 node machine and comparing bops/JVM JVMS LAST_PATCH WITH_PATCH %CHANGE 8 96589.6 105930 9.670 1 181830 178624 -1.76 (numbers from v1 based on v4.17-rc5) Testcase Time: Min Max Avg StdDev numa01.sh Real: 440.65 941.32 758.98 189.17 numa01.sh Sys: 183.48 320.07 258.42 50.09 numa01.sh User: 37384.65 71818.14 60302.51 13798.96 numa02.sh Real: 61.24 65.35 62.49 1.49 numa02.sh Sys: 16.83 24.18 21.40 2.60 numa02.sh User: 5219.59 5356.34 5264.03 49.07 numa03.sh Real: 822.04 912.40 873.55 37.35 numa03.sh Sys: 118.80 140.94 132.90 7.60 numa03.sh User: 62485.19 70025.01 67208.33 2967.10 numa04.sh Real: 690.66 872.12 778.49 65.44 numa04.sh Sys: 459.26 563.03 494.03 42.39 numa04.sh User: 51116.44 70527.20 58849.44 8461.28 numa05.sh Real: 418.37 562.28 525.77 54.27 numa05.sh Sys: 299.45 481.00 392.49 64.27 numa05.sh User: 34115.09 41324.02 39105.30 2627.68 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 516.14 892.41 739.84 151.32 2.587% numa01.sh Sys: 153.16 192.99 177.70 14.58 45.42% numa01.sh User: 39821.04 69528.92 57193.87 10989.48 5.435% numa02.sh Real: 60.91 62.35 61.58 0.63 1.477% numa02.sh Sys: 16.47 26.16 21.20 3.85 0.943% numa02.sh User: 5227.58 5309.61 5265.17 31.04 -0.02% numa03.sh Real: 739.07 917.73 795.75 64.45 9.776% numa03.sh Sys: 94.46 136.08 109.48 14.58 21.39% numa03.sh User: 57478.56 72014.09 61764.48 5343.69 8.813% numa04.sh Real: 442.61 715.43 530.31 96.12 46.79% numa04.sh Sys: 224.90 348.63 285.61 48.83 72.97% numa04.sh User: 35836.84 47522.47 40235.41 3985.26 46.26% numa05.sh Real: 386.13 489.17 434.94 43.59 20.88% numa05.sh Sys: 144.29 438.56 278.80 105.78 40.77% numa05.sh User: 33255.86 36890.82 34879.31 1641.98 12.11% Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
2e62c4743a |
sched/fair: Remove #ifdefs from scale_rt_capacity()
Reuse cpu_util_irq() that has been defined for schedutil and set irq util to 0 when !CONFIG_IRQ_TIME_ACCOUNTING. But the compiler is not able to optimize the sequence (at least with aarch64 GCC 7.2.1): free *= (max - irq); free /= max; when irq is fixed to 0 Add a new inline function scale_irq_capacity() that will scale utilization when irq is accounted. Reuse this funciton in schedutil which applies similar formula. Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
bbb62c0b02 |
sched/core: Remove the rt_avg code
rt_avg is not used anywhere anymore, so we can remove all related code. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
523e979d31 |
sched/core: Use PELT for scale_rt_capacity()
The utilization of the CPU by RT, DL and IRQs are now tracked with PELT so we can use these metrics instead of rt_avg to evaluate the remaining capacity available for CFS class. scale_rt_capacity() behavior has been changed and now returns the remaining capacity available for CFS instead of a scaling factor because RT, DL and IRQ provide now absolute utilization value. The same formula as schedutil is used: IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg but the implementation is different because it doesn't return the same value and doesn't benefit of the same optimization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
91c27493e7 |
sched/irq: Add IRQ utilization tracking
interrupt and steal time are the only remaining activities tracked by rt_avg. Like for sched classes, we can use PELT to track their average utilization of the CPU. But unlike sched class, we don't track when entering/leaving interrupt; Instead, we take into account the time spent under interrupt context when we update rqs' clock (rq_clock_task). This also means that we have to decay the normal context time and account for interrupt time during the update. That's also important to note that because: rq_clock == rq_clock_task + interrupt time and rq_clock_task is used by a sched class to compute its utilization, the util_avg of a sched class only reflects the utilization of the time spent in normal context and not of the whole time of the CPU. The utilization of interrupt gives an more accurate level of utilization of CPU. The CPU utilization is: avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq Most of the time, avg_irq is small and neglictible so the use of the approximation CPU utilization = /Sum avg_rq was enough. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
3727e0e163 |
sched/dl: Add dl_rq utilization tracking
Similarly to what happens with RT tasks, CFS tasks can be preempted by DL tasks and the CFS's utilization might no longer describes the real utilization level. Current DL bandwidth reflects the requirements to meet deadline when tasks are enqueued but not the current utilization of the DL sched class. We track DL class utilization to estimate the system utilization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
371bf42732 |
sched/rt: Add rt_rq utilization tracking
schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks are preempted by RT tasks and in this case util_avg reflects the remaining capacity but not what CFS want to use. In such case, schedutil can select a lower OPP whereas the CPU is overloaded. In order to have a more accurate view of the utilization of the CPU, we track the utilization of RT tasks. Only util_avg is correctly tracked but not load_avg and runnable_load_avg which are useless for rt_rq. rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are the same at the root group level, so the PELT windows of the util_sum are aligned. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
c079629862 |
sched/pelt: Move PELT related code in a dedicated file
We want to track rt_rq's utilization as a part of the estimation of the whole rq's utilization. This is necessary because rt tasks can steal utilization to cfs tasks and make them lighter than they are. As we want to use the same load tracking mecanism for both and prevent useless dependency between cfs and rt code, PELT code is moved in a dedicated file. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
8fe5c5a937 |
sched/fair: Fix util_avg of new tasks for asymmetric systems
When a new task wakes-up for the first time, its initial utilization is set to half of the spare capacity of its CPU. The current implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE directly as a capacity reference. As a result, on a big.LITTLE system, a new task waking up on an idle little CPU will be given ~512 of util_avg, even if the CPU's capacity is significantly less than that. Fix this by computing the spare capacity with arch_scale_cpu_capacity(). Signed-off-by: Quentin Perret <quentin.perret@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
4520843dfa |
Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
3482d98bbc |
sched/util_est: Fix util_est_dequeue() for throttled cfs_rq
When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
everything happens at cfs_rq level. Currently util_est stays unchanged
in such case and it keeps accounting the utilization of throttled tasks.
This can somewhat make sense as we don't dequeue tasks but only throttled
cfs_rq.
If a task of another group is enqueued/dequeued and root cfs_rq becomes
idle during the dequeue, util_est will be cleared whereas it was
accounting util_est of throttled tasks before. So the behavior of util_est
is not always the same regarding throttled tasks and depends of side
activity. Furthermore, util_est will not be updated when the cfs_rq is
unthrottled as everything happens at cfs_rq level. Main results is that
util_est will stay null whereas we now have running tasks. We have to wait
for the next dequeue/enqueue of the previously throttled tasks to get an
up to date util_est.
Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
if there is no running task so the util_est of a task remains until the
latter is dequeued even if its cfs_rq has been throttled.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
![]() |
f1d1be8aee |
sched/fair: Advance global expiration when period timer is restarted
When period gets restarted after some idle time, start_cfs_bandwidth() doesn't update the expiration information, expire_cfs_rq_runtime() will see cfs_rq->runtime_expires smaller than rq clock and go to the clock drift logic, wasting needless CPU cycles on the scheduler hot path. Update the global expiration in start_cfs_bandwidth() to avoid frequent expire_cfs_rq_runtime() calls once a new period begins. Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
512ac999d2 |
sched/fair: Fix bandwidth timer clock drift condition
I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: |
||
![]() |
03585a95cd |
sched/fair: Remove stale tg_unthrottle_up() comments
After commit:
|
||
![]() |
ba2591a599 |
sched/smt: Update sched_smt_present at runtime
The static key sched_smt_present is only updated at boot time when SMT siblings have been detected. Booting with maxcpus=1 and bringing the siblings online after boot rebuilds the scheduling domains correctly but does not update the static key, so the SMT code is not enabled. Let the key be updated in the scheduler CPU hotplug code to fix this. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
6396bb2215 |
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org> |
||
![]() |
2539fc82aa |
sched/fair: Update util_est before updating schedutil
When a task is enqueued the estimated utilization of a CPU is updated
to better support the selection of the required frequency.
However, schedutil is (implicitly) updated by update_load_avg() which
always happens before util_est_{en,de}queue(), thus potentially
introducing a latency between estimated utilization updates and
frequency selections.
Let's update util_est at the beginning of enqueue_task_fair(),
which will ensure that all schedutil updates will see the most
updated estimated utilization value for a CPU.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Fixes:
|
||
![]() |
943d355d7f |
sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
In the following commit:
|
||
![]() |
1378447598 |
sched/numa: Stagger NUMA balancing scan periods for new threads
Threads share an address space and each can change the protections of the same address space to trap NUMA faults. This is redundant and potentially counter-productive as any thread doing the update will suffice. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created. The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity. The following is the time reported by NAS Parallel Benchmark using unbound openmp threads and a D size class: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 Time bt.D 442.77 ( 0.00%) 419.70 ( 5.21%) Time cg.D 171.90 ( 0.00%) 180.85 ( -5.21%) Time ep.D 33.10 ( 0.00%) 32.90 ( 0.60%) Time is.D 9.59 ( 0.00%) 9.42 ( 1.77%) Time lu.D 306.75 ( 0.00%) 304.65 ( 0.68%) Time mg.D 54.56 ( 0.00%) 52.38 ( 4.00%) Time sp.D 1020.03 ( 0.00%) 903.77 ( 11.40%) Time ua.D 400.58 ( 0.00%) 386.49 ( 3.52%) Note it's not a universal win but we have no prior knowledge of which thread matters but the number of threads created often exceeds the size of the node when the threads are not bound. However, there is a reducation of overall system CPU usage: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 sys-time-bt.D 48.78 ( 0.00%) 48.22 ( 1.15%) sys-time-cg.D 25.31 ( 0.00%) 26.63 ( -5.22%) sys-time-ep.D 1.65 ( 0.00%) 0.62 ( 62.42%) sys-time-is.D 40.05 ( 0.00%) 24.45 ( 38.95%) sys-time-lu.D 37.55 ( 0.00%) 29.02 ( 22.72%) sys-time-mg.D 47.52 ( 0.00%) 34.92 ( 26.52%) sys-time-sp.D 119.01 ( 0.00%) 109.05 ( 8.37%) sys-time-ua.D 51.52 ( 0.00%) 45.13 ( 12.40%) NUMA scan activity is also reduced: NUMA alloc local 1042828 1342670 NUMA base PTE updates 140481138 93577468 NUMA huge PMD updates 272171 180766 NUMA page range updates 279832690 186129660 NUMA hint faults 1395972 1193897 NUMA hint local faults 877925 855053 NUMA hint local percent 62 71 NUMA pages migrated 12057909 9158023 Similar observations are made for other thread-intensive workloads. System CPU usage is lower even though the headline gains in performance tend to be small. For example, specjbb 2005 shows almost no difference in performance but scan activity is reduced by a third on a 4-socket box. I didn't find a workload (thread intensive or otherwise) that suffered badly. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
dfd5c3ea64 |
Merge tag 'v4.17-rc5' into sched/core, to pick up fixes and dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
66e1c94db3 |
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner: "A mixed bag of fixes and updates for the ghosts which are hunting us. The scheduler fixes have been pulled into that branch to avoid conflicts. - A set of fixes to address a khread_parkme() race which caused lost wakeups and loss of state. - A deadlock fix for stop_machine() solved by moving the wakeups outside of the stopper_lock held region. - A set of Spectre V1 array access restrictions. The possible problematic spots were discuvered by Dan Carpenters new checks in smatch. - Removal of an unused file which was forgotten when the rest of that functionality was removed" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Remove unused file perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map() perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_* perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[] sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[] sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[] sched/core: Introduce set_special_state() kthread, sched/wait: Fix kthread_parkme() completion issue kthread, sched/wait: Fix kthread_parkme() wait-loop sched/fair: Fix the update of blocked load when newly idle stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock |