Commit Graph

1140 Commits

Author SHA1 Message Date
Cheng Jian
60588bfa22 sched/fair: Optimize select_idle_cpu
select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :

	1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")

introduces a way to limit how many CPUs we scan.

But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.

Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

Fixes: 1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
2019-12-17 13:32:51 +01:00
Frederic Weisbecker
7c2e8bbd87 sched: Spare resched IPI when prio changes on a single fair task
The runqueue of a fair task being remotely reniced is going to get a
resched IPI in order to reassess which task should be the current
running on the CPU. However that evaluation is useless if the fair task
is running alone, in which case we can spare that IPI, preventing
nohz_full CPUs from being disturbed.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-2-frederic@kernel.org
2019-12-17 13:32:50 +01:00
Vincent Guittot
6cf82d559e sched/cfs: fix spurious active migration
The load balance can fail to find a suitable task during the periodic check
because  the imbalance is smaller than half of the load of the waiting
tasks. This results in the increase of the number of failed load balance,
which can end up to start an active migration. This active migration is
useless because the current running task is not a better choice than the
waiting ones. In fact, the current task was probably not running but
waiting for the CPU during one of the previous attempts and it had already
not been selected.

When load balance fails too many times to migrate a task, we should relax
the contraint on the maximum load of the tasks that can be migrated
similarly to what is done with cache hotness.

Before the rework, load balance used to set the imbalance to the average
load_per_task in order to mitigate such situation. This increased the
likelihood of migrating a task but also of selecting a larger task than
needed while more appropriate ones were in the list.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1575036287-6052-1-git-send-email-vincent.guittot@linaro.org
2019-12-17 13:32:48 +01:00
Vincent Guittot
7ed735c331 sched/fair: Fix find_idlest_group() to handle CPU affinity
Because of CPU affinity, the local group can be skipped which breaks the
assumption that statistics are always collected for local group. With
uninitialized local_sgs, the comparison is meaningless and the behavior
unpredictable. This can even end up to use local pointer which is to
NULL in this case.

If the local group has been skipped because of CPU affinity, we return
the idlest group.

Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: John Stultz <john.stultz@linaro.org>
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: mingo@redhat.com
Cc: mgorman@suse.de
Cc: juri.lelli@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: bsegall@google.com
Cc: qais.yousef@arm.com
Link: https://lkml.kernel.org/r/1575483700-22153-1-git-send-email-vincent.guittot@linaro.org
2019-12-17 13:32:48 +01:00
Quentin Perret
76e19fdf8e ANDROID: sched/fair: Fix 5.5-rc1 merge
Some android-specific EAS patches caused issues during the 5.5-rc1 merge
that contains the load-balance rework. Most of the issues these patches
were fixing should now covered by the rework, which means they are not
needed any more. And for the others, they need to be reworked to fit
with the new LB code.

Fix the 5.5-rc1 merge by removing all the android-specific code from the
load balancer.

Effectively, this reverts 65b8ddef4e ("ANDROID: sched: Prevent
unnecessary active balance of single task in sched group"), f351885fc7
("ANDROID: sched/fair: Attempt to improve throughput for asym cap
systems"), a7455f8123 ("ANDROID: sched/fair: Don't balance misfits if
it would overload local group") and 44ab1611a8 ("ANDROID: sched/fair:
Also do misfit in overloaded groups").

Fixes: d3a196a371 ("Merge 5.5-rc1 into android-mainline")
Change-Id: I15c2e044d9f571be5eac0ced929690d9bf7dec65
Signed-off-by: Quentin Perret <qperret@google.com>
2019-12-09 17:01:44 +00:00
Greg Kroah-Hartman
d3a196a371 Merge 5.5-rc1 into android-mainline
Linux 5.5-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6f952ebdd40746115165a2f99bab340482f5c237
2019-12-09 12:12:00 +01:00
Vincent Guittot
bef69dd878 sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
which might be inefficient when the cpufreq driver has rate limitation.

When a task is attached on a CPU, we have this call path:

update_load_avg()
  update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
  attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discarded because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these two functions.

It's also interesting to note that update_load_avg() already calls
cfs_rq_util_change() directly for the !SMP case.

This change will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
IRQ, RT or DL PELT signals.

[ mingo: Minor updates. ]

Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@redhat.com
Cc: linux-pm@vger.kernel.org
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Cc: sargun@sargun.me
Cc: srinivas.pandruvada@linux.intel.com
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Fixes: 039ae8bcf7 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:42:26 +01:00
Ingo Molnar
b21feab0b8 Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:41:02 +01:00
Vincent Guittot
a9723389cc sched/fair: Add comments for group_type and balancing at SD_NUMA level
Add comments to describe each state of goup_type and to add some details
about the load balance at NUMA level.

[ Valentin Schneider: Updates to the comments. ]
[ mingo: Other updates to the comments. ]

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1573570243-1903-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:33:12 +01:00
Vincent Guittot
3318544b72 sched/fair: Fix rework of find_idlest_group()
The task, for which the scheduler looks for the idlest group of CPUs, must
be discounted from all statistics in order to get a fair comparison
between groups. This includes utilization, load, nr_running and idle_cpus.

Such unfairness can be easily highlighted with the unixbench execl 1 task.
This test continuously call execve() and the scheduler looks for the idlest
group/CPU on which it should place the task. Because the task runs on the
local group/CPU, the latter seems already busy even if there is nothing
else running on it. As a result, the scheduler will always select another
group/CPU than the local one.

This recovers most of the performance regression on my system from the
recent load-balancer rewrite.

[ mingo: Minor cleanups. ]

Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Link: https://lkml.kernel.org/r/1571762798-25900-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:11:56 +01:00
Greg Kroah-Hartman
ad5859c6ae Merge 5.4-rc8 into android-mainline
Linux 5.4-rc8

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1f55e5d34dc78ddb064910ce1e1b7a7b5b39aaba
2019-11-18 08:31:11 +01:00
Vincent Guittot
b90f7c9d21 sched/pelt: Fix update of blocked PELT ordering
update_cfs_rq_load_avg() can call cpufreq_update_util() to trigger an
update of the frequency. Make sure that RT, DL and IRQ PELT signals have
been updated before calling cpufreq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: dsmythies@telus.net
Cc: juri.lelli@redhat.com
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Fixes: 371bf42732 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e163 ("sched/dl: Add dl_rq utilization tracking")
Fixes: 91c27493e7 ("sched/irq: Add IRQ utilization tracking")
Link: https://lkml.kernel.org/r/1572434309-32512-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:01:31 +01:00
Peter Zijlstra
a0e813f26e sched/core: Further clarify sched_class::set_next_task()
It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: f95d4eaee6 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:21 +01:00
Peter Zijlstra
2eeb01a28c sched/fair: Use mul_u32_u32()
While reading the code I encountered another site where we should be
using mul_u32_u32() because GCC just won't take a hint.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.717931380@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:20 +01:00
Peter Zijlstra
98c2f700ed sched/core: Simplify sched_class::pick_next_task()
Now that the indirect class call never uses the last two arguments of
pick_next_task(), remove them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:20 +01:00
Peter Zijlstra
5d7d605642 sched/core: Optimize pick_next_task()
Ever since we moved the sched_class definitions into their own files,
the constant expression {fair,idle}_sched_class.pick_next_task() is
not in fact a compile time constant anymore and results in an indirect
call (barring LTO).

Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
rid of the indirect call (and RETPOLINE) on the fast path.

Also remove the unlikely() from the idle case, it is in fact /the/ way
we select idle -- and that is a very common thing to do.

Performance for will-it-scale/sched_yield improves by 2% (as reported
by 0-day).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.603037345@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:19 +01:00
Peter Zijlstra
7277a34c6b sched/fair: Better document newidle_balance()
Whilst chasing the pick_next_task() race, there was some confusion
about the newidle_balance() return values. Document them.

[ mingo: Minor edits. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.488364308@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:18 +01:00
Ingo Molnar
6d5a763c30 Merge tag 'v5.4-rc7' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:34:59 +01:00
Greg Kroah-Hartman
682d8bf784 Merge tag 'v5.4-rc7' into android-mainline
Linux 5.4-rc7

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I505207a0a6f68ccc3519d7f190d8faf25d9d479a
2019-11-11 06:10:55 +01:00
Greg Kroah-Hartman
8c36c8518c Revert "Revert "sched: Rework pick_next_task() slow-path""
This reverts commit 9acb7c07b6.

The root problem has now been fixed in 5.4-rc7, so revert this to
prevent merge issues.

Bug: 142182814
Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-11-11 06:09:50 +01:00
Peter Zijlstra
6e2df0581f sched: Fix pick_next_task() vs 'change' pattern race
Commit 67692435c4 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c4 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
2019-11-08 22:34:14 +01:00
Quentin Perret
9acb7c07b6 Revert "sched: Rework pick_next_task() slow-path"
This reverts commit 67692435c4
temporarily in order to avoid the following splat:

[  222.515957] BUG: kernel NULL pointer dereference, address: 0000000000000040
[  222.518871] #PF: supervisor read access in kernel mode
[  222.518871] #PF: error_code(0x0000) - not-present page
[  222.518871] PGD 800000007cdf7067 P4D 800000007cdf7067 PUD 7c868067 PMD 0
[  222.518871] Oops: 0000 [#1] PREEMPT SMP PTI
[  222.518871] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.0-rc4-mainline-g6a9b34166c05 #1
[  222.518871] Hardware name: ChromiumOS crosvm, BIOS 0
[  222.518871] RIP: 0010:set_next_entity+0xf/0xf0
[  222.518871] Code: 42 0b e5 85 31 c0 e8 90 1a fc ff 0f 0b eb 9b 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 53 48 89 f3 49 89 fe <83> 7e 40 00 74 3d 4c 89 f7 48 89 de e8 80 f9 ff ff 4c 8d 7b 18 4d
[  222.518871] RSP: 0018:ffffb95040073de8 EFLAGS: 00010046
[  222.518871] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff85ec0c30
[  222.518871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ec44c728500
[  222.518871] RBP: ffffb95040073e00 R08: 0000000000000000 R09: 0000000000000000
[  222.518871] R10: 0000000000000000 R11: ffffffff84ef0a30 R12: 0000000000000000
[  222.518871] R13: 0000000000000000 R14: ffff9ec44c728500 R15: ffffb95040073e60
[  222.518871] FS:  0000000000000000(0000) GS:ffff9ec44c700000(0000) knlGS:0000000000000000
[  222.518871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.518871] CR2: 0000000000000040 CR3: 000000007c87a002 CR4: 00000000003606e0
[  222.518871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  222.518871] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  222.518871] Call Trace:
[  222.518871]  pick_next_task_fair+0xe8/0x410
[  222.518871]  ? rcu_note_context_switch+0x3ef/0x520
[  222.518871]  ? finish_task_switch+0x10d/0x250
[  222.518871]  __schedule+0x1f8/0x6e0
[  222.518871]  schedule_idle+0x27/0x40
[  222.518871]  do_idle+0x21c/0x240
[  222.518871]  cpu_startup_entry+0x25/0x30
[  222.518871]  start_secondary+0x18d/0x1b0
[  222.518871]  secondary_startup_64+0xa4/0xb0
[  222.518871] Modules linked in: vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common vsock_diag vsock can_raw can snd_intel8x0 snd_ac97_codec ac97_bus virtio_input virtio_pci virtio_mmio virtio_crypto mmc_block mmc_core dummy_cpufreq rtc_test dummy_hcd can_dev vcan virtio_net net_failover failover virt_wifi virtio_blk virtio_gpu ttm virtio_rng virtio_ring virtio 8250_of crypto_engine
[  222.518871] CR2: 0000000000000040
[  222.518871] ---[ end trace ae741809965b22f2 ]---
[  222.518871] RIP: 0010:set_next_entity+0xf/0xf0
[  222.518871] Code: 42 0b e5 85 31 c0 e8 90 1a fc ff 0f 0b eb 9b 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 53 48 89 f3 49 89 fe <83> 7e 40 00 74 3d 4c 89 f7 48 89 de e8 80 f9 ff ff 4c 8d 7b 18 4d
[  222.518871] RSP: 0018:ffffb95040073de8 EFLAGS: 00010046
[  222.518871] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff85ec0c30
[  222.518871] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ec44c728500
[  222.518871] RBP: ffffb95040073e00 R08: 0000000000000000 R09: 0000000000000000
[  222.518871] R10: 0000000000000000 R11: ffffffff84ef0a30 R12: 0000000000000000
[  222.518871] R13: 0000000000000000 R14: ffff9ec44c728500 R15: ffffb95040073e60
[  222.518871] FS:  0000000000000000(0000) GS:ffff9ec44c700000(0000) knlGS:0000000000000000
[  222.518871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.518871] CR2: 0000000000000040 CR3: 000000007c87a002 CR4: 00000000003606e0
[  222.518871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  222.518871] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  222.518871] Kernel panic - not syncing: Fatal exception
[  222.518871] Kernel Offset: 0x3e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Bug: 142182814
Test: 20/20 avd/avd_boot_test_kernel_triggered passing
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I62bb09b4710fdc593cbf1f42bf4bbc066e401458
2019-10-29 17:29:04 +00:00
Patrick Bellasi
b8c9636140 sched/fair/util_est: Implement faster ramp-up EWMA on utilization increases
The estimated utilization for a task:

   util_est = max(util_avg, est.enqueue, est.ewma)

is defined based on:

 - util_avg: the PELT defined utilization
 - est.enqueued: the util_avg at the end of the last activation
 - est.ewma:     a exponential moving average on the est.enqueued samples

According to this definition, when a task suddenly changes its bandwidth
requirements from small to big, the EWMA will need to collect multiple
samples before converging up to track the new big utilization.

This slow convergence towards bigger utilization values is not
aligned to the default scheduler behavior, which is to optimize for
performance. Moreover, the est.ewma component fails to compensate for
temporarely utilization drops which spans just few est.enqueued samples.

To let util_est do a better job in the scenario depicted above, change
its definition by making util_est directly follow upward motion and
only decay the est.ewma on downward.

Signed-off-by: Patrick Bellasi <patrick.bellasi@matbug.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191023205630.14469-1-patrick.bellasi@matbug.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:07 +01:00
Chris Redpath
06dc94f19d ANDROID: sched: Honor sync flag for energy-aware wakeups
Since we don't do energy-aware wakeups when we are overutilized, always
honoring sync wakeups in this state does not prevent wake-wide mechanics
overruling the flag as normal.

Having the check for (cpu_rq(cpu)->nr_running == 1) in the test condition
means that we only bypass the energy model for sync wakeups where the
cpu is about to become idle. This matches the use of this flag in Binder
which is the main place where sync wakeups come from in Android.

This patch is based upon previous work to build EAS for android products.
It replaces
commit 79e3a4a27e ("ANDROID: sched: Unconditionally honor sync flag
 for energy-aware wakeups")
in what's considered to be 'EAS'.

sync-hint code taken from wahoo
https://android.googlesource.com/kernel/msm/+/4a5e890ec60d
written by Juri Lelli <juri.lelli@arm.com>

Change-Id: I4b3d79141fc8e53dc51cd63ac11096c2e3cb10f5
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit f1ec666a62de)
[ Moved the feature to find_energy_efficient_cpu() and removed the
  sysctl knob ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
[ Add 'cpu_rq(cpu)->nr_running == 1' to the condition and remove the
  word 'Unconditionally' from the patch name ]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2019-10-22 09:04:43 +00:00
Vincent Guittot
57abff067a sched/fair: Rework find_idlest_group()
The slow wake up path computes per sched_group statisics to select the
idlest group, which is quite similar to what load_balance() is doing
for selecting busiest group. Rework find_idlest_group() to classify the
sched_group and select the idlest one following the same steps as
load_balance().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
fc1273f4ce sched/fair: Optimize find_idlest_group()
find_idlest_group() now reads CPU's load_avg in two different ways.

Consolidate the function to read and use load_avg only once and simplify
the algorithm to only look for the group with lowest load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
11f10e5420 sched/fair: Use load instead of runnable load in wakeup path
Runnable load was originally introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.

Tha wake up path now starts looking for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance() logic.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
c63be7be59 sched/fair: Use utilization to select misfit task
Utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Check that task can't fit the CPU's capacity when selecting the misfit
task instead of using the load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Link: https://lkml.kernel.org/r/1571405198-27570-9-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
2ab4092fc8 sched/fair: Spread out tasks evenly when not overloaded
When there is only one CPU per group, using the idle CPUs to evenly spread
tasks doesn't make sense and nr_running is a better metrics.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
b0fb1eb4f0 sched/fair: Use load instead of runnable load in load_balance()
'runnable load' was originally introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized groups with huge blocked load whereas other groups were
overloaded.

The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wake up on the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
5e23e47443 sched/fair: Use rq->nr_running when balancing load
CFS load_balance() only takes care of CFS tasks whereas CPUs can be used by
other scheduling classes. Typically, a CFS task preempted by an RT or deadline
task will not get a chance to be pulled by another CPU because
load_balance() doesn't take into account tasks from other classes.
Add sum of nr_running in the statistics and use it to detect such
situations.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
0b0695f2b3 sched/fair: Rework load_balance()
The load_balance() algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios.

For instance, in the presence of idle CPUs, we should simply try to get at
least one task per CPU, whereas the current load-based algorithm can actually
leave idle CPUs alone simply because the load is somewhat balanced.

The current algorithm ends up creating virtual and meaningless values like
the avg_load_per_task or tweaks the state of a group to make it overloaded
whereas it's not, in order to try to migrate tasks.

load_balance() should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have:

	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_packing
	group_imbalanced
	group_overloaded

Based on the type of sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:

	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance() algorithm fixes several pending wrong tasks
placement:

 - the 1 task per CPU case with asymmetric system
 - the case of cfs task preempted by other class
 - the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:

 - update_sd_pick_busiest() select the busiest sched_group.
 - find_busiest_group() checks if there is an imbalance between local and
   busiest group.
 - calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-5-git-send-email-vincent.guittot@linaro.org
[ Small readability and spelling updates. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
fcf0553db6 sched/fair: Remove meaningless imbalance calculation
Clean up load_balance() and remove meaningless calculation and fields before
adding a new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
a349834703 sched/fair: Rename sg_lb_stats::sum_nr_running to sum_h_nr_running
Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.

There are no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Link: https://lkml.kernel.org/r/1571405198-27570-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
490ba971d8 sched/fair: Clean up asym packing
Clean up asym packing to follow the default load balance behavior:

- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Greg Kroah-Hartman
630839ac24 Merge 5.4-rc3 into android-mainline
Linux 5.4-rc3

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia87ba662738dd58ddb917e32c1fbd812861e7a46
2019-10-17 05:28:13 -07:00
Xuewei Zhang
4929a4e6fa sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e192263 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:38:02 +02:00
Quentin Perret
760b82c9b8 ANDROID: sched/fair: Bias EAS placement for latency
Add to find_energy_efficient_cpu() a latency sensitive case which mimics
what was done for prefer-idle in android-4.19 and before (see [1] for
reference).

This isn't strictly equivalent to the legacy algorithm but comes real
close, and isn't very invasive. Overall, the idea is to select the
biggest idle CPU we can find for latency-sensitive boosted tasks, and
the smallest CPU where the can fit for latency-sensitive non-boosted
tasks.

The main differences with the legacy behaviour are the following:

 1. the policy for 'prefer idle' when there isn't a single idle CPU in
    the system is simpler now. We just pick the CPU with the highest
    spare capacity;

 2. the cstate awareness is implemented by minimizing the exit latency
    rather than the idle state index. This is how it is done in the slow
    path (find_idlest_group_cpu()), it doesn't require us to keep hooks
    into CPUIdle, and should actually be better because what we want is
    a CPU that can wake up quickly;

 3. non-latency-sensitive tasks just use the standard mainline
    energy-aware wake-up path, which decides the placement using the
    Energy Model;

 4. the 'boosted' and 'latency_sensitive' attributes of a task come from
    util_clamp (which now replaces schedtune).

[1] c27c56105d

Change-Id: Ia58516906e9cb5abe08385a8cd088097043d8703
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:34 +01:00
Chris Redpath
44ab1611a8 ANDROID: sched/fair: Also do misfit in overloaded groups
If we can classify the group as overloaded, that overrides
any classification as misfit but we may still have misfit
tasks present. Check the rq we're looking at to see if
this is the case.

Change-Id: Ida8eb66aa625e34de3fe2ee1b0dd8a78926273d8
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[Removed stray reference to rq_has_misfit]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:34 +01:00
Chris Redpath
a7455f8123 ANDROID: sched/fair: Don't balance misfits if it would overload local group
When load balancing in a system with misfit tasks present, if we always
pull a misfit task to the local group this can lead to pulling a running
task from a smaller capacity CPUs to a bigger CPU which is busy. In this
situation, the pulled task is likely not to get a chance to run before
an idle balance on another small CPU pulls it back. This penalises the
pulled task as it is stopped for a short amount of time and then likely
relocated to a different CPU (since the original CPU just did a NEWLY_IDLE
balance and reset the periodic interval).

If we only do this unconditionally for NEWLY_IDLE balance, we can be
sure that any tasks and load which are present on the local group are
related to short-running tasks which we are happy to displace for a
longer running task in a system with misfit tasks present.

However, other balance types should only pull a task if we think
that the local group is underutilized - checking the number of tasks
gives us a conservative estimate here since if they were short tasks
we would have been doing NEWLY_IDLE balances instead.

Change-Id: I710add1ab1139482620b6addc8370ad194791beb
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:34 +01:00
Chris Redpath
f351885fc7 ANDROID: sched/fair: Attempt to improve throughput for asym cap systems
In some systems the capacity and group weights line up to defeat all the
small imbalance correction conditions in fix_small_imbalance, which can
cause bad task placement. Add a new condition if the existing code can't
see anything to fix:

If we have asymmetric capacity, and there are more tasks than CPUs in
the busiest group *and* there are less tasks than CPUs in the local group
then we try to pull something. There could be transient small tasks which
prevent this from working, but on the whole it is beneficial for those
systems with inconvenient capacity/cluster size relationships.

Change-Id: Icf81cde215c082a61f816534b7990ccb70aee409
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:33 +01:00
Morten Rasmussen
65b8ddef4e ANDROID: sched: Prevent unnecessary active balance of single task in sched group
Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes a
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().

A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.

Change-Id: I28f69c72febe0211decbe77b7bc3e48839d3d7b3
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:33 +01:00
Dietmar Eggemann
24c4437797 ANDROID: sched: Update max cpu capacity in case of max frequency constraints
Wakeup balancing uses cpu capacity awareness and needs to know the
system-wide maximum cpu capacity.

Patch "sched: Store system-wide maximum cpu capacity in root domain"
finds the system-wide maximum cpu capacity during scheduler domain
hierarchy setup. This is sufficient as long as maximum frequency
invariance is not enabled.

If it is enabled, the system-wide maximum cpu capacity can change
between scheduler domain hierarchy setups due to frequency capping.

The cpu capacity is changed in update_cpu_capacity() which is called in
load balance on the lowest scheduler domain hierarchy level. To be able
to know if a change in cpu capacity for a certain cpu also has an effect
on the system-wide maximum cpu capacity it is normally necessary to
iterate over all cpus. This would be way too costly. That's why this
patch follows a different approach.

The unsigned long max_cpu_capacity value in struct root_domain is
replaced with a struct max_cpu_capacity, containing value (the
max_cpu_capacity) and cpu (the cpu index of the cpu providing the
maximum cpu_capacity).

Changes to the system-wide maximum cpu capacity and the cpu index are
made if:

 1 System-wide maximum cpu capacity < cpu capacity
 2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu

There are no changes to the system-wide maximum cpu capacity in all
other cases.

Atomic read and write access to the pair (max_cpu_capacity.val,
max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock.

The access to max_cpu_capacity.val in task_fits_max() is still performed
without taking the max_cpu_capacity.lock.

The code to set max cpu capacity in build_sched_domains() has been
removed because the whole functionality is now provided by
update_cpu_capacity() instead.

This approach can introduce errors temporarily, e.g. in case the cpu
currently providing the max cpu capacity has its cpu capacity lowered
due to frequency capping and calls update_cpu_capacity() before any cpu
which might provide the max cpu now.

Change-Id: I5063befab088fbf49e5d5e484ce0c6ee6165283a
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>*
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(- Fixed cherry-pick issues, and conflict with a0fe2cf086 "sched/fair:
   Tune down misfit NOHZ kicks" which makes use of max_cpu_capacity
 - Squashed "sched/fair: remove printk while schedule is in progress"
   fix from Caesar Wang <wxt@rock-chips.com>)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:33 +01:00
Dietmar Eggemann
27ad2abc03 ANDROID: sched/fair: add arch scaling function for max frequency capping
To be able to scale the cpu capacity by this factor introduce a call to
the new arch scaling function arch_scale_max_freq_capacity() in
update_cpu_capacity() and provide a default implementation which returns
SCHED_CAPACITY_SCALE.

Another subsystem (e.g. cpufreq) or architectural or platform specific
code can overwrite this default implementation, exactly as for frequency
and cpu invariance. It has to be enabled by the arch by defining
arch_scale_max_freq_capacity to the actual implementation.

Change-Id: I770a8b1f4f7340e9e314f71c64a765bf880f4b4d
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
( Fixed conflict with scaling against the PELT-based scale_rt_capacity )
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:33 +01:00
Chris Redpath
79e3a4a27e ANDROID: sched: Unconditionally honor sync flag for energy-aware wakeups
Since we don't do energy-aware wakeups when we are overutilized, always
honoring sync wakeups in this state does not prevent wake-wide mechanics
overruling the flag as normal.

This patch is based upon previous work to build EAS for android products.

sync-hint code taken from commit 4a5e890ec60d
"sched/fair: add tunable to force selection at cpu granularity" written
by Juri Lelli <juri.lelli@arm.com>

Change-Id: I4b3d79141fc8e53dc51cd63ac11096c2e3cb10f5
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit f1ec666a62dec1083ed52fe1ddef093b84373aaf)
[ Moved the feature to find_energy_efficient_cpu() and removed the
  sysctl knob ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:33 +01:00
Patrick Bellasi
b61876ed12 ANDROID: sched/fair: EAS: Add uclamp support to find_energy_efficient_cpu()
Utilization clamping can be used to boost the utilization of small tasks
or cap that of big tasks. Thus, one of its possible usages is to bias
tasks placement to "promote" small tasks on higher capacity (less energy
efficient) CPUs or "constraint" big tasks on small capacity (more energy
efficient) CPUs.

When the Energy Aware Scheduler (EAS) looks for the most energy
efficient CPU to run a task on, it currently considers only the
effective utiliation estimated for a task.

Fix this by adding an additional check to skip CPUs which capacity is
smaller then the task clamped utilization.

Change-Id: I43fa6fa27e27c1eb5272c6a45d1a6a5b0faae1aa
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2019-10-01 10:36:32 +01:00
Quentin Perret
4892f51ad5 sched/fair: Avoid redundant EAS calculation
The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.

Add a condition to filter out this corner case before doing the energy
calculation.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@qperret.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: qais.yousef@arm.com
Cc: rjw@rjwysocki.net
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: eb92692b25 ("sched/fair: Speed-up energy-aware wake-ups")
Link: https://lkml.kernel.org/r/20190920094115.GA11503@qperret.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Qian Cai
763a9ec06c sched/fair: Fix -Wunused-but-set-variable warnings
Commit:

   de53fd7aed ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pauld@redhat.com
Fixes: de53fd7aed ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Eric W. Biederman
154abafc68 tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Qian Cai
dac9f027b1 sched/fair: Remove unused cfs_rq_clock_task() function
cfs_rq_clock_task() was first introduced and used in:

  f1b17280ef ("sched: Maintain runnable averages across throttled periods")

Over time its use has been graduately removed by the following commits:

  d31b1a66cb ("sched/fair: Factorize PELT update")
  2312729688 ("sched/fair: Update scale invariance of PELT")

Today, there is no single user left, so it can be safely removed.

Found via the -Wunused-function build warning.

Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1568668775-2127-1-git-send-email-cai@lca.pw
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-17 09:55:02 +02:00