All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
as the time argument and some of them check whether or not cpu_of(rq)
is equal to smp_processor_id() before calling it, so rework it to
take a runqueue pointer as the argument and move the rq_clock(rq)
evaluation into it.
Additionally, provide a wrapper checking cpu_of(rq) against
smp_processor_id() for the cpufreq_update_util() callers that
need it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
It is useful to know the reason why cpufreq_update_util() has just
been called and that can be passed as flags to cpufreq_update_util()
and to the ->func() callback in struct update_util_data. However,
doing that in addition to passing the util and max arguments they
already take would be clumsy, so avoid it.
Instead, use the observation that the schedutil governor is part
of the scheduler proper, so it can access scheduler data directly.
This allows the util and max arguments of cpufreq_update_util()
and the ->func() callback in struct update_util_data to be replaced
with a flags one, but schedutil has to be modified to follow.
Thus make the schedutil governor obtain the CFS utilization
information from the scheduler and use the "RT" and "DL" flags
instead of the special utilization value of ULONG_MAX to track
updates from the RT and DL sched classes. Make it non-modular
too to avoid having to export scheduler variables to modules at
large.
Next, update all of the other users of cpufreq_update_util()
and the ->func() callback in struct update_util_data accordingly.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Vincent noted that the update_tg_load_avg() usage in commit:
3d30544f02 ("sched/fair: Apply more PELT fixes")
isn't entirely sufficient. We need to call this function every time
cfs_rq->avg.load changes, this includes when update_cfs_rq_load_avg()
returns true, but {attach,detach}_entity_load_avg() themselves also
change it. This means we need to unconditionally call
update_tg_load_avg().
Also, add more comments.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
One additional 'rule' for using update_cfs_rq_load_avg() is that one
should call update_tg_load_avg() if it returns true.
Add a bunch of comments to hopefully clarify some of the rules:
o You need to update cfs_rq _before_ any entity attach/detach,
this is important, because while for mathmatical consisency this
isn't strictly needed, it is required for the physical
interpretation of the model, you attach/detach _now_.
o When you modify the cfs_rq avg, you have to then call
update_tg_load_avg() in order to propagate changes upwards.
o (Fair) entities are always attached, switched_{to,from}_fair()
deal with !fair. This directly follows from the definition of the
cfs_rq averages, namely that they are a direct sum of all
(runnable or blocked) entities on that rq.
It is the second rule that this patch enforces, but it adds comments
pertaining to all of them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vincent and Yuyang found another few scenarios in which entity
tracking goes wobbly.
The scenarios are basically due to the fact that new tasks are not
immediately attached and thereby differ from the normal situation -- a
task is always attached to a cfs_rq load average (such that it
includes its blocked contribution) and are explicitly
detached/attached on migration to another cfs_rq.
Scenario 1: switch to fair class
p->sched_class = fair_class;
if (queued)
enqueue_task(p);
...
enqueue_entity()
enqueue_entity_load_avg()
migrated = !sa->last_update_time (true)
if (migrated)
attach_entity_load_avg()
check_class_changed()
switched_from() (!fair)
switched_to() (fair)
switched_to_fair()
attach_entity_load_avg()
If @p is a new task that hasn't been fair before, it will have
!last_update_time and, per the above, end up in
attach_entity_load_avg() _twice_.
Scenario 2: change between cgroups
sched_move_group(p)
if (queued)
dequeue_task()
task_move_group_fair()
detach_task_cfs_rq()
detach_entity_load_avg()
set_task_rq()
attach_task_cfs_rq()
attach_entity_load_avg()
if (queued)
enqueue_task();
...
enqueue_entity()
enqueue_entity_load_avg()
migrated = !sa->last_update_time (true)
if (migrated)
attach_entity_load_avg()
Similar as with scenario 1, if @p is a new task, it will have
!load_update_time and we'll end up in attach_entity_load_avg()
_twice_.
Furthermore, notice how we do a detach_entity_load_avg() on something
that wasn't attached to begin with.
As stated above; the problem is that the new task isn't yet attached
to the load tracking and thereby violates the invariant assumption.
This patch remedies this by ensuring a new task is indeed properly
attached to the load tracking on creation, through
post_init_entity_util_avg().
Of course, this isn't entirely as straightforward as one might think,
since the task is hashed before we call wake_up_new_task() and thus
can be poked at. We avoid this by adding TASK_NEW and teaching
cpu_cgroup_can_attach() to refuse such tasks.
Reported-by: Yuyang Du <yuyang.du@intel.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A new fair task is detached and attached from/to task_group with:
cgroup_post_fork()
ss->fork(child) := cpu_cgroup_fork()
sched_move_task()
task_move_group_fair()
Which is wrong, because at this point in fork() the task isn't fully
initialized and it cannot 'move' to another group, because its not
attached to any group as yet.
In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
can just call this small part directly instead sched_move_task(). And
the task doesn't really migrate because it is not yet attached so we
need the following sequence:
do_fork()
sched_fork()
__set_task_cpu()
cgroup_post_fork()
set_task_rq() # set task group and runqueue
wake_up_new_task()
select_task_rq() can select a new cpu
__set_task_cpu
post_init_entity_util_avg
attach_task_cfs_rq()
activate_task
enqueue_task
This patch makes that happen.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added TASK_SET_GROUP to set depth properly. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vincent reported that when a new task is moved into a new cgroup it
gets attached twice to the load tracking:
sched_move_task()
task_move_group_fair()
detach_task_cfs_rq()
set_task_rq()
attach_task_cfs_rq()
attach_entity_load_avg()
se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0
enqueue_entity()
enqueue_entity_load_avg()
update_cfs_rq_load_avg()
now = clock()
__update_load_avg(&cfs_rq->avg)
cfs_rq->avg.last_load_update = now
// ages load/util for: now - 0, load/util -> 0
if (migrated)
attach_entity_load_avg()
se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0
The problem is that we don't update cfs_rq load_avg before all
entity attach/detach operations. Only enqueue_task() and migrate_task()
do this.
By fixing this, the above will not happen, because the
sched_move_task() attach will have updated cfs_rq's last_load_update
time before attach, and in turn the attach will have set the entity's
last_load_update stamp.
Note that there is a further problem with sched_move_task() calling
detach on a task that hasn't yet been attached; this will be taken
care of in a subsequent patch.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The task_fork_fair() callback already calls __set_task_cpu() and takes
rq->lock.
If we move the sched_class::task_fork callback in sched_fork() under
the existing p->pi_lock, right after its set_task_cpu() call, we can
avoid doing two such calls and omit the IRQ disabling on the rq->lock.
Change to __set_task_cpu() to skip the migration bits, this is a new
task, not a migration. Similarly, make wake_up_new_task() use
__set_task_cpu() for the same reason, the task hasn't actually
migrated as it hasn't ever ran.
This cures the problem of calling migrate_task_rq_fair(), which does
remove_entity_from_load_avg() on tasks that have never been added to
the load avg to begin with.
This bug would result in transiently messed up load_avg values, averaged
out after a few dozen milliseconds. This is probably the reason why
this bug was not found for such a long time.
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
did something non-obvious but also did it buggy yet latent.
The problem was exposed for real by a later commit in the v4.7 merge window:
2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")
... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).
Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.
Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.
The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2159197d66 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Starting with the following commit:
fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
calc_tg_weight() doesn't compute the right value as expected by effective_load().
The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.
Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.
So stop using calc_tg_weight() and do it explicitly.
The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fde7d22e01 ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As per commit:
b7fa30c9cc ("sched/fair: Fix post_init_entity_util_avg() serialization")
> the code generated from update_cfs_rq_load_avg():
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> removed_load = 1;
> }
>
> turns into:
>
> ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
> ffffffff8108706b: 48 85 c0 test %rax,%rax
> ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0>
> ffffffff81087070: 4c 89 f8 mov %r15,%rax
> ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
> ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
> ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
> ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
> ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
> ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.
So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.
Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.
This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.
Note: GCC generates crap code for this, might warrant a look later.
Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
maybe we should do clamping on add too.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pavan reported that in the presence of very light tasks (or cgroups)
the placement of migrated tasks can cause severe fairness issues.
The problem is that enqueue_entity() places the task before it updates
time, thereby it can place the task far in the past (remember that
light tasks will shoot virtual time forward at a high speed, so in
relation to the pre-existing light task, we can land far in the past).
This is done because update_curr() needs the current task, and we
might be placing the current task.
The obvious solution is to differentiate between the current and any
other task; placing the current before we update time, and placing any
other task after, such that !curr tasks end up at the current moment
in time, and not in the past.
This commit re-introduces the previously reverted commit:
3a47d5124a ("sched/fair: Fix fairness issue on migration")
... which is now safe to do, after we've also fixed another
underlying bug first, in:
sched/fair: Prepare to fix fairness problems on migration
and cleaned up other details in the migration code:
sched/core: Kill sched_class::task_waking
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reported that our recent attempt to fix migration problems:
3a47d5124a ("sched/fair: Fix fairness issue on migration")
broke interactivity and the signal starve test. We reverted that
commit and now let's try it again more carefully, with some other
underlying problems fixed first.
One problem is that I assumed ENQUEUE_WAKING was only set when we do a
cross-cpu wakeup (migration), which isn't true. This means we now
destroy the vruntime history of tasks and wakeup-preemption suffers.
Cure this by making my assumption true, only call
sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
the indirect call in the case we do a local wakeup.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: linux-kernel@vger.kernel.org
Fixes: 3a47d5124a ("sched/fair: Fix fairness issue on migration")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
mode and we try to minimize the ticks as much as possible. The nohz
subsystem uses a confusing terminology with the internal state
"ts->tick_stopped" which is also available through its public interface
with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
reduced with the best effort rather than stopped. In the best case the
tick can indeed be actually stopped but there is no guarantee about that.
If a timer needs to fire one second later, a tick will fire while the
CPU is in nohz mode and this is a very common scenario.
Now this confusion happens to be a problem with CPU load updates:
cpu_load_update_active() doesn't handle nohz ticks correctly because it
assumes that ticks are completely stopped in nohz mode and that
cpu_load_update_active() can't be called in dynticks mode. When that
happens, the whole previous tickless load is ignored and the function
just records the load for the current tick, ignoring potentially long
idle periods behind.
In order to solve this, we could account the current load for the
previous nohz time but there is a risk that we account the load of a
task that got freshly enqueued for the whole nohz period.
So instead, lets record the dynticks load on nohz frame entry so we know
what to record in case of nohz ticks, then use this record to account
the tickless load on nohz ticks and nohz frame end.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When asymmetric packing is set in the sched_domain and target CPU is
busy, update_sd_pick_busiest() may not select the busiest runqueue.
When target CPU is busy, find_busiest_group() will ignore checks for
asym packing and may continue to load balance using the currently
selected not-the-busiest runqueue as source runqueue.
Selecting the busiest runqueue as source when the target CPU is busy,
should result in achieving much better load balance.
Also when target CPU is not busy and asymmetric packing is set in sd,
select higher CPU as source CPU for load balancing.
While doing this change, move the check to see if target CPU is busy
into check_asym_packing().
The extent of performance benefit from this change decreases with the
increasing load. However there is benefit in undercommit as well as
overcommit conditions.
1. Record per second ebizzy (32 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
Testcase: Min Max Avg StdDev
ebizzy: 5223767.00 10368236.00 7946971.00 1753094.76
4.6.0-rc2+asym-changes
Testcase: Min Max Avg StdDev %Change
ebizzy: 8617191.00 13872356.00 11383980.00 1783400.89 +24.78%
2. Record per second ebizzy (64 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
Testcase: Min Max Avg StdDev
ebizzy: 6497666.00 18399783.00 10818093.20 4051452.08
4.6.0-rc2+asym-changes
Testcase: Min Max Avg StdDev %Change
ebizzy: 7567365.00 19456937.00 11674063.60 4295407.48 +4.40%
3. Record per second ebizzy (128 threads) on a 64 CPU power 7 box. (5 iterations)
4.6.0-rc2
Testcase: Min Max Avg StdDev
ebizzy: 37073983.00 40341911.00 38776241.80 1259766.82
4.6.0-rc2+asym-changes
Testcase: Min Max Avg StdDev %Change
ebizzy: 38030399.00 41333378.00 39827404.40 1255001.86 +2.54%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1459948660-16073-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>