Currently, if the RCU grace-period kthread has not yet been created,
in which case the starvation-check code will print zero for the state,
which maps to TASK_RUNNING. This could clearly be quite confusing, so
this commit prints ~0, which does not map to any legal ->state value.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Calling set_memory_rw() and set_memory_ro() for every iteration of the
loop in klp_write_object_relocations() is messy, inefficient, and
error-prone.
Change all the read-only pages to read-write before the loop and convert
them back to read-only again afterwards.
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
As agreed with Rusty, we're taking a current module-next pile through
livepatching.git, as it contains solely patches that are pre-requisity
for module page protection cleanups in livepatching. Rusty will be
restarting module-next from scratch.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently, percpu symbols from .data..percpu ELF section of a module are
not copied over and stored in final symtab array of struct module.
Consequently such symbol cannot be returned via kallsyms API (for
example kallsyms_lookup_name). This can be especially confusing when the
percpu symbol is exported. Only its __ksymtab et al. are present in its
symtab.
The culprit is in layout_and_allocate() function where SHF_ALLOC flag is
dropped for .data..percpu section. There is in fact no need to copy the
section to final struct module, because kernel module loader allocates
extra percpu section by itself. Unfortunately only symbols from
SHF_ALLOC sections are copied due to a check in is_core_symbol().
The patch changes is_core_symbol() function to copy over also percpu
symbols (their st_shndx points to .data..percpu ELF section). We do it
only if CONFIG_KALLSYMS_ALL is set to be consistent with the rest of the
function (ELF section is SHF_ALLOC but !SHF_EXECINSTR). Finally
elf_type() returns type 'a' for a percpu symbol because its address is
absolute.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Modules have three sections: text, rodata and writable data. The code
handled the case where these overlapped, however they never can:
debug_align() ensures they are always page-aligned.
This is why we got away with manually traversing the pages in
set_all_modules_text_rw() without rounding.
We create three helper functions: frob_text(), frob_rodata() and
frob_writable_data(). We then call these explicitly at every point,
so it's clear what we're doing.
We also expose module_enable_ro() and module_disable_ro() for
livepatch to use.
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Makes it easier to handle init vs core cleanly, though the change is
fairly invasive across random architectures.
It simplifies the rbtree code immediately, however, while keeping the
core data together in the same cachline (now iff the rbtree code is
enabled).
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
An exact mapping would be within_module_core(), but at this stage
(MODULE_STATE_GOING) the init section is empty, and this is clearer.
Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When setting a module's RO and NX permissions, set_section_ro_nx() is
used, but when clearing them, unset_module_{init,core}_ro_nx() are used.
The unset functions don't have the same checks the set function has for
partial page protections. It's probably harmless, but it's still
confusingly asymmetrical.
Instead, use the same logic to do both. Also add some new
set_module_{init,core}_ro_nx() helper functions for more symmetry with
the unset functions.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal. Especially given that there are regions of the scheduler that
already have interrupts disabled.
This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
by Peter Zijlstra. ]
Currently, rcu_prepare_for_idle() checks for tick_nohz_active, even on
individual NOCBs CPUs, unless all CPUs are marked as NOCBs CPUs at build
time. This check is pointless on NOCBs CPUs because they never have any
callbacks posted, given that all of their callbacks are handed off to the
corresponding rcuo kthread. There is a check for individually designated
NOCBs CPUs, but it pointelessly follows the check for tick_nohz_active.
This commit therefore moves the check for individually designated NOCBs
CPUs up with the check for CONFIG_RCU_NOCB_CPU_ALL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This function no longer has #ifdefs, so this commit removes the
header comment calling them out.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Several releases have come and gone without the warning triggering,
so remove the lock-acquisition loop. Retain the WARN_ON_ONCE()
out of sheer paranoia.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit applies an early-exit approach to rcu_sched_qs(), reducing
the nesting level and saving a line of code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The Kconfig currently controlling compilation of this code is:
init/Kconfig:config TREE_RCU_TRACE
init/Kconfig: def_bool RCU_TRACE && ( TREE_RCU || PREEMPT_RCU )
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the file there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. We could
consider moving this to an earlier initcall if desired.
We don't replace module.h with init.h since the file already has that.
We also delete the moduleparam.h include that is left over from
commit 64db4cfff9 (""Tree RCU": scalable
classic RCU implementation") since it is not needed here either.
We morph some tags like MODULE_AUTHOR into the comments at the top of
the file for documentation purposes.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the rcu_node_class[], rcu_fqs_class[], and rcu_exp_class[]
arrays needlessly pollute the global namespace within tree.c. This
commit therefore converts them to static local variables within
rcu_init_one().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Expedited grace periods can speed up boot, but are undesirable in
aggressive real-time systems. This commit therefore introduces a
kernel parameter rcupdate.rcu_normal_after_boot that disables
expedited grace periods just before init is spawned.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although expedited grace periods can be quite useful, and although their
OS jitter has been greatly reduced, they can still pose problems for
extreme real-time workloads. This commit therefore adds a rcu_normal
kernel boot parameter (which can also be manipulated via sysfs)
to suppress expedited grace periods, that is, to treat requests for
expedited grace periods as if they were requests for normal grace periods.
If both rcu_expedited and rcu_normal are specified, rcu_normal wins.
This means that if you are relying on expedited grace periods to speed up
boot, you will want to specify rcu_expedited on the kernel command line,
and then specify rcu_normal via sysfs once boot completes.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds print statements that check the rcu_node structure to
find which ->expmask bits and which ->exp_tasks structures are blocking
the current expedited grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, if a grace period ends just as the stall-warning timeout
fires, an empty stall warning will be printed. This is not helpful,
so this commit avoids these useless warnings by rechecking completion
after awakening in synchronize_sched_expedited_wait().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the piggybacked-work checks carried out by sync_exp_work_done()
atomically increment a small set of variables (the ->expedited_workdone0,
->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
fields in the rcu_state structure), which will form a memory-contention
bottleneck given a sufficiently large number of CPUs concurrently invoking
either synchronize_rcu_expedited() or synchronize_sched_expedited().
This commit therefore moves these for fields to the per-CPU rcu_data
structure, eliminating the memory contention. The show_rcuexp() function
also changes to sum up each field in the rcu_data structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit saves a couple lines of code and reduces indentation
by inverting the sense of an "if" statement in the function
sync_rcu_exp_select_cpus().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The memory barrier in rcu_seq_snap() is needed only for grace periods,
so this commit moves it to the grace-period-oriented wrapper
rcu_exp_gp_seq_snap().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Analogy with the ->qsmaskinitnext field might lead one to believe that
->expmaskinitnext tracks online CPUs. This belief is incorrect: Any CPU
that has ever been online will have its bit set in the ->expmaskinitnext
field. This commit therefore adds a comment to make this clear, at
least to people who read comments.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If there is only one CPU, then invoking synchronize_sched_expedited()
is by definition a grace period. This commit checks for this condition
and does a short-circuit return in that case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.
This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s
For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt. This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.
The pv_wait_head() function was renamed as pv_wait_head_or_lock()
as it was modified to acquire the lock before returning. This is
necessary because of possible lock stealing attempts from other tasks.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of data which
enable us to calculate the kicking and wakeup latencies which have
a heavy dependency on the CPUs being used.
The statistical counters are per-cpu variables to minimize the
performance overhead in their updates. These counters are exported
via the debugfs filesystem under the qlockstat directory. When the
corresponding debugfs files are read, summation and computing of the
required data are then performed.
The measured latencies for different CPUs are:
CPU Wakeup Kicking
--- ------ -------
Haswell-EX 63.6us 7.4us
Westmere-EX 67.6us 9.3us
The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.
A sample of statistical counters after system bootup (with vCPU
overcommit) was:
pv_hash_hops=1.00
pv_kick_unlock=1148
pv_kick_wake=1146
pv_latency_kick=11040
pv_latency_wake=194840
pv_spurious_wakeup=7
pv_wait_again=4
pv_wait_head=23
pv_wait_node=1129
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.
Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:
10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.
This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.
By doing so, the perf profile became:
9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:
Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.
With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.
This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When building a kernel with a gcc 6 snapshot the compiler complains
about unused const static variables for prio_to_weight and prio_to_mult
for multiple scheduler files (all but core.c and autogroup.c)
The way the array is currently declared it will be duplicated in
every scheduler file that includes sched.h, which seems rather wasteful.
Move the array out of line into core.c. I also added a sched_ prefix
to avoid any potential name space collisions.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.
This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.
These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So we want to change a locking API, but the scheduler uses it, and a conflict
is generated by a recent scheduler fix.
Pick up the pending scheduler fixes to make life easier.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.
Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.
CPU0 CPU1
set_current_state(...)
<preempt_schedule>
context_switch(X, Y)
prepare_lock_switch(Y)
Y->on_cpu = 1;
finish_lock_switch(X)
store_release(X->on_cpu, 0);
try_to_wake_up(X)
LOCK(p->pi_lock);
t = X->on_cpu; // 0
context_switch(Y, X)
prepare_lock_switch(X)
X->on_cpu = 1;
finish_lock_switch(Y)
store_release(Y->on_cpu, 0);
</preempt_schedule>
schedule();
deactivate_task(X);
X->on_rq = 0;
if (X->on_rq) // false
if (t) while (X->on_cpu)
cpu_relax();
context_switch(X, ..)
finish_lock_switch(X)
store_release(X->on_cpu, 0);
Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.
This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.
Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.
The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.
I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:
# while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
S 4348
R 7064566
R 7064766
R 7064967
R 7065168
S 4759
S 4759
During running busy loop, it returns large value.
After applying this patch, we can see right gtime.
# while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
S 5338
R 5365
R 5465
R 5566
R 5666
S 5726
S 5726
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.
The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.
Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before the system go to suspend (S3), if user create a timer
with clockid CLOCK_REALTIME_ALARM/CLOCK_BOOTTIME_ALARM and set a
"large" timeout value to this timer. The function
alarmtimer_suspend will be called to setup a timeout value to
RTC timer to avoid the system sleep over time. However, if the
system wakeup early than RTC timeout, the RTC timer will not be
cleared. And this will cause the hpet_rtc_interrupt come
unexpectedly until the RTC timeout. To fix this problem, just
adding alarmtimer_resume to cancel the RTC timer.
This was noticed because the HPET RTC emulation fires an
interrupt every 16ms(=1/2^DEFAULT_RTC_SHIFT) up to the point
where the alarm time is reached.
This program always hits this situation
(https://lkml.org/lkml/2015/11/8/326), if system wake up earlier
than alarm time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Zhuo-hao Lee <zhuo-hao.lee@intel.com>
[jstultz: Tweak commit subject & formatting slightly]
Signed-off-by: John Stultz <john.stultz@linaro.org>