The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/20140822132605.GA20130@ada
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: paulmck@linux.vnet.ibm.com
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: http://lkml.kernel.org/r/20140822141536.GA32051@ada
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current code can fail to migrate a waking task (silently) when TTWU_QUEUE is
enabled.
When a task is waking, it is pending on the wake_list of the rq, but it is not
queued (task->on_rq == 0). In this case, set_cpus_allowed_ptr() and
__migrate_task() will not migrate it because its invisible to them.
This behavior is incorrect, because the task has been already woken, it will be
running on the wrong CPU without correct placement until the next wake-up or
update for cpus_allowed.
To fix this problem, we need to finish the wakeup (so they appear on
the runqueue) before we migrate them.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Tested-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/538ED7EB.5050303@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enhanced test_suspend boot paramter to repeat tests multiple times,
by adding optional repeat count. The new boot param syntax:
test_suspend="mem|freeze|standby[,N]"
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull cgroup fixes from Tejun Heo:
"This pull request includes Alban's patch to disallow '\n' in cgroup
names.
Two other patches from Li to fix a possible oops when cgroup
destruction races against other file operations and one from Vivek to
fix a unified hierarchy devel behavior"
* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: check cgroup liveliness before unbreaking kernfs
cgroup: delay the clearing of cgrp->kn->priv
cgroup: Display legacy cgroup files on default hierarchy
cgroup: reject cgroup names with '\n'
Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.
This patch doesn't make any functional difference.
v2: blk-mq conversion was missing. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
The rcu_bh_qs(), rcu_preempt_qs(), and rcu_sched_qs() functions use
old-style per-CPU variable access and write to ->passed_quiesce even
if it is already set. This commit therefore updates to use the new-style
per-CPU variable access functions and avoids the spurious writes.
This commit also eliminates the "cpu" argument to these functions because
they are always invoked on the indicated CPU.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_preempt_note_context_switch() function is on a scheduling fast
path, so it would be good to avoid disabling irqs. The reason that irqs
are disabled is to synchronize process-level and irq-handler access to
the task_struct ->rcu_read_unlock_special bitmask. This commit therefore
makes ->rcu_read_unlock_special instead be a union of bools with a short
allowing single-access checks in RCU's __rcu_read_unlock(). This results
in the process-level and irq-handler accesses being simple loads and
stores, so that irqs need no longer be disabled. This commit therefore
removes the irq disabling from rcu_preempt_note_context_switch().
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The grace-period-wait loop in rcu_tasks_kthread() is under (unnecessary)
RCU protection, and therefore has no preemption points in a PREEMPT=n
kernel. This commit therefore removes the RCU protection and inserts
cond_resched().
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
usermode execution. There would be neither a context switch nor a
scheduling-clock interrupt to tell TASKS_RCU that the task in question
had passed through a quiescent state. The grace period would therefore
extend indefinitely. This commit therefore makes RCU's dyntick-idle
subsystem record the task_struct structure of the task that is running
in dyntick-idle mode on each CPU. The TASKS_RCU grace period can
then access this information and record a quiescent state on
behalf of any CPU running in dyntick-idle usermode.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is expected that many sites will have CONFIG_TASKS_RCU=y, but
will never actually invoke call_rcu_tasks(). For such sites, creating
rcu_tasks_kthread() at boot is wasteful. This commit therefore defers
creation of this kthread until the time of the first call_rcu_tasks().
This of course means that the first call_rcu_tasks() must be invoked
from process context after the scheduler is fully operational.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current RCU-tasks implementation uses strict polling to detect
callback arrivals. This works quite well, but is not so good for
energy efficiency. This commit therefore replaces the strict polling
with a wait queue.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds a ten-minute RCU-tasks stall warning. The actual
time is controlled by the boot/sysfs parameter rcu_task_stall_timeout,
with values less than or equal to zero disabling the stall warnings.
The default value is ten minutes, which means that the tasks that have
not yet responded will get their stacks dumped every ten minutes, until
they pass through a voluntary context switch.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds torture tests for RCU-tasks. It also fixes a bug that
would segfault for an RCU flavor lacking a callback-barrier function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit exports the RCU-tasks synchronous APIs,
synchronize_rcu_tasks() and rcu_barrier_tasks(), to
GPL-licensed kernel modules.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Once a task has passed exit_notify() in the do_exit() code path, it
is no longer on the task lists, and is therefore no longer visible
to rcu_tasks_kthread(). This means that an almost-exited task might
be preempted while within a trampoline, and this task won't be waited
on by rcu_tasks_kthread(). This commit fixes this bug by adding an
srcu_struct. An exiting task does srcu_read_lock() just before calling
exit_notify(), and does the corresponding srcu_read_unlock() after
doing the final preempt_disable(). This means that rcu_tasks_kthread()
can do synchronize_srcu() to wait for all mostly-exited tasks to reach
their final preempt_disable() region, and then use synchronize_sched()
to wait for those tasks to finish exiting.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It turns out to be easier to add the synchronous grace-period waiting
functions to RCU-tasks than to work around their absense in rcutorture,
so this commit adds them. The key point is that the existence of
call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks. In some cases, this requires
instrumenting cond_resched(). However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.
This commit currently instruments only RCU-tasks. Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks(). This RCU flavor's quiescent states are voluntary
context switch (not preemption!) and userspace execution (not the idle
loop -- use some sort of schedule_on_each_cpu() if you need to handle the
idle tasks. Note that unlike other RCU flavors, these quiescent states
occur in tasks, not necessarily CPUs. Includes fixes from Steven Rostedt.
This RCU flavor is assumed to have very infrequent latency-tolerant
updaters. This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single task-private linked list containing all tasks that have not
yet passed through a quiescent state. If experience shows this assumption
to be incorrect, the required additional complexity will be added.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although RCU is designed to handle arbitrary floods of callbacks, this
capability is not routinely tested. This commit therefore adds a
cbflood capability in which kthreads repeatedly registers large numbers
of callbacks. One such kthread is created for each four CPUs (rounding
up), and the test may be controlled by several cbflood_* kernel boot
parameters, which control the number of bursts per flood, the number
of callbacks per burst, the time between bursts, and the time between
floods. The default values are large enough to exercise RCU's emergency
responses to callback flooding.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
User pr_alert/pr_cont for printing the logs from rcutorture module directly
instead of writing it to a buffer and then printing it. This allows us from not
having to allocate such buffers. Also remove a resulting empty function.
I tested this using the parse-torture.sh script as follows:
$ dmesg | grep torture > log.txt
$ bash parse-torture.sh log.txt test
$
There were no warnings which means that parsing went fine.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit fixes the following sparse warning by marking boost_mutex
static:
kernel/rcu/rcutorture.c:185:1: warning: symbol 'boost_mutex' was not declared. Should it be static?
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, when RCU awakens from a wait_event_interruptible() that
might have awakened prematurely, it does a flush_signals(). This is
done on the off-chance that someone figured out how to deliver a signal
to a kthread, which is supposed to be impossible. Given that this
is supposed to be impossible, this commit changes the flush_signals()
calls into WARN_ON(signal_pending()).
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:
* Is the thread we are trying to wake up the current thread?
* Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
* Is there no thread created for this flavour, hence nothing to wake up?
If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.
Idle:
Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0
rcutorture:
Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0
Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.
There is a comment about an implied barrier supplied by the wake_up()
logic. This barrier is necessary for the awakened thread to see the
updated ->gp_flags. This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.
Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_idle_enter_common() and rcu_idle_exit_common() functions contain
error checks that have to the best of my knowledge have never triggered
over the past several years. These are nevertheless valuable when
creating new architectures or doing other low-level changes, so the
checks should not be deleted. This commit instead places these checks
under #ifdef CONFIG_RCU_TRACE so that they are executed only when
specifically requested.
The savings are significant:
Before:
text data bss dec hex filename
1749 39 0 1788 6fc /tmp/b/kernel/rcu/tiny.o
632 152 0 784 310 /tmp/b/kernel/rcu/update.o
----
2572
After:
text data bss dec hex filename
1281 37 0 1318 526 /tmp/b/kernel/rcu/tiny.o
632 152 0 784 310 /tmp/b/kernel/rcu/update.o
----
2102
This amounts to 470 bytes, or 18% of the original.
Switched from #ifdef to IS_ENABLED() on Josh Triplett's advice.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Commit 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler
and perf) covered the case where __call_rcu_nocb_enqueue() needs to wake
the rcuo kthread due to the queue being initially empty, but did not
do anything for the case where the queue was overflowing. This commit
therefore also defers wakeup for the overflow case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes a stale comment in rcu/tree.c which was left
out when some code was moved around previously in commit 2036d94a7b
("rcu: Rework detection of use of RCU by offline CPUs") For reference,
the following updated comment exists a few lines below this which means
the same:
/* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit f7f7bac9cb ("rcu: Have the RCU tracepoints use the tracepoint_string
infrastructure") unconditionally populates the __tracepoint_str input section,
but this section is not assigned an output section if CONFIG_TRACING is not set.
This results in the __tracepoint_str turning up in unexpected places, i.e.,
after _edata.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit uninlines rcu_read_lock_held(). According to "size vmlinux"
this saves 28549 in .text:
- 5541731 3014560 14757888 23314179
+ 5513182 3026848 14757888 23297918
Note: it looks as if the data grows by 12288 bytes but this is not true,
it does not actually grow. But .data starts with ALIGN(THREAD_SIZE) and
since .text shrinks the padding grows, and thus .data grows too as it
seen by /bin/size. diff System.map:
- ffffffff81510000 D _sdata
- ffffffff81510000 D init_thread_union
+ ffffffff81509000 D _sdata
+ ffffffff8150c000 D init_thread_union
Perhaps we can change vmlinux.lds.S to .data itself, so that /bin/size
can't "wrongly" report that .data grows if .text shinks.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit uses true/false instead of 1/0 for bool types in rcu_gp_fqs()
and force_qs_rnp().
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
fix sparse warning about rcu_batches_completed_preempt() being non-static by
marking it as static
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull ACPI and power management fixes from Rafael Wysocki:
"These are regression fixes (ACPI sysfs, ACPI video, suspend test),
ACPI cpuidle deadlock fix, missing runtime validation of ACPI _DSD
output, a fix and a new CPU ID for the RAPL driver, new blacklist
entry for the ACPI EC driver and a couple of trivial cleanups
(intel_pstate and generic PM domains).
Specifics:
- Fix for recently broken test_suspend= command line argument (Rafael
Wysocki).
- Fixes for regressions related to the ACPI video driver caused by
switching the default to native backlight handling in 3.16 from
Hans de Goede.
- Fix for a sysfs attribute of ACPI device objects that returns stale
values sometimes due to the fact that they are cached instead of
executing the appropriate method (_SUN) every time (broken in
3.14). From Yasuaki Ishimatsu.
- Fix for a deadlock between cpuidle_lock and cpu_hotplug.lock in the
ACPI processor driver from Jiri Kosina.
- Runtime output validation for the ACPI _DSD device configuration
object missing from the support for it that has been introduced
recently. From Mika Westerberg.
- Fix for an unuseful and misleading RAPL (Running Average Power
Limit) domain detection message in the RAPL driver from Jacob Pan.
- New Intel Haswell CPU ID for the RAPL driver from Jason Baron.
- New Clevo W350etq blacklist entry for the ACPI EC driver from Lan
Tianyu.
- Cleanup for the intel_pstate driver and the core generic PM domains
code from Gabriele Mazzotta and Geert Uytterhoeven"
* tag 'pm+acpi-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock
ACPI / scan: not cache _SUN value in struct acpi_device_pnp
cpufreq: intel_pstate: Remove unneeded variable
powercap / RAPL: change domain detection message
powercap / RAPL: add support for CPU model 0x3f
PM / domains: Make generic_pm_domain.name const
PM / sleep: Fix test_suspend= command line option
ACPI / EC: Add msi quirk for Clevo W350etq
ACPI / video: Disable native_backlight on HP ENVY 15 Notebook PC
ACPI / video: Add a disable_native_backlight quirk
ACPI / video: Fix use_native_backlight selection logic
ACPICA: ACPI 5.1: Add support for runtime validation of _DSD package.
Pull RCU fix from Ingo Molnar:
"A boot hang fix for the offloaded callback RCU model (RCU_NOCB_CPU=y
&& (TREE_CPU=y || TREE_PREEMPT_RC)) in certain bootup scenarios"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Make nocb leader kthreads process pending callbacks after spawning
Pull timer fixes from Thomas Gleixner:
"Three fixlets from the timer departement:
- Update the timekeeper before updating vsyscall and pvclock. This
fixes the kvm-clock regression reported by Chris and Paolo.
- Use the proper irq work interface from NMI. This fixes the
regression reported by Catalin and Dave.
- Clarify the compat_nanosleep error handling mechanism to avoid
future confusion"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Update timekeeper before updating vsyscall and pvclock
compat: nanosleep: Clarify error handling
nohz: Restore NMI safe local irq work for local nohz kick
An overrun could happen in function start_hrtick_dl()
when a task with SCHED_DEADLINE runs in the microseconds
range.
For example, if a task with SCHED_DEADLINE has the following parameters:
Task runtime deadline period
P1 200us 500us 500us
The deadline and period from task P1 are less than 1ms.
In order to achieve microsecond precision, we need to enable HRTICK feature
by the next command:
PC#echo "HRTICK" > /sys/kernel/debug/sched_features
PC#trace-cmd record -e sched_switch &
PC#./schedtool -E -t 200000:500000:500000 -e ./test
The binary test is in an endless while(1) loop here.
Some pieces of trace.dat are as follows:
<idle>-0 157.603157: sched_switch: :R ==> 2481:4294967295: test
test-2481 157.603203: sched_switch: 2481:R ==> 0:120: swapper/2
<idle>-0 157.605657: sched_switch: :R ==> 2481:4294967295: test
test-2481 157.608183: sched_switch: 2481:R ==> 2483:120: trace-cmd
trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test
We can get the runtime of P1 from the information above:
runtime = 157.608183 - 157.605657
runtime = 0.002526(2.526ms)
The correct runtime should be less than or equal to 200us at some point.
The problem is caused by a conditional judgment "delta > 10000"
in function start_hrtick_dl().
Because no hrtimer start up to control the rest of runtime
when the reset of runtime is less than 10us.
So the process will continue to run until tick-period is coming.
Move the code with the limit of the least time slice
from hrtick_start_fair() to hrtick_start() because the
EDF schedule class also needs this function in start_hrtick_dl().
To fix this problem, we call hrtimer_start() unconditionally in
start_hrtick_dl(), and make sure the scheduling slice won't be smaller
than 10us in hrtimer_start().
Signed-off-by: Xiaofeng Yan <xiaofeng.yan@huawei.com>
Reviewed-by: Li Zefan <lizefan@huawei.com>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
[ Massaged the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The update_walltime() code works on the shadow timekeeper to make the
seqcount protected region as short as possible. But that update to the
shadow timekeeper does not update all timekeeper fields because it's
sufficient to do that once before it becomes life. One of these fields
is tkr.base_mono. That stays stale in the shadow timekeeper unless an
operation happens which copies the real timekeeper to the shadow.
The update function is called after the update calls to vsyscall and
pvclock. While not correct, it did not cause any problems because none
of the invoked update functions used base_mono.
commit cbcf2dd3b3 (x86: kvm: Make kvm_get_time_and_clockread()
nanoseconds based) changed that in the kvm pvclock update function, so
the stale mono_base value got used and caused kvm-clock to malfunction.
Put the update where it belongs and fix the issue.
Reported-by: Chris J Arges <chris.j.arges@canonical.com>
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The error handling in compat_sys_nanosleep() is correct, but
completely non obvious. Document it and restrict it to the
-ERESTART_RESTARTBLOCK return value for clarity.
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>