The only user of this facility is ptp_clock, which does not implement any of
those functions.
Remove them to prevent accidental users. Especially the interval timer
interfaces are now more or less impossible to implement because the
necessary infrastructure has been confined to the core code. Aside of that
it's really complex to make these callbacks implemented according to spec
as the alarm timer implementation demonstrates. If at all then a nanosleep
callback might be a reasonable extension. For now keep just what ptp_clock
needs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20170530211656.145036286@linutronix.de
The alarmtimer code has another source of potentially rearming itself too
fast. Interval timers with a very samll interval have a similar CPU hog
effect as the previously fixed overflow issue.
The reason is that alarmtimers do not implement the normal protection
against this kind of problem which the other posix timer use:
timer expires -> queue signal -> deliver signal -> rearm timer
This scheme brings the rearming under scheduler control and prevents
permanently firing timers which hog the CPU.
Bringing this scheme to the alarm timer code is a major overhaul because it
lacks all the necessary mechanisms completely.
So for a quick fix limit the interval to one jiffie. This is not
problematic in practice as alarmtimers are usually backed by an RTC for
suspend which have 1 second resolution. It could be therefor argued that
the resolution of this clock should be set to 1 second in general, but
that's outside the scope of this fix.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de
Andrey reported a alartimer related RCU stall while fuzzing the kernel with
syzkaller.
The reason for this is an overflow in ktime_add() which brings the
resulting time into negative space and causes immediate expiry of the
timer. The following rearm with a small interval does not bring the timer
back into positive space due to the same issue.
This results in a permanent firing alarmtimer which hogs the CPU.
Use ktime_add_safe() instead which detects the overflow and clamps the
result to KTIME_SEC_MAX.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de
Some freezer related variables are only used when either CONFIG_POSIX_TIMER
or CONFIG_RTC_CLASS are enabled. Hide them when both are off.
Fixes: d3ba5a9a34 ("posix-timers: Make posix_clocks immutable")
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Helwig <hch@lst.de>
There are no more modular users providing a posix clock. The register
function is now pointless so the posix clock array can be initialized
statically at compile time and the array including the various k_clock
structs can be marked 'const'.
Inspired by changes in the Grsecurity patch set, but done proper.
[ tglx: Massaged changelog and fixed the POSIX_TIMER=n case ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Link: http://lkml.kernel.org/r/20170526090311.3377-3-hch@lst.de
A recent commit added extra printks for CPU/RT limits. This can result in
excessive spam in dmesg.
Make the printks conditional on print_fatal_signals.
Fixes: e7ea7c9806 ("rlimits: Print more information when CPU/RT limits are exceeded")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arun Raghavan <arun@arunraghavan.net>
This restores commit:
24b91e360e: ("nohz: Fix collision between tick and other hrtimers")
... which got reverted by commit:
558e8e27e7: ('Revert "nohz: Fix collision between tick and other hrtimers"')
... due to a regression where CPUs spuriously stopped ticking.
The bug happened when a tick fired too early past its expected expiration:
on IRQ exit the tick was scheduled again to the same deadline but skipped
reprogramming because ts->next_tick still kept in cache the deadline.
This has been fixed now with resetting ts->next_tick from the tick
itself. Extra care has also been taken to prevent from obsolete values
throughout CPU hotplug operations.
When the tick is stopped and an interrupt occurs afterward, we check on
that interrupt exit if the next tick needs to be rescheduled. If it
doesn't need any update, we don't want to do anything.
In order to check if the tick needs an update, we compare it against the
clockevent device deadline. Now that's a problem because the clockevent
device is at a lower level than the tick itself if it is implemented
on top of hrtimer.
Every hrtimer share this clockevent device. So comparing the next tick
deadline against the clockevent device deadline is wrong because the
device may be programmed for another hrtimer whose deadline collides
with the tick. As a result we may end up not reprogramming the tick
accidentally.
In a worst case scenario under full dynticks mode, the tick stops firing
as it is supposed to every 1hz, leaving /proc/stat stalled:
Task in a full dynticks CPU
----------------------------
* hrtimer A is queued 2 seconds ahead
* the tick is stopped, scheduled 1 second ahead
* tick fires 1 second later
* on tick exit, nohz schedules the tick 1 second ahead but sees
the clockevent device is already programmed to that deadline,
fooled by hrtimer A, the tick isn't rescheduled.
* hrtimer A is cancelled before its deadline
* tick never fires again until an interrupt happens...
In order to fix this, store the next tick deadline to the tick_sched
local structure and reuse that value later to check whether we need to
reprogram the clock after an interrupt.
On the other hand, ts->sleep_length still wants to know about the next
clock event and not just the tick, so we want to improve the related
comment to avoid confusion.
Reported-and-tested-by: Tim Wright <tim@binbash.co.uk>
Reported-and-tested-by: Pavel Machek <pavel@ucw.cz>
Reported-by: James Hartsock <hartsjc@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1492783255-5051-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we keep sched_clock_tick() active for stable TSC in order to
keep the per-CPU state semi up-to-date. The (obvious) problem is that
by the time we detect TSC is borked, our per-CPU state is also borked.
So hook into the clocksource watchdog and call a method after we've
found it to still be stable.
There's the obvious race where the TSC goes wonky between finding it
stable and us running the callback, but closing that is too much work
and not really worth it, since we're already detecting TSC wobbles
after the fact, so we cannot, per definition, fully avoid funny clock
values.
And since the watchdog runs less often than the tick, this is also an
optimization.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.
There is a lot of good development going on but unfortunately it
didn't quite make the merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root
Pull timer updates from Thomas Gleixner:
"The timer departement delivers:
- more year 2038 rework
- a massive rework of the arm achitected timer
- preparatory patches to allow NTP correction of clock event devices
to avoid early expiry
- the usual pile of fixes and enhancements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
timer/sysclt: Restrict timer migration sysctl values to 0 and 1
arm64/arch_timer: Mark errata handlers as __maybe_unused
Clocksource/mips-gic: Remove redundant non devicetree init
MIPS/Malta: Probe gic-timer via devicetree
clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver
clocksource: arm_arch_timer: add GTDT support for memory-mapped timer
acpi/arm64: Add memory-mapped timer support in GTDT driver
clocksource: arm_arch_timer: simplify ACPI support code.
acpi/arm64: Add GTDT table parse driver
clocksource: arm_arch_timer: split MMIO timer probing.
clocksource: arm_arch_timer: add structs to describe MMIO timer
clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call
clocksource: arm_arch_timer: refactor arch_timer_needs_probing
clocksource: arm_arch_timer: split dt-only rate handling
x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks
unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks
um/time: Set ->min_delta_ticks and ->max_delta_ticks
tile/time: Set ->min_delta_ticks and ->max_delta_ticks
score/time: Set ->min_delta_ticks and ->max_delta_ticks
...
CPUCLOCK_PID(which_clock) is a pid value from userspace so compare it
against task_pid_vnr, not current->pid. As task_pid_vnr is in the tasks
pid value in the tasks pid namespace, and current->pid is in the
initial pid namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
struct timespec is not y2038 safe on 32 bit machines. Replace uses of
struct timespec with struct timespec64 in the kernel. The syscall
interfaces themselves will be changed in a separate series.
The clock_getres() interface has also been changed to use timespec64 even
though this particular interface is not affected by the y2038 problem. This
helps verification for internal kernel code for y2038 readiness by getting
rid of time_t/ timeval/ timespec completely.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timekeeping changes from John Stultz:
Main changes are the initial steps of Nicoli's work to make the clockevent
timers be corrected for NTP adjustments. Then a few smaller fixes that
I've queued, and adding Stephen Boyd to the maintainers list for
timekeeping.
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
On systems with a large number of CPUs, running sysrq-<q> can cause
watchdog timeouts. There are two slow sections of code in the sysrq-<q>
path in timer_list.c.
1. print_active_timers() - This function is called by print_cpu() and
contains a slow goto loop. On a machine with hundreds of CPUs, this
loop took approximately 100ms for the first CPU in a NUMA node.
(Subsequent CPUs in the same node ran much quicker.) The total time
to print all of the CPUs is ultimately long enough to trigger the
soft lockup watchdog.
2. print_tickdevice() - This function outputs a large amount of textual
information. This function also took approximately 100ms per CPU.
Since sysrq-<q> is not a performance critical path, there should be no
harm in touching the nmi watchdog in both slow sections above. Touching
it in just one location was insufficient on systems with hundreds of
CPUs as occasional timeouts were still observed during testing.
This issue was observed on an Oracle T7 machine with 128 CPUs, but I
anticipate it may affect other systems with similarly large numbers of
CPUs.
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The scheduler clock framework may not use the correct timeout for the clock
wrap. This happens when a new clock driver calls sched_clock_register()
after the kernel called sched_clock_postinit(). In this case the clock wrap
timeout is too long thus sched_clock_poll() is called too late and the clock
already wrapped.
On my ARM system the scheduler was no longer scheduling any other task than
the idle task because the sched_clock() wrapped.
Signed-off-by: David Engraf <david.engraf@sysgo.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
A clockevent device's rate should be configured before or at registration
and changed afterwards through clockevents_update_freq() only.
For the configuration at registration, we already have
clockevents_config_and_register().
Right now, there are no clockevents_config() users outside of the
clockevents core.
To mitigiate the risk of drivers errorneously reconfiguring their rates
through clockevents_config() *after* device registration, make
clockevents_config() static.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.
That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.
That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task. If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away. That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.
This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.
On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.
To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
When a process is sent a SIGKILL because it exceeded CPU or RT limits,
the cause may not be obvious in userspace -- daemonised processes just
get killed, and even foreground process just see a 'Killed' message. The
lack of any information on why this might be happening in logs can be
confusing to users who are not aware of this mechanism.
Add messages which dump the process name and tid in dmesg when a process
exceeds its CPU or RT limits (soft and hard) in order to make it clearer to
people debugging such issues.
Signed-off-by: Arun Raghavan <arun@arunraghavan.net>
Link: http://lkml.kernel.org/r/20170301145309.27214-1-arun@arunraghavan.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer fixes from Ingo Molnar:
"This includes a fix for lockups caused by incorrect nsecs related
cleanup, and a capabilities check fix for timerfd"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
timerfd: Only check CAP_WAKE_ALARM when it is needed
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/nohz.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/nohz.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>