__radix_tree_insert already checks for duplicates and returns -EEXIST in
that case, so remove the duplicate (and racy) duplicates check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Change the calling convention so that get_dev_pagemap always consumes the
previous reference instead of doing this using an explicit earlier call to
put_dev_pagemap in the callers.
The callers will still need to put the final reference after finishing the
loop over the pages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is a pretty big function, which should be out of line in general,
and a no-op stub if CONFIG_ZONE_DEVICЕ is not set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pass the vmem_altmap two levels down instead of needing a lookup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull cgroup fixes from Tejun Heo:
"This contains fixes for the following two non-trivial issues:
- The task iterator got broken while adding thread mode support for
v4.14. It was less visible because it only triggers when both
cgroup1 and cgroup2 hierarchies are in use. The recent versions of
systemd uses cgroup2 for process management even when cgroup1 is
used for resource control exposing this issue.
- cpuset CPU hotplug path could deadlock when racing against exits.
There also are two patches to replace unlimited strcpy() usages with
strlcpy()"
* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC
cgroup: Fix deadlock in cpu hotplug path
cgroup: use strlcpy() instead of strscpy() to avoid spurious warning
cgroup: avoid copying strings longer than the buffers
syzbot reported the following panic in the verifier triggered
by kmalloc error injection:
kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
Call Trace:
pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
push_stack kernel/bpf/verifier.c:491 [inline]
check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769
when copy_verifier_state() aborts in the middle due to kmalloc failure
some of the frames could have been partially copied while
current free_verifier_state() loop
for (i = 0; i <= state->curframe; i++)
assumed that all frames are non-null.
Simply fix it by adding 'if (!state)' to free_func_state().
Also avoid stressing copy frame logic more if kzalloc fails
in push_stack() free env->cur_state right away.
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com
Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently we use perf_event_context::task_ctx_data to save and restore
the LBR status when the task is scheduled out and in.
We don't allocate it for child contexts, which results in shorter task's
LBR stack, because we don't save the history from previous run and start
over every time we schedule the task in.
I made a test to generate samples with LBR call stack and got higher
numbers on bigger chain depths:
before: after:
LBR call chain: nr: 1 60561 498127
LBR call chain: nr: 2 0 0
LBR call chain: nr: 3 107030 2172
LBR call chain: nr: 4 466685 62758
LBR call chain: nr: 5 2307319 878046
LBR call chain: nr: 6 48713 495218
LBR call chain: nr: 7 1040 4551
LBR call chain: nr: 8 481 172
LBR call chain: nr: 9 878 120
LBR call chain: nr: 10 2377 6698
LBR call chain: nr: 11 28830 151487
LBR call chain: nr: 12 29347 339867
LBR call chain: nr: 13 4 22
LBR call chain: nr: 14 3 53
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 4af57ef28c ("perf: Add pmu specific data for perf task context")
Link: http://lkml.kernel.org/r/20180107160356.28203-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Workqueues can be created early during boot before workqueue subsystem
in fully online - work items are queued waiting for later full
initialization. However, early init wasn't supported for
WQ_MEM_RECLAIM workqueues causing unnecessary annoyances for a subset
of users. Expand early init support to include WQ_MEM_RECLAIM
workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Separate out init_rescuer() from __alloc_workqueue_key() to prepare
for early init support for WQ_MEM_RECLAIM. This patch doesn't
introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Pull vfs fixes from Al Viro:
- untangle sys_close() abuses in xt_bpf
- deal with register_shrinker() failures in sget()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
sget(): handle failures of register_shrinker()
mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
Add psock NULL check to handle a racing sock event that can get the
sk_callback_lock before this case but after xchg happens causing the
refcnt to hit zero and sock user data (psock) to be null and queued
for garbage collection.
Also add a comment in the code because this is a bit subtle and
not obvious in my opinion.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.
The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Descriptor table is a shared object; it's not a place where you can
stick temporary references to files, especially when we don't need
an opened file at all.
Cc: stable@vger.kernel.org # v4.14
Fixes: 98589a0998 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
s2disk/s2both may fail unnecessarily and erratically if NR_FILE_MAPPED
is high - for instance when using VMs with VirtualBox and perhaps VMware
Player. In those situations s2disk becomes unreliable and therefore
unusable.
A typical scenario is: user issues a s2disk and it fails. User issues
a second s2disk immediately after that and it succeeds. And user
wonders why.
The problem is caused by minimum_image_size() in snapshot.c. The
value it returns is roughly 100% too high because NR_FILE_MAPPED is
subtracted in its calculation. Eventually the number of preallocated
image pages is falsely too low.
This doesn't matter as long as NR_FILE_MAPPED-values are in a normal
range or in 32bit-environments as the code allows for allocation of
additional pages from highmem.
But with the high values generated by VirtualBox-VMs (a 2-GB-VM causes
NR_FILE_MAPPED go up by 2 GB) it may lead to failure in 64bit-systems.
Not subtracting NR_FILE_MAPPED in minimum_image_size() solves the
problem.
I've done at least hundreds of successful s2both/s2disk now on an
x86_64 system (with and without VirtualBox) which gives me some
confidence that this is right. It has turned s2disk/s2both from
unusable into 100% reliable.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=97201
Signed-off-by: Rainer Fiebig <jrf@mailbox.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check
is very wrong, we need time_is_after_jiffies() to make sys_acct() work.
Ignoring the overflows, the code should "goto out" if needcheck >
jiffies, while currently it checks "needcheck < jiffies" and thus in the
likely case check_free_space() does nothing until jiffies overflow.
In particular this means that sys_acct() is simply broken, acct_on()
sets acct->needcheck = jiffies and expects that check_free_space()
should set acct->active = 1 after the free-space check, but this won't
happen if jiffies increments in between.
This was broken by commit 32dc730860 ("get rid of timer in
kern/acct.c") in 2011, then another (correct) commit 795a2f22a8
("acct() should honour the limits from the very beginning") made the
problem more visible.
Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com
Fixes: 32dc730860 ("get rid of timer in kern/acct.c")
Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sockmap infrastructure is only aware of TCP sockets at the
moment. In the future we plan to add UDP. In both cases CONFIG_NET
should be built-in.
So lets only build sockmap if CONFIG_INET is enabled.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This was added for some work that was eventually factored out but the
helper call was missed. Remove it now and add it back later if needed.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
0day and kernelCI automatically parse kernel log - basically some sort
of grepping using the pre-defined text patterns - in order to detect
and report regressions/errors. There are several sources they get the
kernel logs from:
a) dmesg or /proc/ksmg
This is the preferred way. Because `dmesg --raw' (see later Note)
and /proc/kmsg output contains facility and log level, which greatly
simplifies grepping for EMERG/ALERT/CRIT/ERR messages.
b) serial consoles
This option is harder to maintain, because serial console messages
don't contain facility and log level.
This patch introduces a `console_msg_format=' command line option,
to switch between different message formatting on serial consoles.
For the time being we have just two options - default and syslog.
The "default" option just keeps the existing format. While the
"syslog" option makes serial console messages to appear in syslog
format [syslog() syscall], matching the `dmesg -S --raw' and
`cat /proc/kmsg' output formats:
- facility and log level
- time stamp (depends on printk_time/PRINTK_TIME)
- message
<%u>[time stamp] text\n
NOTE: while Kevin and Fengguang talk about "dmesg --raw", it's actually
"dmesg -S --raw" that always prints messages in syslog format [per
Petr Mladek]. Running "dmesg --raw" may produce output in non-syslog
format sometimes. console_msg_format=syslog enables syslog format,
thus in documentation we mention "dmesg -S --raw", not "dmesg --raw".
Per Kevin Hilman:
: Right now we can get this info from a "dmesg --raw" after bootup,
: but it would be really nice in certain automation frameworks to
: have a kernel command-line option to enable printing of loglevels
: in default boot log.
:
: This is especially useful when ingesting kernel logs into advanced
: search/analytics frameworks (I'm playing with and ELK stack: Elastic
: Search, Logstash, Kibana).
:
: The other important reason for having this on the command line is that
: for testing linux-next (and other bleeding edge developer branches),
: it's common that we never make it to userspace, so can't even run
: "dmesg --raw" (or equivalent.) So we really want this on the primary
: boot (serial) console.
Per Fengguang Wu, 0day scripts should quickly benefit from that
feature, because they will be able to switch to a more reliable
parsing, based on messages' facility and log levels [1]:
`#{grep} -a -E -e '^<[0123]>' -e '^kern :(err |crit |alert |emerg )'
instead of doing text pattern matching
`#{grep} -a -F -f /lkp/printk-error-messages #{kmsg_file} |
grep -a -v -E -f #{LKP_SRC}/etc/oops-pattern |
grep -a -v -F -f #{LKP_SRC}/etc/kmsg-blacklist`
[1] https://github.com/fengguang/lkp-tests/blob/master/lib/dmesg.rb
Link: http://lkml.kernel.org/r/20171221054149.4398-1-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Kevin Hilman <khilman@baylibre.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
- Rename from kdb_send_sig_info to kdb_send_sig
As there is no meaningful siginfo sent
- Use SEND_SIG_PRIV instead of generating a siginfo for a kdb
signal. The generated siginfo had a bogus rationale and was
not correct in the face of pid namespaces. SEND_SIG_PRIV
is simpler and actually correct.
- As the code grabs siglock just send the signal with siglock
held instead of dropping siglock and attempting to grab it again.
- Move the sig_valid test into kdb_kill where it can generate
a good error message.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull pid allocation bug fix from Eric Biederman:
"The replacement of the pid hash table and the pid bitmap with an idr
resulted in an implementation that now fails more often in low memory
situations. Allowing fuzzers to observe bad behavior from a memory
allocation failure during pid allocation.
This is a small change to fix this by making the kernel more robust in
the case of error. The non-error paths are left alone so the only
danger is to the already broken error path. I have manually injected
errors and verified that this new error handling works"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
pid: Handle failure to allocate the first pid in a pid namespace
Pull RCU updates from Paul E. McKenney:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and
in kernel/torture.c). Also a couple of fixes to avoid sending
IPIs to offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends()
and read_barrier_depends().
- Miscellaneous fixes.
- Torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export perf_event_update_userpage() so that PMU driver using them,
can be built as modules.
Acked-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Pull timer fixes from Thomas Gleixner:
"A pile of fixes for long standing issues with the timer wheel and the
NOHZ code:
- Prevent timer base confusion accross the nohz switch, which can
cause unlocked access and data corruption
- Reinitialize the stale base clock on cpu hotplug to prevent subtle
side effects including rollovers on 32bit
- Prevent an interrupt storm when the timer softirq is already
pending caused by tick_nohz_stop_sched_tick()
- Move the timer start tracepoint to a place where it actually makes
sense
- Add documentation to timerqueue functions as they caused confusion
several times now"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timerqueue: Document return values of timerqueue_add/del()
timers: Invoke timer_start_debug() where it makes sense
nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
timers: Reinitialize per cpu bases on hotplug
timers: Use deferrable base independent of base::nohz_active
Pull smp fixlet from Thomas Gleixner:
"A trivial build warning fix for newer compilers"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Move inline keyword at the beginning of declaration
Pull scheduler fixes from Thomas Gleixner:
"Three patches addressing the fallout of the CPU_ISOLATION changes
especially with NO_HZ_FULL plus documentation of boot parameter
dependency"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/isolation: Document boot parameters dependency on CONFIG_CPU_ISOLATION=y
sched/isolation: Enable CONFIG_CPU_ISOLATION=y by default
sched/isolation: Make CONFIG_NO_HZ_FULL select CONFIG_CPU_ISOLATION
Pull irq fixes from Thomas Gleixner:
"A rather large update after the kaisered maintainer finally found time
to handle regression reports.
- The larger part addresses a regression caused by the x86 vector
management rework.
The reservation based model does not work reliably for MSI
interrupts, if they cannot be masked (yes, yet another hw
engineering trainwreck). The reason is that the reservation mode
assigns a dummy vector when the interrupt is allocated and switches
to a real vector when the interrupt is requested.
If the MSI entry cannot be masked then the initialization might
raise an interrupt before the interrupt is requested, which ends up
as spurious interrupt and causes device malfunction and worse. The
fix is to exclude MSI interrupts which do not support masking from
reservation mode and assign a real vector right away.
- Extend the extra lockdep class setup for nested interrupts with a
class for the recently added irq_desc::request_mutex so lockdep can
differeniate and does not emit false positive warnings.
- A ratelimit guard for the bad irq printout so in case a bad irq
comes back immediately the system does not drown in dmesg spam"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI
genirq/irqdomain: Rename early argument of irq_domain_activate_irq()
x86/vector: Use IRQD_CAN_RESERVE flag
genirq: Introduce IRQD_CAN_RESERVE flag
genirq/msi: Handle reactivation only on success
gpio: brcmstb: Make really use of the new lockdep class
genirq: Guard handle_bad_irq log messages
kernel/irq: Extend lockdep class for request mutex
Report to the user ifindex and namespace information of offloaded
programs. If device has disappeared return -ENODEV. Specify the
namespace using dev/inode combination.
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Bound programs are quite useless after their device disappears.
They are simply waiting for reference count to go to zero,
don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID
early.
Note that orphaned offload programs will return -ENODEV on
BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
All bpf offload operations should now be under bpf_devs_lock,
it's safe to free and clear the entire offload structure,
not only the netdev pointer.
__bpf_prog_offload_destroy() will no longer be called multiple
times.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To allow verifier instruction callbacks without any extra locking
NETDEV_UNREGISTER notification would wait on a waitqueue for verifier
to finish. This design decision was made when rtnl lock was providing
all the locking. Use the read/write lock instead and remove the
workqueue.
Verifier will now call into the offload code, so dev_ops are moved
to offload structure. Since verifier calls are all under
bpf_prog_is_dev_bound() we no longer need static inline implementations
to please builds with CONFIG_NET=n.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We currently use aux->offload to indicate that program is bound
to a specific device. This forces us to keep the offload structure
around even after the device is gone. Add a bool member to
struct bpf_prog_aux to indicate if offload was requested.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We don't need the RTNL lock for all operations on offload state.
We only need to hold it around ndo calls. The device offload
initialization doesn't require it. The soon-to-come querying
of the offload info will only need it partially. We will also
be able to remove the waitqueue in following patches.
Use struct rw_semaphore because map offload will require sleeping
with the semaphore held for read.
Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
subsequently invokes tick_nohz_stop_sched_tick() are:
if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))
If need_resched() is not set, but a timer softirq is pending then this is
an indication that the softirq code punted and delegated the execution to
softirqd. need_resched() is not true because the current interrupted task
takes precedence over softirqd.
Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
timer interrupts because the timer wheel contains an expired timer, but
softirqs are not yet executed. So it returns an immediate expiry request,
which causes the timer to fire immediately again. Lather, rinse and
repeat....
Prevent that by adding a check for a pending timer soft interrupt to the
conditions in tick_nohz_stop_sched_tick() which avoid calling
get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
prevents a repetitive programming of an already expired timer.
Reported-by: Sebastian Siewior <bigeasy@linutronix.d>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
During boot and before base::nohz_active is set in the timer bases, deferrable
timers are enqueued into the standard timer base. This works correctly as
long as base::nohz_active is false.
Once it base::nohz_active is set and a timer which was enqueued before that
is accessed the lock selector code choses the lock of the deferred
base. This causes unlocked access to the standard base and in case the
timer is removed it does not clear the pending flag in the standard base
bitmap which causes get_next_timer_interrupt() to return bogus values.
To prevent that, the deferrable timers must be enqueued in the deferrable
base, even when base::nohz_active is not set. Those deferrable timers also
need to be expired unconditional.
Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20171222145337.633328378@linutronix.de
net/ipv6/ip6_gre.c is a case of parallel adds.
include/trace/events/tcp.h is a little bit more tricky. The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>