config TRACING_SUPPORT has an exception for PPC32, because PPC32
didn't have irqflags tracing support.
But that hasn't been true since commit 5d38902c48 ("powerpc: Add
irqtrace support for 32-bit powerpc") (Jun 2009).
So remove the exception for PPC32 and the comment.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When speculation flaw mitigations are opt-in (via prctl), using seccomp
will automatically opt-in to these protections, since using seccomp
indicates at least some level of sandboxing is desired.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.
This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:
Bit Define Description
0 PR_SPEC_PRCTL Mitigation can be controlled per task by
PR_SET_SPECULATION_CTRL
1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
disabled
2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
enabled
If all bits are 0 the CPU is not affected by the speculation misfeature.
If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.
PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
The common return values are:
EINVAL prctl is not implemented by the architecture or the unused prctl()
arguments are not 0
ENODEV arg2 is selecting a not supported speculation misfeature
PR_SET_SPECULATION_CTRL has these additional return values:
ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO prctl control of the selected speculation misfeature is disabled
The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.
Based on an initial patch from Tim Chen and mostly rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Even with the wait-loop fixed, there is a further issue with
kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
smpboot_park_threads() can return before all those threads are in fact
blocked, due to the placement of the complete() in __kthread_parkme().
When that happens, sched_cpu_dying() -> migrate_tasks() can end up
migrating such a still runnable task onto another CPU.
Normally the task will have hit schedule() and gone to sleep by the
time we do kthread_unpark(), which will then do __kthread_bind() to
re-bind the task to the correct CPU.
However, when we loose the initial TASK_PARKED store to the concurrent
wakeup issue described previously, do the complete(), get migrated, it
is possible to either:
- observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
the park and set TASK_RUNNING, or
- __kthread_bind()'s wait_task_inactive() to observe the competing
TASK_RUNNING store.
Either way the WARN() in __kthread_bind() will trigger and fail to
correctly set the CPU affinity.
Fix this by only issuing the complete() when the kthread has scheduled
out. This does away with all the icky 'still running' nonsense.
The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.
The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.
This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!need_sleep)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.
The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.
There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Matt reported the following deadlock:
CPU0 CPU1
schedule(.prev=migrate/0) <fault>
pick_next_task() ...
idle_balance() migrate_swap()
active_balance() stop_two_cpus()
spin_lock(stopper0->lock)
spin_lock(stopper1->lock)
ttwu(migrate/0)
smp_cond_load_acquire() -- waits for schedule()
stop_one_cpu(1)
spin_lock(stopper1->lock) -- waits for stopper lock
Fix this deadlock by taking the wakeups out from under stopper->lock.
This allows the active_balance() to queue the stop work and finish the
context switch, which in turn allows the wakeup from migrate_swap() to
observe the context and complete the wakeup.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180420095005.GH4064@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull tracing fixes from Steven Rostedt:
"Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
- Use special field for function pointer in trace event
- Fix igrab issues in uprobes
- Fixes to the new histogram triggers"
* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Do not warn on ENOMEM
tracing: Add field modifier parsing hist error for hist triggers
tracing: Add field parsing hist error for hist triggers
tracing: Restore proper field flag printing when displaying triggers
tracing: initcall: Ordered comparison of function pointers
tracing: Remove igrab() iput() call from uprobes.c
tracing: Fix bad use of igrab in trace_uprobe.c
When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.
With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.
The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.
In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.
In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.
Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.
The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is the io_getevents equivalent of ppoll/pselect and allows to
properly mix signals and aio completions (especially with IOCB_CMD_POLL)
and atomically executes the following sequence:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ret = io_getevents(ctx, min_nr, nr, events, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
Note that unlike many other signal related calls we do not pass a sigmask
size, as that would get us to 7 arguments, which aren't easily supported
by the syscall infrastructure. It seems a lot less painful to just add a
new syscall variant in the unlikely case we're going to increase the
sigset size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Because of how the code flips between tsc-early and tsc clocksources
it might need to mark one or both unstable. The current code in
mark_tsc_unstable() only worked because previously it registered the
tsc clocksource once and then never touched it.
Since it now unregisters the tsc-early clocksource, it needs to know
if a clocksource got unregistered and the current cs->mult test
doesn't work for that. Instead use list_empty(&cs->list) to test for
registration.
Furthermore, since clocksource_mark_unstable() needs to place the cs
on the wd_list, it links the cs->list and cs->wd_list serialization.
It must not see a clocsource registered (!empty cs->list) but already
past dequeue_watchdog(). So place {en,de}queue{,_watchdog}() under the
same lock.
Provided cs->list is initialized to empty, this then allows us to
unconditionally use clocksource_mark_unstable(), regardless of the
registration state.
Fixes: aa83c45762 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Diego Viola <diego.viola@gmail.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180502135312.GS12217@hirez.programming.kicks-ass.net
tracepoints to bpf core were added as a way to provide introspection
to bpf programs and maps, but after some time it became clear that
this approach is inadequate, so prog_id, map_id and corresponding
get_next_id, get_fd_by_id, get_info_by_fd, prog_query APIs were
introduced and fully adopted by bpftool and other applications.
The tracepoints in bpf core started to rot and causing syzbot warnings:
WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
Kernel panic - not syncing: panic_on_warn set ...
perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
map_update_elem kernel/bpf/syscall.c:597 [inline]
SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
Hence this patch deletes tracepoints in bpf core.
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: syzbot <bot+a9dbb3c3e64b62536a4bc5ee7bbd4ca627566188@syzkaller.appspotmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, the bpf_current_task_under_cgroup helper has a check where if
the BPF program is running in_interrupt(), it will return -EINVAL. This
prevents the helper to be used in many useful scenarios, particularly
BPF programs attached to Perf Events.
This commit removes the check. Tested a few NMI (Perf Event) and some
softirq context, the helper returns the correct result.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull timer fixes from Thomas Gleixner:
"Two fixes from the timer departement:
- Fix a long standing issue in the NOHZ tick code which causes RB
tree corruption, delayed timers and other malfunctions. The cause
for this is code which modifies the expiry time of an enqueued
hrtimer.
- Revert the CLOCK_MONOTONIC/CLOCK_BOOTTIME unification due to
regression reports. Seems userspace _is_ relying on the documented
behaviour despite our hope that it wont"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME
tick/sched: Do not mess with an enqueued hrtimer
When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
56: (c7) r8 s>>= 32
57: R8=inv(id=0)
With this patch:
54: R0=inv(id=0,umax_value=800)
54: (bf) r8 = r0
55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
55: (67) r8 <<= 32
56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
56: (c7) r8 s>>= 32
57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.
In our later example,
......
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0)
return 0;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In verifier function adjust_scalar_min_max_vals,
when src_known is false and the opcode is BPF_LSH/BPF_RSH,
early return will happen in the function. So remove
the branch in handling BPF_LSH/BPF_RSH when src_known is false.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
if (usize < 0 || usize > max_len)
return 0;
The verifier may have the following errors:
52: (85) call bpf_get_stack#65
R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R9_w=inv800 R10=fp0,call_-1
53: (bf) r8 = r0
54: (bf) r1 = r8
55: (67) r1 <<= 32
56: (bf) r2 = r1
57: (77) r2 >>= 32
58: (25) if r2 > 0x31f goto pc+33
R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
umax_value=18446744069414584320,
var_off=(0x0; 0xffffffff00000000))
R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R8=inv(id=0) R9=inv800 R10=fp0,call_-1
59: (1f) r9 -= r8
60: (c7) r1 s>>= 32
61: (bf) r2 = r7
62: (0f) r2 += r1
math between map_value pointer and register with unbounded
min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.
Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
52: (85) call bpf_get_stack#65
R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R9_w=inv800 R10=fp0,call_-1
53: (b7) r1 = 0
54: (bf) r8 = r0
55: (67) r8 <<= 32
56: (c7) r8 s>>= 32
57: (6d) if r1 s> r8 goto pc+24
R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
R1=inv0 R6=ctx(id=0,off=0,imm=0)
R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
R10=fp0,call_-1
58: (bf) r2 = r7
59: (0f) r2 += r8
60: (1f) r9 -= r8
61: (bf) r1 = r6
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.
This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull modules fix from Jessica Yu:
"Fix display of module section addresses in sysfs, which were getting
hashed with %pK and breaking tools like perf"
* tag 'modules-for-v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Fix display of wrong module .text address
Pull staging fixes from Greg KH:
"Here are two staging driver fixups for 4.17-rc3.
The first is the remaining stragglers of the irda code removal that
you pointed out during the merge window. The second is a fix for the
wilc1000 driver due to a patch that got merged in 4.17-rc1.
Both of these have been in linux-next for a while with no reported
issues"
* tag 'staging-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: wilc1000: fix NULL pointer exception in host_int_parse_assoc_resp_info()
staging: irda: remove remaining remants of irda code removal
try_to_wake_up() might invoke delayacct_blkio_end() while holding the
pi_lock (which is a raw_spinlock_t). delayacct_blkio_end() acquires
task_delay_info.lock which is a spinlock_t. This causes a might sleep splat
on -RT where non raw spinlocks are converted to 'sleeping' spinlocks.
task_delay_info.lock is only held for a short amount of time so it's not a
problem latency wise to make convert it to a raw spinlock.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: https://lkml.kernel.org/r/20180423161024.6710-1-bigeasy@linutronix.de
The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue, since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering, see:
95bcade33a ("locking/qspinlock: Ensure node is initialised before updating prev->next")
Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-12-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The qspinlock locking slowpath utilises a "pending" bit as a simple form
of an embedded test-and-set lock that can avoid the overhead of explicit
queuing in cases where the lock is held but uncontended. This bit is
managed using a cmpxchg() loop which tries to transition the uncontended
lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved
indefinitely if the lock word is seen to oscillate between unlocked
(0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
able to take the lock in the cmpxchg() loop without queuing and pass it
around amongst themselves.
This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
using atomic_fetch_or, and then inspecting the old value to see whether
we need to spin on the current lock owner, or whether we now effectively
hold the lock. The tricky scenario is when concurrent lockers end up
queuing on the lock and the lock becomes available, causing us to see
a lockword of (n,0,0). With pending now set, simply queuing could lead
to deadlock as the head of the queue may not have observed the pending
flag being cleared. Conversely, if the head of the queue did observe
pending being cleared, then it could transition the lock from (n,0,0) ->
(0,0,1) meaning that any attempt to "undo" our setting of the pending
bit could race with a concurrent locker trying to set it.
We handle this race by preserving the pending bit when taking the lock
after reaching the head of the queue and leaving the tail entry intact
if we saw pending set, because we know that the tail is going to be
updated shortly.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a locker taking the qspinlock slowpath reads a lock value indicating
that only the pending bit is set, then it will spin whilst the
concurrent pending->locked transition takes effect.
Unfortunately, there is no guarantee that such a transition will ever be
observed since concurrent lockers could continuously set pending and
hand over the lock amongst themselves, leading to starvation. Whilst
this would probably resolve in practice, it means that it is not
possible to prove liveness properties about the lock and means that lock
acquisition time is unbounded.
Rather than removing the pending->locked spinning from the slowpath
altogether (which has been shown to heavily penalise a 2-threaded
locking stress test on x86), this patch replaces the explicit spinning
with a call to atomic_cond_read_relaxed and allows the architecture to
provide a bound on the number of spins. For architectures that can
respond to changes in cacheline state in their smp_cond_load implementation,
it should be sufficient to use the default bound of 1.
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the user specifies an invalid field modifier for a hist trigger,
the current code correctly flags that as an error, but doesn't tell
the user what happened.
Fix this by invoking hist_err() with an appropriate message when
invalid modifiers are specified.
Before:
# echo 'hist:keys=pid:ts0=common_timestamp.junkusecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
After:
# echo 'hist:keys=pid:ts0=common_timestamp.junkusecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
ERROR: Invalid field modifier: junkusecs
Last command: keys=pid:ts0=common_timestamp.junkusecs
Link: http://lkml.kernel.org/r/b043c59fa79acd06a5f14a1d44dee9e5a3cd1248.1524790601.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If the user specifies a nonexistent field for a hist trigger, the
current code correctly flags that as an error, but doesn't tell the
user what happened.
Fix this by invoking hist_err() with an appropriate message when
nonexistent fields are specified.
Before:
# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
After:
# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
ERROR: Couldn't find field: pid
Last command: keys=pid:ts0=common_timestamp.usecs
Link: http://lkml.kernel.org/r/fdc8746969d16906120f162b99dd71c741e0b62c.1524790601.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The flag-printing code used when displaying hist triggers somehow got
dropped during refactoring of the inter-event patchset. This restores
it.
Below are a couple examples - in the first case, .usecs wasn't being
displayed properly for common_timestamps and the second illustrates
the same for other flags such as .execname.
Before:
# echo 'hist:key=common_pid.execname:val=count:sort=count' > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
hist:keys=common_pid:vals=hitcount,count:sort=count:size=2048 [active]
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
hist:keys=pid:vals=hitcount:ts0=common_timestamp:sort=hitcount:size=2048:clock=global if comm=="cyclictest" [active]
After:
# echo 'hist:key=common_pid.execname:val=count:sort=count' > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
hist:keys=common_pid.execname:vals=hitcount,count:sort=count:size=2048 [active]
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if comm=="cyclictest" [active]
Link: http://lkml.kernel.org/r/492bab42ff21806600af98a8ea901af10efbee0c.1524790601.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-27
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add extensive BPF helper description into include/uapi/linux/bpf.h
and a new script bpf_helpers_doc.py which allows for generating a
man page out of it. Thus, every helper in BPF now comes with proper
function signature, detailed description and return code explanation,
from Quentin.
2) Migrate the BPF collect metadata tunnel tests from BPF samples over
to the BPF selftests and further extend them with v6 vxlan, geneve
and ipip tests, simplify the ipip tests, improve documentation and
convert to bpf_ntoh*() / bpf_hton*() api, from William.
3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
access stack and packet memory. Extend this to allow such helpers
to also use map values, which enabled use cases where value from
a first lookup can be directly used as a key for a second lookup,
from Paul.
4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
order to retrieve XFRM state information containing SPI, peer
address and reqid values, from Eyal.
5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
and SUB instructions with negative immediate into the opposite
operation with a positive immediate such that nfp can better fit
small immediates into instructions. Savings in instruction count
up to 4% have been observed, from Jakub.
6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
and add support for dumping this through bpftool, from Jiri.
7) Move the BPF sockmap samples over into BPF selftests instead since
sockmap was rather a series of tests than sample anyway and this way
this can be run from automated bots, from John.
8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
with generic XDP, from Nikita.
9) Some follow-up cleanups to BTF, namely, removing unused defines from
BTF uapi header and renaming 'name' struct btf_* members into name_off
to make it more clear they are offsets into string section, from Martin.
10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
not run directly but invoked from test_sock_addr.sh, from Yonghong.
11) Remove redundant ret assignment in sample BPF loader, from Wang.
12) Add couple of missing files to BPF selftest's gitignore, from Anders.
There are two trivial merge conflicts while pulling:
1) Remove samples/sockmap/Makefile since all sockmap tests have been
moved to selftests.
2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
file since git should ignore all of them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the siginfo_layout function and enum siginfo_layout to represent
all of the possible field layouts of struct siginfo.
This allows the uses of siginfo_layout in um and arm64 where they are testing
for SIL_FAULT to be more accurate as this rules out the other cases.
Further this allows the switch statements on siginfo_layout to be simpler
if perhaps a little more wordy. Making it easier to understand what is
actually going on.
As SIL_FAULT_BNDERR and SIL_FAULT_PKUERR are never expected to appear
in signalfd just treat them as SIL_FAULT. To include them would take
20 extra bytes an pretty much fill up what is left of
signalfd_siginfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only architecture that does not support SEGV_PKUERR is ia64 and
ia64 has not had 32bit support since some time in 2008. Therefore
copy_siginfo_to_user32 and copy_siginfo_from_user32 do not need to
include support for a missing SEGV_PKUERR.
Compile test on ia64.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>