Anatoly has been fuzzing with kBdysch harness and reported a hang in one
of the outcomes:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (85) call bpf_get_socket_cookie#46
1: R0_w=invP(id=0) R10=fp0
1: (57) r0 &= 808464432
2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
2: (14) w0 -= 810299440
3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
3: (c4) w0 s>>= 1
4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
4: (76) if w0 s>= 0x30303030 goto pc+216
221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
221: (95) exit
processed 6 insns (limit 1000000) [...]
Taking a closer look, the program was xlated as follows:
# ./bpftool p d x i 12
0: (85) call bpf_get_socket_cookie#7800896
1: (bf) r6 = r0
2: (57) r6 &= 808464432
3: (14) w6 -= 810299440
4: (c4) w6 s>>= 1
5: (76) if w6 s>= 0x30303030 goto pc+216
6: (05) goto pc-1
7: (05) goto pc-1
8: (05) goto pc-1
[...]
220: (05) goto pc-1
221: (05) goto pc-1
222: (95) exit
Meaning, the visible effect is very similar to f54c7898ed ("bpf: Fix
precision tracking for unbounded scalars"), that is, the fall-through
branch in the instruction 5 is considered to be never taken given the
conclusion from the min/max bounds tracking in w6, and therefore the
dead-code sanitation rewrites it as goto pc-1. However, real-life input
disagrees with verification analysis since a soft-lockup was observed.
The bug sits in the analysis of the ARSH. The definition is that we shift
the target register value right by K bits through shifting in copies of
its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
register into 32 bit mode, same happens after simulating the operation.
However, for the case of simulating the actual ARSH, we don't take the
mode into account and act as if it's always 64 bit, but location of sign
bit is different:
dst_reg->smin_value >>= umin_val;
dst_reg->smax_value >>= umin_val;
dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
for example return 0xffff. With the above ARSH simulation, we'd see the
following results:
[...]
1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
1: (85) call bpf_get_socket_cookie#46
2: R0_w=invP(id=0) R10=fp0
2: (57) r0 &= 808464432
-> R0_runtime = 0x3030
3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
3: (14) w0 -= 810299440
-> R0_runtime = 0xcfb40000
4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
(0xffffffff)
4: (c4) w0 s>>= 1
-> R0_runtime = 0xe7da0000
5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
(0x67c00000) (0x7ffbfff8)
[...]
In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
and means here that the simulation didn't retain the sign bit. With above
logic, the updates happen on the 64 bit min/max bounds and given we coerced
the register, the sign bits of the bounds are cleared as well, meaning, we
need to force the simulation into s32 space for 32 bit alu mode.
Verification after the fix below. We're first analyzing the fall-through branch
on 32 bit signed >= test eventually leading to rejection of the program in this
specific case:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (b7) r2 = 808464432
1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
1: (85) call bpf_get_socket_cookie#46
2: R0_w=invP(id=0) R10=fp0
2: (bf) r6 = r0
3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
3: (57) r6 &= 808464432
4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
4: (14) w6 -= 810299440
5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
5: (c4) w6 s>>= 1
6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
(0x67c00000) (0xfffbfff8)
6: (76) if w6 s>= 0x30303030 goto pc+216
7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
7: (30) r0 = *(u8 *)skb[808464432]
BPF_LD_[ABS|IND] uses reserved fields
processed 8 insns (limit 1000000) [...]
Fixes: 9cbe1f5a32 ("bpf/verifier: improve register value range tracking with ARSH")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net
Suspend to IDLE invokes tick_unfreeze() on resume. tick_unfreeze() on the
first resuming CPU resumes timekeeping, which also has the side effect of
resetting the softlockup watchdog on this CPU.
But on the secondary CPUs the watchdog is not reset in the resume /
unfreeze() path, which can result in false softlockup warnings on those
CPUs depending on the time spent in suspend.
Prevent this by clearing the softlock watchdog in the unfreeze path also
on the secondary resuming CPUs.
[ tglx: Massaged changelog ]
Signed-off-by: Chunyan Zhang <chunyan.zhang@unisoc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200110083902.27276-1-chunyan.zhang@unisoc.com
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
added helper bpf_send_signal() which permits bpf program to
send a signal to the current process. The signal may be
delivered to any threads in the process.
We found a use case where sending the signal to the current
thread is more preferable.
- A bpf program will collect the stack trace and then
send signal to the user application.
- The user application will add some thread specific
information to the just collected stack trace for
later analysis.
If bpf_send_signal() is used, user application will need
to check whether the thread receiving the signal matches
the thread collecting the stack by checking thread id.
If not, it will need to send signal to another thread
through pthread_kill().
This patch proposed a new helper bpf_send_signal_thread(),
which sends the signal to the thread corresponding to
the current kernel task. This way, user space is guaranteed that
bpf_program execution context and user space signal handling
context are the same thread.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115035002.602336-1-yhs@fb.com
The test_cgcore_no_internal_process_constraint_on_threads selftest when
running with subsystem controlling noise triggers two warnings:
> [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
> [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160
Both stem from a call to cgroup_type_write. The first warning was also
triggered by syzkaller.
When we're switching cgroup to threaded mode shortly after a subsystem
was disabled on it, we can see the respective subsystem css dying there.
The warning in cgroup_apply_control_enable is harmless in this case
since we're not adding new subsys anyway.
The warning in cgroup_apply_control_disable indicates an attempt to kill
css of recently disabled subsystem repeatedly.
The commit prevents these situations by making cgroup_type_write wait
for all dying csses to go away before re-applying subtree controls.
When at it, the locations of WARN_ON_ONCE calls are moved so that
warning is triggered only when we are about to misuse the dying css.
Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It's surprising that workqueue_execute_end includes only the work when
its counterpart workqueue_execute_start has both the work and the worker
function.
You can't set a tracing filter or trigger based on the function, and
postprocessing scripts interested in specific functions are harder to
write since they have to remember the work from _start and match it up
with the same field in _end.
Add the function name, taking care to use the copy stashed in the
worker since the work is no longer safe to touch.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Function name cgroup_rstat_cpu_pop_upated() in comment should be
cgroup_rstat_cpu_pop_updated().
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It is useful to know which module failed signature verification, so
print the module name along with the error message.
Signed-off-by: Jessica Yu <jeyu@kernel.org>
The alarmtimer_rtc_add_device() function creates a wakeup source and then
tries to grab a module reference. If that fails the function returns early
with an error code, but fails to remove the wakeup source.
Cleanup this exit path so there is no dangling wakeup source, which is
named 'alarmtime' left allocated which will conflict with another RTC
device that may be registered later.
Fixes: 51218298a2 ("alarmtimer: Ensure RTC module is not unloaded")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 30350d65ac ("tracing: Add variable support to hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Merge misc fixes from David Howells.
Two afs fixes and a key refcounting fix.
* dhowells:
afs: Fix afs_lookup() to not clobber the version on a new dentry
afs: Fix use-after-loss-of-ref
keys: Fix request_key() cache
Instead of using bpf_struct_ops_map_lookup_elem() which is
not implemented, bpf_struct_ops_map_seq_show_elem() should
also use bpf_struct_ops_map_sys_lookup_elem() which does
an inplace update to the value. The change allocates
a value to pass to bpf_struct_ops_map_sys_lookup_elem().
[root@arch-fb-vm1 bpf]# cat /sys/fs/bpf/dctcp
{{{1}},BPF_STRUCT_OPS_STATE_INUSE,{{00000000df93eebc,00000000df93eebc},0,2, ...
Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200114072647.3188298-1-kafai@fb.com
When the key cached by request_key() and co. is cleaned up on exit(),
the code looks in the wrong task_struct, and so clears the wrong cache.
This leads to anomalies in key refcounting when doing, say, a kernel
build on an afs volume, that then trigger kasan to report a
use-after-free when the key is viewed in /proc/keys.
Fix this by making exit_creds() look in the passed-in task_struct rather
than in current (the task_struct cleanup code is deferred by RCU and
potentially run in another task).
Fixes: 7743c48e54 ("keys: Cache result of request_key*() temporarily in task_struct")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.
Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
VDSO support for Time namespace needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page contains time namespace clock offsets and it
has vdso_data->seq set to 1 to enforce the slow path and
vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time namespace
handling path.
Allocate the timens page during namespace creation. Setup the offsets
when the first task enters the ns and freeze them to guarantee the pace
of monotonic/boottime clocks and to avoid breakage of applications.
The design decision is to have a global offset_lock which is used during
namespace offsets setup and to freeze offsets when the first task joins the
new time namespace. That is better in terms of memory usage compared to
having a per namespace mutex that's used only during the setup period.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Based-on-work-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-24-dima@arista.com
clock_nanosleep() accepts absolute values of expiration time when
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace, and has to be converted to the host's time.
There is timens_ktime_to_host() helper for converting time, but
it accepts ktime argument.
As a preparation, make hrtimer_nanosleep() accept a clock value in ktime
instead of timespec64.
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-17-dima@arista.com
Wire timer_settime() syscall into time namespace virtualization.
sys_timer_settime() calls the ktime->timer_set() callback. Right now,
common_timer_set() is the only implementation for the callback.
The user-supplied expiry value is converted from timespec64 to ktime and
then timens_ktime_to_host() can be used to convert namespace's time to the
host time.
Inside a time namespace kernel's time differs by a fixed offset from a
user-supplied time, but only absolute values (TIMER_ABSTIME) must be
converted.
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-15-dima@arista.com
Adjust monotonic and boottime clocks with per-timens offsets. As the
result a process inside time namespace will see timers and clocks corrected
to offsets that were set when the namespace was created
Note that applications usually go through vDSO to get time, which is not
yet adjusted. Further changes will complete time namespace virtualisation
with vDSO support.
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-12-dima@arista.com
The callsite in common_timer_get() has already a comment:
/*
* The timespec64 based conversion is suboptimal, but it's not
* worth to implement yet another callback.
*/
kc->clock_get(timr->it_clock, &ts64);
now = timespec64_to_ktime(ts64);
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-10-dima@arista.com
The upcoming support for time namespaces requires to have access to:
- The time in a tasks time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
struct alarm_base needs to follow the same naming convention, so rename
.gettime() callback into get_ktime() as a preparation for introducing
get_timespec().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-8-dima@arista.com
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format in (struct k_clock).
As a preparation ground for introducing clock_get_ktime(), the original
callback clock_get() was renamed into clock_get_timespec().
Reflect the renaming into the callback implementations.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-7-dima@arista.com
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format, rather than in (struct timespec).
Rename the clock_get() callback to clock_get_timespec() as a preparation
for introducing clock_get_ktime().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-6-dima@arista.com
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.
One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.
/* prototype */
/* flags is currently reserved and should be set to 0 */
int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
/* testing */
Ran self-test suite on x86_64
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Add below function-tracer filter options to boot-time tracing.
- ftrace.[instance.INSTANCE.]ftrace.filters
This will take an array of tracing function filter rules
- ftrace.[instance.INSTANCE.]ftrace.notraces
This will take an array of NON-tracing function filter rules
Link: http://lkml.kernel.org/r/157867244841.17873.10933616628243103561.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add synthetic event node support to boot time tracing.
The synthetic event is a kind of event node, but the group
name is "synthetic".
- ftrace.event.synthetic.EVENT.fields = FIELD[, FIELD2...]
Defines new synthetic event with FIELDs. Each field should be
"type varname".
The synthetic node requires "fields" string arraies, which defines
the fields as same as tracing/synth_events interface.
Link: http://lkml.kernel.org/r/157867241236.17873.12411615143321557709.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add kprobe event support on event node to boot-time tracing.
If the group name of event is "kprobes", the boot-time tracing
defines new probe event according to "probes" values.
- ftrace.event.kprobes.EVENT.probes = PROBE[, PROBE2...]
Defines new kprobe event based on PROBEs. It is able to define
multiple probes on one event, but those must have same type of
arguments.
For example,
ftrace.events.kprobes.myevent {
probes = "vfs_read $arg1 $arg2";
enable;
}
This will add kprobes:myevent on vfs_read with the 1st and the 2nd
arguments.
Link: http://lkml.kernel.org/r/157867240104.17873.9712052065426433111.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add per-event settings for boottime tracing. User can set filter,
actions and enable on each event on boot. The event entries are
under ftrace.event.GROUP.EVENT node (note that the option key
includes event's group name and event name.) This supports below
configs.
- ftrace.event.GROUP.EVENT.enable
Enables GROUP:EVENT tracing.
- ftrace.event.GROUP.EVENT.filter = FILTER
Set FILTER rule to the GROUP:EVENT.
- ftrace.event.GROUP.EVENT.actions = ACTION[, ACTION2...]
Set ACTIONs to the GROUP:EVENT.
For example,
ftrace.event.sched.sched_process_exec {
filter = "pid < 128"
enable
}
this will enable tracing "sched:sched_process_exec" event
with "pid < 128" filter.
Link: http://lkml.kernel.org/r/157867238942.17873.11177628789184546198.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Setup tracing options via extra boot config in addition to kernel
command line.
This adds following commands support. These are applied to
the global trace instance.
- ftrace.options = OPT1[,OPT2...]
Enable given ftrace options.
- ftrace.trace_clock = CLOCK
Set given CLOCK to ftrace's trace_clock.
- ftrace.buffer_size = SIZE
Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
for that SIZE.
- ftrace.events = EVENT[, EVENT2...]
Enable given events on boot. You can use a wild card in EVENT.
- ftrace.tracer = TRACER
Set TRACER to current tracer on boot. (e.g. function)
Note that this is NOT replacing the kernel parameters, because
this boot config based setting is later than that. If you want to
trace earlier boot events, you still need kernel parameters.
Link: http://lkml.kernel.org/r/157867237723.17873.17494943526320587488.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since kprobe-events use event_trigger_unlock_commit_regs() directly,
that events doesn't show up in printk buffer if "tp_printk" is set.
Use trace_event_buffer_commit() in kprobe events so that it can
invoke output_printk() as same as other trace events.
Link: http://lkml.kernel.org/r/157867233085.17873.5210928676787339604.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ Adjusted data var declaration placement in __kretprobe_trace_func() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As there's two struct ring_buffers in the kernel, it causes some confusion.
The other one being the perf ring buffer. It was agreed upon that as neither
of the ring buffers are generic enough to be used globally, they should be
renamed as:
perf's ring_buffer -> perf_buffer
ftrace's ring_buffer -> trace_buffer
This implements the changes to the ring buffer that ftrace uses.
Link: https://lore.kernel.org/r/20191213140531.116b3200@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>