When it is not possible for a non-privilege perf command to monitor at
the kernel level (:k), the fallback code forces a :u. That works if the
event was previously monitoring both levels. But if the event was
already constrained to kernel only, then it does not make sense to
restrict it to user only.
Given the code works by exclusion, a kernel only event would have:
attr->exclude_user = 1
The fallback code would add:
attr->exclude_kernel = 1
In the end the end would not monitor in either the user level or kernel
level. In other words, it would count nothing.
An event programmed to monitor kernel only cannot be switched to user
only without seriously warning the user.
This patch forces an error in this case to make it clear the request
cannot really be satisfied.
Behavior with paranoid 1:
$ sudo bash -c "echo 1 > /proc/sys/kernel/perf_event_paranoid"
$ perf stat -e cycles:k sleep 1
Performance counter stats for 'sleep 1':
1,520,413 cycles:k
1.002361664 seconds time elapsed
0.002480000 seconds user
0.000000000 seconds sys
Old behavior with paranoid 2:
$ sudo bash -c "echo 2 > /proc/sys/kernel/perf_event_paranoid"
$ perf stat -e cycles:k sleep 1
Performance counter stats for 'sleep 1':
0 cycles:ku
1.002358127 seconds time elapsed
0.002384000 seconds user
0.000000000 seconds sys
New behavior with paranoid 2:
$ sudo bash -c "echo 2 > /proc/sys/kernel/perf_event_paranoid"
$ perf stat -e cycles:k sleep 1
Error:
You may not have permission to collect stats.
Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).
The current value is 2:
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
>= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN
To make this setting permanent, edit /etc/sysctl.conf too, e.g.:
kernel.perf_event_paranoid = -1
v2 of this patch addresses the review feedback from jolsa@redhat.com.
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20200414161550.225588-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When AUX area events are used in sampling mode, they must be the group
leader, but the group leader is also used for leader-sampling. However,
it is not desirable to use an AUX area event as the leader for
leader-sampling, because it doesn't have any samples of its own. To support
leader-sampling with AUX area events, use the 2nd event of the group as the
"leader" for the purposes of leader-sampling.
Example:
# perf record --kcore --aux-sample -e '{intel_pt//,cycles,instructions}:S' -c 10000 uname
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.786 MB perf.data ]
# perf report
Samples: 380 of events 'anon group { cycles, instructions }', Event count (approx.): 3026164
Children Self Command Shared Object Symbol
+ 38.76% 42.65% 0.00% 0.00% uname [kernel.kallsyms] [k] __x86_indirect_thunk_rax
+ 35.82% 31.33% 0.00% 0.00% uname ld-2.28.so [.] _dl_start_user
+ 34.29% 29.74% 0.55% 0.47% uname ld-2.28.so [.] _dl_start
+ 33.73% 28.62% 1.60% 0.97% uname ld-2.28.so [.] dl_main
+ 33.19% 29.04% 0.52% 0.32% uname ld-2.28.so [.] _dl_sysdep_start
+ 27.83% 33.74% 0.00% 0.00% uname [kernel.kallsyms] [k] do_syscall_64
+ 26.76% 33.29% 0.00% 0.00% uname [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 23.78% 20.33% 5.97% 5.25% uname [kernel.kallsyms] [k] page_fault
+ 23.18% 24.60% 0.00% 0.00% uname libc-2.28.so [.] __libc_start_main
+ 22.64% 24.37% 0.00% 0.00% uname uname [.] _start
+ 21.04% 23.27% 0.00% 0.00% uname uname [.] main
+ 19.48% 18.08% 3.72% 3.64% uname ld-2.28.so [.] _dl_relocate_object
+ 19.47% 21.81% 0.00% 0.00% uname libc-2.28.so [.] setlocale
+ 19.44% 21.56% 0.52% 0.61% uname libc-2.28.so [.] _nl_find_locale
+ 17.87% 19.66% 0.00% 0.00% uname libc-2.28.so [.] _nl_load_locale_from_archive
+ 15.71% 13.73% 0.53% 0.52% uname [kernel.kallsyms] [k] do_page_fault
+ 15.18% 13.21% 1.03% 0.68% uname [kernel.kallsyms] [k] handle_mm_fault
+ 14.15% 12.53% 1.01% 1.12% uname [kernel.kallsyms] [k] __handle_mm_fault
+ 12.03% 9.67% 0.54% 0.32% uname ld-2.28.so [.] _dl_map_object
+ 10.55% 8.48% 0.00% 0.00% uname ld-2.28.so [.] openaux
+ 10.55% 20.20% 0.52% 0.61% uname libc-2.28.so [.] __run_exit_handlers
Comnmitter notes:
Fixed up this problem:
util/record.c: In function ‘perf_evlist__config’:
util/record.c:256:3: error: too few arguments to function ‘perf_evsel__config_leader_sampling’
256 | perf_evsel__config_leader_sampling(evsel);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
util/record.c:190:13: note: declared here
190 | static void perf_evsel__config_leader_sampling(struct evsel *evsel,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lore.kernel.org/lkml/20200401101613.6201-17-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Move leader-sampling configuration in preparation for adding support for
leader sampling with AUX area events.
Committer notes:
It only makes sense when configuring an evsel that is part of an evlist,
so the only case where it is called outside perf_evlist__config(), in
some 'perf test' entry, is safe, and even there we should just use
perf_evlist__config(), but since in that case we have just one evsel in
the evlist, it is equivalent.
Also fixed up this problem:
util/record.c: In function ‘perf_evlist__config’:
util/record.c:223:3: error: too many arguments to function ‘perf_evsel__config_leader_sampling’
223 | perf_evsel__config_leader_sampling(evsel, evlist);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
util/record.c:170:13: note: declared here
170 | static void perf_evsel__config_leader_sampling(struct evsel *evsel)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lore.kernel.org/lkml/20200401101613.6201-14-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Allow building vringh without IOTLB (that's the case for userspace
builds, will be useful for CAIF/VOD down the road too).
Update for API tweaks.
Don't include vringh with userspace builds.
Cc: Jason Wang <jasowang@redhat.com>
Cc: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Pull networking fixes from David Miller:
1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.
2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.
3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.
4) Adjust icmp6 message source address selection when routes have a
preferred source address set, from Tim Stallard.
5) Be sure to validate HSR protocol version when creating new links,
from Taehee Yoo.
6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
non-initial namespaces, from Michael Weiß.
7) Missing release firmware call in mlx5, from Eran Ben Elisha.
8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
Taehee Yoo.
9) Fix pause frame negotiation in marvell phy driver, from Clemens
Gruber.
10) Record RX queue early enough in tun packet paths such that XDP
programs will see the correct RX queue index, from Gilberto Bertin.
11) Fix double unlock in mptcp, from Florian Westphal.
12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.
13) marvell10g needs to soft reset PHY when coming out of low power
mode, from Russell King.
14) Fix MTU setting regression in stmmac for some chip types, from
Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
amd-xgbe: Use __napi_schedule() in BH context
mISDN: make dmril and dmrim static
net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
tipc: fix incorrect increasing of link window
Documentation: Fix tcp_challenge_ack_limit default value
net: tulip: make early_486_chipsets static
dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
ipv6: remove redundant assignment to variable err
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
selftests/bpf: Check for correct program attach/detach in xdp_attach test
libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
libbpf: Always specify expected_attach_type on program load if supported
xsk: Add missing check on user supplied headroom size
mac80211: fix channel switch trigger from unknown mesh peer
mac80211: fix race in ieee80211_register_hw()
net: marvell10g: soft-reset the PHY when coming out of low power
net: marvell10g: report firmware version
net/cxgb4: Check the return from t4_query_params properly
...
This script works in tandem with d3-flame-graph to generate flame graphs
from perf. It supports two output formats: JSON and HTML (the default).
The HTML format will look for a standalone d3-flame-graph template file
in /usr/share/d3-flame-graph/d3-flamegraph-base.html and fill in the
collected stacks.
Usage:
perf record -a -g -F 99 sleep 60
perf script report flamegraph
Combined:
perf script flamegraph -a -F 99 sleep 60
Committer testing:
Tested both with "PYTHON=python3" and with the default, that uses
python2-devel:
Complete set of instructions:
$ mkdir /tmp/build/perf
$ make PYTHON=python3 -C tools/perf O=/tmp/build/perf install-bin
$ export PATH=~/bin:$PATH
$ perf record -a -g -F 99 sleep 60
$ perf script report flamegraph
Now go and open the generated flamegraph.html file in a browser.
At first this required building with PYTHON=python3, but after I
reported this Andreas was kind enough to send a patch making it work
with both python and python3.
Signed-off-by: Andreas Gerstmayr <agerstmayr@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brendan Gregg <bgregg@netflix.com>
Cc: Martin Spier <mspier@netflix.com>
Link: http://lore.kernel.org/lkml/20200320151355.66302-1-agerstmayr@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The xxx_mountpoint() interface provided by fs.c finds mount points for
common pseudo filesystems. The first time xxx_mountpoint() is invoked,
it scans the mount table (/proc/mounts) looking for a match. If found,
it is cached. The price to scan /proc/mounts is paid once if the mount
is found.
When the mount point is not found, subsequent calls to xxx_mountpoint()
scan /proc/mounts over and over again. There is no caching.
This causes a scaling issue in perf record with hugeltbfs__mountpoint().
The function is called for each process found in
synthesize__mmap_events(). If the machine has thousands of processes
and if the /proc/mounts has many entries this could cause major overhead
in perf record. We have observed multi-second slowdowns on some
configurations.
As an example on a laptop:
Before:
$ sudo umount /dev/hugepages
$ strace -e trace=openat -o /tmp/tt perf record -a ls
$ fgrep mounts /tmp/tt
285
After:
$ sudo umount /dev/hugepages
$ strace -e trace=openat -o /tmp/tt perf record -a ls
$ fgrep mounts /tmp/tt
1
One could argue that the non-caching in case the moint point is not
found is intentional. That way subsequent calls may discover a moint
point if the sysadmin mounts the filesystem. But the same argument could
be made against caching the mount point. It could be unmounted causing
errors. It all depends on the intent of the interface. This patch
assumes it is expected to scan /proc/mounts once. The patch documents
the caching behavior in the fs.h header file.
An alternative would be to just fix perf record. But it would solve the
problem with hugetlbs__mountpoint() but there could be similar issues
(possibly down the line) with other xxx_mountpoint() calls in perf or
other tools.
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrey Zhizhikin <andrey.z@gmail.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/lkml/20200402154357.107873-3-irogers@google.com
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Extend error messages to mention CAP_PERFMON capability as an option to
substitute CAP_SYS_ADMIN capability for secure system performance
monitoring and observability operations. Make
perf_event_paranoid_check() and __cmd_ftrace() to be aware of
CAP_PERFMON capability.
CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)
For backward compatibility reasons access to perf_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure perf_events monitoring is discouraged with respect to CAP_PERFMON
capability.
Committer testing:
Using a libcap with this patch:
diff --git a/libcap/include/uapi/linux/capability.h b/libcap/include/uapi/linux/capability.h
index 78b2fd4c8a95..89b5b0279b60 100644
--- a/libcap/include/uapi/linux/capability.h
+++ b/libcap/include/uapi/linux/capability.h
@@ -366,8 +366,9 @@ struct vfs_ns_cap_data {
#define CAP_AUDIT_READ 37
+#define CAP_PERFMON 38
-#define CAP_LAST_CAP CAP_AUDIT_READ
+#define CAP_LAST_CAP CAP_PERFMON
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
Note that using '38' in place of 'cap_perfmon' works to some degree with
an old libcap, its only when cap_get_flag() is called that libcap
performs an error check based on the maximum value known for
capabilities that it will fail.
This makes determining the default of perf_event_attr.exclude_kernel to
fail, as it can't determine if CAP_PERFMON is in place.
Using 'perf top -e cycles' avoids the default check and sets
perf_event_attr.exclude_kernel to 1.
As root, with a libcap supporting CAP_PERFMON:
# groupadd perf_users
# adduser perf -g perf_users
# mkdir ~perf/bin
# cp ~acme/bin/perf ~perf/bin/
# chgrp perf_users ~perf/bin/perf
# setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" ~perf/bin/perf
# getcap ~perf/bin/perf
/home/perf/bin/perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
# ls -la ~perf/bin/perf
-rwxr-xr-x. 1 root perf_users 16968552 Apr 9 13:10 /home/perf/bin/perf
As the 'perf' user in the 'perf_users' group:
$ perf top -a --stdio
Error:
Failed to mmap with 1 (Operation not permitted)
$
Either add the cap_ipc_lock capability to the perf binary or reduce the
ring buffer size to some smaller value:
$ perf top -m10 -a --stdio
rounding mmap pages size to 64K (16 pages)
Error:
Failed to mmap with 1 (Operation not permitted)
$ perf top -m4 -a --stdio
Error:
Failed to mmap with 1 (Operation not permitted)
$ perf top -m2 -a --stdio
PerfTop: 762 irqs/sec kernel:49.7% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 4 CPUs)
------------------------------------------------------------------------------------------------------
9.83% perf [.] __symbols__insert
8.58% perf [.] rb_next
5.91% [kernel] [k] module_get_kallsym
5.66% [kernel] [k] kallsyms_expand_symbol.constprop.0
3.98% libc-2.29.so [.] __GI_____strtoull_l_internal
3.66% perf [.] rb_insert_color
2.34% [kernel] [k] vsnprintf
2.30% [kernel] [k] string_nocheck
2.16% libc-2.29.so [.] _IO_getdelim
2.15% [kernel] [k] number
2.13% [kernel] [k] format_decode
1.58% libc-2.29.so [.] _IO_feof
1.52% libc-2.29.so [.] __strcmp_avx2
1.50% perf [.] rb_set_parent_color
1.47% libc-2.29.so [.] __libc_calloc
1.24% [kernel] [k] do_syscall_64
1.17% [kernel] [k] __x86_indirect_thunk_rax
$ perf record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.552 MB perf.data (74 samples) ]
$ perf evlist
cycles
$ perf evlist -v
cycles: size: 120, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
$ perf report | head -20
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 74 of event 'cycles'
# Event count (approx.): 15694834
#
# Overhead Command Shared Object Symbol
# ........ ............... .......................... ......................................
#
19.62% perf [kernel.vmlinux] [k] strnlen_user
13.88% swapper [kernel.vmlinux] [k] intel_idle
13.83% ksoftirqd/0 [kernel.vmlinux] [k] pfifo_fast_dequeue
13.51% swapper [kernel.vmlinux] [k] kmem_cache_free
6.31% gnome-shell [kernel.vmlinux] [k] kmem_cache_free
5.66% kworker/u8:3+ix [kernel.vmlinux] [k] delay_tsc
4.42% perf [kernel.vmlinux] [k] __set_cpus_allowed_ptr
3.45% kworker/2:1-eve [kernel.vmlinux] [k] shmem_truncate_range
2.29% gnome-shell libgobject-2.0.so.0.6000.7 [.] g_closure_ref
$
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/a66d5648-2b8e-577e-e1f2-1d56c017ab5e@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Synthesize bpf images (trampolines/dispatchers) on start, as ksymbol
events from /proc/kallsyms. Having this perf can recognize samples from
those images and perf report and top shows them correctly.
The rest of the ksymbol handling is already in place from for the bpf
programs monitoring, so only the initial state was needed.
perf report output:
# Overhead Command Shared Object Symbol
12.37% test_progs [kernel.vmlinux] [k] entry_SYSCALL_64
11.80% test_progs [kernel.vmlinux] [k] syscall_return_via_sysret
9.63% test_progs bpf_prog_bcf7977d3b93787c_prog2 [k] bpf_prog_bcf7977d3b93787c_prog2
6.90% test_progs bpf_trampoline_24456 [k] bpf_trampoline_24456
6.36% test_progs [kernel.vmlinux] [k] memcpy_erms
Committer notes:
Use scnprintf() instead of strncpy() to overcome this on fedora:32,
rawhide and OpenMandriva Cooker:
CC /tmp/build/perf/util/bpf-event.o
In file included from /usr/include/string.h:495,
from /git/linux/tools/lib/bpf/libbpf_common.h:12,
from /git/linux/tools/lib/bpf/bpf.h:31,
from util/bpf-event.c:4:
In function 'strncpy',
inlined from 'process_bpf_image' at util/bpf-event.c:323:2,
inlined from 'kallsyms_process_symbol' at util/bpf-event.c:358:9:
/usr/include/bits/string_fortified.h:106:10: error: '__builtin_strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation]
106 | return __builtin___strncpy_chk (__dest, __src, __len, __bos (__dest));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@redhat.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200312195610.346362-14-jolsa@kernel.org/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When --timeout is used and a workload is specified to be started by
'perf stat', i.e.
$ perf stat --timeout 1000 sleep 1h
The --timeout wasn't being honoured, i.e. the workload, 'sleep 1h' in
the above example, should be terminated after 1000ms, but it wasn't,
'perf stat' was waiting for it to finish.
Fix it by sending a SIGTERM when the timeout expires.
Now it works:
# perf stat -e cycles --timeout 1234 sleep 1h
sleep: Terminated
Performance counter stats for 'sleep 1h':
1,066,692 cycles
1.234314838 seconds time elapsed
0.000750000 seconds user
0.000000000 seconds sys
#
Fixes: f1f8ad52f8 ("perf stat: Add support to print counts after a period of time")
Reported-by: Konstantin Kharlamov <hi-angel@yandex.ru>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207243
Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: yuzhoujian <yuzhoujian@didichuxing.com>
Link: https://lore.kernel.org/lkml/20200415153803.GB20324@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David Ahern noticed that there was a bug in the EXPECTED_FD code so
programs did not get detached properly when that parameter was supplied.
This case was not included in the xdp_attach tests; so let's add it to be
sure that such a bug does not sneak back in down.
Fixes: 87854a0b57 ("selftests/bpf: Add tests for attaching XDP programs")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200414145025.182163-2-toke@redhat.com
The 'old_fd' parameter used for atomic replacement of XDP programs is
supposed to be an FD, but was left as a u32 from an earlier iteration of
the patch that added it. It was converted to an int when read, so things
worked correctly even with negative values, but better change the
definition to correctly reflect the intention.
Fixes: bd5ca3ef93 ("libbpf: Add function to set link XDP fd while specifying old program")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200414145025.182163-1-toke@redhat.com
For some types of BPF programs that utilize expected_attach_type, libbpf won't
set load_attr.expected_attach_type, even if expected_attach_type is known from
section definition. This was done to preserve backwards compatibility with old
kernels that didn't recognize expected_attach_type attribute yet (which was
added in 5e43f899b0 ("bpf: Check attach type at prog load time"). But this
is problematic for some BPF programs that utilize newer features that require
kernel to know specific expected_attach_type (e.g., extended set of return
codes for cgroup_skb/egress programs).
This patch makes libbpf specify expected_attach_type by default, but also
detect support for this field in kernel and not set it during program load.
This allows to have a good metadata for bpf_program
(e.g., bpf_program__get_extected_attach_type()), but still work with old
kernels (for cases where it can work at all).
Additionally, due to expected_attach_type being always set for recognized
program types, bpf_program__attach_cgroup doesn't have to do extra checks to
determine correct attach type, so remove that additional logic.
Also adjust section_names selftest to account for this change.
More detailed discussion can be found in [0].
[0] https://lore.kernel.org/bpf/20200412003604.GA15986@rdna-mbp.dhcp.thefacebook.com/
Fixes: 5cf1e91456 ("bpf: cgroup inet skb programs can return 0 to 3")
Fixes: 5e43f899b0 ("bpf: Check attach type at prog load time")
Reported-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20200414182645.1368174-1-andriin@fb.com
In commit 65c9362859 ("bpftool: Add struct_ops support") a new
type of command named struct_ops has been added. This command requires
a kernel with CONFIG_DEBUG_INFO_BTF=y set and for retrieving BTF info
in bpftool, the helper get_btf_vmlinux() is used.
When running this command on kernel without BTF debug info, this will
lead to 'btf_vmlinux' variable being an invalid(error) pointer. And by
this, btf_free() causes a segfault when executing 'bpftool struct_ops'.
This commit adds pointer validation with IS_ERR not to free invalid
pointer, and this will fix the segfault issue.
Fixes: 65c9362859 ("bpftool: Add struct_ops support")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200410020612.2930667-1-danieltimlee@gmail.com
Test that frozen and mmap()'ed BPF map can't be mprotect()'ed as writable or
executable memory. Also validate that "downgrading" from writable to read-only
doesn't screw up internal writable count accounting for the purposes of map
freezing.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200410202613.3679837-2-andriin@fb.com
After successfully running the IPC msgque test once, subsequent runs
result in a test failure:
$ sudo ./run_kselftest.sh
TAP version 13
1..1
# selftests: ipc: msgque
# Failed to get stats for IPC queue with id 0
# Failed to dump queue: -22
# Bail out!
# # Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0
not ok 1 selftests: ipc: msgque # exit=1
The dump_queue() function loops through the possible message queue index
values using calls to msgctl(kern_id, MSG_STAT, ...) where kern_id
represents the index value. The first time the test is ran, the initial
index value of 0 is valid and the test is able to complete. The index
value of 0 is not valid in subsequent test runs and the loop attempts to
try index values of 1, 2, 3, and so on until a valid index value is
found that corresponds to the message queue created earlier in the test.
The msgctl() syscall returns -1 and sets errno to EINVAL when invalid
index values are used. The test failure is caused by incorrectly
comparing errno to -EINVAL when cycling through possible index values.
Fix invalid test failures on subsequent runs of the msgque test by
correctly comparing errno values to a non-negated EINVAL.
Fixes: 3a665531a3 ("selftests: IPC message queue copy feature test")
Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
ftrace-direct.tc and kprobe-direct.tc require CONFIG_SAMPLE_FTRACE_DIRECT=m
so add it to config file which is used by merge_config.sh.
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
glibc 2.31 calls clock_nanosleep when its nanosleep function is used. So
the restart_syscall fails after that. In order to deal with it, we trace
clock_nanosleep and nanosleep. Then we check for either.
This works just fine on systems with both glibc 2.30 and glibc 2.31,
whereas it failed before on a system with glibc 2.31.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
While running seccomp_bpf, kill_after_ptrace() gets stuck if we run it
via /usr/bin/timeout (that is the default), until the timeout expires.
This is because /usr/bin/timeout is preventing to properly deliver
signals to ptrace'd children (SIGSYS in this case).
This problem can be easily reproduced by running:
$ sudo make TARGETS=seccomp kselftest
...
# [ RUN ] TRACE_syscall.skip_a#
not ok 1 selftests: seccomp: seccomp_bpf # TIMEOUT
The test is hanging at this point until the timeout expires and then it
reports the timeout error.
Prevent this problem by passing --foreground to /usr/bin/timeout,
allowing to properly deliver signals to children processes.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
To pick up the changes in these csets:
295bcca849 ("linux/bits.h: add compile time sanity check of GENMASK inputs")
3945ff37d2 ("linux/bits.h: Extract common header for vDSO")
To address this tools/perf build warning:
Warning: Kernel ABI header at 'tools/include/linux/bits.h' differs from latest version at 'include/linux/bits.h'
diff -u tools/include/linux/bits.h include/linux/bits.h
This clashes with usage of userspace's static_assert(), that, at least
on glibc, is guarded by a ifnded/endif pair, do the same to our copy of
build_bug.h and avoid that diff in check_headers.sh so that we continue
checking for drifts with the kernel sources master copy.
This will all be tested with the set of build containers that includes
uCLibc, musl libc, lots of glibc versions in lots of distros and cross
build environments.
The tools/objtool, tools/bpf, etc were tested as well.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes from:
d3b1b776ee ("x86/entry/64: Remove ptregs qualifier from syscall table")
cab56d3484 ("x86/entry: Remove ABI prefixes from functions in syscall tables")
27dd84fafc ("x86/entry/64: Use syscall wrappers for x32_rt_sigreturn")
Addressing this tools/perf build warning:
Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
That didn't result in any tooling changes, as what is extracted are just
the first two columns, and these patches touched only the third.
$ cp /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c /tmp
$ cp arch/x86/entry/syscalls/syscall_64.tbl tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
$ make -C tools/perf O=/tmp/build/perf install-bin
make: Entering directory '/home/acme/git/perf/tools/perf'
BUILD: Doing 'make -j12' parallel build
DESCEND plugins
CC /tmp/build/perf/util/syscalltbl.o
INSTALL trace_plugins
LD /tmp/build/perf/util/perf-in.o
LD /tmp/build/perf/perf-in.o
LINK /tmp/build/perf/perf
$ diff -u /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c /tmp/syscalls_64.c
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the change in:
88be76cdaf ("drm/i915: Allow userspace to specify ringsize on construction")
That don't result in any changes in tooling, just silences this perf
build warning:
Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>