Enable perf_evsel::ignore_missing_thread for -u option to ignore
complete failure if any of the user's processes die between its
enumeration and time we open the event.
Committer notes:
While doing a 'make -j4 allmodconfig' we sometimes get into the race:
Before:
# perf record -u acme
Error:
The sys_perf_event_open() syscall returned with 3 (No such process) for event (cycles:ppp).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?
#
After:
[root@jouet ~]# perf record -u acme
WARNING: Ignored open failure for pid 9888
WARNING: Ignored open failure for pid 18059
[root@jouet ~]#
Which is an improvement, with the races not preventing the remaining threads
for the specified user from being monitored, but the message probably needs
further clarification.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1481538943-21874-6-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Perf hooks allow hooking user code at perf events. They can be used for
manipulation of BPF maps, taking snapshot and reporting results. In this
patch two perf hook points are introduced: record_start and record_end.
To avoid buggy user actions, a SIGSEGV signal handler is introduced into
'perf record'. It turns off perf hook if it causes a segfault and report
an error to help debugging.
A test case for perf hook is introduced.
Test result:
$ ./buildperf/perf test -v hook
50: Test perf hooks :
--- start ---
test child forked, pid 10311
SIGSEGV is observed as expected, try to recover.
Fatal error (SEGFAULT) in perf hook 'test'
test child finished with 0
---- end ----
Test perf hooks: Ok
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Joe Stringer <joe@ovn.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/20161126070354.141764-5-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When working with overwritable ring buffer there's a inconvenience
problem: if perf dumps data after a long period after it starts,
non-sample events may lost, which makes following 'perf report' unable
to identify proc name and mmap layout. For example:
# perf record -m 4 -e raw_syscalls:* -g --overwrite --switch-output \
dd if=/dev/zero of=/dev/null
send SIGUSR2 after dd runs long enough. The resuling perf.data lost
correct comm and mmap events:
# perf script -i perf.data.2016061522374354
perf 24478 [004] 2581325.601789: raw_syscalls:sys_exit: NR 0 = 512
^^^^
Should be 'dd'
27b2e8 syscall_slow_exit_work+0xfe2000e3 (/lib/modules/4.6.0-rc3+/build/vmlinux)
203cc7 do_syscall_64+0xfe200117 (/lib/modules/4.6.0-rc3+/build/vmlinux)
b18d83 return_from_SYSCALL_64+0xfe200000 (/lib/modules/4.6.0-rc3+/build/vmlinux)
7f47c417edf0 [unknown] ([unknown])
^^^^^^^^^^^^
Fail to unwind
This patch provides a '--tail-synthesize' option, allows perf to collect
system status when finalizing output file. In resuling output file, the
non-sample events reflect system status when dumping data.
After this patch:
# perf record -m 4 -e raw_syscalls:* -g --overwrite --switch-output --tail-synthesize \
dd if=/dev/zero of=/dev/null
# perf script -i perf.data.2016061600544998
dd 27364 [004] 2583244.994464: raw_syscalls:sys_enter: NR 1 (1, ...
^^
Correct comm
203a18 syscall_trace_enter_phase2+0xfe2001a8 ([kernel.kallsyms])
203aa5 syscall_trace_enter+0xfe200055 ([kernel.kallsyms])
203caa do_syscall_64+0xfe2000fa ([kernel.kallsyms])
b18d83 return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
d8e50 __GI___libc_write+0xffff01d9639f4010 (/tmp/oxygen_root-w00229757/lib64/libc-2.18.so)
^^^^^
Correct unwind
This option doesn't aim to solve this problem completely. If a process
terminates before SIGUSR2, we still lost its COMM and MMAP events. For
example, we can't unwind correctly from the final perf.data we get from
the previous example, because when perf collects the final output file
(when we press C-c), 'dd' has been terminated so its '/proc/<pid>/mmap'
becomes empty.
However, this is a cheaper choice. To completely solve this problem we
need to continously output non-sample events. To satisify the
requirement of daemonization, we need to merge them periodically. It is
possible but requires much more code and cycles.
Automatically select --tail-synthesize when --overwrite is provided.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1468485287-33422-16-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch allows following config terms and option:
Globally setting events to overwrite;
# perf record --overwrite ...
Set specific events to be overwrite or no-overwrite.
# perf record --event cycles/overwrite/ ...
# perf record --event cycles/no-overwrite/ ...
Add missing config terms and update the config term array size because
the longest string length has changed.
For overwritable events, it automatically selects attr.write_backward
since perf requires it to be backward for reading.
Test result:
# perf record --overwrite -e syscalls:*enter_nanosleep* usleep 1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data (1 samples) ]
# perf evlist -v
syscalls:sys_enter_nanosleep: type: 2, size: 112, config: 0x134, { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|PERIOD|RAW, disabled: 1, inherit: 1, mmap: 1, comm: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, write_backward: 1
# Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1468485287-33422-14-git-send-email-wangnan0@huawei.com
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introduce a bkw_mmap_state state machine to evlist:
.________________(forbid)_____________.
| V
NOTREADY --(0)--> RUNNING --(1)--> DATA_PENDING --(2)--> EMPTY
^ ^ | ^ |
| |__(forbid)____/ |___(forbid)___/|
| |
\_________________(3)_______________/
NOTREADY : Backward ring buffers are not ready
RUNNING : Backward ring buffers are recording
DATA_PENDING : We are required to collect data from backward ring buffers
EMPTY : We have collected data from backward ring buffers.
(0): Setup backward ring buffer
(1): Pause ring buffers for reading
(2): Read from ring buffers
(3): Resume ring buffers for recording
We can't avoid this complexity. Since we deliberately drop records from
overwritable ring buffer, there's no way for us to check remaining from
ring buffer itself (by checking head and old pointers). Therefore, we
need DATA_PENDING and EMPTY state to help us recording what we have done
to the ring buffer.
In record__mmap_read_evlist(), drive this state machine from DATA_PENDING
to EMPTY.
In perf_evlist__mmap_per_evsel(), drive this state machine from NOTREADY
to RUNNING when creating backward mmap.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: He Kuang <hekuang@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1468485287-33422-11-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The tools so far have been using the strerror_r() GNU variant, that
returns a string, be it the buffer passed or something else.
But that, besides being tricky in cases where we expect that the
function using strerror_r() returns the error formatted in a provided
buffer (we have to check if it returned something else and copy that
instead), breaks the build on systems not using glibc, like Alpine
Linux, where musl libc is used.
So, introduce yet another wrapper, str_error_r(), that has the GNU
interface, but uses the portable XSI variant of strerror_r(), so that
users rest asured that the provided buffer is used and it is what is
returned.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-d4t42fnf48ytlk8rjxs822tf@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
record__mmap_read() writes data from ring buffer into perf.data. 'head'
is maintained by the kernel, points to the last written record.
'old' is maintained by perf, points to the record read in previous
round. record__mmap_read() saves data from 'old' to 'head' to
perf.data.
The names of these variables are not very intutive. In addition,
when dealing with backward writing ring buffer, the md->prev pointer
should point to 'head' instead of the last byte it got.
Add 'start' and 'end' pointer to make code clear and set md->prev to
'head' instead of the moved 'old' pointer. This patch doesn't change
behavior since:
buf = &data[old & md->mask];
size = head - old;
old += size; <--- Here, old == head
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1463987628-163563-4-git-send-email-wangnan0@huawei.com
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Without this patch, the last output doesn't have timestamp appended if
--timestamp-filename is not explicitly provided. For example:
# perf record -a --switch-output &
[1] 11224
# kill -s SIGUSR2 11224
[ perf record: dump data: Woken up 1 times ]
# [ perf record: Dump perf.data.2015122622372823 ]
# fg
perf record -a --switch-output
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.027 MB perf.data (540 samples) ]
# ls -l
total 836
-rw------- 1 root root 33256 Dec 26 22:37 perf.data <---- *Odd*
-rw------- 1 root root 817156 Dec 26 22:37 perf.data.2015122622372823
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1461178794-40467-5-git-send-email-wangnan0@huawei.com
Signed-off-by: He Kuang <hekuang@huawei.com>
[ Updated man page, that also got an entry for --timestamp-filename ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch allows cloning bpf-output event configuration among multiple
bpf scripts. If there exist a map named '__bpf_output__' and not
configured using 'map:__bpf_output__.event=', this patch clones the
configuration of another '__bpf_stdout__' map. For example, following
command:
# perf trace --ev bpf-output/no-inherit,name=evt/ \
--ev ./test_bpf_trace.c/map:__bpf_stdout__.event=evt/ \
--ev ./test_bpf_trace2.c usleep 100000
equals to:
# perf trace --ev bpf-output/no-inherit,name=evt/ \
--ev ./test_bpf_trace.c/map:__bpf_stdout__.event=evt/ \
--ev ./test_bpf_trace2.c/map:__bpf_stdout__.event=evt/ \
usleep 100000
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1460128045-97310-4-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Intel PT uses the time members from the perf_event_mmap_page to convert
between TSC and perf time.
Due to a lack of foresight when Intel PT was implemented, those time
members were recorded in the (implementation dependent) AUXTRACE_INFO
event, the structure of which is generally inaccessible outside of the
Intel PT decoder. However now the conversion between TSC and perf time
is needed when processing a jitdump file when Intel PT has been used for
tracing.
So add a user event to record the time members. 'perf record' will
synthesize the event if the information is available. And session
processing will put a copy of the event on the session so that tools
like 'perf inject' can easily access it.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1457426324-30158-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
bpf__apply_obj_config() is introduced as the core API to apply object
config options to all BPF objects. This patch also does the real work
for setting values for BPF_MAP_TYPE_PERF_ARRAY maps by inserting value
stored in map's private field into the BPF map.
This patch is required because we are not always able to set all BPF
config during parsing. Further patch will set events created by perf to
BPF_MAP_TYPE_PERF_EVENT_ARRAY maps, which is not exist until
perf_evsel__open().
bpf_map_foreach_key() is introduced to iterate over each key needs to be
configured. This function would be extended to support more map types
and different key settings.
In perf record, before start recording, call bpf__apply_config() to turn
on all BPF config options.
Test result:
# cat ./test_bpf_map_1.c
/************************ BEGIN **************************/
#include <uapi/linux/bpf.h>
#define SEC(NAME) __attribute__((section(NAME), used))
struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};
static void *(*map_lookup_elem)(struct bpf_map_def *, void *) =
(void *)BPF_FUNC_map_lookup_elem;
static int (*trace_printk)(const char *fmt, int fmt_size, ...) =
(void *)BPF_FUNC_trace_printk;
struct bpf_map_def SEC("maps") channel = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1,
};
SEC("func=sys_nanosleep")
int func(void *ctx)
{
int key = 0;
char fmt[] = "%d\n";
int *pval = map_lookup_elem(&channel, &key);
if (!pval)
return 0;
trace_printk(fmt, sizeof(fmt), *pval);
return 0;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
/************************* END ***************************/
# echo "" > /sys/kernel/debug/tracing/trace
# ./perf record -e './test_bpf_map_1.c/map:channel.value=11/' usleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB perf.data ]
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:8
[SNIP]
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
usleep-18593 [007] d... 2394714.395539: : 11
# ./perf record -e './test_bpf_map_1.c/map:channel.value=101/' usleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB perf.data ]
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:8
[SNIP]
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
usleep-18593 [007] d... 2394714.395539: : 11
usleep-19000 [006] d... 2394831.057840: : 101
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Cody P Schafer <dev@codyps.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jeremie Galarneau <jeremie.galarneau@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1456132275-98875-6-git-send-email-wangnan0@huawei.com
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Allow user to easily switch all events to user or kernel space with simple
--all-user or --all-kernel options.
This will be handy within perf mem/c2c wrappers to switch easily monitoring
modes.
Committer note:
Testing it:
# perf record --all-kernel --all-user -a sleep 2
Error: option `all-user' cannot be used with all-kernel
Usage: perf record [<options>] [<command>]
or: perf record [<options>] -- <command> [<options>]
--all-user Configure all used events to run in user space.
--all-kernel Configure all used events to run in kernel space.
# perf record --all-user --all-kernel -a sleep 2
Error: option `all-kernel' cannot be used with all-user
Usage: perf record [<options>] [<command>]
or: perf record [<options>] -- <command> [<options>]
--all-kernel Configure all used events to run in kernel space.
--all-user Configure all used events to run in user space.
# perf record --all-user -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.416 MB perf.data (162 samples) ]
# perf report | grep '\[k\]'
# perf record --all-kernel -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.423 MB perf.data (296 samples) ]
# perf report | grep '\[\.\]'
#
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1455525293-8671-2-git-send-email-jolsa@kernel.org
[ Made those options to be mutually exclusive ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>