Perf stat doesn't count the uncore event aliases from the same uncore
block in a group, for example:
perf stat -e '{unc_m_cas_count.all,unc_m_clockticks}' -a -I 1000
# time counts unit events
1.000447342 <not counted> unc_m_cas_count.all
1.000447342 <not counted> unc_m_clockticks
2.000740654 <not counted> unc_m_cas_count.all
2.000740654 <not counted> unc_m_clockticks
The output is very misleading. It gives a wrong impression that the
uncore event doesn't work.
An uncore block could be composed by several PMUs. An uncore event alias
is a joint name which means the same event runs on all PMUs of a block.
Perf doesn't support mixed events from different PMUs in the same group.
It is wrong to put uncore event aliases in a big group.
The right way is to split the big group into multiple small groups which
only include the events from the same PMU.
Only uncore event aliases from the same uncore block should be specially
handled here. It doesn't make sense to mix the uncore events with other
uncore events from different blocks or even core events in a group.
With the patch:
# time counts unit events
1.001557653 140,833 unc_m_cas_count.all
1.001557653 1,330,231,332 unc_m_clockticks
2.002709483 85,007 unc_m_cas_count.all
2.002709483 1,429,494,563 unc_m_clockticks
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Agustin Vega-Frias <agustinv@codeaurora.org>
Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shaokun Zhang <zhangshaokun@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1525727623-19768-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since the ip shown for a symbol is now always a virtual address, it
becomes difficult to correlate this with objdump output and determine
the exact instruction address. So, we always show the offset from the
start of the symbol.
This can be verified on a powerpc64le system running Fedora 27 as
follows:
# perf probe -a sys_write
# perf record -e probe:sys_write -g ~/test
Before applying this patch:
# perf script
test 9710 [013] 95614.332431: probe:sys_write: (c0000000004025b0)
c0000000004025b0 sys_write (/lib/modules/4.17.0-rc4+/build/vmlinux)
c00000000000b9e0 system_call (/lib/modules/4.17.0-rc4+/build/vmlinux)
7fffb70d8234 __GI___libc_write (/usr/lib64/libc-2.26.so)
7fffb7052c74 _IO_file_write@@GLIBC_2.17 (/usr/lib64/libc-2.26.so)
5afc1818 [unknown] ([unknown])
7fffb7051a60 new_do_write (/usr/lib64/libc-2.26.so)
7fffb7054638 _IO_do_write@@GLIBC_2.17 (/usr/lib64/libc-2.26.so)
7fffb7054bbc _IO_file_overflow@@GLIBC_2.17 (/usr/lib64/libc-2.26.so)
7fffb7055a24 __overflow (/usr/lib64/libc-2.26.so)
7fffb7044548 _IO_puts (/usr/lib64/libc-2.26.so)
10000440 main (/home/sandipan/test)
7fffb6fe36a0 generic_start_main.isra.0 (/usr/lib64/libc-2.26.so)
7fffb6fe3898 __libc_start_main (/usr/lib64/libc-2.26.so)
0 [unknown] ([unknown])
...
After applying this patch:
# perf script
test 9710 [013] 95614.332431: probe:sys_write: (c0000000004025b0)
c0000000004025b0 sys_write+0x10 (/lib/modules/4.17.0-rc4+/build/vmlinux)
c00000000000b9e0 system_call+0x58 (/lib/modules/4.17.0-rc4+/build/vmlinux)
7fffb70d8234 __GI___libc_write+0x24 (/usr/lib64/libc-2.26.so)
7fffb7052c74 _IO_file_write@@GLIBC_2.17+0x44 (/usr/lib64/libc-2.26.so)
5afc1818 [unknown] ([unknown])
7fffb7051a60 new_do_write+0x90 (/usr/lib64/libc-2.26.so)
7fffb7054638 _IO_do_write@@GLIBC_2.17+0x38 (/usr/lib64/libc-2.26.so)
7fffb7054bbc _IO_file_overflow@@GLIBC_2.17+0x14c (/usr/lib64/libc-2.26.so)
7fffb7055a24 __overflow+0x64 (/usr/lib64/libc-2.26.so)
7fffb7044548 _IO_puts+0x218 (/usr/lib64/libc-2.26.so)
10000440 main+0x20 (/home/sandipan/test)
7fffb6fe36a0 generic_start_main.isra.0+0x140 (/usr/lib64/libc-2.26.so)
7fffb6fe3898 __libc_start_main+0xb8 (/usr/lib64/libc-2.26.so)
0 [unknown] ([unknown])
...
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Link: http://lkml.kernel.org/r/20180517063326.6319-2-sandipan@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When perf data is recorded with the call-graph option enabled, the
callchain shown by perf script shows the binary offsets of the symbols
as the ip. This is incorrect for kernel symbols as the ip values are
always off by a fixed offset depending on the architecture. If the
offsets from the start of the symbols are printed, they are also
incorrect for both kernel and userspace symbols.
Without the call-graph option, the callchain shows the virtual addresses
of the symbols rather than their binary offsets. The offsets printed in
this case are also correct.
This fixes the inconsistency in perf script's output.
This can be verified on a powerpc64le system running Fedora 27 as
follows:
# cat /proc/kallsyms | grep sys_write
...
c0000000004025a0 T sys_write
c0000000004025a0 T __se_sys_write
...
# perf probe -a sys_write
Before applying this patch:
# perf record -e probe:sys_write -g ~/test
# perf script -F ip,sym,symoff
4125b0 sys_write+0x8000000000008010
1b9e0 system_call+0x8000000000008058
118234 __GI___libc_write+0xffff0000f52c0024
92c74 _IO_file_write@@GLIBC_2.17+0xffff0000f52c0044
5afbfd8a [unknown]
91a60 new_do_write+0xffff0000f52c0090
94638 _IO_do_write@@GLIBC_2.17+0xffff0000f52c0038
94bbc _IO_file_overflow@@GLIBC_2.17+0xffff0000f52c014c
95a24 __overflow+0xffff0000f52c0064
84548 _IO_puts+0xffff0000f52c0218
440 main+0xffffffffe0000020
236a0 generic_start_main.isra.0+0xffff0000f52c0140
23898 __libc_start_main+0xffff0000f52c00b8
0 [unknown]
...
# perf record -e probe:sys_write ~/test
# perf script -F ip,sym,symoff
c0000000004025b0 sys_write+0x10
...
After applying this patch:
# perf record -e probe:sys_write -g ~/test
# perf script -F ip,sym,symoff
c0000000004025b0 sys_write+0x10
c00000000000b9e0 system_call+0x58
7fffb70d8234 __GI___libc_write+0x24
7fffb7052c74 _IO_file_write@@GLIBC_2.17+0x44
5afc1818 [unknown]
7fffb7051a60 new_do_write+0x90
7fffb7054638 _IO_do_write@@GLIBC_2.17+0x38
7fffb7054bbc _IO_file_overflow@@GLIBC_2.17+0x14c
7fffb7055a24 __overflow+0x64
7fffb7044548 _IO_puts+0x218
10000440 main+0x20
7fffb6fe36a0 generic_start_main.isra.0+0x140
7fffb6fe3898 __libc_start_main+0xb8
0 [unknown]
...
# perf record -e probe:sys_write ~/test
# perf script -F ip,sym,symoff
c0000000004025b0 sys_write+0x10
...
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Link: http://lkml.kernel.org/r/20180517063326.6319-1-sandipan@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Perf stat doesn't count the uncore event aliases from the same uncore
block in a group, for example:
perf stat -e '{unc_m_cas_count.all,unc_m_clockticks}' -a -I 1000
# time counts unit events
1.000447342 <not counted> unc_m_cas_count.all
1.000447342 <not counted> unc_m_clockticks
2.000740654 <not counted> unc_m_cas_count.all
2.000740654 <not counted> unc_m_clockticks
The output is very misleading. It gives a wrong impression that the
uncore event doesn't work.
An uncore block could be composed by several PMUs. An uncore event alias
is a joint name which means the same event runs on all PMUs of a block.
Perf doesn't support mixed events from different PMUs in the same group.
It is wrong to put uncore event aliases in a big group.
The right way is to split the big group into multiple small groups which
only include the events from the same PMU.
Only uncore event aliases from the same uncore block should be specially
handled here. It doesn't make sense to mix the uncore events with other
uncore events from different blocks or even core events in a group.
With the patch:
# time counts unit events
1.001557653 140,833 unc_m_cas_count.all
1.001557653 1,330,231,332 unc_m_clockticks
2.002709483 85,007 unc_m_cas_count.all
2.002709483 1,429,494,563 unc_m_clockticks
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Agustin Vega-Frias <agustinv@codeaurora.org>
Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shaokun Zhang <zhangshaokun@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1525727623-19768-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
So that kprobe definitions become:
int probe(function, variables)(void *ctx, int err, var1, var2, ...)
The existing 5sec.c, got converted and goes from:
SEC("func=hrtimer_nanosleep rqtp->tv_sec")
int func(void *ctx, int err, long sec)
{
}
To:
int probe(hrtimer_nanosleep, rqtp->tv_sec)(void *ctx, int err, long sec)
{
}
If we decide to add tv_nsec as well, then it becomes:
$ cat tools/perf/examples/bpf/5sec.c
#include <bpf.h>
int probe(hrtimer_nanosleep, rqtp->tv_sec rqtp->tv_nsec)(void *ctx, int err, long sec, long nsec)
{
return sec == 5;
}
license(GPL);
$
And if we run it, system wide as before and run some 'sleep' with values
for the tv_nsec field, we get:
# perf trace --no-syscalls -e tools/perf/examples/bpf/5sec.c
0.000 perf_bpf_probe:hrtimer_nanosleep:(ffffffff9811b5f0) tv_sec=5 tv_nsec=100000000
9641.650 perf_bpf_probe:hrtimer_nanosleep:(ffffffff9811b5f0) tv_sec=5 tv_nsec=123450001
^C#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-1v9r8f6ds5av0w9pcwpeknyl@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Description:
. Disable strace like syscall tracing (--no-syscalls), or try tracing
just some (-e *sleep).
. Attach a filter function to a kernel function, returning when it should
be considered, i.e. appear on the output:
$ cat tools/perf/examples/bpf/5sec.c
#include <bpf.h>
SEC("func=hrtimer_nanosleep rqtp->tv_sec")
int func(void *ctx, int err, long sec)
{
return sec == 5;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
$
. Run it system wide, so that any sleep of >= 5 seconds and < than 6
seconds gets caught.
. Ask for callgraphs using DWARF info, so that userspace can be unwound
. While this is running, run something like "sleep 5s".
# perf trace --no-syscalls -e tools/perf/examples/bpf/5sec.c/call-graph=dwarf/
0.000 perf_bpf_probe:func:(ffffffff9811b5f0) tv_sec=5
hrtimer_nanosleep ([kernel.kallsyms])
__x64_sys_nanosleep ([kernel.kallsyms])
do_syscall_64 ([kernel.kallsyms])
entry_SYSCALL_64 ([kernel.kallsyms])
__GI___nanosleep (/usr/lib64/libc-2.26.so)
rpl_nanosleep (/usr/bin/sleep)
xnanosleep (/usr/bin/sleep)
main (/usr/bin/sleep)
__libc_start_main (/usr/lib64/libc-2.26.so)
_start (/usr/bin/sleep)
^C#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-2nmxth2l2h09f9gy85lyexcq@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
So, the first helper is the one shortening a variable/function section
attribute, from, for instance:
char _license[] __attribute__((section("license"), used)) = "GPL";
to:
char _license[] SEC("license") = "GPL";
Convert empty.c to that and it becomes:
# cat ~acme/lib/examples/perf/bpf/empty.c
#include <bpf.h>
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-zmeg52dlvy51rdlhyumfl5yf@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The first one is the bare minimum that bpf infrastructure accepts before
it expects actual events to be set up:
$ cat tools/perf/examples/bpf/empty.c
char _license[] __attribute__((section("license"), used)) = "GPL";
int _version __attribute__((section("version"), used)) = LINUX_VERSION_CODE;
$
If you remove that "version" line, then it will be refused with:
# perf trace -e tools/perf/examples/bpf/empty.c
event syntax error: 'tools/perf/examples/bpf/empty.c'
\___ Failed to load tools/perf/examples/bpf/empty.c from source: 'version' section incorrect or lost
(add -v to see detail)
Run 'perf list' for a list of valid events
Usage: perf trace [<options>] [<command>]
or: perf trace [<options>] -- <command> [<options>]
or: perf trace record [<options>] [<command>]
or: perf trace record [<options>] -- <command> [<options>]
-e, --event <event> event/syscall selector. use 'perf list' to list available events
#
The next ones will, step by step, show simple filters, then the needs
for headers will be made clear, it will be put in place and tested with
new examples, rinse, repeat.
Back to using this first one to test the perf+bpf infrastructure:
If we run it will fail, as no functions are present connecting with,
say, a tracepoint or a function using the kprobes or uprobes
infrastructure:
# perf trace -e tools/perf/examples/bpf/empty.c
WARNING: event parser found nothing
invalid or unsupported event: 'tools/perf/examples/bpf/empty.c'
Run 'perf list' for a list of valid events
Usage: perf trace [<options>] [<command>]
or: perf trace [<options>] -- <command> [<options>]
or: perf trace record [<options>] [<command>]
or: perf trace record [<options>] -- <command> [<options>]
-e, --event <event> event/syscall selector. use 'perf list' to list available events
#
But, if we set things up to dump the generated object file to a file,
and do this after having run 'make install', still on the developer's
$HOME directory:
# cat ~/.perfconfig
[llvm]
dump-obj = true
#
# perf trace -e ~acme/lib/examples/perf/bpf/empty.c
LLVM: dumping /home/acme/lib/examples/perf/bpf/empty.o
WARNING: event parser found nothing
invalid or unsupported event: '/home/acme/lib/examples/perf/bpf/empty.c'
<SNIP>
#
We can look at the dumped object file:
# ls -la ~acme/lib/examples/perf/bpf/empty.o
-rw-r--r--. 1 root root 576 May 4 12:10 /home/acme/lib/examples/perf/bpf/empty.o
# file ~acme/lib/examples/perf/bpf/empty.o
/home/acme/lib/examples/perf/bpf/empty.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped
# readelf -sw ~acme/lib/examples/perf/bpf/empty.o
Symbol table '.symtab' contains 3 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 3 _license
2: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 4 _version
#
# tools/bpf/bpftool/bpftool --pretty ~acme/lib/examples/perf/bpf/empty.o
null
#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-y7dkhakejz3013o0w21n98xd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This is not specific to BPF but was found when parsing a .c BPF proggie
that while valid, had no events attached to tracepoints, kprobes, etc:
Very minimal file that perf's BPF code can compile:
# cat empty.c
char _license[] __attribute__((section("license"), used)) = "GPL";
int _version __attribute__((section("version"), used)) = LINUX_VERSION_CODE;
#
Before this patch:
# perf trace -e empty.c
WARNING: event parser found nothinginvalid or unsupported event: 'empty.c'
Run 'perf list' for a list of valid events
Usage: perf trace [<options>] [<command>]
or: perf trace [<options>] -- <command> [<options>]
or: perf trace record [<options>] [<command>]
or: perf trace record [<options>] -- <command> [<options>]
-e, --event <event> event/syscall selector. use 'perf list' to list available events
#
After:
# perf trace -e empty.c
WARNING: event parser found nothing
invalid or unsupported event: 'empty.c'
Run 'perf list' for a list of valid events
Usage: perf trace [<options>] [<command>]
or: perf trace [<options>] -- <command> [<options>]
or: perf trace record [<options>] [<command>]
or: perf trace record [<options>] -- <command> [<options>]
-e, --event <event> event/syscall selector. use 'perf list' to list available events
#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-8ysughiz00h6mjpcot04qyjj@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
CoreSight doesn't allocate thread structure for unknown_thread in ETM
auxtrace, so unknown_thread is NULL pointer. If the perf data doesn't
contain valid tid and then cs_etm__mem_access() uses unknown_thread
instead as thread handler, this results in a segmentation fault when
thread__find_addr_map() accesses the thread handler.
This commit creates a new thread data which is used by unknown_thread, so
CoreSight tracing can roll back to use unknown_thread if perf data
doesn't include valid thread info. This commit also releases thread
data for initialization failure case and for normal auxtrace free flow.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1525924920-4381-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When we perform the following command lines:
$ perf record -e "{cycles,branches}" ./div
$ perf annotate main --stdio
The output shows only the first event, "cycles" and the displaying
format is not correct.
Percent | Source code & Disassembly of div for cycles (44550 samples)
-----------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 00000000004004b0 <main>:
: main():
:
: return i;
: }
:
: int main(void)
: {
0.00 : 4004b0: push %rbx
: int i;
: int flag;
: volatile double x = 1212121212, y = 121212;
:
: s_randseed = time(0);
0.00 : 4004b1: xor %edi,%edi
: srand(s_randseed);
0.00 : 4004b3: mov $0x77359400,%ebx
:
: return i;
: }
The issue is that the value of the 'nr_percent' variable is hardcoded to
1. This patch fixes it.
With this patch, the output is:
Percent | Source code & Disassembly of div for cycles (44550 samples)
-----------------------------------------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 00000000004004b0 <main>:
: main():
:
: return i;
: }
:
: int main(void)
: {
0.00 0.00 : 4004b0: push %rbx
: int i;
: int flag;
: volatile double x = 1212121212, y = 121212;
:
: s_randseed = time(0);
0.00 0.00 : 4004b1: xor %edi,%edi
: srand(s_randseed);
0.00 0.00 : 4004b3: mov $0x77359400,%ebx
:
: return i;
: }
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: f681d593d1 ("perf annotate: Remove disasm__calc_percent() from disasm_line__print()")
Link: http://lkml.kernel.org/r/1525881435-4092-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>