The branch history code has a loop detection function. With this, we can
get the number of iterations by calculating the removed loops.
While it would be nice for knowing the average cycles of iterations.
This patch adds up the cycles in branch entries of removed loops and
save the result to the next branch entry (e.g. branch entry A).
Finally it will display the iteration number and average cycles at the
"from" of branch entry A.
For example:
perf record -g -j any,save_type ./div
perf report --branch-history --no-children --stdio
--22.63%--main div.c:42 (RET CROSS_2M)
compute_flag div.c:28 (cycles:2 iter:173115 avg_cycles:2)
|
--10.73%--compute_flag div.c:27 (RET CROSS_2M)
rand rand.c:28 (cycles:1)
rand rand.c:28 (RET CROSS_2M)
__random random.c:298 (cycles:1)
__random random.c:297 (COND_BWD CROSS_2M)
__random random.c:295 (cycles:1)
__random random.c:295 (COND_BWD CROSS_2M)
__random random.c:295 (cycles:1)
__random random.c:295 (RET CROSS_2M)
Signed-off-by: Yao Jin <yao.jin@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1502111115-18305-1-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
In commit f8f4aaead5 ("perf annotate: Finally display IPC and cycle
accounting") the 'pcnt_width' variable was abused in a few places to
also include the optional width of the "IPC" and "cycles" columns, while
in other places we stopped using 'pcnt_width' and instead its previous
equation...
Now that we need to tap into annotate_browser__pcnt_width() to consider
if --show-total-period is being used and instead of that hardcoded 7
(strlen("Percent")) we need to use it or strlen("Event count") we need
this properly clarified to avoid having to touch all the (7 * nr_events)
places.
Clarify this by introducing a separate annotate_browser__cycles_width()
to leave the pcnt_width calculate just what its name implies.
Cc: Taeung Song <treeze.taeung@gmail.com>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-szgb07t4k5wtvks8nzwkg710@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
struct sym_hist has addr[] but it should have not only number of samples
but also the sample period. So use new struct symhist_entry to pave the
way to have that.
Committer notes:
This initial patch will only introduce the struct sym_hist_entry and use
only the nr_samples member, which makes the code clearer and paves the
way to save the period as well.
Signed-off-by: Taeung Song <treeze.taeung@gmail.com>
Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/1500500205-16553-1-git-send-email-treeze.taeung@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
For marking fused instructions clearly this patch adds a line before the
first instruction of pair and joins it with the arrow of the jump to its
target.
For example, when "je" is selected in annotate view, the line before
cmpl is displayed and joins the arrow of "je".
│ ┌──cmpl $0x0,argp_program_version_hook
81.93 │ ├──je 20
│ │ lock cmpxchg %esi,0x38a9a4(%rip)
│ │↓ jne 29
│ │↓ jmp 43
11.47 │20:└─→cmpxch %esi,0x38a999(%rip)
That means the cmpl+je is a fused instruction pair and they should be
considered together.
Changelog:
v3: Use Arnaldo's fix to improve the arrow origin rendering. To get the
evsel->evlist->env->cpuid, save the evsel in annotate_browser.
v2: new function "ins__is_fused" to check if the instructions are fused.
Signed-off-by: Yao Jin <yao.jin@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1499403995-19857-3-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Macro fusion merges two instructions to a single micro-op. Intel core
platform performs this hardware optimization under limited
circumstances.
For example, CMP + JCC can be "fused" and executed /retired together.
While with sampling this can result in the sample sometimes being on the
JCC and sometimes on the CMP. So for the fused instruction pair, they
could be considered together.
On Nehalem, fused instruction pairs:
cmp/test + jcc.
On other new CPU:
cmp/test/add/sub/and/inc/dec + jcc.
This patch adds an x86-specific function which checks if 2 instructions
are in a "fused" pair. For non-x86 arch, the function is just NULL.
Changelog:
v4: Move the CPU model checking to symbol__disassemble and save the CPU
family/model in arch structure.
It avoids checking every time when jump arrow printed.
v3: Add checking for Nehalem (CMP, TEST). For other newer Intel CPUs
just check it by default (CMP, TEST, ADD, SUB, AND, INC, DEC).
v2: Remove the original weak function. Arnaldo points out that doing it
as a weak function that will be overridden by the host arch doesn't
work. So now it's implemented as an arch-specific function.
Committer fix:
Do not access evsel->evlist->env->cpuid, ->env can be null, introduce
perf_evsel__env_cpuid(), just like perf_evsel__env_arch(), also used in
this function call.
The original patch was segfaulting 'perf top' + annotation.
But this essentially disables this fused instructions augmentation in
'perf top', the right thing is to get the cpuid from the running kernel,
left for a later patch tho.
Signed-off-by: Yao Jin <yao.jin@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1499403995-19857-2-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The annotate browser is divided into 2 frames. Left frame contains 3
columns (some platforms only have one column).
For example:
│26 int compute_flag()
│27 {
22.80 1.20 │ sub $0x8,%rsp
│25 int i;
│
│27 i = rand() % 2;
22.78 1.20 1 │ → callq rand@plt
While it's hard for user to understand what the data is.
This patch adds the titles "Percent", "IPC" and "Cycle" on columns.
Percent IPC Cycle │
│25 __attribute__((noinline))
│26 int compute_flag()
│27 {
22.80 1.20 │ sub $0x8,%rsp
│25 int i;
│
│27 i = rand() % 2;
22.78 1.20 1 │ → callq rand@plt
The titles are displayed at row 0 of annotate browser if row 0 doesn't
have values of percent, ipc and cycle.
Signed-off-by: Yao Jin <yao.jin@linux.intel.com>
Acked-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yao Jin <yao.jin@linux.intel.com>
Link: http://lkml.kernel.org/r/1493909895-9668-3-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Often it is interesting to know how costly a given source line is in
total. Previously, one had to build these sums manually based on all
addresses that pointed to the same source line. This patch introduces
srcline as a sort key, which will do the aggregation for us.
Paired with the recent addition of showing inline frames, this makes
perf report much more useful for many C++ work loads.
The following shows the new feature in action. First, let's show the
status quo output when we sort by address. The result contains many hist
entries that generate the same output:
~~~~~~~~~~~~~~~~
$ perf report --stdio --inline -g address
# Children Self Command Shared Object Symbol
# ........ ........ ............ ................... .........................................
#
99.89% 35.34% cpp-inlining cpp-inlining [.] main
|
|--64.55%--main complex:655
| /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 (inline)
| /usr/include/c++/6.3.1/complex:664 (inline)
| |
| |--60.31%--hypot +20
| | |
| | |--8.52%--__hypot_finite +273
| | |
| | |--7.32%--__hypot_finite +411
...
--35.34%--_start +4194346
__libc_start_main +241
|
|--6.65%--main random.tcc:3326
| /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1809 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1818 (inline)
| /usr/include/c++/6.3.1/bits/random.h:185 (inline)
|
|--2.70%--main random.tcc:3326
| /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1809 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1818 (inline)
| /usr/include/c++/6.3.1/bits/random.h:185 (inline)
|
|--1.69%--main random.tcc:3326
| /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1809 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1818 (inline)
| /usr/include/c++/6.3.1/bits/random.h:185 (inline)
...
~~~~~~~~~~~~~~~~
With this patch and `-g srcline` we instead get the following output:
~~~~~~~~~~~~~~~~
$ perf report --stdio --inline -g srcline
# Children Self Command Shared Object Symbol
# ........ ........ ............ ................... .........................................
#
99.89% 35.34% cpp-inlining cpp-inlining [.] main
|
|--64.55%--main complex:655
| /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 (inline)
| /usr/include/c++/6.3.1/complex:664 (inline)
| |
| |--64.02%--hypot
| | |
| | --59.81%--__hypot_finite
| |
| --0.53%--cabs
|
--35.34%--_start
__libc_start_main
|
|--12.48%--main random.tcc:3326
| /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1809 (inline)
| /usr/include/c++/6.3.1/bits/random.h:1818 (inline)
| /usr/include/c++/6.3.1/bits/random.h:185 (inline)
...
~~~~~~~~~~~~~~~~
Signed-off-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Yao Jin <yao.jin@linux.intel.com>
Link: http://lkml.kernel.org/r/20170318214928.9047-1-milian.wolff@kdab.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
If the address belongs to an inlined function, the source information
back to the first non-inlined function will be printed.
For example:
1. Show inlined function name
perf report -g function --inline
- 0.69% 0.00% inline ld-2.23.so [.] dl_main
- dl_main
0.56% _dl_relocate_object
_dl_relocate_object (inline)
elf_dynamic_do_Rela (inline)
2. Show the file/line information
perf report -g address --inline
- 0.69% 0.00% inline ld-2.23.so [.] _dl_start
_dl_start rtld.c:307
/build/glibc-GKVZIf/glibc-2.23/elf/rtld.c:413 (inline)
+ _dl_sysdep_start dl-sysdep.c:250
Signed-off-by: Yao Jin <yao.jin@linux.intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Link: http://lkml.kernel.org/r/1490474069-15823-6-git-send-email-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
If jump target is outside of function range, perf is not handling it
correctly. Especially when target address is lesser than function start
address, target offset will be negative. But, target address declared to
be unsigned, converts negative number into 2's complement. See below
example. Here target of 'jumpq' instruction at 34cf8 is 34ac0 which is
lesser than function start address(34cf0).
34ac0 - 34cf0 = -0x230 = 0xfffffffffffffdd0
Objdump output:
0000000000034cf0 <__sigaction>:
__GI___sigaction():
34cf0: lea -0x20(%rdi),%eax
34cf3: cmp -bashx1,%eax
34cf6: jbe 34d00 <__sigaction+0x10>
34cf8: jmpq 34ac0 <__GI___libc_sigaction>
34cfd: nopl (%rax)
34d00: mov 0x386161(%rip),%rax # 3bae68 <_DYNAMIC+0x2e8>
34d07: movl -bashx16,%fs:(%rax)
34d0e: mov -bashxffffffff,%eax
34d13: retq
perf annotate before applying patch:
__GI___sigaction /usr/lib64/libc-2.22.so
lea -0x20(%rdi),%eax
cmp -bashx1,%eax
v jbe 10
v jmpq fffffffffffffdd0
nop
10: mov _DYNAMIC+0x2e8,%rax
movl -bashx16,%fs:(%rax)
mov -bashxffffffff,%eax
retq
perf annotate after applying patch:
__GI___sigaction /usr/lib64/libc-2.22.so
lea -0x20(%rdi),%eax
cmp -bashx1,%eax
v jbe 10
^ jmpq 34ac0 <__GI___libc_sigaction>
nop
10: mov _DYNAMIC+0x2e8,%rax
movl -bashx16,%fs:(%rax)
mov -bashxffffffff,%eax
retq
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Chris Riyder <chris.ryder@arm.com>
Cc: Kim Phillips <kim.phillips@arm.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Taeung Song <treeze.taeung@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1480953407-7605-3-git-send-email-ravi.bangoria@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
New tool:
- 'perf sched timehist' provides an analysis of scheduling events.
Example usage:
perf sched record -- sleep 1
perf sched timehist
By default it shows the individual schedule events, including the wait
time (time between sched-out and next sched-in events for the task), the
task scheduling delay (time between wakeup and actually running) and run
time for the task:
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
-------- ------ ---------------- --------- --------- --------
1.874569 [0011] gcc[31949] 0.014 0.000 1.148
1.874591 [0010] gcc[31951] 0.000 0.000 0.024
1.874603 [0010] migration/10[59] 3.350 0.004 0.011
1.874604 [0011] <idle> 1.148 0.000 0.035
1.874723 [0005] <idle> 0.016 0.000 1.383
1.874746 [0005] gcc[31949] 0.153 0.078 0.022
...
Times are in msec.usec. (David Ahern, Namhyung Kim)
Improvements:
- Make 'perf c2c report' support -f/--force, to allow skipping the
ownership check for root users, for instance, just like the other
tools (Jiri Olsa)
- Allow sorting cachelines by total number of HITMs, in addition to
local and remote numbers (Jiri Olsa)
Fixes:
- Make sure errors aren't suppressed by the TUI reset at the end of
a 'perf c2c report' session (Jiri Olsa)
Infrastructure changes:
- Initial work on having the annotate code better support multiple
architectures, including the ability to cross-annotate, i.e. to
annotate perf.data files collected on an ARM system on a x86_64
workstation (Arnaldo Carvalho de Melo, Ravi Bangoria, Kim Phillips)
- Use USECS_PER_SEC instead of hard coded number in libtraceevent (Steven Rostedt)
- Add retrieval of preempt count and latency flags in libtraceevent (Steven Rostedt)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When horizontal scrolling is used in hierarchy mode, the folded signed
disappears at the right most column.
Committer note:
To test it, run 'perf top --hierarchy, see the '+' symbol at the first
column, then press the right arrow key, the '+' symbol will disappear,
this patch fixes that.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20161108130833.9263-3-namhyung@kernel.org
[ Move 'width -= 2' invariant to right after the if/else ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The perf report/top on TUI supports horizontal scrolling using LEFT and
RIGHT keys.
But it calculate the number of columns incorrectly when hierarchy mode
is enabled so that keep pressing RIGHT key can make the output
disappeared.
In the hierarchy mode, all sort keys are collapsed into a single column,
so it needs to be applied when calculating column numbers.
Reported-and-Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20161024162110.17918-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The perf report/top on TUI supports horizontal scrolling using LEFT and
RIGHT keys.
But it calculate the number of columns incorrectly when hierarchy mode
is enabled so that keep pressing RIGHT key can make the output
disappeared.
In the hierarchy mode, all sort keys are collapsed into a single column,
so it needs to be applied when calculating column numbers.
Reported-and-Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20161024162110.17918-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Here is a small patch which tries to fulfill a point in the perf todo
list:
* Make pressing 'V' multiple times to go on cycling thru various
verbosity levels in 'perf top', so that info that is present in
'perf top -v' can be obtained without having to restart the tool
(acme).
After a small grep in the code, the max verbosity level seems 3; so,
we cycle at 4; I did not dare define a MAX_VERBOSE_LEVEL constant.
Signed-off-by: Alexis Berlemont <alexis.berlemont@gmail.com>
Suggested-and-Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20161012214823.14324-2-alexis.berlemont@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>