When processing events the session code has an ordered samples queue
which is used to time-sort events coming in across multiple mmaps. At a
later point in time samples on the queue are flushed up to some
timestamp at which point the event is actually processed.
When analyzing events live (ie., record/analysis path in the same
command) there is a race that leads to corrupted events and parse errors
which cause perf to terminate. The problem is that when the event is
placed in the ordered samples queue it is only a reference to the event
which is really sitting in the mmap buffer. Even though the event is
queued for later processing the mmap tail pointer is updated which
indicates to the kernel that the event has been processed. The race is
flushing the event from the queue before it gets overwritten by some
other event. For commands trying to process events live (versus just
writing to a file) and processing a high rate of events this leads to
parse failures and perf terminates.
Examples hitting this problem are 'perf kvm stat live', especially with
nested VMs which generate 100,000+ traces per second, and a command
processing scheduling events with a high rate of context switching --
e.g., running 'perf bench sched pipe'.
This patch offers live commands an option to copy the event when it is
placed in the ordered samples queue.
Based on a patch from David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1412347212-28237-2-git-send-email-yarygin@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The unw_addr_space_t in libunwind represents an address space to be used
for stack unwinding. It doesn't need to be create/destory everytime to
unwind callchain (as in get_entries) and can have a same lifetime as
thread (unless exec called).
So move the address space construction/destruction logic to the thread
lifetime handling functions. This is a preparation to enable caching in
the unwind library.
Note that it saves unw_addr_space_t object using thread__set_priv(). It
seems currently only used by perf trace and perf kvm stat commands which
don't use callchain.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jean Pihet <jean.pihet@linaro.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1412556363-26229-3-git-send-email-namhyung@kernel.org
[ Fixup unwind-libunwind.c missing CALLCHAIN_DWARF definition, added
missing __maybe_unused on unused parameters in stubs at util/unwind.h ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add test case in automated tests suite. It checks not only the two types
of pmu event stytle formats "pmu_event_name" and "cpu/pmu_event_name/",
but also the different formats mixtures which are more likely to trigger
parse issue.
The patch set including this one has been tested by the perf automated
test:
./perf test parse -v"
On haswell, ivybridge and Romley platform.
The patch set also has been tested on haswell by the following script.
Note: please make sure that your test system support TSX and
L1-dcache-loads events. Otherwise, you may want to change the events to
other pmu events.
[lk@localhost ~]$ cat perf_style_test.sh
# hardware events + kernel pmu event with different style
perf stat -x, -e cycles,mem-stores,tx-start sleep 2
perf stat -x, -e cpu-cycles,cycles-ct,cycles-t sleep 2
perf stat -x, -e cycles,cpu/cycles-ct/,cpu/cycles-t/ sleep 2
perf stat -x, -e instructions,cpu/tx-start/ sleep 2
perf stat -x, -e '{cycles,tx-start}' sleep 2
perf stat -x, -e '{cycles,cpu/tx-start/}' sleep 2
# HW Cache event + kernel pmu event with different style
perf stat -x, -e L1-dcache-loads,cpu/mem-stores/,tx-start sleep 2
perf stat -x, -e L1-dcache-loads,mem-stores,cpu/tx-start/ sleep 2
perf stat -x, -e '{L1-dcache-loads,mem-stores}' sleep 2
perf stat -x, -e '{L1-dcache-loads,cpu/tx-start/}' sleep 2
# Raw event + kernel pmu event with different style:
perf stat -x, -e cpu/event=0xc0,umask=0x00/,mem-loads,cpu/mem-stores/ sleep 2
perf stat -x, -e cpu/event=0xc0,umask=0x00/,tx-start,cpu/el-start/ sleep 2
perf stat -x, -e '{cpu/event=0xc0,umask=0x00/,tx-start}' sleep 2
Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1412694532-23391-5-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add new rules for kernel PMU event.
Currently, the patch only want to handle the PMU event name as "a-b" and
"a".
event_pmu:
PE_KERNEL_PMU_EVENT sep_dc
|
PE_PMU_EVENT_PRE '-' PE_PMU_EVENT_SUF sep_dc
PE_KERNEL_PMU_EVENT token is for
cycles-ct/cycles-t/mem-loads/mem-stores.
The prefix cycles is mixed up with cpu-cycles. loads and stores are
mixed up with cache event So they have to be hardcode in lex.
PE_PMU_EVENT_PRE and PE_PMU_EVENT_SUF tokens are for other PMU events.
The lex looks generic identifier up in the table and return the matched
token. If there is no match, generic PE_NAME token will be return.
Using the rules, kernel PMU event could use new style format without //
so you can use:
perf record -e mem-loads ...
instead of:
perf record -e cpu/mem-loads/
Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1412694532-23391-4-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
There are two types of event formats for PMU events. E.g. el-abort OR
cpu/el-abort/. However, the lexer mistakenly recognizes the simple style
format as two events.
The parse_events_pmu_check function uses bsearch to search the name in
known pmu event list. It can tell the lexer that the name is a PE_NAME
or a PMU event name prefix or a PMU event name suffix. All these
information will be used for accurately parsing kernel PMU events.
The pmu events list will be read from sysfs at runtime.
Note: Currently, the patch only want to handle the PMU event name as
"a-b" and "a". The only exception, "stalled-cycles-frontend" and
"stalled-cycles-fronted", are already hardcoded in lexer.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1412694532-23391-3-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This reverts commit 50e200f079 ("perf tools: Default to cpu// for
events v5")
The fixup cannot handle the case that
new style format(which without //) mixed with
other different formats.
For example,
group events with new style format: {mem-stores,mem-loads}
some hardware event + new style event: cycles,mem-loads
Cache event + new style event: LLC-loads,mem-loads
Raw event + new style event:
cpu/event=0xc8,umask=0x08/,mem-loads
old style event and new stytle mixture: mem-stores,cpu/mem-loads/
Signed-off-by: Kan Liang <kan.liang@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1412694532-23391-2-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When 'perf top' is run, one can't easily find a difference
between -z option and normal output.
So I added a visual cue to know whether it is the zeroing or not.
Output is as below.
Before:
$ perf top
Samples: 61K of event 'cycles', Event count (approx.): 3908136933
Overhead Shared Object Symbol
1.42% firefox [.] 0x0000000000011e76
1.32% libpthread-2.17.so [.] pthread_mutex_lock
If you press key 'z' or run with zero option like '$ perf top --zero', it is as below.
After:
Samples: 61K of event 'cycles', Event count (approx.): 3908136933 [z]
Overhead Shared Object Symbol
1.42% firefox [.] 0x0000000000011e76
1.32% libpthread-2.17.so [.] pthread_mutex_lock
Signed-off-by: Taeung Song <treeze.taeung@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1412665995-26359-1-git-send-email-treeze.taeung@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
Infrastructure fixes and changes:
* Fix off-by-one bugs in map->end handling (Stephane Eranian)
* Fix off-by-one bug in maps__find(), also related to map->end handling (Namhyung Kim)
* Make struct symbol->end be the first addr after the symbol range, to make it
match the convention used for struct map->end. (Arnaldo Carvalho de Melo)
* Fix perf_evlist__add_pollfd() error handling in 'perf kvm stat live' (Jiri Olsa)
* Fix python test build by moving callchain_param to an object linked into the
python binding (Jiri Olsa)
* Do not include a struct hists per perf_evsel, untangling the histogram code
from perf_evsel, to pave the way for exporting a minimalistic
tools/lib/api/perf/ library usable by tools/perf and initially by the rasd
daemon being developed by Borislav Petkov, Robert Richter and Jean Pihet.
(Arnaldo Carvalho de Melo)
* Make perf_evlist__open(evlist, NULL, NULL), i.e. without cpu and thread
maps mean syswide monitoring, reducing the boilerplate for tools that
only want system wide mode. (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fixes off-by-one errors in the management of maps.
A map is defined by start address and length as implemented by
map__new():
map__init(map, type, start, start + len, pgoff, dso);
map->start = addr;
map->end = end;
Consequently, the actual address range is [start; end[ map->end is the
first byte outside the range.
This patch fixes two bugs where upper bound checking was off-by-one.
In V2, we fix map_groups__fixup_overlappings() some more where
map->start was off-by-one as reported by Jiri.
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20141006083532.GA4850@quad
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
A segfault happens on 'perf test hists_link' because we end up using a
struct machines on the stack, and then machines__init() was not
initializing the newly introduced rb_root, just the existing list_head.
When we introduced struct dsos, to group the two ways to store dsos,
i.e. the linked list and the rbtree, we didn't turned the initialization
done in:
machines__init(machines->host) ->
machine__init() ->
INIT_LIST_HEAD
into a dsos__init() to keep on initializing the list_head but _as well_
initializing the rb_root, oops.
All worked because outside perf-test we probably zalloc the whole thing
which ends up initializing it in to NULL.
So the problem looks contained to 'perf test' that uses it on stack,
etc.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>,
Cc: Adrian Hunter <adrian.hunter@intel.com>,
Cc: Don Zickus <dzickus@redhat.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Waiman Long <Waiman.Long@hp.com>,
Link: http://lkml.kernel.org/r/20141014180353.GF3198@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf updates from Ingo Molnar:
"Kernel side updates:
- Fix and enhance poll support (Jiri Olsa)
- Re-enable inheritance optimization (Jiri Olsa)
- Enhance Intel memory events support (Stephane Eranian)
- Refactor the Intel uncore driver to be more maintainable (Zheng
Yan)
- Enhance and fix Intel CPU and uncore PMU drivers (Peter Zijlstra,
Andi Kleen)
- [ plus various smaller fixes/cleanups ]
User visible tooling updates:
- Add +field argument support for --field option, so that one can add
fields to the default list of fields to show, ie now one can just
do:
perf report --fields +pid
And the pid will appear in addition to the default fields (Jiri
Olsa)
- Add +field argument support for --sort option (Jiri Olsa)
- Honour -w in the report tools (report, top), allowing to specify
the widths for the histogram entries columns (Namhyung Kim)
- Properly show submicrosecond times in 'perf kvm stat' (Christian
Borntraeger)
- Add beautifier for mremap flags param in 'trace' (Alex Snast)
- perf script: Allow callchains if any event samples them
- Don't truncate Intel style addresses in 'annotate' (Alex Converse)
- Allow profiling when kptr_restrict == 1 for non root users, kernel
samples will just remain unresolved (Andi Kleen)
- Allow configuring default options for callchains in config file
(Namhyung Kim)
- Support operations for shared futexes. (Davidlohr Bueso)
- "perf kvm stat report" improvements by Alexander Yarygin:
- Save pid string in opts.target.pid
- Enable the target.system_wide flag
- Unify the title bar output
- [ plus lots of other fixes and small improvements. ]
Tooling infrastructure changes:
- Refactor unit and scale function parameters for PMU parsing
routines (Matt Fleming)
- Improve DSO long names lookup with rbtree, resulting in great
speedup for workloads with lots of DSOs (Waiman Long)
- We were not handling POLLHUP notifications for event file
descriptors
Fix it by filtering entries in the events file descriptor array
after poll() returns, refcounting mmaps so that when the last fd
pointing to a perf mmap goes away we do the unmap (Arnaldo Carvalho
de Melo)
- Intel PT prep work, from Adrian Hunter, including:
- Let a user specify a PMU event without any config terms
- Add perf-with-kcore script
- Let default config be defined for a PMU
- Add perf_pmu__scan_file()
- Add a 'perf test' for tracking with sched_switch
- Add 'flush' callback to scripting API
- Use ring buffer consume method to look like other tools (Arnaldo
Carvalho de Melo)
- hists browser (used in top and report) refactorings, getting rid of
unused variables and reducing source code size by handling similar
cases in a fewer functions (Namhyung Kim).
- Replace thread unsafe strerror() with strerror_r() accross the
whole tools/perf/ tree (Masami Hiramatsu)
- Rename ordered_samples to ordered_events and allow setting a queue
size for ordering events (Jiri Olsa)
- [ plus lots of fixes, cleanups and other improvements ]"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (198 commits)
perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment
perf/x86/intel/uncore: Fix minor race in box set up
perf record: Fix error message for --filter option not coming after tracepoint
perf tools: Fix build breakage on arm64 targets
perf symbols: Improve DSO long names lookup speed with rbtree
perf symbols: Encapsulate dsos list head into struct dsos
perf bench futex: Sanitize -q option in requeue
perf bench futex: Support operations for shared futexes
perf trace: Fix mmap return address truncation to 32-bit
perf tools: Refactor unit and scale function parameters
perf tools: Fix line number in the config file error message
perf tools: Convert {record,top}.call-graph option to call-graph.record-mode
perf tools: Introduce perf_callchain_config()
perf callchain: Move some parser functions to callchain.c
perf tools: Move callchain config from record_opts to callchain_param
perf hists browser: Fix callchain print bug on TUI
perf tools: Use ACCESS_ONCE() instead of volatile cast
perf tools: Modify error code for when perf_session__new() fails
perf tools: Fix perf record as non root with kptr_restrict == 1
perf stat: Fix --per-core on multi socket systems
...
That loop in there is both anti-idiomatic *and* completely pointless.
strtoll() is there for purpose; use it and compare what's left with
acceptable suffices.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Attempting to build the perf tool for an arm64 target results in the
following failure:
arch/arm64/util/unwind-libunwind.c: In function 'libunwind__arch_reg_id':
arch/arm64/util/unwind-libunwind.c:77:3: error: implicit declaration of function 'pr_err'
pr_err("unwind: invalid reg id %d\n", regnum);
^
arch/arm64/util/unwind-libunwind.c:77:3: error: nested extern declaration of 'pr_err'
This is due to commit 84f5d36f48 ("perf tools: Move pr_* debug macros
into debug object") moving the pr_* macros into a new header file, but
failing to update architectures other than x86.
This patch adds the missing include, and fixes the build again.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1412076432-22045-1-git-send-email-will.deacon@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
With workload that spawns and destroys many threads and processes, it
was found that perf-mem could took a long time to post-process the perf
data after the target workload had completed its operation.
The performance bottleneck was found to be the lookup and insertion of
the new DSO structures (thousands of them in this case).
In a dual-socket Ivy-Bridge E7-4890 v2 machine (30-core, 60-thread), the
perf profile below shows what perf was doing after the profiled AIM7
shared workload completed:
- 83.94% perf libc-2.11.3.so [.] __strcmp_sse42
- __strcmp_sse42
- 99.82% map__new
machine__process_mmap_event
perf_session_deliver_event
perf_session__process_event
__perf_session__process_events
cmd_record
cmd_mem
run_builtin
main
__libc_start_main
- 13.17% perf perf [.] __dsos__findnew
__dsos__findnew
map__new
machine__process_mmap_event
perf_session_deliver_event
perf_session__process_event
__perf_session__process_events
cmd_record
cmd_mem
run_builtin
main
__libc_start_main
So about 97% of CPU times were spent in the map__new() function trying
to insert new DSO entry into the DSO linked list. The whole
post-processing step took about 9 minutes.
The DSO structures are currently searched linearly. So the total
processing time will be proportional to n^2.
To overcome this performance problem, the DSO code is modified to also
put the DSO structures in a RB tree sorted by its long name in
additional to being in a simple linked list. With this change, the
processing time will become proportional to n*log(n) which will be much
quicker for large n. However, the short name will still be searched
using the old linear searching method. With that patch in place, the
same perf-mem post-processing step took less than 30 seconds to
complete.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Link: http://lkml.kernel.org/r/1412098575-27863-3-git-send-email-Waiman.Long@hp.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When given the number of threads to requeue at once by user input,
there's always the risk of this value being larger than the total number
of threads. This doesn't make any sense, and the kernel can easily deal
with such sort of situations, hence no big deal. We should however
prevent bogus output such as:
./perf bench --repeat 2 futex requeue -q 10
Run summary [PID 22210]: Requeuing 4 threads (from [private] 0x99ef3c to 0x99ef38), 10 at a time.
[Run 1]: Requeued 10 of 4 threads in 0.0040 ms
[Run 2]: Requeued 10 of 4 threads in 0.0030 ms
Requeued 10 of 4 threads in 0.0035 ms (+-14.29%)
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Link: http://lkml.kernel.org/r/1412008868-22328-2-git-send-email-dave@stgolabs.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>