If /proc/sys/kernel/perf_event_mlock_kb is not (power of 2 + PAGE_SIZE_in_kb)
and we let the perf tools do mmap length autosizing based on that, then, for
non-CAP_IPC_LOCK users when /proc/sys/kernel/perf_event_paranoid is > -1, then
we get an -EINVAL that ends up in:
[acme@ssdandy linux]$ trace usleep 1
Invalid argument
[acme@ssdandy linux]$ perf record usleep 1
failed to mmap with 22 (Invalid argument)
After this fix:
[acme@ssdandy linux]$ trace usleep 1
<SNIP>
0.806 ( 0.006 ms): munmap(addr: 0x7f7e4740a000, len: 66467) = 0
0.869 ( 0.002 ms): brk( ) = 0x7bb000
0.873 ( 0.003 ms): brk(brk: 0x7dc000 ) = 0x7dc000
0.877 ( 0.001 ms): brk( ) = 0x7dc000
0.953 ( 0.058 ms): nanosleep(rqtp: 0x7fff26ab9420 ) = 0
0.959 ( 0.000 ms): exit_group(
[acme@ssdandy linux]$ perf record usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (~759 samples) ]
[acme@ssdandy linux]$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-6p6l5ou6jev6o7ymc4nn1n2a@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Considering the per user locked pages limit, improve the message when a
user uses multiple simultaneous perf mmap calls:
When the request is more than the current maximum:
[acme@ssdandy linux]$ trace -m 128 usleep 1
Error: Operation not permitted.
Hint: Check /proc/sys/kernel/perf_event_mlock_kb (516 kB) setting.
Hint: Tried using 516 kB.
Hint: Try 'sudo sh -c "echo 1032 > /proc/sys/kernel/perf_event_mlock_kb"', or
Hint: Try using a smaller -m/--mmap-pages value.
[acme@ssdandy linux]$
And when the limit is less than that:
[acme@ssdandy linux]$ trace -m 512 usleep 1
Error: Operation not permitted.
Hint: Check /proc/sys/kernel/perf_event_mlock_kb (2056 kB) setting.
Hint: Tried using 2052 kB.
Hint: Try using a smaller -m/--mmap-pages value.
[acme@ssdandy linux]$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-yqdie3c8qvdgenwleri267d4@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When failing due to asking for a number of mmap pages that is more than
the max, it was suggesting that an even bigger number of mmap pages
should be specified, doh, au contraire!
Before:
[acme@ssdandy linux]$ trace -m 128 usleep 1
Error: Operation not permitted.
Hint: Check /proc/sys/kernel/perf_event_mlock_kb (516 kB) setting.
Hint: Tried using 516 kB.
Hint: Try using a bigger -m/--mmap-pages value.
[acme@ssdandy linux]$
After:
[acme@ssdandy linux]$ trace -m 128 usleep 1
Error: Operation not permitted.
Hint: Check /proc/sys/kernel/perf_event_mlock_kb (516 kB) setting.
Hint: Tried using 516 kB.
Hint: Try using a smaller -m/--mmap-pages value.
[acme@ssdandy linux]$
And to (really) clarify what happens above, when what the user requests
is <= max and even then it fails, a changeset is being made to tell that
this is a per user limit, not per process (in the above example there
was another 'perf trace' running for this user, which was using all the
pages it could use).
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-8qope8lxb898narnq5kmu2gf@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add an index of the event identifiers, in preparation for Intel PT.
The event id (also called the sample id) is a unique number
allocated by the kernel to the event created by perf_event_open(). Events
can include the event id by having a sample type including PERF_SAMPLE_ID or
PERF_SAMPLE_IDENTIFIER.
Currently the main use of the event id is to match an event back to the
evsel to which it belongs i.e. perf_evlist__id2evsel()
The purpose of this patch is to make it possible to match an event back to
the mmap from which it was read. The reason that is useful is because the
mmap represents a time-ordered context (either for a cpu or for a thread).
Intel PT decodes trace information on that basis. In full-trace mode, that
information can be recorded when the Intel PT trace is read, but in
sample-mode the Intel PT trace data is embedded in a sample and it is in
that case that the "id index" is needed.
So the mmaps are numbered (idx) and the cpu and tid recorded against the id
by perf_evlist__set_sid_idx() which is called by perf_evlist__mmap_per_evsel().
That information is recorded on the perf.data file in the new "id index".
idx, cpu and tid are added to struct perf_sample_id (which is the node of
evlist's hash table to match ids to evsels). The information can be
retrieved using perf_evlist__id2sid(). Note however this all depends on
having a sample type including PERF_SAMPLE_ID or PERF_SAMPLE_IDENTIFIER,
otherwise ids are not recorded.
The "id index" is a synthesized event record which will be created when
Intel PT sampling is used by calling perf_event__synthesize_id_index().
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1414417770-18602-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We will use this in perf's evlist class so that it can, at
fdarray__filter() time, to unmap the associated ring buffer.
We may need to have further info associated with each fdarray entry, in
that case we'll make that int array a 'union fdarray_priv' one and put a
pointer there so that users can stash whatever they want there. For now,
an int is enough tho.
v2: Add clarification to the per array entry priv area, as well as make
it a union, which makes usage a bit longer, but if/when we make it
use more space by allowing per entry pointers existing users source
code will not have to be changed, just rebuilt.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/n/tip-0p00bn83quck3fio3kcs9vca@git.kernel.org
The extensible file description array that grew in the perf_evlist class
can be useful for other tools, as it is not something that only evlists
need, so move it to tools/lib/api/fd to ease sharing it.
v2: Don't use {} like in:
libapi_dirs:
$(QUIET_MKDIR)mkdir -p $(OUTPUT){fs,fd}/
in Makefiles, as it will not work in some systems, as in ubuntu13.10.
v3: Add fd/*.[ch] to LIBAPIKFS_SOURCES (Fix from Jiri Olsa)
v4: Leave the fcntl(fd, O_NONBLOCK) in the evlist layer, remains to
be checked if it is really needed there, but has no place in the
fdarray class (Fix from Jiri Olsa)
v5: Remove evlist details from fdarray grow/filter tests. Improve it a
bit doing more tests about expected internal state.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kleuni3hckbc3s0lu6yb9x40@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We want to know when the fd went away, like when a monitored thread
exits.
If we do not monitor such events, then the tools will wait forever on
events from a vanished thread, like when running:
$ sleep 5s &
$ perf record -p `pidof sleep`
This builds upon the kernel patch by Jiri Olsa that actually makes a
poll on those file descriptors to return POLLHUP.
It is also needed to change the tools to use
perf_evlist__filter_pollfd() to check if there are remainings fds to
monitor or if all are gone, in which case they will exit the
poll/mmap/read loop.
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-a4fslwspov0bs69nj825hqpq@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The perf_evlist__prepare_workload() method works by forking and then
waiting on a fd that must be written to to allow the workload to be
exec()ed.
But if the tool calling it fails to, say, set up the events with which
it wants to sample the workload for, it will not call
perf_evlist__start_workload(), but even in this case the workload ended
up running:
[acme@zoo linux]$ trace /bin/echo workload ends up running, it should not...
Couldn't mmap the events: Operation not permitted
workload ends up running, it should not...
[acme@zoo linux]$
So check if at least one byte was written before letting exec() be
called.
Now the expected behaviour:
[acme@zoo linux]$ trace /bin/echo workload ends up running, it should not...
Couldn't mmap the events: Operation not permitted
[acme@zoo linux]$
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-oh1ixo8m74rf295a05gfjw8b@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch fixes a memory corruption problem with the xyarray when the
evsel fds get closed at the end of the run_perf_stat() call.
It could be triggered with:
# perf stat -a -e power/energy-cores/ ls
When cpumask are used by events (.e.g, RAPL or uncores) then the evsel
fds are allocated based on the actual number of CPUs monitored. That
number can be smaller than the total number of CPUs on the system.
The problem arises at the end by perf stat closes the fds twice. When
fds are closed, their entry in the xyarray are set to -1.
The first close() on the evsel is made from __run_perf_stat() and it
uses the actual number of CPUS for the event which is how the xyarray
was allocated for.
The second is from perf_evlist_close() but that one is on the total
number of CPUs in the system, so it assume the xyarray was allocated to
cover it. However it was not, and some writes corrupt memory.
The fix is in perf_evlist_close() is to first try with the evsel->cpus
if present, if not use the evlist->cpus. That fixes the problem.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389972846-6566-3-git-send-email-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>