The proc files which is sorted with alphabetical order are evenly
assigned to several synthesize threads to be processed in parallel.
For 'perf top', the threads number hard code to online CPU number. The
following patch will introduce an option to set it.
For other perf tools, the thread number is 1. Because the process
function is not ready for multithreading, e.g.
process_synthesized_event.
This patch series only support event synthesize multithreading for 'perf
top'. For other tools, it can be done separately later.
With multithread applied, the total processing time can get up to 1.56x
speedup on Knights Mill for 'perf top'.
For specific single event processing, the processing time could increase
because of the lock contention. So proc_map_timeout may need to be
increased. Otherwise some proc maps will be truncated.
Based on my test, increasing the proc_map_timeout has small impact
on the total processing time. The total processing time still get 1.49x
speedup on Knights Mill after increasing the proc_map_timeout.
The patch itself doesn't increase the proc_map_timeout.
Doesn't need to implement multithreading for per task monitoring,
perf_event__synthesize_thread_map. It doesn't have performance issue.
Committer testing:
# getconf _NPROCESSORS_ONLN
4
# perf trace --no-inherit -e clone -o /tmp/output perf top
# tail -4 /tmp/bla
0.124 ( 0.041 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3eb3a8f30, parent_tidptr: 0x7fc3eb3a99d0, child_tidptr: 0x7fc3eb3a99d0, tls: 0x7fc3eb3a9700) = 9548 (perf)
0.246 ( 0.023 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3eaba7f30, parent_tidptr: 0x7fc3eaba89d0, child_tidptr: 0x7fc3eaba89d0, tls: 0x7fc3eaba8700) = 9549 (perf)
0.286 ( 0.019 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3ea3a6f30, parent_tidptr: 0x7fc3ea3a79d0, child_tidptr: 0x7fc3ea3a79d0, tls: 0x7fc3ea3a7700) = 9550 (perf)
246.540 ( 0.047 ms): clone(flags: VM|FS|FILES|SIGHAND|THREAD|SYSVSEM|SETTLS|PARENT_SETTID|CHILD_CLEARTID, child_stack: 0x7fc3ea3a6f30, parent_tidptr: 0x7fc3ea3a79d0, child_tidptr: 0x7fc3ea3a79d0, tls: 0x7fc3ea3a7700) = 9551 (perf)
#
Signed-off-by: Kan Liang <kan.liang@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Lukasz Odzioba <lukasz.odzioba@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1506696477-146932-4-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To process any events, it needs to find the thread in the machine first.
The machine maintains a rb tree to store all threads. The rb tree is
protected by a rw lock.
It is not a problem for current perf which serially processing events.
However, it will have scalability performance issue to process events in
parallel, especially on a heavy load system which have many threads.
Introduce a hashtable to divide the big rb tree into many samll rb tree
for threads. The index is thread id % hashtable size. It can reduce the
lock contention.
Committer notes:
Renamed some variables and function names to reduce semantic confusion:
'struct threads' pointers: thread -> threads
threads hastable index: tid -> hash_bucket
struct threads *machine__thread() -> machine__threads()
Cast tid to (unsigned int) to handle -1 in machine__threads() (Kan Liang)
Signed-off-by: Kan Liang <kan.liang@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Lukasz Odzioba <lukasz.odzioba@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1505096603-215017-2-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reuse 'mprotect' beautifiers for 'pkey_mprotect'.
System wide tracing pkey_alloc, pkey_free and pkey_mprotect calls, with
backtraces:
# perf trace -e pkey_alloc,pkey_mprotect,pkey_free --max-stack=5
0.000 ( 0.011 ms): pkey/7818 pkey_alloc(init_val: DISABLE_ACCESS|DISABLE_WRITE) = -1 EINVAL Invalid argument
syscall (/usr/lib64/libc-2.25.so)
pkey_alloc (/home/acme/c/pkey)
0.022 ( 0.003 ms): pkey/7818 pkey_mprotect(start: 0x7f28c3890000, len: 4096, prot: READ|WRITE, pkey: -1) = 0
syscall (/usr/lib64/libc-2.25.so)
pkey_mprotect (/home/acme/c/pkey)
0.030 ( 0.002 ms): pkey/7818 pkey_free(pkey: -1 ) = -1 EINVAL Invalid argument
syscall (/usr/lib64/libc-2.25.so)
pkey_free (/home/acme/c/pkey)
The tools/include/uapi/asm-generic/mman-common.h file is used to find
the access rights defines for the pkey_alloc syscall second argument.
Since we have the detector of changes for the tools/include header files
versus its kernel origin (include/uapi/asm-generic/mman-common.h), we'll
get whatever new flag appears for that argument automatically.
This method should be used in other cases where it is easy to generate
those flags tables because the header has properly namespaced defines
like PKEY_DISABLE_ACCESS and PKEY_DISABLE_WRITE.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-3xq5312qlks7wtfzv2sk3nct@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
By using the _IOC_(DIR,NR,TYPE,SIZE) macros to lookup a 'type' keyed
table that then gets indexed by 'nr', falling back to a notation similar
to the one used by 'strace', only more compact, i.e.:
474.356 ( 0.007 ms): gnome-shell/22401 ioctl(fd: 8</dev/dri/card0>, cmd: (READ|WRITE, 0x64, 0xae, 0x1c), arg: 0x7ffc934f7880) = 0
474.369 ( 0.053 ms): gnome-shell/22401 ioctl(fd: 8</dev/dri/card0>, cmd: (READ|WRITE, 0x64, 0xb0, 0x18), arg: 0x7ffc934f77d0) = 0
505.055 ( 0.014 ms): gnome-shell/22401 ioctl(fd: 8</dev/dri/card0>, cmd: (READ|WRITE, 0x64, 0xaf, 0x4), arg: 0x7ffc934f741c) = 0
This also moves it out of builtin-trace.c and into trace/beauty/ioctl.c
to better compartimentalize all these formatters.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-s3enursdxsvnhdomh6qlte4g@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Avoiding a loop, so now its quite convenient to ssh to a machine and
then simply do:
# perf trace
To trace all syscalls without causing a loop.
This was possible using --filter-pids, i.e. once you noticed the loop,
get the sshd pid and add it to --filter-pids, restarting the 'perf
trace'.
Now to figure out how to do that in a X terminal, the other common
scenario, which is way more involved, as there are multiple processes
communicating to process terminal activity...
Using --filter-pids + '-e \!syscall,names,you,dont,need' may be a good
approximation when having to do syswide tracing on your workstation.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-68rjeao9wnpylla41htk7xps@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Previously we only used the syscall_fmt when we had sc->tp_format set,
i.e. when we found the (enter, exit) pair in tracefs/events/syscalls/.
But we really only need to use what is in sc->arg_fmt to apply the arg
beautifiers to the syscall argument values, so do it.
With this we will be able to provide formatters to the "clone" syscall,
which doesn't have entries in tracefs/events/syscalls/.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-y41nl41jrayjo5ucnde2peix@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The 'perf trace' tool is suppressing args set to zero, with the
exception of string tables (strarrays), which are kinda like enums, i.e.
we have maps to go from numbers to strings.
But the 'cmd' fcntl arg requires more specialized treatment, as its
value will regulate if the next fcntl syscall arg, 'arg', should be
ignored (not used) and also how to format the syscall return (fd, file
flags, etc), so add a 'show_zero" bool to struct syscall_arg_fmt, to
regulate this more explicitely.
Will be used in a following patch with fcntl, here is just the
mechanism.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-all738jctxets8ffyizp5lzo@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Instead of having syscall_fmt.{arg_scnprintf,arg_parm}, introduce
struct syscall_arg_fmt and have these two, paving the way for more
state to change the formatting algorithms.
For instance, in the 'fcntl' 'cmd' case it is better not to suppress
it when being zero, showing instead its name "DUPFD".
We had that in an ad-hoc way just for strarrays, but with more involved
cases like fcntl, that can't be done with just a strarray, we'll need
a ".show_zero" arg in the 'cmd' syscall_arg_fmt.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-ch06o2j72zbjx5xww4qp67au@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The s390 architecture maps sys_mmap (nr 90) into sys_old_mmap. For this
reason perf trace can't find the proper syscall event to get args format
from and displays it wrongly as 'continued'.
To fix that fill the "alias" field with "old_mmap" for trace's mmap record
to get the correct translation.
Before:
0.042 ( 0.011 ms): vest/43052 fstat(statbuf: 0x3ffff89fd90 ) = 0
0.042 ( 0.028 ms): vest/43052 ... [continued]: mmap()) = 0x3fffd6e2000
0.072 ( 0.025 ms): vest/43052 read(buf: 0x3fffd6e2000, count: 4096 ) = 6
After:
0.045 ( 0.011 ms): fstat(statbuf: 0x3ffff8a0930 ) = 0
0.057 ( 0.018 ms): mmap(arg: 0x3ffff8a0858 ) = 0x3fffd14a000
0.076 ( 0.025 ms): read(buf: 0x3fffd14a000, count: 4096 ) = 6
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20170531113557.19175-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To test it, build samples/statx/test_statx, which I did as:
$ make headers_install
$ cc -I ~/git/linux/usr/include samples/statx/test-statx.c -o /tmp/statx
And then use perf trace on it:
# perf trace -e statx /tmp/statx /etc/passwd
statx(/etc/passwd) = 0
results=7ff
Size: 3496 Blocks: 8 IO Block: 4096 regular file
Device: fd:00 Inode: 280156 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2017-03-29 16:01:01.650073438-0300
Modify: 2017-03-10 16:25:14.156479354-0300
Change: 2017-03-10 16:25:14.171479328-0300
0.000 ( 0.007 ms): statx/30648 statx(dfd: CWD, filename: 0x7ef503f4, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff7ef4eb10) = 0
#
Using the test-stat.c options to change the mask:
# perf trace -e statx /tmp/statx -O /etc/passwd > /dev/null
0.000 ( 0.008 ms): statx/30745 statx(dfd: CWD, filename: 0x3a0753f4, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffd3a0735c0) = 0
#
# perf trace -e statx /tmp/statx -A /etc/passwd > /dev/null
0.000 ( 0.010 ms): statx/30757 statx(dfd: CWD, filename: 0xa94e63f4, flags: SYMLINK_NOFOLLOW|NO_AUTOMOUNT, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffea94e49d0) = 0
#
# trace --no-inherit -e statx /tmp/statx -F /etc/passwd > /dev/null
0.000 ( 0.011 ms): statx(dfd: CWD, filename: 0x3b02d3f3, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffd3b02c850) = 0
#
# trace --no-inherit -e statx /tmp/statx -F -L /etc/passwd > /dev/null
0.000 ( 0.008 ms): statx(dfd: CWD, filename: 0x15cff3f3, flags: STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff15cfdda0) = 0
#
# trace --no-inherit -e statx /tmp/statx -D -O /etc/passwd > /dev/null
0.000 ( 0.009 ms): statx(dfd: CWD, filename: 0xfa37f3f3, flags: SYMLINK_NOFOLLOW|STATX_DONT_SYNC, mask: BTIME, buffer: 0x7ffffa37da20) = 0
#
Adding a probe to get the filename collected as well:
# perf probe 'vfs_getname=getname_flags:72 pathname=result->name:string'
Added new event:
probe:vfs_getname (on getname_flags:72 with pathname=result->name:string)
You can now use it in all perf tools, such as:
perf record -e probe:vfs_getname -aR sleep 1
# trace --no-inherit -e statx /tmp/statx -D -O /etc/passwd > /dev/null
0.169 ( 0.007 ms): statx(dfd: CWD, filename: /etc/passwd, flags: SYMLINK_NOFOLLOW|STATX_DONT_SYNC, mask: BTIME, buffer: 0x7ffda9bf50f0) = 0
#
Same technique could be used to collect and beautify the result put in
the 'buffer' argument.
Finally do a system wide 'perf trace' session looking for any use of statx,
then run the test proggie with various flags:
# trace -e statx
16612.967 ( 0.028 ms): statx/4562 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffef195d660) = 0
33064.447 ( 0.011 ms): statx/4569 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW|STATX_FORCE_SYNC, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7ffc5484c790) = 0
36050.891 ( 0.023 ms): statx/4576 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW, mask: BTIME, buffer: 0x7ffeb18b66e0) = 0
38039.889 ( 0.023 ms): statx/4584 statx(dfd: CWD, filename: /tmp/statx, flags: SYMLINK_NOFOLLOW, mask: TYPE|MODE|NLINK|UID|GID|ATIME|MTIME|CTIME|INO|SIZE|BLOCKS|BTIME, buffer: 0x7fff1db0ea90) = 0
^C#
This one also starts moving the beautifiers from files directly included
in builtin-trace.c to separate objects + a beauty.h header with
prototypes, so that we can add test cases in tools/perf/tests/ to fire
syscalls with various arguments and then get them intercepted as
syscalls:sys_enter_foo or raw_syscalls:sys_enter + sys_exit to then
format and check that the formatted output is the one we expect.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Ahern <dsahern@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-xvzw8eynffvez5czyzidhrno@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Which may happen when we start a tracing session and a thread is waiting
for something like "poll" to return, in which case we better print "?"
both for the syscall entry timestamp and for the duration.
E.g.:
Tracing existing mutt session:
# perf trace -p `pidof mutt`
? ( ? ): mutt/17135 ... [continued]: poll()) = 1
0.027 ( 0.013 ms): mutt/17135 read(buf: 0x7ffcb3c42cef, count: 1) = 1
0.047 ( 0.008 ms): mutt/17135 poll(ufds: 0x7ffcb3c42c50, nfds: 1, timeout_msecs: 1000) = 1
0.059 ( 0.008 ms): mutt/17135 read(buf: 0x7ffcb3c42cef, count: 1) = 1
<SNIP>
Before it would print a large number because we'd do:
ttrace->entry_time - trace->base_time
And entry_time would be 0, while base_time would be the timestamp for
the first event 'perf trace' reads, oops.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Luis Claudio Gonçalves <lclaudio@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-wbcb93ofva2qdjd5ltn5eeqq@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>