Pull 'objtool' stack frame validation from Ingo Molnar:
"This tree adds a new kernel build-time object file validation feature
(ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
It was written by and is maintained by Josh Poimboeuf.
The motivation: there's a category of hard to find kernel bugs, most
of them in assembly code (but also occasionally in C code), that
degrades the quality of kernel stack dumps/backtraces. These bugs are
hard to detect at the source code level. Such bugs result in
incorrect/incomplete backtraces most of time - but can also in some
rare cases result in crashes or other undefined behavior.
The build time correctness checking is done via the new 'objtool'
user-space utility that was written for this purpose and which is
hosted in the kernel repository in tools/objtool/. The tool's (very
simple) UI and source code design is shaped after Git and perf and
shares quite a bit of infrastructure with tools/perf (which tooling
infrastructure sharing effort got merged via perf and is already
upstream). Objtool follows the well-known kernel coding style.
Objtool does not try to check .c or .S files, it instead analyzes the
resulting .o generated machine code from first principles: it decodes
the instruction stream and interprets it. (Right now objtool supports
the x86-64 architecture.)
From tools/objtool/Documentation/stack-validation.txt:
"The kernel CONFIG_STACK_VALIDATION option enables a host tool named
objtool which runs at compile time. It has a "check" subcommand
which analyzes every .o file and ensures the validity of its stack
metadata. It enforces a set of rules on asm code and C inline
assembly code so that stack traces can be reliable.
Currently it only checks frame pointer usage, but there are plans to
add CFI validation for C files and CFI generation for asm files.
For each function, it recursively follows all possible code paths
and validates the correct frame pointer state at each instruction.
It also follows code paths involving special sections, like
.altinstructions, __jump_table, and __ex_table, which can add
alternative execution paths to a given instruction (or set of
instructions). Similarly, it knows how to follow switch statements,
for which gcc sometimes uses jump tables."
When this new kernel option is enabled (it's disabled by default), the
tool, if it finds any suspicious assembly code pattern, outputs
warnings in compiler warning format:
warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer
... so that scripts that pick up compiler warnings will notice them.
All known warnings triggered by the tool are fixed by the tree, most
of the commits in fact prepare the kernel to be warning-free. Most of
them are bugfixes or cleanups that stand on their own, but there are
also some annotations of 'special' stack frames for justified cases
such entries to JIT-ed code (BPF) or really special boot time code.
There are two other long-term motivations behind this tool as well:
- To improve the quality and reliability of kernel stack frames, so
that they can be used for optimized live patching.
- To create independent infrastructure to check the correctness of
CFI stack frames at build time. CFI debuginfo is notoriously
unreliable and we cannot use it in the kernel as-is without extra
checking done both on the kernel side and on the build side.
The quality of kernel stack frames matters to debuggability as well,
so IMO we can merge this without having to consider the live patching
or CFI debuginfo angle"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
objtool: Only print one warning per function
objtool: Add several performance improvements
tools: Copy hashtable.h into tools directory
objtool: Fix false positive warnings for functions with multiple switch statements
objtool: Rename some variables and functions
objtool: Remove superflous INIT_LIST_HEAD
objtool: Add helper macros for traversing instructions
objtool: Fix false positive warnings related to sibling calls
objtool: Compile with debugging symbols
objtool: Detect infinite recursion
objtool: Prevent infinite recursion in noreturn detection
objtool: Detect and warn if libelf is missing and don't break the build
tools: Support relative directory path for 'O='
objtool: Support CROSS_COMPILE
x86/asm/decoder: Use explicitly signed chars
objtool: Enable stack metadata validation on 64-bit x86
objtool: Add CONFIG_STACK_VALIDATION option
objtool: Add tool to perform compile-time stack metadata validation
x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
sched: Always inline context_switch()
...
Pull perf updates from Ingo Molnar:
"Main kernel side changes:
- Big reorganization of the x86 perf support code. The old code grew
organically deep inside arch/x86/kernel/cpu/perf* and its naming
became somewhat messy.
The new location is under arch/x86/events/, using the following
cleaner hierarchy of source code files:
perf/x86: Move perf_event.c .................. => x86/events/core.c
perf/x86: Move perf_event_amd.c .............. => x86/events/amd/core.c
perf/x86: Move perf_event_amd_ibs.c .......... => x86/events/amd/ibs.c
perf/x86: Move perf_event_amd_iommu.[ch] ..... => x86/events/amd/iommu.[ch]
perf/x86: Move perf_event_amd_uncore.c ....... => x86/events/amd/uncore.c
perf/x86: Move perf_event_intel_bts.c ........ => x86/events/intel/bts.c
perf/x86: Move perf_event_intel.c ............ => x86/events/intel/core.c
perf/x86: Move perf_event_intel_cqm.c ........ => x86/events/intel/cqm.c
perf/x86: Move perf_event_intel_cstate.c ..... => x86/events/intel/cstate.c
perf/x86: Move perf_event_intel_ds.c ......... => x86/events/intel/ds.c
perf/x86: Move perf_event_intel_lbr.c ........ => x86/events/intel/lbr.c
perf/x86: Move perf_event_intel_pt.[ch] ...... => x86/events/intel/pt.[ch]
perf/x86: Move perf_event_intel_rapl.c ....... => x86/events/intel/rapl.c
perf/x86: Move perf_event_intel_uncore.[ch] .. => x86/events/intel/uncore.[ch]
perf/x86: Move perf_event_intel_uncore_nhmex.c => x86/events/intel/uncore_nmhex.c
perf/x86: Move perf_event_intel_uncore_snb.c => x86/events/intel/uncore_snb.c
perf/x86: Move perf_event_intel_uncore_snbep.c => x86/events/intel/uncore_snbep.c
perf/x86: Move perf_event_knc.c .............. => x86/events/intel/knc.c
perf/x86: Move perf_event_p4.c ............... => x86/events/intel/p4.c
perf/x86: Move perf_event_p6.c ............... => x86/events/intel/p6.c
perf/x86: Move perf_event_msr.c .............. => x86/events/msr.c
(Borislav Petkov)
- Update various x86 PMU constraint and hw support details (Stephane
Eranian)
- Optimize kprobes for BPF execution (Martin KaFai Lau)
- Rewrite, refactor and fix the Intel uncore PMU driver code (Thomas
Gleixner)
- Rewrite, refactor and fix the Intel RAPL PMU code (Thomas Gleixner)
- Various fixes and smaller cleanups.
There are lots of perf tooling updates as well. A few highlights:
perf report/top:
- Hierarchy histogram mode for 'perf top' and 'perf report',
showing multiple levels, one per --sort entry: (Namhyung Kim)
On a mostly idle system:
# perf top --hierarchy -s comm,dso
Then expand some levels and use 'P' to take a snapshot:
# cat perf.hist.0
- 92.32% perf
58.20% perf
22.29% libc-2.22.so
5.97% [kernel]
4.18% libelf-0.165.so
1.69% [unknown]
- 4.71% qemu-system-x86
3.10% [kernel]
1.60% qemu-system-x86_64 (deleted)
+ 2.97% swapper
#
- Add 'L' hotkey to dynamicly set the percent threshold for
histogram entries and callchains, i.e. dynamicly do what the
--percent-limit command line option to 'top' and 'report' does.
(Namhyung Kim)
perf mem:
- Allow specifying events via -e in 'perf mem record', also listing
what events can be specified via 'perf mem record -e list' (Jiri
Olsa)
perf record:
- Add 'perf record' --all-user/--all-kernel options, so that one
can tell that all the events in the command line should be
restricted to the user or kernel levels (Jiri Olsa), i.e.:
perf record -e cycles:u,instructions:u
is equivalent to:
perf record --all-user -e cycles,instructions
- Make 'perf record' collect CPU cache info in the perf.data file header:
$ perf record usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (7 samples) ]
$ perf report --header-only -I | tail -10 | head -8
# CPU cache info:
# L1 Data 32K [0-1]
# L1 Instruction 32K [0-1]
# L1 Data 32K [2-3]
# L1 Instruction 32K [2-3]
# L2 Unified 256K [0-1]
# L2 Unified 256K [2-3]
# L3 Unified 4096K [0-3]
Will be used in 'perf c2c' and eventually in 'perf diff' to
allow, for instance running the same workload in multiple
machines and then when using 'diff' show the hardware difference.
(Jiri Olsa)
- Improved support for Java, using the JVMTI agent library to do
jitdumps that then will be inserted in synthesized
PERF_RECORD_MMAP2 events via 'perf inject' pointed to synthesized
ELF files stored in ~/.debug and keyed with build-ids, to allow
symbol resolution and even annotation with source line info, see
the changeset comments to see how to use it (Stephane Eranian)
perf script/trace:
- Decode data_src values (e.g. perf.data files generated by 'perf
mem record') in 'perf script': (Jiri Olsa)
# perf script
perf 693 [1] 4.088652: 1 cpu/mem-loads,ldlat=30/P: ffff88007d0b0f40 68100142 L1 hit|SNP None|TLB L1 or L2 hit|LCK No <SNIP>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Improve support to 'data_src', 'weight' and 'addr' fields in
'perf script' (Jiri Olsa)
- Handle empty print fmts in 'perf script -s' i.e. when running
python or perl scripts (Taeung Song)
perf stat:
- 'perf stat' now shows shadow metrics (insn per cycle, etc) in
interval mode too. E.g:
# perf stat -I 1000 -e instructions,cycles sleep 1
# time counts unit events
1.000215928 519,620 instructions # 0.69 insn per cycle
1.000215928 752,003 cycles
<SNIP>
- Port 'perf kvm stat' to PowerPC (Hemant Kumar)
- Implement CSV metrics output in 'perf stat' (Andi Kleen)
perf BPF support:
- Support converting data from bpf events in 'perf data' (Wang Nan)
- Print bpf-output events in 'perf script': (Wang Nan).
# perf record -e bpf-output/no-inherit,name=evt/ -e ./test_bpf_output_3.c/map:channel.event=evt/ usleep 1000
# perf script
usleep 4882 21384.532523: evt: ffffffff810e97d1 sys_nanosleep ([kernel.kallsyms])
BPF output: 0000: 52 61 69 73 65 20 61 20 Raise a
0008: 42 50 46 20 65 76 65 6e BPF even
0010: 74 21 00 00 t!..
BPF string: "Raise a BPF event!"
#
- Add API to set values of map entries in a BPF object, be it
individual map slots or ranges (Wang Nan)
- Introduce support for the 'bpf-output' event (Wang Nan)
- Add glue to read perf events in a BPF program (Wang Nan)
- Improve support for bpf-output events in 'perf trace' (Wang Nan)
... and tons of other changes as well - see the shortlog and git log
for details!"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (342 commits)
perf stat: Add --metric-only support for -A
perf stat: Implement --metric-only mode
perf stat: Document CSV format in manpage
perf hists browser: Check sort keys before hot key actions
perf hists browser: Allow thread filtering for comm sort key
perf tools: Add sort__has_comm variable
perf tools: Recalc total periods using top-level entries in hierarchy
perf tools: Remove nr_sort_keys field
perf hists browser: Cleanup hist_browser__fprintf_hierarchy_entry()
perf tools: Remove hist_entry->fmt field
perf tools: Fix command line filters in hierarchy mode
perf tools: Add more sort entry check functions
perf tools: Fix hist_entry__filter() for hierarchy
perf jitdump: Build only on supported archs
tools lib traceevent: Add '~' operation within arg_num_eval()
perf tools: Omit unnecessary cast in perf_pmu__parse_scale
perf tools: Pass perf_hpp_list all the way through setup_sort_list
perf tools: Fix perf script python database export crash
perf jitdump: DWARF is also needed
perf bench mem: Prepare the x86-64 build for upstream memcpy_mcsafe() changes
...
When building with CONFIG_STACK_VALIDATION on a ppc64le host with an x86
cross-compiler, Stephen Rothwell saw the following objtool build errors:
DESCEND objtool
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/builtin-check.o
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/special.o
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/elf.o
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool.o
MKDIR /home/sfr/next/x86_64_allmodconfig/tools/objtool/arch/x86/insn/
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/libstring.o
elf.c:22:23: fatal error: sys/types.h: No such file or directory
compilation terminated.
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/exec-cmd.o
CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/help.o
builtin-check.c:28:20: fatal error: string.h: No such file or directory
compilation terminated.
objtool.c:28:19: fatal error: stdio.h: No such file or directory
compilation terminated.
It fails to build because it tries to compile objtool with the
cross-compiler instead of the host compiler.
Ensure that it always uses the host compiler by ignoring CROSS_COMPILE.
In order to do that properly, the libsubcmd.a library needs to be built
in tools/objtool/ rather than tools/lib/subcmd/. The latter directory
contains the cross-compiled version which is needed for perf and
possibly other tools.
Note that cross-compiling for x86 on a _big_ endian system would result
in a bunch of false positive objtool warnings during the kernel build
because it isn't endian-aware. But that's generally a rare edge case
and there haven't been any reports of anybody needing that.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55b63eefc347f1bb28573f972d8d1adbf1f1c31d.1456962210.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Had a bug where on error of parsing __print_array() where the fields are
freed after they were allocated, but since they were not set to NULL,
the freeing of the arg also tried to free the already freed fields
causing a double free.
Fix process_hex() while at it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20160209204237.188327674@goodmis.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When rounding to microseconds, if the timestamp subsecond is between
.999999500 and .999999999, it is rounded to .1000000, when it should
instead increment the second counter due to the overflow.
For example, if the timestamp is 1234.999999501 instead of seeing:
1235.000000
we see:
1234.1000000
Signed-off-by: Chaos.Chen <rainboy1215@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20160209204236.824426460@goodmis.org
[ fixed incrementing "secs" instead of decrementing it ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently there's a single function that is used to display a record's
data in human readable format. That's pevent_print_event().
Unfortunately, this gives little room for adding other output within the
line without updating that function call.
I've decided to split that function into 3 parts.
pevent_print_event_task() which prints the task comm, pid and the CPU
pevent_print_event_time() which outputs the record's timestamp
pevent_print_event_data() which outputs the rest of the event data.
pevent_print_event() now simply calls these three functions.
To save time from doing the search for event from the record's type, I
created a new helper function called pevent_find_event_by_record(),
which returns the record's event, and this event has to be passed to the
above functions.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20160229090128.43a56704@gandalf.local.home
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The operation '%' is not implemented on event-parse.c, causing an error
when parsing events with '%' the operation in its printk format. For
example,
# perf record -e sched:sched_deadline_yield ~/yield-test
Warning: [sched:sched_deadline_yield] unknown op '%'
....
# perf script
Warning: [sched:sched_deadline_yield] unknown op '%'
test 1641 [006] 3364.109319: sched:sched_deadline_yield: \
[FAILED TO PARSE] now=3364109314595 \
deadline=3364139295135 runtime=19975597
This patch implements the '%' operation. With this patch, we see the
correct output:
# perf record -e sched:sched_deadline_yield ~/yield-test
No Warning
# perf script
yield-test 4005 [001] 4623.650978: sched:sched_deadline_yield: \
now=4623.650974050 \
deadline=4623.680957364 remaining_runtime=19979611
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/5c96a395c56cea6d3d13d949051bdece86cc26e0.1456157869.git.bristot@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding support for warning/info/debug output within libapi code. Adding
following macros:
pr_warning(fmt, ...)
pr_info(fmt, ...)
pr_debug(fmt, ...)
Also adding libapi_set_print function to set above functions. This will
be used in perf to set standard debug handlers for libapi.
Adding 2 header files:
debug.h
- to be used outside libapi, contains
libapi_set_print interface
debug-internal.h
- to be used within libapi, contains
pr_warning/pr_info/pr_debug definitions
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1455465826-8426-2-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
User visible changes:
- Rename the "colors.code" ~/.perfconfig variable to "colors.jump_arrows",
as it controls just the that UI element in the annotate browser (Taeung Song)
- Avoid trying to read ELF symtabs from device files, noticed while doing
memory profiling work (Jiri Olsa)
- Improve context detection when offering options in the hists browser,
i.e. some options don't make sense when the browser is not working with
a perf.data file ('perf top' mode), only in 'perf report' mode, like
scripting (Namhyung Kim)
Infrastructure changes:
- Elliminate duplication in the hists browser filter functions, getting the
common part into a function that receives callbacks for filtering by
DSO, thread, etc. (Namhyung Kim)
- Fix misleadingly indented assignment, found using
gcc6 -Wmisleading-indentation (Markus Trippelsdorf)
- Handle LLVM relocation oddities in libbpf, introducing a 'perf test' that
detects such problems and then fixing the problem, so that the test now
passes (Wang Nan)
- More improvements to the build infrastructure to allow reusing the
feature detection facilities (Wang Nan)
- Auto initialize the globals needed by cpu__max_{cpu,node}() routines
(Arnaldo Carvalho de Melo)
Documentation changes:
- Document the perf sysctls in Documentation/sysctl/kernel.txt (Ben Hutchings)
- Document a bunch more ~/.perfconfig knobs (Taeung Song)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Libbpf should check the target section before doing relocation to ensure
the relocation is correct. If not, a bug in LLVM causes an error. See
[1]. Also, if an incorrect BPF script uses both global variable and
map, global variable whould be treated as map and be relocated without
error.
This patch saves the id of the map section into obj->efile and compare
target section of a relocation symbol against it during relocation.
Previous patch introduces a test case about this problem. After this
patch:
# ~/perf test BPF
37: Test BPF filter :
37.1: Test basic BPF filtering : Ok
37.2: Test BPF prologue generation : Ok
37.3: Test BPF relocation checker : Ok
# perf test -v BPF
...
37.3: Test BPF relocation checker :
...
libbpf: loading object '[bpf_relocation_test]' from buffer
libbpf: section .strtab, size 126, link 0, flags 0, type=3
libbpf: section .text, size 0, link 0, flags 6, type=1
libbpf: section .data, size 0, link 0, flags 3, type=1
libbpf: section .bss, size 0, link 0, flags 3, type=8
libbpf: section func=sys_write, size 104, link 0, flags 6, type=1
libbpf: found program func=sys_write
libbpf: section .relfunc=sys_write, size 16, link 10, flags 0, type=9
libbpf: section maps, size 16, link 0, flags 3, type=1
libbpf: maps in [bpf_relocation_test]: 16 bytes
libbpf: section license, size 4, link 0, flags 3, type=1
libbpf: license of [bpf_relocation_test] is GPL
libbpf: section version, size 4, link 0, flags 3, type=1
libbpf: kernel version of [bpf_relocation_test] is 40400
libbpf: section .symtab, size 144, link 1, flags 0, type=2
libbpf: map 0 is "my_table"
libbpf: collecting relocating info for: 'func=sys_write'
libbpf: Program 'func=sys_write' contains non-map related relo data pointing to section 65522
bpf: failed to load buffer
Compile BPF program failed.
test child finished with 0
---- end ----
Test BPF filter subtest 2: Ok
[1] https://llvm.org/bugs/show_bug.cgi?id=26243
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1453715801-7732-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull rdma updates from Doug Ledford:
"Initial roundup of 4.5 merge window patches
- Remove usage of ib_query_device and instead store attributes in
ib_device struct
- Move iopoll out of block and into lib, rename to irqpoll, and use
in several places in the rdma stack as our new completion queue
polling library mechanism. Update the other block drivers that
already used iopoll to use the new mechanism too.
- Replace the per-entry GID table locks with a single GID table lock
- IPoIB multicast cleanup
- Cleanups to the IB MR facility
- Add support for 64bit extended IB counters
- Fix for netlink oops while parsing RDMA nl messages
- RoCEv2 support for the core IB code
- mlx4 RoCEv2 support
- mlx5 RoCEv2 support
- Cross Channel support for mlx5
- Timestamp support for mlx5
- Atomic support for mlx5
- Raw QP support for mlx5
- MAINTAINERS update for mlx4/mlx5
- Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates
- Add support for remote invalidate to the iSER driver (pushed
through the RDMA tree due to dependencies, acknowledged by nab)
- Update to NFSoRDMA (pushed through the RDMA tree due to
dependencies, acknowledged by Bruce)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits)
IB/mlx5: Unify CQ create flags check
IB/mlx5: Expose Raw Packet QP to user space consumers
{IB, net}/mlx5: Move the modify QP operation table to mlx5_ib
IB/mlx5: Support setting Ethernet priority for Raw Packet QPs
IB/mlx5: Add Raw Packet QP query functionality
IB/mlx5: Add create and destroy functionality for Raw Packet QP
IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types
IB/mlx5: Allocate a Transport Domain for each ucontext
net/mlx5_core: Warn on unsupported events of QP/RQ/SQ
net/mlx5_core: Add RQ and SQ event handling
net/mlx5_core: Export transport objects
IB/mlx5: Expose CQE version to user-space
IB/mlx5: Add CQE version 1 support to user QPs and SRQs
IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext
IB/sa: Fix netlink local service GFP crash
IB/srpt: Remove redundant wc array
IB/qib: Improve ipoib UD performance
IB/mlx4: Advertise RoCE v2 support
IB/mlx4: Create and use another QP1 for RoCEv2
IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers
...
The new name is irq_poll as iopoll is already taken. Better suggestions
welcome.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Namhyung Kim pointed out a potential problem in original code that it
fetches names of maps from section header string table, which is used
to store section names.
Original code doesn't cause error because of a LLVM behavior that, it
combines shstrtab into strtab. For example:
$ echo 'int func() {return 0;}' | x86_64-oe-linux-clang -x c -o temp.o -c -
$ readelf -h ./temp.o
ELF Header:
Magic: 7f 45 4c 46 02 01 01 03 00 00 00 00 00 00 00 00
...
Section header string table index: 1
$ readelf -S ./temp.o
There are 10 section headers, starting at offset 0x288:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .strtab STRTAB 0000000000000000 00000230
0000000000000051 0000000000000000 0 0 1
...
$ readelf -p .strtab ./temp.o
String dump of section '.strtab':
[ 1] .text
[ 7] .comment
[ 10] .bss
[ 15] .note.GNU-stack
[ 25] .rela.eh_frame
[ 34] func
[ 39] .strtab
[ 41] .symtab
[ 49] .data
[ 4f] -
$ readelf -p .shstrtab ./temp.o
readelf: Warning: Section '.shstrtab' was not dumped because it does not exist!
Where, 'section header string table index' points to '.strtab', and
symbol names are also stored there.
However, in case of gcc:
$ echo 'int func() {return 0;}' | gcc -x c -o temp.o -c -
$ readelf -p .shstrtab ./temp.o
String dump of section '.shstrtab':
[ 1] .symtab
[ 9] .strtab
[ 11] .shstrtab
[ 1b] .text
[ 21] .data
[ 27] .bss
[ 2c] .comment
[ 35] .note.GNU-stack
[ 45] .rela.eh_frame
$ readelf -p .strtab ./temp.o
String dump of section '.strtab':
[ 1] func
They are separated sections.
Although original code doesn't cause error, we'd better use canonical
method for fetching symbol names to avoid potential behavior changing.
This patch learns from readelf's code, fetches string from sh_link
of .symbol section.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Reported-and-Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1449541544-67621-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Before this patch libbpf always do feature check even when cleaning.
For example:
$ cd kernel/tools/lib/bpf
$ make
Auto-detecting system features:
... libelf: [ on ]
... bpf: [ on ]
CC libbpf.o
CC bpf.o
LD libbpf-in.o
LINK libbpf.a
LINK libbpf.so
$ make clean
CLEAN libbpf
CLEAN core-gen
$ make clean
Auto-detecting system features:
... libelf: [ on ]
... bpf: [ on ]
CLEAN libbpf
CLEAN core-gen
$
Although the first 'make clean' doesn't show feature check result, it
still does the check. No output because check result is similar to
FEATURE-DUMP.libbpf.
This patch uses same method as perf to turn off feature checking when
'make clean'.
Reported-and-Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1448372181-151723-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
User visible changes:
- Allows BPF scriptlets specify arguments to be fetched using
DWARF info, using a prologue generated at compile/build time (He Kuang, Wang Nan)
- Allow attaching BPF scriptlets to module symbols (Wang Nan)
- Allow attaching BPF scriptlets to userspace code using uprobe (Wang Nan)
- BPF programs now can specify 'perf probe' tunables via its section name,
separating key=val values using semicolons (Wang Nan)
Testing some of these new BPF features:
Use case: get callchains when receiving SSL packets, filter then in the
kernel, at arbitrary place.
# cat ssl.bpf.c
#define SEC(NAME) __attribute__((section(NAME), used))
struct pt_regs;
SEC("func=__inet_lookup_established hnum")
int func(struct pt_regs *ctx, int err, unsigned short port)
{
return err == 0 && port == 443;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
#
# perf record -a -g -e ssl.bpf.c
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.787 MB perf.data (3 samples) ]
# perf script | head -30
swapper 0 [000] 58783.268118: perf_bpf_probe:func: (ffffffff816a0f60) hnum=0x1bb
8a0f61 __inet_lookup_established (/lib/modules/4.3.0+/build/vmlinux)
896def ip_rcv_finish (/lib/modules/4.3.0+/build/vmlinux)
8976c2 ip_rcv (/lib/modules/4.3.0+/build/vmlinux)
855eba __netif_receive_skb_core (/lib/modules/4.3.0+/build/vmlinux)
8565d8 __netif_receive_skb (/lib/modules/4.3.0+/build/vmlinux)
8572a8 process_backlog (/lib/modules/4.3.0+/build/vmlinux)
856b11 net_rx_action (/lib/modules/4.3.0+/build/vmlinux)
2a284b __do_softirq (/lib/modules/4.3.0+/build/vmlinux)
2a2ba3 irq_exit (/lib/modules/4.3.0+/build/vmlinux)
96b7a4 do_IRQ (/lib/modules/4.3.0+/build/vmlinux)
969807 ret_from_intr (/lib/modules/4.3.0+/build/vmlinux)
2dede5 cpu_startup_entry (/lib/modules/4.3.0+/build/vmlinux)
95d5bc rest_init (/lib/modules/4.3.0+/build/vmlinux)
1163ffa start_kernel ([kernel.vmlinux].init.text)
11634d7 x86_64_start_reservations ([kernel.vmlinux].init.text)
1163623 x86_64_start_kernel ([kernel.vmlinux].init.text)
qemu-system-x86 9178 [003] 58785.792417: perf_bpf_probe:func: (ffffffff816a0f60) hnum=0x1bb
8a0f61 __inet_lookup_established (/lib/modules/4.3.0+/build/vmlinux)
896def ip_rcv_finish (/lib/modules/4.3.0+/build/vmlinux)
8976c2 ip_rcv (/lib/modules/4.3.0+/build/vmlinux)
855eba __netif_receive_skb_core (/lib/modules/4.3.0+/build/vmlinux)
8565d8 __netif_receive_skb (/lib/modules/4.3.0+/build/vmlinux)
856660 netif_receive_skb_internal (/lib/modules/4.3.0+/build/vmlinux)
8566ec netif_receive_skb_sk (/lib/modules/4.3.0+/build/vmlinux)
430a br_handle_frame_finish ([bridge])
48bc br_handle_frame ([bridge])
855f44 __netif_receive_skb_core (/lib/modules/4.3.0+/build/vmlinux)
8565d8 __netif_receive_skb (/lib/modules/4.3.0+/build/vmlinux)
#
Use 'perf probe' various options to list functions, see what variables can
be collected at any given point, experiment first collecting without a filter,
then filter, use it together with 'perf trace', 'perf top', with or without
callchains, if it explodes, please tell us!
- Introduce a new callchain mode: "folded", that will list per line
representations of all callchains for a give histogram entry, facilitating
'perf report' output processing by other tools, such as Brendan Gregg's
flamegraph tools (Namhyung Kim)
E.g:
# perf report | grep -v ^# | head
18.37% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
|
---cpu_startup_entry
|
|--12.07%--start_secondary
|
--6.30%--rest_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
#
Becomes, in "folded" mode:
# perf report -g folded | grep -v ^# | head -5
18.37% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
12.07% cpu_startup_entry;start_secondary
6.30% cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel
16.90% 0.00% swapper [kernel.kallsyms] [k] call_cpuidle
11.23% call_cpuidle;cpu_startup_entry;start_secondary
5.67% call_cpuidle;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel
16.90% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter
11.23% cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary
5.67% cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel
15.12% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter_state
#
The user can also select one of "count", "period" or "percent" as the first column.
Infrastructure changes:
- Fix multiple leaks found with Valgrind and a refcount
debugger (Masami Hiramatsu)
- Add further 'perf test' entries for BPF and LLVM (Wang Nan)
- Improve 'perf test' to suport subtests, so that the series of tests
performed in the LLVM and BPF main tests appear in the default 'perf test'
output (Wang Nan)
- Move memdup() from tools/perf to tools/lib/string.c (Arnaldo Carvalho de Melo)
- Adopt strtobool() from the kernel into tools/lib/ (Wang Nan)
- Fix selftests_install tools/ Makefile rule (Kevin Hilman)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch is a preparation for BPF prologue support which allows
generating a series of BPF bytecode for fetching kernel data before
calling program code. With the newly introduced multiple instances
support, perf is able to create different prologues for different kprobe
points.
Before this patch, a bpf_program can be loaded into kernel only once,
and get the only resulting fd. What this patch does is to allow creating
and loading different variants of one bpf_program, then fetching their
fds.
Here we describe the basic idea in this patch. The detailed description
of the newly introduced APIs can be found in comments in the patch body.
The key of this patch is the new mechanism in bpf_program__load().
Instead of loading BPF program into kernel directly, it calls a
'pre-processor' to generate program instances which would be finally
loaded into the kernel based on the original code. To enable the
generation of multiple instances, libbpf passes an index to the
pre-processor so it know which instance is being loaded.
Pre-processor should be called from libbpf's user (perf) using
bpf_program__set_prep(). The number of instances and the relationship
between indices and the target instance should be clear when calling
bpf_program__set_prep().
To retrieve a fd for a specific instance of a program,
bpf_program__nth_fd() is introduced. It returns the resulting fd
according to index.
Signed-off-by: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: He Kuang <hekuang@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1447675815-166222-8-git-send-email-wangnan0@huawei.com
Signed-off-by: Wang Nan <wangnan0@huawei.com>
[ Enclosed multi-line if/else blocks with {}, (*func_ptr)() -> func_ptr() ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf updates from Thomas Gleixner:
"Mostly updates to the perf tool plus two fixes to the kernel core code:
- Handle tracepoint filters correctly for inherited events (Peter
Zijlstra)
- Prevent a deadlock in perf_lock_task_context (Paul McKenney)
- Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
Melo)
- Print full source file paths when using 'perf annotate --print-line
--full-paths' (Michael Petlan)
- Fix 'perf probe -d' when just one out of uprobes and kprobes is
enabled (Wang Nan)
- Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)
- Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
the 'perf test' LLVM entries, when running it in-tree, to
.gitignore (Yunlong Song)
- libbpf error reporting improvements, using a strerror interface to
more precisely tell the user about problems with the provided
scriptlet, be it in C or as a ready made object file (Wang Nan)
- Do not be case sensitive when searching for matching 'perf test'
entries (Arnaldo Carvalho de Melo)
- Inform the user about objdump failures in 'perf annotate' (Andi
Kleen)
- Improve the LLVM 'perf test' entry, introduce a new ones for BPF
and kbuild tests to check the environment used by clang to compile
.c scriptlets (Wang Nan)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
tools include: Add compiler.h to list.h
perf probe: Verify parameters in two functions
perf session: Add missing newlines to some pr_err() calls
perf annotate: Support full source file paths for srcline fix
perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
perf: Fix inherited events vs. tracepoint filters
perf: Disable IRQs across RCU RS CS that acquires scheduler lock
perf test: Do not be case sensitive when searching for matching tests
perf test: Add 'perf test BPF'
perf test: Enhance the LLVM tests: add kbuild test
perf test: Enhance the LLVM test: update basic BPF test program
perf bpf: Improve BPF related error messages
perf tools: Make fetch_kernel_version() publicly available
bpf tools: Add new API bpf_object__get_kversion()
bpf tools: Improve libbpf error reporting
perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
perf annotate: Inform the user about objdump failures in --stdio
perf stat: Make stat options global
perf sched latency: Fix thread pid reuse issue
...
Pull liblockdep fixes from Sasha Levin:
" ... three fixes for liblockdep. Just keeping up with kernel
code changes and new gcc versions."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It seems that newer gcc complains about lack of explicit declaration for some
of the API we use, add it in.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
These were added to the kernel code in cee34d88c ("lockdep: Fix a race between
/proc/lock_stat and module unload"). There's nothing special we need to do
about them in userspace.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>