There is some reason to believe that Documentation/memory-barriers.txt
could use some help, and a major purpose of this patch is to provide
that help in the form of a design-time tool that can produce all valid
executions of a small fragment of concurrent Linux-kernel code, which is
called a "litmus test". This tool's functionality is roughly similar to
a full state-space search. Please note that this is a design-time tool,
not useful for regression testing. However, we hope that the underlying
Linux-kernel memory model will be incorporated into other tools capable
of analyzing large bodies of code for regression-testing purposes.
The main tool is herd7, together with the linux-kernel.bell,
linux-kernel.cat, linux-kernel.cfg, linux-kernel.def, and lock.cat files
added by this patch. The herd7 executable takes the other files as input,
and all of these files collectively define the Linux-kernel memory memory
model. A brief description of each of these other files is provided
in the README file. Although this tool does have its limitations,
which are documented in the README file, it does improve on the version
reported on in the LWN series (https://lwn.net/Articles/718628/ and
https://lwn.net/Articles/720550/) by supporting locking and arithmetic,
including a much wider variety of read-modify-write atomic operations.
Please note that herd7 is not part of this submission, but is freely
available from http://diy.inria.fr/sources/index.html (and via "git"
at https://github.com/herd/herdtools7).
A second tool is klitmus7, which converts litmus tests to loadable
kernel modules for direct testing. As with herd7, the klitmus7
code is freely available from http://diy.inria.fr/sources/index.html
(and via "git" at https://github.com/herd/herdtools7).
Of course, litmus tests are not always the best way to fully understand a
memory model, so this patch also includes Documentation/explanation.txt,
which describes the memory model in detail. In addition,
Documentation/recipes.txt provides example known-good and known-bad use
cases for those who prefer working by example.
This patch also includes a few sample litmus tests, and a great many
more litmus tests are available at https://github.com/paulmckrcu/litmus.
This patch was the result of a most excellent collaboration founded
by Jade Alglave and also including Alan Stern, Andrea Parri, and Luc
Maranget. For more details on the history of this collaboration, please
refer to the Linux-kernel memory model presentations at 2016 LinuxCon EU,
2016 Kernel Summit, 2016 Linux Plumbers Conference, 2017 linux.conf.au,
or 2017 Linux Plumbers Conference microconference. However, one aspect
of the history does bear repeating due to weak copyright tracking earlier
in this project, which extends back to early 2015. This weakness came
to light in late 2017 after an LKMM presentation by Paul in which an
audience member noted the similarity of some LKMM code to code in early
published papers. This prompted a copyright review.
From Alan Stern:
To say that the model was mine is not entirely accurate.
Pieces of it (especially the Scpv and Atomic axioms) were taken
directly from Jade's models. And of course the Happens-before
and Propagation relations and axioms were heavily based on
Jade and Luc's work, even though they weren't identical to the
earlier versions. Only the RCU portion was completely original.
. . .
One can make a much better case that I wrote the bulk of lock.cat.
However, it was inspired by Luc's earlier version (and still
shares some elements in common), and of course it benefited from
feedback and testing from all members of our group.
The model prior to Alan's was Luc Maranget's. From Luc:
I totally agree on Alan Stern's account of the linux kernel model
genesis. I thank him for his acknowledgments of my participation
to previous model drafts. I'd like to complete Alan Stern's
statement: any bell cat code I have written has its roots in
discussions with Jade Alglave and Paul McKenney. Moreover I
have borrowed cat and bell code written by Jade Alglave freely.
This copyright review therefore resulted in late adds to the copyright
statements of several files.
Discussion of v1 has raised several issues, which we do not believe should
block acceptance given that this level of change will be ongoing, just
as it has been with memory-barriers.txt:
o Under what conditions should ordering provided by pure locking
be seen by CPUs not holding the relevant lock(s)? In particular,
should the message-passing pattern be forbidden?
o Should examples involving C11 release sequences be forbidden?
Note that this C11 is still a moving target for this issue:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html
o Some details of the handling of internal dependencies for atomic
read-modify-write atomic operations are still subject to debate.
o Changes recently accepted into mainline greatly reduce the need
to handle DEC Alpha as a special case. These changes add an
smp_read_barrier_depends() to READ_ONCE(), thus causing Alpha
to respect ordering of dependent reads. If these changes stick,
the memory model can be simplified accordingly.
o Will changes be required to accommodate RISC-V?
Differences from v1:
(http://lkml.kernel.org/r/20171113184031.GA26302@linux.vnet.ibm.com)
o Add SPDX notations to .bell and .cat files, replacing
textual license statements.
o Add reference to upcoming ASPLOS paper to .bell and .cat files.
o Updated identifier names in .bell and .cat files to match those
used in the ASPLOS paper.
o Updates to READMEs and other documentation based on review
feedback.
o Added a memory-ordering cheatsheet.
o Update sigs to new Co-Developed-by and add acks and
reviewed-bys.
o Simplify rules detecting nested RCU read-side critical sections.
o Update copyright statements as noted above.
Co-Developed-by: Alan Stern <stern@rowland.harvard.edu>
Co-Developed-by: Andrea Parri <parri.andrea@gmail.com>
Co-Developed-by: Jade Alglave <j.alglave@ucl.ac.uk>
Co-Developed-by: Luc Maranget <luc.maranget@inria.fr>
Co-Developed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Jade Alglave <j.alglave@ucl.ac.uk>
Signed-off-by: Luc Maranget <luc.maranget@inria.fr>
Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: "Reshetova, Elena" <elena.reshetova@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Akira Yokosawa <akiyks@gmail.com>
Cc: <linux-arch@vger.kernel.org>
mostly revert the previous workaround and make
'dubious pointer arithmetic' test useful again.
Use (ptr - ptr) << const instead of ptr << const to generate large scalar.
The rest stays as before commit 2b36047e78.
Fixes: 2b36047e78 ("selftests/bpf: fix test_align")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel discovered recently I broke TC filter replace (and fixed
it in commit ad9294dbc2 ("bpf: fix cls_bpf on filter replace")).
Add a test to make sure it never happens again.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make netdevsim print a message to the BPF verifier log buffer when a
program is offloaded.
Then use this message in hardware offload selftests to make sure that
using this buffer actually prints the message to the console for
eBPF hardware offload.
The message is appended after the last instruction is processed with the
verifying function from netdevsim. Output looks like the following:
$ tc filter add dev foo ingress bpf obj sample_ret0.o \
sec .text verbose skip_sw
Prog section '.text' loaded (5)!
- Type: 3
- Instructions: 2 (0 over limit)
- License:
Verifier analysis:
0: (b7) r0 = 0
1: (95) exit
[netdevsim] Hello from netdevsim!
processed 2 insns, stack depth 0
"verbose" flag is required to see it in the console since netdevsim does
not throw an error after printing the message.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add checks to test that netlink extack messages are correctly displayed
in some expected error cases for eBPF offload to netdevsim with TC and
XDP.
iproute2 may be built without libmnl support, in which case the extack
messages will not be reported. Try to detect this condition, and when
enountered print a mild warning to the user and skip the extack validation.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug: BPF programs and maps related to sockmaps test exist
in memory even after test_maps ends.
This patch fixes it as a short term workaround (sockmap
kernel side needs real fixing) by empyting sockmaps when
test ends.
Fixes: 6f6d33f3b3 ("bpf: selftests add sockmap tests")
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
[ daniel: Note on workaround. ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The test incorrectly doing
mkdir /mnt/cgroup-test-work-dirtest-bpf-based-device-cgroup
instead of
mkdir /mnt/cgroup-test-work-dir/test-bpf-based-device-cgroup
somehow such mkdir succeeds and new directory appears:
/mnt/cgroup-test-work-dir/cgroup-test-work-dirtest-bpf-based-device-cgroup
Later cleanup via nftw("/mnt/cgroup-test-work-dir", ...);
doesn't walk this directory.
"rmdir /mnt/cgroup-test-work-dir" succeeds, but bpf program and
dangling cgroup stays in memory.
That's a separate issue on a cgroup side.
For now fix the test.
Fixes: 37f1ba0909 ("selftests/bpf: add a test for device cgroup controller")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
test_hashmap_walk takes very long time on debug kernel with kasan on.
Reduce the number of iterations in this test without sacrificing
test coverage.
Also add printfs as progress indicator.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 111e6b4531 ("selftests/bpf: make test_verifier run most programs")
enables tools/testing/selftests/bpf/test_verifier unit cases to run
via bpf_prog_test_run command. With the latest code base,
test_verifier had one test case failure:
...
#473/p check deducing bounds from const, 2 FAIL retval 1 != 0
0: (b7) r0 = 1
1: (75) if r0 s>= 0x1 goto pc+1
R0=inv1 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
2: (95) exit
from 1 to 3: R0=inv1 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
3: (d5) if r0 s<= 0x1 goto pc+1
R0=inv1 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
4: (95) exit
from 3 to 5: R0=inv1 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
5: (1f) r1 -= r0
6: (95) exit
processed 7 insns (limit 131072), stack depth 0
...
The test case does not set return value in the test
structure and hence the return value from the prog run
is assumed to be 0. However, the actual return value is 1.
As a result, the test failed. The fix is to correctly set
the return value in the test structure.
Fixes: 111e6b4531 ("selftests/bpf: make test_verifier run most programs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Display the state of the rest of the features (FEATURE_TESTS_EXTRA) on a
'make VF=1' build. These features are detected manually by perf's
Makefile.config so they can't be displayed with the main list, but only
after we're done in Makefile.config.
$ make VF=1
BUILD: Doing 'make -j4' parallel build
Auto-detecting system features:
... dwarf: [ on ]
... dwarf_getlocations: [ on ]
... glibc: [ on ]
... gtk2: [ on ]
SNIP
... timerfd: [ on ]
... sched_getcpu: [ on ]
... sdt: [ on ]
... setns: [ on ]
extra features:
... bionic: [ OFF ]
... compile-32: [ on ]
... compile-x32: [ OFF ]
... cplus-demangle: [ on ]
... hello: [ OFF ]
... libbabeltrace: [ on ]
... liberty: [ on ]
... liberty-z: [ on ]
... libunwind-debug-frame: [ OFF ]
... libunwind-debug-frame-arm: [ OFF ]
... libunwind-debug-frame-aarch64: [ OFF ]
SNIP
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180109092646.GB11520@krava
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
I've meet a strange behavior with these commands on my gentoo box:
1: perf kmem record
2: CTRL-C to stop 1
3: perf report
4: "Enter", "Enter", "Run scripts for all samples",
"event_analyzing_sample".
Then 'perf report' says:
"
No kallsyms or vmlinux with build-id xxxx was found
/lib/modules/4.10.0+/build/vmlinux with build id xxxx not found,
continuing without symbols
".
It is strange because I am sure /lib/modules/4.10.0+/build/vmlinux is
right for perf.data.
After digging, I found out the reason is that "perf report" generates
many open fds, then "script_browse" uses popen to run "perf script"
which run out of open files.
The gentoo box has a small default value for "max open files", 1024.
Yes, "ulimit -n " with a bigger number could fix it, but I think that
using O_CLOEXEC in do_open is a better way.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180115050448.GA20759@udknight
[ Make sure O_CLOEXEC is available in old systems by adding a patch
just before this one, to keep this bisectable in such systems ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
usbip host lists devices attached to vhci_hcd on the same server
when user does attach over localhost or specifies the server as the
remote.
usbip attach -r localhost -b busid
or
usbip attach -r servername (or server IP)
Fix it to check and not list devices that are attached to vhci_hcd.
Cc: stable@vger.kernel.org
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
usbip host binds to devices attached to vhci_hcd on the same server
when user does attach over localhost or specifies the server as the
remote.
usbip attach -r localhost -b busid
or
usbip attach -r servername (or server IP)
Unbind followed by bind works, however device is left in a bad state with
accesses via the attached busid result in errors and system hangs during
shutdown.
Fix it to check and bail out if the device is already attached to vhci_hcd.
Cc: stable@vger.kernel.org
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a selftest to check if endianness is flipped inadvertently to BE
(MSR.LE set to zero) on BE and LE machines when a trap is caught in
transactional mode and load_fp and load_vec are zero, i.e. when MSR.FP
and MSR.VEC are zeroed (disabled).
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a selftest to exercise the powerpc alignment fault handler.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
[mpe: Add 32-bit support to the signal handler]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds --pgfault and --iterations options to mmap_bench test. With
--pgfault we touch every page mapped. This helps in measuring impact in the
page fault path with a patch series.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-01-19
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) bpf array map HW offload, from Jakub.
2) support for bpf_get_next_key() for LPM map, from Yonghong.
3) test_verifier now runs loaded programs, from Alexei.
4) xdp cpumap monitoring, from Jesper.
5) variety of tests, cleanups and small x64 JIT optimization, from Daniel.
6) user space can now retrieve HW JITed program, from Jiong.
Note there is a minor conflict between Russell's arm32 JIT fixes
and removal of bpf_jit_enable variable by Daniel which should
be resolved by keeping Russell's comment and removing that variable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The BPF verifier conflict was some minor contextual issue.
The TUN conflict was less trivial. Cong Wang fixed a memory leak of
tfile->tx_array in 'net'. This is an skb_array. But meanwhile in
net-next tun changed tfile->tx_arry into tfile->tx_ring which is a
ptr_ring.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add couple of missing test cases for eBPF div/mod by zero to the
new test_verifier prog runtime feature. Also one for an empty prog
and only exit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A test case is added in tools/testing/selftests/bpf/test_lpm_map.c
for MAP_GET_NEXT_KEY command. A four node trie, which
is described in kernel/bpf/lpm_trie.c, is built and the
MAP_GET_NEXT_KEY results are checked.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull networking fixes from David Miller:
1) Fix BPF divides by zero, from Eric Dumazet and Alexei Starovoitov.
2) Reject stores into bpf context via st and xadd, from Daniel
Borkmann.
3) Fix a memory leak in TUN, from Cong Wang.
4) Disable RX aggregation on a specific troublesome configuration of
r8152 in a Dell TB16b dock.
5) Fix sw_ctx leak in tls, from Sabrina Dubroca.
6) Fix program replacement in cls_bpf, from Daniel Borkmann.
7) Fix uninitialized station_info structures in cfg80211, from Johannes
Berg.
8) Fix miscalculation of transport header offset field in flow
dissector, from Eric Dumazet.
9) Fix LPM tree leak on failure in mlxsw driver, from Ido Schimmel.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
ibmvnic: Fix IPv6 packet descriptors
ibmvnic: Fix IP offload control buffer
ipv6: don't let tb6_root node share routes with other node
ip6_gre: init dev->mtu and dev->hard_header_len correctly
mlxsw: spectrum_router: Free LPM tree upon failure
flow_dissector: properly cap thoff field
fm10k: mark PM functions as __maybe_unused
cfg80211: fix station info handling bugs
netlink: reset extack earlier in netlink_rcv_skb
can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
bpf: mark dst unknown on inconsistent {s, u}bounds adjustments
bpf: fix cls_bpf on filter replace
Net: ethernet: ti: netcp: Fix inbound ping crash if MTU size is greater than 1500
tls: reset crypto_info when do_tls_setsockopt_tx fails
tls: return -EBUSY if crypto_info is already set
tls: fix sw_ctx leak
net/tls: Only attach to sockets in ESTABLISHED state
net: fs_enet: do not call phy_stop() in interrupts
r8152: disable RX aggregation on Dell TB16 dock
...
Check map device information is reported correctly, and perform
basic map operations. Check device destruction gets rid of the
maps and map allocation failure path by telling netdevsim to
reject map offload via DebugFS.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tell user space about device on which the map was created.
Unfortunate reality of user ABI makes sharing this code
with program offload difficult but the information is the
same.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
to improve test coverage make test_verifier run all successfully loaded
programs on 64-byte zero initialized data.
For clsbpf and xdp it means empty 64-byte packet.
For lwt and socket_filters it's 64-byte packet where skb->data
points after L2.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Update tools/include/uapi/linux/bpf.h to bring it in sync with
include/uapi/linux/bpf.h. The listed commits forgot to update it.
Fixes: 02dd3291b2 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs")
Fixes: f19397a5c6 ("bpf: Add access to snd_cwnd and others in sock_ops")
Fixes: 06ef0ccb5a ("bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There is a static checker warning that "proglen" has an upper bound but
no lower bound. The allocation will just fail harmlessly so it's not a
big deal.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>