Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-03-26
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) introduce bpf_tcp_check_syncookie() helper for XDP and tc, from Lorenz.
2) allow bpf_skb_ecn_set_ce() in tc, from Peter.
3) numerous bpf tc tunneling improvements, from Willem.
4) and other miscellaneous improvements from Adrian, Alan, Daniel, Ivan, Stanislav.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It is hard to say what KEY_SCREEN and KEY_ZOOM mean, but historically DVB
folks have used them to indicate switch to full screen mode. Later, they
converged on using KEY_ZOOM to switch into full screen mode and KEY)SCREEN
to control aspect ratio (see Documentation/media/uapi/rc/rc-tables.rst).
Let's commit to these uses, and define:
- KEY_FULL_SCREEN (and make KEY_ZOOM its alias)
- KEY_ASPECT_RATIO (and make KEY_SCREEN its alias)
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Add structs and definitions needed to implement stateless
decoder for fwht and add I/P-frames QP controls to the
public api.
Signed-off-by: Dafna Hirschfeld <dafna3@gmail.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
With this patch multicast packets with a limited number of destinations
(current default: 16) will be split and transmitted by the originator as
individual unicast transmissions.
Wifi broadcasts with their low bitrate are still a costly undertaking.
In a mesh network this cost multiplies with the overall size of the mesh
network. Therefore using multiple unicast transmissions instead of
broadcast flooding is almost always less burdensome for the mesh
network.
The maximum amount of unicast packets can be configured via the newly
introduced multicast_fanout parameter. If this limit is exceeded
distribution will fall back to classic broadcast flooding.
The multicast-to-unicast conversion is performed on the initial
multicast sender node and counts on a final destination node, mesh-wide
basis (and not next hop, neighbor node basis).
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
All files got a SPDX-License-Identifier with commit 7db7d9f369
("batman-adv: Add SPDX license identifier above copyright header"). All the
required information about the license conditions can be found in
LICENSES/.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
When pushing tunnel headers, annotate skbs in the same way as tunnel
devices.
For GSO packets, the network stack requires certain fields set to
segment packets with tunnel headers. gro_gse_segment depends on
transport and inner mac header, for instance.
Add an option to pass this information.
Remove the restriction on len_diff to network header length, which
is too short, e.g., for GRE protocols.
Changes
v1->v2:
- document new flags
- BPF_F_ADJ_ROOM_MASK moved
v2->v3:
- BPF_F_ADJ_ROOM_ENCAP_L3_MASK moved
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_skb_adjust_room adjusts gso_size of gso packets to account for the
pushed or popped header room.
This is not allowed with UDP, where gso_size delineates datagrams. Add
an option to avoid these updates and allow this call for datagrams.
It can also be used with TCP, when MSS is known to allow headroom,
e.g., through MSS clamping or route MTU.
Changes v1->v2:
- document flag BPF_F_ADJ_ROOM_FIXED_GSO
- do not expose BPF_F_ADJ_ROOM_MASK through uapi, as it may change.
Link: https://patchwork.ozlabs.org/patch/1052497/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_skb_adjust_room net allows inserting room in an skb.
Existing mode BPF_ADJ_ROOM_NET inserts room after the network header
by pulling the skb, moving the network header forward and zeroing the
new space.
Add new mode BPF_ADJUST_ROOM_MAC that inserts room after the mac
header. This allows inserting tunnel headers in front of the network
header without having to recreate the network header in the original
space, avoiding two copies.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Follow the file conventions of:
- register offsets not indented
- fields within a register indented one space
- field masks use same width as register
- register field values indented an additional space
No functional change intended.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Using bpf_skc_lookup_tcp it's possible to ascertain whether a packet
belongs to a known connection. However, there is one corner case: no
sockets are created if SYN cookies are active. This means that the final
ACK in the 3WHS is misclassified.
Using the helper, we can look up the listening socket via
bpf_skc_lookup_tcp and then check whether a packet is a valid SYN
cookie ACK.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow looking up a sock_common. This gives eBPF programs
access to timewait and request sockets.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In commit f2780d6d74 "tun: Add ioctl() SIOCGSKNS cmd to allow
obtaining net ns of tun device" it was missed that tun may change
its net ns, while net ns of socket remains the same as it was
created initially. SIOCGSKNS returns net ns of socket, so it is
not suitable for obtaining net ns of device.
We may have two tun devices with the same names in two net ns,
and in this case it's not possible to determ, which of them
fd refers to (TUNGETIFF will return the same name).
This patch adds new ioctl() cmd for obtaining net ns of a device.
Reported-by: Harald Albrecht <harald.albrecht@gmx.net>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syscall_get_arch() is required to be implemented on all architectures
in addition to already implemented syscall_get_nr(),
syscall_get_arguments(), syscall_get_error(), and
syscall_get_return_value() functions in order to extend the generic
ptrace API with PTRACE_GET_SYSCALL_INFO request.
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Elvira Khabirova <lineprinter@altlinux.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linux-audit@redhat.com
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This should never have been defined in the arch tree to begin with,
and now uapi/linux/audit.h header is going to use EM_UNICORE
in order to define AUDIT_ARCH_UNICORE which is needed to implement
syscall_get_arch() which in turn is required to extend
the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Elvira Khabirova <lineprinter@altlinux.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linux-audit@redhat.com
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).
This looks like:
int fd = fspick(AT_FDCWD, "/mnt",
FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Provide a system call by which a filesystem opened with fsopen() and
configured by a series of fsconfig() calls can have a detached mount object
created for it. This mount object can then be attached to the VFS mount
hierarchy using move_mount() by passing the returned file descriptor as the
from directory fd.
The system call looks like:
int mfd = fsmount(int fsfd, unsigned int flags,
unsigned int attr_flags);
where fsfd is the file descriptor returned by fsopen(). flags can be 0 or
FSMOUNT_CLOEXEC. attr_flags is a bitwise-OR of the following flags:
MOUNT_ATTR_RDONLY Mount read-only
MOUNT_ATTR_NOSUID Ignore suid and sgid bits
MOUNT_ATTR_NODEV Disallow access to device special files
MOUNT_ATTR_NOEXEC Disallow program execution
MOUNT_ATTR__ATIME Setting on how atime should be updated
MOUNT_ATTR_RELATIME - Update atime relative to mtime/ctime
MOUNT_ATTR_NOATIME - Do not update access times
MOUNT_ATTR_STRICTATIME - Always perform atime updates
MOUNT_ATTR_NODIRATIME Do not update directory access times
In the event that fsmount() fails, it may be possible to get an error
message by calling read() on fsfd. If no message is available, ENODATA
will be reported.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
long fsconfig(int fs_fd, unsigned int cmd, const char *key,
const void *value, int aux);
Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.
The following command IDs are proposed:
(*) FSCONFIG_SET_FLAG: No value is specified. The parameter must be
boolean in nature. The key may be prefixed with "no" to invert the
setting. value must be NULL and aux must be 0.
(*) FSCONFIG_SET_STRING: A string value is specified. The parameter can
be expecting boolean, integer, string or take a path. A conversion to
an appropriate type will be attempted (which may include looking up as
a path). value points to a NUL-terminated string and aux must be 0.
(*) FSCONFIG_SET_BINARY: A binary blob is specified. value points to
the blob and aux indicates its size. The parameter must be expecting
a blob.
(*) FSCONFIG_SET_PATH: A non-empty path is specified. The parameter must
be expecting a path object. value points to a NUL-terminated string
that is the path and aux is a file descriptor at which to start a
relative lookup or AT_FDCWD.
(*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
implied.
(*) FSCONFIG_SET_FD: An open file descriptor is specified. value must
be NULL and aux indicates the file descriptor.
(*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
(*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take. The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.
Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..
[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem. Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.
Example usage:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("afs", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("jffs2", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle. fsopen() is given the name of the filesystem that will be used:
int mfd = fsopen(const char *fsname, unsigned int flags);
where flags can be 0 or FSOPEN_CLOEXEC.
For example:
sfd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fsinfo(sfd, NULL, ...); // query new superblock attributes
mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
sfd = fsopen("afs", -1);
fsconfig(fd, FSCONFIG_SET_STRING, "source",
"#grand.central.org:root.cell", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
mfd = fsmount(sfd, 0, MS_NODEV);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:
"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"
Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.
The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.
Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.
Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.
The new system call looks like the following:
int move_mount(int from_dfd, const char *from_path,
int to_dfd, const char *to_path,
unsigned int flags);
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
open_tree(dfd, pathname, flags)
Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken. The detached tree will be
dissolved on the final close of obtained file. Creation of such
detached trees requires the same capabilities as doing mount --bind.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Added support for AES128-CCM based record encryption. AES128-CCM is
similar to AES128-GCM. Both of them have same salt/iv/mac size. The
notable difference between the two is that while invoking AES128-CCM
operation, the salt||nonce (which is passed as IV) has to be prefixed
with a hardcoded value '2'. Further, CCM implementation in kernel
requires IV passed in crypto_aead_request() to be full '16' bytes.
Therefore, the record structure 'struct tls_rec' has been modified to
reserve '16' bytes for IV. This works for both GCM and CCM based cipher.
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, a multicast stream uses either broadcast or replicast as
transmission method, based on the ratio between number of actual
destinations nodes and cluster size.
However, when an L2 interface (e.g., VXLAN) provides pseudo
broadcast support, this becomes very inefficient, as it blindly
replicates multicast packets to all cluster/subnet nodes,
irrespective of whether they host actual target sockets or not.
The TIPC multicast algorithm is able to distinguish real destination
nodes from other nodes, and hence provides a smarter and more
efficient method for transferring multicast messages than
pseudo broadcast can do.
Because of this, we now make it possible for users to force
the broadcast link to permanently switch to using replicast,
irrespective of which capabilities the bearer provides,
or pretend to provide.
Conversely, we also make it possible to force the broadcast link
to always use true broadcast. While maybe less useful in
deployed systems, this may at least be useful for testing the
broadcast algorithm in small clusters.
We retain the current AUTOSELECT ability, i.e., to let the broadcast link
automatically select which algorithm to use, and to switch back and forth
between broadcast and replicast as the ratio between destination
node number and cluster size changes. This remains the default method.
Furthermore, we make it possible to configure the threshold ratio for
such switches. The default ratio is now set to 10%, down from 25% in the
earlier implementation.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
RAS ECC event will combine with GPU reset event, due to
ECC interrupts are caused by uncorrectable error that triggers
GPU reset.
v2: Fix misleading-indentation warning
v3: fix build with CONFIG_HSA_AMD disabled
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Commit b7bb367afa added support for inserting delays in between
individual words within a single SPI transaction. This makes it
accessible from userspace.
WARNING: This delay is silently ignored unless the SPI controller
implements extra support for it. This is similar to how the in-kernel
users handle the other existing property, spi_transfer->word_delay.
Signed-off-by: Jan Kundrát <jan.kundrat@cesnet.cz>
Signed-off-by: Mark Brown <broonie@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2019-03-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a umem memory leak on cleanup in AF_XDP, from Björn.
2) Fix BTF to properly resolve forward-declared enums into their corresponding
full enum definition types during deduplication, from Andrii.
3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus.
4) Fix accessing invalid pointer returned from bpf_tcp_sock() and
bpf_sk_fullsock() after bpf_sk_release() was called, from Martin.
5) Fix generation of load/store DW instructions in PPC JIT, from Naveen.
6) Various fixes in BPF helper function documentation in bpf.h UAPI header
used to bpf-helpers(7) man page, from Quentin.
7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add documentation for the BPF spinlock-related helpers to the doc in
bpf.h. I added the constraints and restrictions coming with the use of
spinlocks for BPF: not all of it is directly related to the use of the
helper, but I thought it would be nice for users to find them in the man
page.
This list of restrictions is nearly a verbatim copy of the list in
Alexei's commit log for those helpers.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Another round of minor fixes for the documentation of the BPF helpers
located in the UAPI bpf.h header file. Changes include:
- Moving around description of some helpers, to keep the descriptions in
the same order as helpers are declared (bpf_map_push_elem(), leftover
from commit 90b1023f68 ("bpf: fix documentation for eBPF helpers"),
bpf_rc_keydown(), and bpf_skb_ancestor_cgroup_id()).
- Fixing typos ("contex" -> "context").
- Harmonising return types ("void* " -> "void *", "uint64_t" -> "u64").
- Addition of the "bpf_" prefix to bpf_get_storage().
- Light additions of RST markup on some keywords.
- Empty line deletion between description and return value for
bpf_tcp_sock().
- Edit for the description for bpf_skb_ecn_set_ce() (capital letters,
acronym expansion, no effect if ECT not set, more details on return
value).
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)"
which returns a bpf_sock in TCP_LISTEN state. It will trace back to
the listener sk from a request_sock if possible. It returns NULL
for all other cases.
No reference is taken because the helper ensures the sk is
in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in).
Hence, bpf_sk_release() is unnecessary and the verifier does not
allow bpf_sk_release(listen_sk) to be called either.
The following is also allowed because the bpf_prog is run under
rcu_read_lock():
sk = bpf_sk_lookup_tcp();
/* if (!sk) { ... } */
listen_sk = bpf_get_listener_sock(sk);
/* if (!listen_sk) { ... } */
bpf_sk_release(sk);
src_port = listen_sk->src_port; /* Allowed */
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in -next since before the merge window
opened, with no known collisions / issues reported.
The only detail worth noting, outside the summary below, is that the
"libnvdimm-start-pad" topic has been truncated to just cleanups and
small fixes. The full topic branch would have doubled down on hacks
around the "section alignment" limitation of the core-mm, instead
effort is now being spent to address that root issue in the memory
hotplug implementation for v5.2.
- Fix nfit-bus command submission regression
- Support retrieval of short-ARS results if the ARS state is
"requires continuation", and even if the "no_init_ars" module
parameter is specified
- Allow busy-polling of the kernel ARS state by allowing root to
reset the exponential back-off timer
- Filter potentially stale ARS results by tracking query-ARS relative
to the previous start-ARS
- Enhance dax_device alignment checks
- Add support for the Hyper-V family of device-specific-methods
(DSMs)
- Add several fixes and workarounds for Hyper-V compatibility
- Fix support to cache the dirty-shutdown-count at init"
* tag 'libnvdimm-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (25 commits)
libnvdimm/namespace: Clean up holder_class_store()
libnvdimm/of_pmem: Fix platform_no_drv_owner.cocci warnings
acpi/nfit: Update NFIT flags error message
libnvdimm/btt: Fix LBA masking during 'free list' population
libnvdimm/btt: Remove unnecessary code in btt_freelist_init
libnvdimm/pfn: Remove dax_label_reserve
dax: Check the end of the block-device capacity with dax_direct_access()
nfit/ars: Avoid stale ARS results
nfit/ars: Allow root to busy-poll the ARS state machine
nfit/ars: Introduce scrub_flags
nfit/ars: Remove ars_start_flags
nfit/ars: Attempt short-ARS even in the no_init_ars case
nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
acpi/nfit: Require opt-in for read-only label configurations
libnvdimm/pmem: Honor force_raw for legacy pmem regions
libnvdimm/pfn: Account for PAGE_SIZE > info-block-size in nd_pfn_init()
libnvdimm: Fix altmap reservation size calculation
libnvdimm, pfn: Fix over-trim in trim_pfn_device()
acpi/nfit: Fix bus command validation
libnvdimm/dimm: Add a no-BLK quirk based on NVDIMM family
...
Pull fuse updates from Miklos Szeredi:
"Scalability and performance improvements, as well as minor bug fixes
and cleanups"
* tag 'fuse-update-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
fuse: cache readdir calls if filesystem opts out of opendir
fuse: support clients that don't implement 'opendir'
fuse: lift bad inode checks into callers
fuse: multiplex cached/direct_io file operations
fuse add copy_file_range to direct io fops
fuse: use iov_iter based generic splice helpers
fuse: Switch to using async direct IO for FOPEN_DIRECT_IO
fuse: use atomic64_t for khctr
fuse: clean up aborted
fuse: Protect ff->reserved_req via corresponding fi->lock
fuse: Protect fi->nlookup with fi->lock
fuse: Introduce fi->lock to protect write related fields
fuse: Convert fc->attr_version into atomic64_t
fuse: Add fuse_inode argument to fuse_prepare_release()
fuse: Verify userspace asks to requeue interrupt that we really sent
fuse: Do some refactoring in fuse_dev_do_write()
fuse: Wake up req->waitq of only if not background
fuse: Optimize request_end() by not taking fiq->waitq.lock
fuse: Kill fasync only if interrupt is queued in queue_interrupt()
fuse: Remove stale comment in end_requests()
...
Merge several updates to the ARS implementation. Highlights include:
* Support retrieval of short-ARS results if the ARS state is "requires
continuation", and even if the "no_init_ars" module parameter is
specified.
* Allow busy-polling of the kernel ARS state by allowing root to reset
the exponential back-off timer.
* Filter potentially stale ARS results by tracking query-ARS relative to
the previous start-ARS.
Merge miscellaneous libnvdimm sub-system updates for v5.1. Highlights
include:
* Support for the Hyper-V family of device-specific-methods (DSMs)
* Several fixes and workarounds for Hyper-V compatibility.
* Fix for the support to cache the dirty-shutdown-count at init.
Pull IOMMU updates from Joerg Roedel:
- A big cleanup and optimization patch-set for the Tegra GART driver
- Documentation updates and fixes for the IOMMU-API
- Support for page request in Intel VT-d scalable mode
- Intel VT-d dma_[un]map_resource() support
- Updates to the ATS enabling code for PCI (acked by Bjorn) and Intel
VT-d to align with the latest version of the ATS spec
- Relaxed IRQ source checking in the Intel VT-d driver for some aliased
devices, needed for future devices which send IRQ messages from more
than on request-ID
- IRQ remapping driver for Hyper-V
- Patches to make generic IOVA and IO-Page-Table code usable outside of
the IOMMU code
- Various other small fixes and cleanups
* tag 'iommu-updates-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (60 commits)
iommu/vt-d: Get domain ID before clear pasid entry
iommu/vt-d: Fix NULL pointer reference in intel_svm_bind_mm()
iommu/vt-d: Set context field after value initialized
iommu/vt-d: Disable ATS support on untrusted devices
iommu/mediatek: Fix semicolon code style issue
MAINTAINERS: Add Hyper-V IOMMU driver into Hyper-V CORE AND DRIVERS scope
iommu/hyper-v: Add Hyper-V stub IOMMU driver
x86/Hyper-V: Set x2apic destination mode to physical when x2apic is available
PCI/ATS: Add inline to pci_prg_resp_pasid_required()
iommu/vt-d: Check identity map for hot-added devices
iommu: Fix IOMMU debugfs fallout
iommu: Document iommu_ops.is_attach_deferred()
iommu: Document iommu_ops.iotlb_sync_map()
iommu/vt-d: Enable ATS only if the device uses page aligned address.
PCI/ATS: Add pci_ats_page_aligned() interface
iommu/vt-d: Fix PRI/PASID dependency issue.
PCI/ATS: Add pci_prg_resp_pasid_required() interface.
iommu/vt-d: Allow interrupts from the entire bus for aliased devices
iommu/vt-d: Add helper to set an IRTE to verify only the bus number
iommu: Fix flush_tlb_all typo
...
Pull arm64 updates from Catalin Marinas:
- Pseudo NMI support for arm64 using GICv3 interrupt priorities
- uaccess macros clean-up (unsafe user accessors also merged but
reverted, waiting for objtool support on arm64)
- ptrace regsets for Pointer Authentication (ARMv8.3) key management
- inX() ordering w.r.t. delay() on arm64 and riscv (acks in place by
the riscv maintainers)
- arm64/perf updates: PMU bindings converted to json-schema, unused
variable and misleading comment removed
- arm64/debug fixes to ensure checking of the triggering exception
level and to avoid the propagation of the UNKNOWN FAR value into the
si_code for debug signals
- Workaround for Fujitsu A64FX erratum 010001
- lib/raid6 ARM NEON optimisations
- NR_CPUS now defaults to 256 on arm64
- Minor clean-ups (documentation/comments, Kconfig warning, unused
asm-offsets, clang warnings)
- MAINTAINERS update for list information to the ARM64 ACPI entry
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
arm64: mmu: drop paging_init comments
arm64: debug: Ensure debug handlers check triggering exception level
arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
Revert "arm64: uaccess: Implement unsafe accessors"
arm64: avoid clang warning about self-assignment
arm64: Kconfig.platforms: fix warning unmet direct dependencies
lib/raid6: arm: optimize away a mask operation in NEON recovery routine
lib/raid6: use vdupq_n_u8 to avoid endianness warnings
arm64: io: Hook up __io_par() for inX() ordering
riscv: io: Update __io_[p]ar() macros to take an argument
asm-generic/io: Pass result of I/O accessor to __io_[p]ar()
arm64: Add workaround for Fujitsu A64FX erratum 010001
arm64: Rename get_thread_info()
arm64: Remove documentation about TIF_USEDFPU
arm64: irqflags: Fix clang build warnings
arm64: Enable the support of pseudo-NMIs
arm64: Skip irqflags tracing for NMI in IRQs disabled context
arm64: Skip preemption when exiting an NMI
arm64: Handle serror in NMI context
irqchip/gic-v3: Allow interrupts to be set as pseudo-NMI
...
Pull device mapper updates from Mike Snitzer:
- Update bio-based DM core to always call blk_queue_split() and update
DM targets to properly advertise discard limits that
blk_queue_split() looks at when dtermining to split discard. Whereby
allowing DM core's own 'split_discard_bios' to be removed.
- Improve DM cache target to provide support for discard passdown to
the origin device.
- Introduce support to directly boot to a DM mapped device from init by
using dm-mod.create= module param. This eliminates the need for an
elaborate initramfs that is otherwise needed to create DM devices.
This feature's implementation has been worked on for quite some time
(got up to v12) and is of particular interest to Android and other
more embedded platforms (e.g. ARM).
- Rate limit errors from the DM integrity target that were identified
as the cause for recent NMI hangs due to console limitations.
- Add sanity checks for user input to thin-pool and external snapshot
creation.
- Remove some unused leftover kmem caches from when old .request_fn
request-based support was removed.
- Various small cleanups and fixes to targets (e.g. typos, needless
unlikely() annotations, use struct_size(), remove needless
.direct_access method from dm-snapshot)
* tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm integrity: limit the rate of error messages
dm snapshot: don't define direct_access if we don't support it
dm cache: add support for discard passdown to the origin device
dm writecache: fix typo in name for writeback_wq
dm: add support to directly boot to a mapped device
dm thin: add sanity checks to thin-pool and external snapshot creation
dm block manager: remove redundant unlikely annotation
dm verity fec: remove redundant unlikely annotation
dm integrity: remove redundant unlikely annotation
dm: always call blk_queue_split() in dm_process_bio()
dm: fix to_sector() for 32bit
dm switch: use struct_size() in kzalloc()
dm: remove unused _rq_tio_cache and _rq_cache
dm: eliminate 'split_discard_bios' flag from DM target interface
dm: update dm_process_bio() to split bio if in ->make_request_fn()
Pull media updates from Mauro Carvalho Chehab:
- remove sensor drivers that got converted from soc_camera
- remaining soc_camera drivers got moved to staging
- some documentation cleanups and improvements
- the imx staging driver now supports imx7
- the ov9640, mt9m001 and mt9m111 got converted from soc_camera
- the vim2m driver now does what a m2m convert driver expects to do
- epoll() fixes on media subsystems
- several drivers fixes, typos, cleanups and improvements
* tag 'media/v5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (346 commits)
media: dvb/earth-pt1: fix wrong initialization for demod blocks
media: vim2m: Address some coding style issues
media: vim2m: don't use BUG()
media: vim2m: speedup passthrough copy
media: vim2m: add an horizontal scaler
media: vim2m: don't accept YUYV anymore as output format
media: vim2m: add vertical linear scaler
media: vim2m: better handle cap/out buffers with different sizes
media: vim2m: use different framesizes for bayer formats
media: vim2m: add support for VIDIOC_ENUM_FRAMESIZES
media: vim2m: ensure that width is multiple of two
media: vim2m: improve debug messages
media: vim2m: add bayer capture formats
media: a few more typos at staging, pci, platform, radio and usb
media: Documentation: fix several typos
media: staging: fix several typos
media: include: fix several typos
media: common: fix several typos
media: v4l2-core: fix several typos
media: usb: fix several typos
...
Pull io_uring IO interface from Jens Axboe:
"Second attempt at adding the io_uring interface.
Since the first one, we've added basic unit testing of the three
system calls, that resides in liburing like the other unit tests that
we have so far. It'll take a while to get full coverage of it, but
we're working towards it. I've also added two basic test programs to
tools/io_uring. One uses the raw interface and has support for all the
various features that io_uring supports outside of standard IO, like
fixed files, fixed IO buffers, and polled IO. The other uses the
liburing API, and is a simplified version of cp(1).
This adds support for a new IO interface, io_uring.
io_uring allows an application to communicate with the kernel through
two rings, the submission queue (SQ) and completion queue (CQ) ring.
This allows for very efficient handling of IOs, see the v5 posting for
some basic numbers:
https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/
Outside of just efficiency, the interface is also flexible and
extendable, and allows for future use cases like the upcoming NVMe
key-value store API, networked IO, and so on. It also supports async
buffered IO, something that we've always failed to support in the
kernel.
Outside of basic IO features, it supports async polled IO as well.
This particular feature has already been tested at Facebook months ago
for flash storage boxes, with 25-33% improvements. It makes polled IO
actually useful for real world use cases, where even basic flash sees
a nice win in terms of efficiency, latency, and performance. These
boxes were IOPS bound before, now they are not.
This series adds three new system calls. One for setting up an
io_uring instance (io_uring_setup(2)), one for submitting/completing
IO (io_uring_enter(2)), and one for aux functions like registrating
file sets, buffers, etc (io_uring_register(2)). Through the help of
Arnd, I've coordinated the syscall numbers so merge on that front
should be painless.
Jon did a writeup of the interface a while back, which (except for
minor details that have been tweaked) is still accurate. Find that
here:
https://lwn.net/Articles/776703/
Huge thanks to Al Viro for helping getting the reference cycle code
correct, and to Jann Horn for his extensive reviews focused on both
security and bugs in general.
There's a userspace library that provides basic functionality for
applications that don't need or want to care about how to fiddle with
the rings directly. It has helpers to allow applications to easily set
up an io_uring instance, and submit/complete IO through it without
knowing about the intricacies of the rings. It also includes man pages
(thanks to Jeff Moyer), and will continue to grow support helper
functions and features as time progresses. Find it here:
git://git.kernel.dk/liburing
Fio has full support for the raw interface, both in the form of an IO
engine (io_uring), but also with a small test application (t/io_uring)
that can exercise and benchmark the interface"
* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
io_uring: add a few test tools
io_uring: allow workqueue item to handle multiple buffered requests
io_uring: add support for IORING_OP_POLL
io_uring: add io_kiocb ref count
io_uring: add submission polling
io_uring: add file set registration
net: split out functions related to registering inflight socket files
io_uring: add support for pre-mapped user IO buffers
block: implement bio helper to add iter bvec pages to bio
io_uring: batch io_kiocb allocation
io_uring: use fget/fput_many() for file references
fs: add fget_many() and fput_many()
io_uring: support for IO polling
io_uring: add fsync support
Add io_uring IO interface
Pull block layer updates from Jens Axboe:
"Not a huge amount of changes in this round, the biggest one is that we
finally have Mings multi-page bvec support merged. Apart from that,
this pull request contains:
- Small series that avoids quiescing the queue for sysfs changes that
match what we currently have (Aleksei)
- Series of bcache fixes (via Coly)
- Series of lightnvm fixes (via Mathias)
- NVMe pull request from Christoph. Nothing major, just SPDX/license
cleanups, RR mp policy (Hannes), and little fixes (Bart,
Chaitanya).
- BFQ series (Paolo)
- Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
for the fast path (Jianchao)
- fops->iopoll() added for async IO polling, this is a feature that
the upcoming io_uring interface will use (Christoph, me)
- Partition scan loop fixes (Dongli)
- mtip32xx conversion from managed resource API (Christoph)
- cdrom registration race fix (Guenter)
- MD pull from Song, two minor fixes.
- Various documentation fixes (Marcos)
- Multi-page bvec feature. This brings a lot of nice improvements
with it, like more efficient splitting, larger IOs can be supported
without growing the bvec table size, and so on. (Ming)
- Various little fixes to core and drivers"
* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
block: fix updating bio's front segment size
block: Replace function name in string with __func__
nbd: propagate genlmsg_reply return code
floppy: remove set but not used variable 'q'
null_blk: fix checking for REQ_FUA
block: fix NULL pointer dereference in register_disk
fs: fix guard_bio_eod to check for real EOD errors
blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
block: optimize bvec iteration in bvec_iter_advance
block: introduce mp_bvec_for_each_page() for iterating over page
block: optimize blk_bio_segment_split for single-page bvec
block: optimize __blk_segment_map_sg() for single-page bvec
block: introduce bvec_nth_page()
iomap: wire up the iopoll method
block: add bio_set_polled() helper
block: wire up block device iopoll method
fs: add an iopoll method to struct file_operations
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
loop: do not print warn message if partition scan is successful
block: bounce: make sure that bvec table is updated
...
Merge more updates from Andrew Morton:
- some of the rest of MM
- various misc things
- dynamic-debug updates
- checkpatch
- some epoll speedups
- autofs
- rapidio
- lib/, lib/lzo/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
samples/mic/mpssd/mpssd.h: remove duplicate header
kernel/fork.c: remove duplicated include
include/linux/relay.h: fix percpu annotation in struct rchan
arch/nios2/mm/fault.c: remove duplicate include
unicore32: stop printing the virtual memory layout
MAINTAINERS: fix GTA02 entry and mark as orphan
mm: create the new vm_fault_t type
arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
arch: simplify several early memory allocations
openrisc: simplify pte_alloc_one_kernel()
sh: prefer memblock APIs returning virtual address
microblaze: prefer memblock API returning virtual address
powerpc: prefer memblock APIs returning virtual address
lib/lzo: separate lzo-rle from lzo
lib/lzo: implement run-length encoding
lib/lzo: fast 8-byte copy on arm64
lib/lzo: 64-bit CTZ on arm64
lib/lzo: tidy-up ifdefs
ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
ipc: annotate implicit fall through
...