Currently, etf expects a socket with SO_TXTIME option set for each packet
it encounters. So, it will drop all other packets. But, in the future
commits we are planning to add functionality where tstamp value will be set
by another qdisc. Also, some packets which are generated from within the
kernel (e.g. ICMP packets) do not have any socket associated with them.
So, this commit adds support for skip_sock_check. When this option is set,
etf will skip checking for a socket and other associated options for all
skbs.
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC hooks allow the application of filters and actions to packets at both
ingress and egress of the network stack. It is possible, with poor
configuration, that this can produce loops whereby an ingress hook calls
a mirred egress action that has an egress hook that redirects back to
the first ingress etc. The TC core classifier protects against loops when
doing reclassifies but there is no protection against a packet looping
between multiple hooks and recursively calling act_mirred. This can lead
to stack overflow panics.
Add a per CPU counter to act_mirred that is incremented for each recursive
call of the action function when processing a packet. If a limit is passed
then the packet is dropped and CPU counter reset.
Note that this patch does not protect against loops in TC datapaths. Its
aim is to prevent stack overflow kernel panics that can be a consequence
of such loops.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TC_ACT_REINSERT return type was added as an in-kernel only option to
allow a packet ingress or egress redirect. This is used to avoid
unnecessary skb clones in situations where they are not required. If a TC
hook returns this code then the packet is 'reinserted' and no skb consume
is carried out as no clone took place.
This return type is only used in act_mirred. Rather than have the reinsert
called from the main datapath, call it directly in act_mirred. Instead of
returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED
which tells the caller that the packet has been stolen by another process
and that no consume call is required.
Moving all redirect calls to the act_mirred code is in preparation for
tracking recursion created by act_mirred.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tools such as vpnc try to flush routes when run inside network
namespaces by writing 1 into /proc/sys/net/ipv4/route/flush. This
currently does not work because flush is not enabled in non-initial
network namespaces.
Since routes are per network namespace it is safe to enable
/proc/sys/net/ipv4/route/flush in there.
Link: https://github.com/lxc/lxd/issues/4257
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix memleak reported by syzkaller when registering IPVS hooks,
patch from Julian Anastasov.
2) Fix memory leak in start_sync_thread, also from Julian.
3) Fix conntrack deletion via ctnetlink, from Felix Kaechele.
4) Fix reject for ICMP due to incorrect checksum handling, from
He Zhe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the leds documentation files to ReST, add an
index for them and adjust in order to produce a nice html
output via the Sphinx build system.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Since v5.1-rc1, some types of packets do not get unreachable reply with the
following iptables setting. Fox example,
$ iptables -A INPUT -p icmp --icmp-type 8 -j REJECT
$ ping 127.0.0.1 -c 1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
— 127.0.0.1 ping statistics —
1 packets transmitted, 0 received, 100% packet loss, time 0ms
We should have got the following reply from command line, but we did not.
From 127.0.0.1 icmp_seq=1 Destination Port Unreachable
Yi Zhao reported it and narrowed it down to:
7fc3822536 ("netfilter: reject: skip csum verification for protocols that don't support it"),
This is because nf_ip_checksum still expects pseudo-header protocol type 0 for
packets that are of neither TCP or UDP, and thus ICMP packets are mistakenly
treated as TCP/UDP.
This patch corrects the conditions in nf_ip_checksum and all other places that
still call it with protocol 0.
Fixes: 7fc3822536 ("netfilter: reject: skip csum verification for protocols that don't support it")
Reported-by: Yi Zhao <yi.zhao@windriver.com>
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- fix includes for _MAX constants, atomic functions and fwdecls,
by Sven Eckelmann (3 patches)
- shorten multicast tt/tvlv worker spinlock section, by Linus Luessing
- routeable multicast preparations: implement MAC multicast filtering,
by Linus Luessing (2 patches, David Millers comments integrated)
- remove return value checks for debugfs_create, by Greg Kroah-Hartman
- add routable multicast optimizations, by Linus Luessing (2 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- fix a leaked TVLV handler which wasn't unregistered, by Jeremy Sowden
- fix duplicated OGMs when interfaces are set UP, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case where a record marker was used, xs_sendpages() needs
to return the length of the payload + record marker so that we
operate correctly in the case of a partial transmission.
When the callers check return value, they therefore need to
take into account the record marker length.
Fixes: 06b5fc3ad9 ("Merge tag 'nfs-rdma-for-5.1-1'...")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
As other udp/ip tunnels do, tipc udp media should also have a
lockless dst_cache supported on its tx path.
Here we add dst_cache into udp_replicast to support dst cache
for both rmcast and rcast, and rmcast uses ub->rcast and each
rcast uses its own node in ub->rcast.list.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.
In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.
The aquantia driver conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix ppp_mppe crypto soft dependencies, from Takashi Iawi.
2) Fix TX completion to be finite, from Sergej Benilov.
3) Use register_pernet_device to avoid a dst leak in tipc, from Xin
Long.
4) Double free of TX cleanup in Dirk van der Merwe.
5) Memory leak in packet_set_ring(), from Eric Dumazet.
6) Out of bounds read in qmi_wwan, from Bjørn Mork.
7) Fix iif used in mcast/bcast looped back packets, from Stephen
Suryaputra.
8) Fix neighbour resolution on raw ipv6 sockets, from Nicolas Dichtel.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (25 commits)
af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET
sctp: change to hold sk after auth shkey is created successfully
ipv6: fix neighbour resolution with raw socket
ipv6: constify rt6_nexthop()
net: dsa: microchip: Use gpiod_set_value_cansleep()
net: aquantia: fix vlans not working over bridged network
ipv4: reset rt_iif for recirculated mcast/bcast out pkts
team: Always enable vlan tx offload
net/smc: Fix error path in smc_init
net/smc: hold conns_lock before calling smc_lgr_register_conn()
bonding: Always enable vlan tx offload
net/ipv6: Fix misuse of proc_dointvec "skip_notify_on_dev_down"
ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop
qmi_wwan: Fix out-of-bounds read
tipc: check msg->req data len in tipc_nl_compat_bearer_disable
net: macb: do not copy the mac address if NULL
net/packet: fix memory leak in packet_set_ring()
net/tls: fix page double free on TX cleanup
net/sched: cbs: Fix error path of cbs_module_init
tipc: change to use register_pernet_device
...
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
passing them down to the kernel or bypass kernel completely.
BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
kernel returns.
Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
The buffer memory is pre-allocated (because I don't think there is
a precedent for working with __user memory from bpf). This might be
slow to do for each {s,g}etsockopt call, that's why I've added
__cgroup_bpf_prog_array_is_empty that exits early if there is nothing
attached to a cgroup. Note, however, that there is a race between
__cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
program layout might have changed; this should not be a problem
because in general there is a race between multiple calls to
{s,g}etsocktop and user adding/removing bpf progs from a cgroup.
The return code of the BPF program is handled as follows:
* 0: EPERM
* 1: success, continue with next BPF program in the cgroup chain
v9:
* allow overwriting setsockopt arguments (Alexei Starovoitov):
* use set_fs (same as kernel_setsockopt)
* buffer is always kzalloc'd (no small on-stack buffer)
v8:
* use s32 for optlen (Andrii Nakryiko)
v7:
* return only 0 or 1 (Alexei Starovoitov)
* always run all progs (Alexei Starovoitov)
* use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
(decided to use optval=-1 instead, optval=0 might be a valid input)
* call getsockopt hook after kernel handlers (Alexei Starovoitov)
v6:
* rework cgroup chaining; stop as soon as bpf program returns
0 or 2; see patch with the documentation for the details
* drop Andrii's and Martin's Acked-by (not sure they are comfortable
with the new state of things)
v5:
* skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
v4:
* don't export bpf_sk_fullsock helper (Martin Lau)
* size != sizeof(__u64) for uapi pointers (Martin Lau)
* offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
v3:
* typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
* reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
Nakryiko)
* use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
* use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
* new CG_SOCKOPT_ACCESS macro to wrap repeated parts
v2:
* moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
* aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
* bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
* added [0,2] return code check to verifier (Martin Lau)
* dropped unused buf[64] from the stack (Martin Lau)
* use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
* dropped bpf_target_off from ctx rewrites (Martin Lau)
* use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split. This will also allow a
greater range of subjects to represented.
============
WHY DO THIS?
============
The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.
For SETATTR, this includes actions that are about controlling access to a
key:
(1) Changing a key's ownership.
(2) Changing a key's security information.
(3) Setting a keyring's restriction.
And actions that are about managing a key's lifetime:
(4) Setting an expiry time.
(5) Revoking a key.
and (proposed) managing a key as part of a cache:
(6) Invalidating a key.
Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.
Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission. It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.
As for SEARCH permission, that currently covers:
(1) Finding keys in a keyring tree during a search.
(2) Permitting keyrings to be joined.
(3) Invalidation.
But these don't really belong together either, since these actions really
need to be controlled separately.
Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.
===============
WHAT IS CHANGED
===============
The SETATTR permission is split to create two new permissions:
(1) SET_SECURITY - which allows the key's owner, group and ACL to be
changed and a restriction to be placed on a keyring.
(2) REVOKE - which allows a key to be revoked.
The SEARCH permission is split to create:
(1) SEARCH - which allows a keyring to be search and a key to be found.
(2) JOIN - which allows a keyring to be joined as a session keyring.
(3) INVAL - which allows a key to be invalidated.
The WRITE permission is also split to create:
(1) WRITE - which allows a key's content to be altered and links to be
added, removed and replaced in a keyring.
(2) CLEAR - which allows a keyring to be cleared completely. This is
split out to make it possible to give just this to an administrator.
(3) REVOKE - see above.
Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together. An ACE specifies a subject, such as:
(*) Possessor - permitted to anyone who 'possesses' a key
(*) Owner - permitted to the key owner
(*) Group - permitted to the key group
(*) Everyone - permitted to everyone
Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.
Further subjects may be made available by later patches.
The ACE also specifies a permissions mask. The set of permissions is now:
VIEW Can view the key metadata
READ Can read the key content
WRITE Can update/modify the key content
SEARCH Can find the key by searching/requesting
LINK Can make a link to the key
SET_SECURITY Can change owner, ACL, expiry
INVAL Can invalidate
REVOKE Can revoke
JOIN Can join this keyring
CLEAR Can clear this keyring
The KEYCTL_SETPERM function is then deprecated.
The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.
The KEYCTL_INVALIDATE function then requires INVAL.
The KEYCTL_REVOKE function then requires REVOKE.
The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.
The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.
======================
BACKWARD COMPATIBILITY
======================
To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.
It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.
SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned
on if a keyring is being altered.
The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.
It will make the following mappings:
(1) INVAL, JOIN -> SEARCH
(2) SET_SECURITY -> SETATTR
(3) REVOKE -> WRITE if SETATTR isn't already set
(4) CLEAR -> WRITE
Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.
=======
TESTING
=======
This passes the keyutils testsuite for all but a couple of tests:
(1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
if the type doesn't have ->read(). You still can't actually read the
key.
(2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
work as Other has been replaced with Everyone in the ACL.
Signed-off-by: David Howells <dhowells@redhat.com>
Some drivers want to access the data transmitted in order to implement
acceleration features of the NICs. It is also useful in AF_XDP TX flow.
Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
contains the data pointer, length and DMA address, instead of only the
latter two. Adapt the implementation of i40e and ixgbe to this change.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make it possible for the application to determine whether the AF_XDP
socket is running in zero-copy mode. To achieve this, add a new
getsockopt option XDP_OPTIONS that returns flags. The only flag
supported for now is the zero-copy mode indicator.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a function that checks whether the Fill Ring has the specified
amount of descriptors available. It will be useful for mlx5e that wants
to check in advance, whether it can allocate a bulk of RX descriptors,
to get the best performance.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Gateway validation does not need a dst_entry, it only needs the fib
entry to validate the gateway resolution and egress device. So,
convert ip6_nh_lookup_table from ip6_pol_route to fib6_table_lookup
and ip6_route_check_nh to use fib6_lookup over rt6_lookup.
ip6_pol_route is a call to fib6_table_lookup and if successful a call
to fib6_select_path. From there the exception cache is searched for an
entry or a dst_entry is created to return to the caller. The exception
entry is not relevant for gateway validation, so what matters are the
calls to fib6_table_lookup and then fib6_select_path.
Similarly, rt6_lookup can be replaced with a call to fib6_lookup with
RT6_LOOKUP_F_IFACE set in flags. Again, the exception cache search is
not relevant, only the lookup with path selection. The primary difference
in the lookup paths is the use of rt6_select with fib6_lookup versus
rt6_device_match with rt6_lookup. When you remove complexities in the
rt6_select path, e.g.,
1. saddr is not set for gateway validation, so RT6_LOOKUP_F_HAS_SADDR
is not relevant
2. rt6_check_neigh is not called so that removes the RT6_NUD_FAIL_DO_RR
return and round-robin logic.
the code paths are believed to be equivalent for the given use case -
validate the gateway and optionally given the device. Furthermore, it
aligns the validation with onlink code path and the lookup path actually
used for rx and tx.
Adjust the users, ip6_route_check_nh_onlink and ip6_route_check_nh to
handle a fib6_info vs a rt6_info when performing validation checks.
Existing selftests fib-onlink-tests.sh and fib_tests.sh are used to
verify the changes.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we not only track the presence of multicast listeners but also
multicast routers we can safely apply group-aware multicast-to-unicast
forwarding to packets with a destination address of scope greater than
link-local as well.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
To be able to apply our group aware multicast optimizations to packets
with a scope greater than link-local we need to not only keep track of
multicast listeners but also multicast routers.
With this patch a node detects the presence of multicast routers on
its segment by checking if
/proc/sys/net/ipv{4,6}/conf/<bat0|br0(bat)>/mc_forwarding is set for one
thing. This option is enabled by multicast routing daemons and needed
for the kernel's multicast routing tables to receive and route packets.
For another thing if a bridge is configured on top of bat0 then the
presence of an IPv6 multicast router behind this bridge is currently
detected by checking for an IPv6 multicast "All Routers Address"
(ff02::2). This should later be replaced by querying the bridge, which
performs proper, RFC4286 compliant Multicast Router Discovery (our
simplified approach includes more hosts than necessary, most notably
not just multicast routers but also unicast ones and is not applicable
for IPv4).
If no multicast router is detected then this is signalized via the new
BATADV_MCAST_WANT_NO_RTR4 and BATADV_MCAST_WANT_NO_RTR6
multicast tvlv flags.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Because we don't care if debugfs works or not, this trickles back a bit
so we can clean things up by making some functions return void instead
of an error value that is never going to fail.
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <a@unstable.cc>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[sven@narfation.org: drop unused variables]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
When a bridge is added on top of bat0 we set the WANT_ALL_UNSNOOPABLES
flag. Which means we sign up for all traffic for ff02::1/128 and
224.0.0.0/24.
When the node itself had IPv6 enabled or joined a group in 224.0.0.0/24
itself then so far this would result in a multicast TT entry which is
redundant to the WANT_ALL_UNSNOOPABLES.
With this patch such redundant TT entries are avoided.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Instead of collecting multicast MAC addresses from the netdev hw mc
list collect a node's multicast listeners from the IP lists and convert
those to MAC addresses.
This allows to exclude addresses of specific scope later. On a
multicast MAC address the IP destination scope is not visible anymore.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
There are common steps when releasing an accepted or unaccepted socket.
Move this code into a common routine.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
secondary address promotion causes infinite loop -- it arranges
for ifa->ifa_next to point back to itself.
Problem is that 'prev_prom' and 'last_prim' might point at the same entry,
so 'last_sec' pointer must be obtained after prev_prom->next update.
Fixes: 2638eb8b50 ("net: ipv4: provide __rcu annotation for ifa_list")
Reported-by: Ran Rozenstein <ranro@mellanox.com>
Reported-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When parsing an ethtool_rx_flow_spec, users can specify an ethernet flow
which could contain matches based on the ethernet header, such as the
MAC address, the VLAN tag or the ethertype.
ETHER_FLOW uses the src and dst ethernet addresses, along with the
ethertype as keys. Matches based on the vlan tag are also possible, but
they are specified using the special FLOW_EXT flag.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Pablo Neira Ayuso <pablo@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an application is run that:
a) Sets its scheduler to be SCHED_FIFO
and
b) Opens a memory mapped AF_PACKET socket, and sends frames with the
MSG_DONTWAIT flag cleared, its possible for the application to hang
forever in the kernel. This occurs because when waiting, the code in
tpacket_snd calls schedule, which under normal circumstances allows
other tasks to run, including ksoftirqd, which in some cases is
responsible for freeing the transmitted skb (which in AF_PACKET calls a
destructor that flips the status bit of the transmitted frame back to
available, allowing the transmitting task to complete).
However, when the calling application is SCHED_FIFO, its priority is
such that the schedule call immediately places the task back on the cpu,
preventing ksoftirqd from freeing the skb, which in turn prevents the
transmitting task from detecting that the transmission is complete.
We can fix this by converting the schedule call to a completion
mechanism. By using a completion queue, we force the calling task, when
it detects there are no more frames to send, to schedule itself off the
cpu until such time as the last transmitted skb is freed, allowing
forward progress to be made.
Tested by myself and the reporter, with good results
Change Notes:
V1->V2:
Enhance the sleep logic to support being interruptible and
allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
V2->V3:
Rearrage the point at which we wait for the completion queue, to
avoid needing to check for ph/skb being null at the end of the loop.
Also move the complete call to the skb destructor to avoid needing to
modify __packet_set_status. Also gate calling complete on
packet_read_pending returning zero to avoid multiple calls to complete.
(Willem de Bruijn)
Move timeo computation within loop, to re-fetch the socket
timeout since we also use the timeo variable to record the return code
from the wait_for_complete call (Neil Horman)
V3->V4:
Willem has requested that the control flow be restored to the
previous state. Doing so lets us eliminate the need for the
po->wait_on_complete flag variable, and lets us get rid of the
packet_next_frame function, but introduces another complexity.
Specifically, but using the packet pending count, we can, if an
applications calls sendmsg multiple times with MSG_DONTWAIT set, each
set of transmitted frames, when complete, will cause
tpacket_destruct_skb to issue a complete call, for which there will
never be a wait_on_completion call. This imbalance will lead to any
future call to wait_for_completion here to return early, when the frames
they sent may not have completed. To correct this, we need to re-init
the completion queue on every call to tpacket_snd before we enter the
loop so as to ensure we wait properly for the frames we send in this
iteration.
Change the timeout and interrupted gotos to out_put rather than
out_status so that we don't try to free a non-existant skb
Clean up some extra newlines (Willem de Bruijn)
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The scenario is the following: the user uses a raw socket to send an ipv6
packet, destinated to a not-connected network, and specify a connected nh.
Here is the corresponding python script to reproduce this scenario:
import socket
IPPROTO_RAW = 255
send_s = socket.socket(socket.AF_INET6, socket.SOCK_RAW, IPPROTO_RAW)
# scapy
# p = IPv6(src='fd00💯:1', dst='fd00:200::fa')/ICMPv6EchoRequest()
# str(p)
req = b'`\x00\x00\x00\x00\x08:@\xfd\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xfd\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfa\x80\x00\x81\xc0\x00\x00\x00\x00'
send_s.sendto(req, ('fd00:175::2', 0, 0, 0))
fd00:175::/64 is a connected route and fd00:200::fa is not a connected
host.
With this scenario, the kernel starts by sending a NS to resolve
fd00:175::2. When it receives the NA, it flushes its queue and try to send
the initial packet. But instead of sending it, it sends another NS to
resolve fd00:200::fa, which obvioulsy fails, thus the packet is dropped. If
the user sends again the packet, it now uses the right nh (fd00:175::2).
The problem is that ip6_dst_lookup_neigh() uses the rt6i_gateway, which is
:: because the associated route is a connected route, thus it uses the dst
addr of the packet. Let's use rt6_nexthop() to choose the right nh.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no functional change in this patch, it only prepares the next one.
rt6_nexthop() will be used by ip6_dst_lookup_neigh(), which uses const
variables.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_default_metrics has all of the metrics initialized to 0, so nothing
will be added to the skb in rtnetlink_put_metrics. Avoid the loop if
metrics is from dst_default_metrics.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create key domain tags for network namespaces and make it possible to
automatically tag keys that are used by networked services (e.g. AF_RXRPC,
AFS, DNS) with the default network namespace if not set by the caller.
This allows keys with the same description but in different namespaces to
coexist within a keyring.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org
Add a 'recurse' flag for keyring searches so that the flag can be omitted
and recursion disabled, thereby allowing just the nominated keyring to be
searched and none of the children.
Signed-off-by: David Howells <dhowells@redhat.com>
Multicast or broadcast egress packets have rt_iif set to the oif. These
packets might be recirculated back as input and lookup to the raw
sockets may fail because they are bound to the incoming interface
(skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
returns rt_iif instead of skb_iif. Hence, the lookup fails.
v2: Make it non vrf specific (David Ahern). Reword the changelog to
reflect it.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If register_pernet_subsys success in smc_init,
we should cleanup it in case any other error.
Fixes: 64e28b52c7 (net/smc: add pnet table namespace support")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After smc_lgr_create(), the newly created link group is added
to smc_lgr_list, thus is accessible from other context.
Although link group creation is serialized by
smc_create_lgr_pending, the new link group may still be accessed
concurrently. For example, if ib_device is no longer active,
smc_ib_port_event_work() will call smc_port_terminate(), which
in turn will call __smc_lgr_terminate() on every link group of
this device. So conns_lock is required here.
Signed-off-by: Huaping Zhou <zhp@smail.nju.edu.cn>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clang warns:
In file included from net/xdp/xsk_queue.c:10:
net/xdp/xsk_queue.h:292:2: warning: expression result unused
[-Wunused-value]
WRITE_ONCE(q->ring->producer, q->prod_tail);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:284:6: note: expanded from macro 'WRITE_ONCE'
__u.__val; \
~~~ ^~~~~
1 warning generated.
The q->prod_tail assignment has a comma at the end, not a semi-colon.
Fix that so clang no longer warns and everything works as expected.
Fixes: c497176cb2 ("xsk: add Rx receive functions and poll support")
Link: https://github.com/ClangBuiltLinux/linux/issues/544
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit f8e6089820 ("netfilter: ctnetlink: Resolve conntrack
L3-protocol flush regression") introduced a regression in which deletion
of conntrack entries would fail because the L3 protocol information
is replaced by AF_UNSPEC. As a result the search for the entry to be
deleted would turn up empty due to the tuple used to perform the search
is now different from the tuple used to initially set up the entry.
For flushing the conntrack table we do however want to keep the option
for nfgenmsg->version to have a non-zero value to allow for newer
user-space tools to request treatment under the new behavior. With that
it is possible to independently flush tables for a defined L3 protocol.
This was introduced with the enhancements in in commit 59c08c69c2
("netfilter: ctnetlink: Support L3 protocol-filter on flush").
Older user-space tools will retain the behavior of flushing all tables
regardless of defined L3 protocol.
Fixes: f8e6089820 ("netfilter: ctnetlink: Resolve conntrack L3-protocol flush regression")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Felix Kaechele <felix@kaechele.ca>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We rename the inline function msg_get_wrapped() to the more
comprehensible msg_inner_hdr().
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We increase the allocated headroom for the buffer copies to be
retransmitted. This eliminates the need for the lower stack levels
(UDP/IP/L2) to expand the headroom in order to add their own headers.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit a4dc70d46c ("tipc: extend link reset criteria for stale
packet retransmission") we made link retransmission failure events
dependent on the link tolerance, and not only of the number of failed
retransmission attempts, as we did earlier. This works well. However,
keeping the original, additional criteria of 99 failed retransmissions
is now redundant, and may in some cases lead to failure detection
times in the order of minutes instead of the expected 1.5 sec link
tolerance value.
We now remove this criteria altogether.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/sys/net/ipv6/route/skip_notify_on_dev_down assumes given value to be
0 or 1. Use proc_dointvec_minmax instead of proc_dointvec.
Fixes: 7c6bb7d2fa ("net/ipv6: Add knob to skip DELROUTE message ondevice down")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 19e4e76806 ("ipv4: Fix raw socket lookup for local
traffic"), the dif argument to __raw_v4_lookup() is coming from the
returned value of inet_iif() but the change was done only for the first
lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex.
Fixes: 19e4e76806 ("ipv4: Fix raw socket lookup for local traffic")
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve conflict between d2912cb15b ("treewide: Replace GPLv2
boilerplate/reference with SPDX - rule 500") removing the GPL disclaimer
and fe03d47456 ("Update my email address") which updates Jozsef
Kadlecsik's email.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>