This patch is to implement sender-side procedures for the Outgoing
and Incoming SSN Reset Request Parameter described in rfc6525 section
5.1.2 and 5.1.3.
It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2
for users.
Note that the new asoc member strreset_outstanding is to make sure
only one reconf request chunk on the fly as rfc6525 section 5.1.1
demands.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set
strreset_enable to indicate which reconf request type it supports,
which is described in rfc6525 section 6.3.1.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add reconf_enable field in all of asoc ep and netns
to indicate if they support stream reset.
When initializing, asoc reconf_enable get the default value from ep
reconf_enable which is from netns netns reconf_enable by default.
It is also to add reconf_capable in asoc peer part to know if peer
supports reconf_enable, the value is set if ext params have reconf
chunk support when processing init chunk, just as rfc6525 section
5.1.1 demands.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add a primitive based on sctp primitive frame for
sending stream reconf request. It works as the other primitives,
and create a SCTP_CMD_REPLY command to send the request chunk out.
sctp_primitive_RECONF would be the api to send a reconf request
chunk.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add a per transport timer based on sctp timer frame
for stream reconf chunk retransmission. It would start after sending
a reconf request chunk, and stop after receiving the response chunk.
If the timer expires, besides retransmitting the reconf request chunk,
it would also do the same thing with data RTO timer. like to increase
the appropriate error counts, and perform threshold management, possibly
destroying the asoc if sctp retransmission thresholds are exceeded, just
as section 5.1.1 describes.
This patch is also to add asoc strreset_chunk, it is used to save the
reconf request chunk, so that it can be retransmitted, and to check if
the response is really for this request by comparing the information
inside with the response chunk as well.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add asoc strreset_outseq and strreset_inseq for
saving the reconf request sequence, initialize them when create
assoc and process init, and also to define Incoming and Outgoing
SSN Reset Request Parameter described in rfc6525 section 4.1 and
4.2, As they can be in one same chunk as section rfc6525 3.1-3
describes, it makes them in one function.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After adding the following nft rule, then ping 224.0.0.1:
# nft add rule netdev t c pkttype host counter
The warning complain message will be printed out again and again:
WARNING: CPU: 0 PID: 10182 at net/netfilter/nft_meta.c:163 \
nft_meta_get_eval+0x3fe/0x460 [nft_meta]
[...]
Call Trace:
<IRQ>
dump_stack+0x85/0xc2
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
nft_meta_get_eval+0x3fe/0x460 [nft_meta]
nft_do_chain+0xff/0x5e0 [nf_tables]
So we should deal with PACKET_LOOPBACK in netdev family too. For ipv4,
convert it to PACKET_BROADCAST/MULTICAST according to the destination
address's type; For ipv6, convert it to PACKET_MULTICAST directly.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since there's no broadcast address in IPV6, so in ipv6 family, the
PACKET_LOOPBACK must be multicast packets, there's no need to check
it again.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
never set it again. Which means that in the future if we end up adding a bunch
of reuseport sk's to that tb we'll have to do the expensive scan every time.
Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
so we know what comparison to make, and the ipv6 only setting so we can make
sure to compare with new sockets appropriately. Once one sk has made it onto
the list we know that there are no potential bind conflicts on the owners list
that match that sk's rcv_addr. So copy the sk's information into our bind
bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
an extra check for subsequent reuseport sockets and skip the expensive bind
conflict check.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_csk_get_port does two different things, it either scans for an open port,
or it tries to see if the specified port is available for use. Since these two
operations have different rules and are basically independent lets split them
into two different functions to make them both more readable.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just wasted time, we've already found a tb that doesn't have a bind
conflict, and we don't drop the head lock so scanning again isn't going to give
us a different answer. Instead move the tb->reuse setting logic outside of the
found_tb path and put it in the success: path. Then make it so that we don't
goto again if we find a bind conflict in the found_tb path as we won't reach
this anymore when we are scanning for an ephemeral port.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In inet_csk_get_port we seem to be using smallest_port to figure out where the
best place to look for a SO_REUSEPORT sk that matches with an existing set of
SO_REUSEPORT's. However if we get to the logic
if (smallest_size != -1) {
port = smallest_port;
goto have_port;
}
we will do a useless search, because we would have already done the
inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
would have gone to found_tb and succeeded. Since this logic makes us do yet
another trip through inet_csk_bind_conflict for a port we know won't work just
delete this code and save us the time.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
is how they check the rcv_saddr, so delete this call back and simply
change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We pass these per-protocol equal functions around in various places, but
we can just have one function that checks the sk->sk_family and then do
the right comparison function. I've also changed the ipv4 version to
not cast to inet_sock since it is unneeded.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
... otherwise the crypto stack will align it for us with a GFP_ATOMIC
allocation and a memcpy() -- see skcipher_walk_first().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.
This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.
A packetdrill test demonstrating the behavior:
// testing mac osx rst behavior
// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4
// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768
// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768
// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The inet_num is u16, so use %hu instead of casting it to int. And
the sk_bound_dev_if is int actually, so it needn't cast to int.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed type of csum field in struct igmpv3_query from __be16 to
__sum16 to eliminate type warning, made same change in struct
igmpv3_report for consistency.
Fixed up an ntohs() where htons() should have been used instead.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having MPLS packet stats is useful for observing network operation and
for diagnosing network problems. In the absence of anything better,
RFC2863 and RFC3813 are used for guidance for which stats to expose
and the semantics of them. In particular rx_noroutes maps to in
unknown protos in RFC2863. The stats are exposed to userspace via
AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
RTM_GETSTATS messages.
All the introduced fields are 64-bit, even error ones, to ensure no
overflow with long uptimes. Per-CPU counters are used to avoid
cache-line contention on the commonly used fields. The other fields
have also been made per-CPU for code to avoid performance problems in
error conditions on the assumption that on some platforms the cost of
atomic operations could be more expensive than sending the packet
(which is what would be done in the success case). If that's not the
case, we could instead not use per-CPU counters for these fields.
Only unicast and non-fragment are exposed at the moment, but other
counters can be exposed in the future either by adding to the end of
struct mpls_link_stats or by additional netlink attributes in the
AF_MPLS IFLA_STATS_AF_SPEC nested attribute.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the functionality for including address-family-specific per-link
stats in RTM_GETSTATS messages. This is done through adding a new
IFLA_STATS_AF_SPEC attribute under which address family attributes are
nested and then the AF-specific attributes can be further nested. This
follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
advantage of presenting an easily extended hierarchy. The rtnl_af_ops
structure is extended to provide AFs with the opportunity to fill and
provide the size of their stats attributes.
One alternative would have been to provide AFs with the ability to add
attributes directly into the RTM_GETSTATS message without a nested
hierarchy. I discounted this approach as it increases the rate at
which the 32 attribute number space is used up and it makes
implementation a little more tricky for stats dump resuming (at the
moment the order in which attributes are added to the message has to
match the numeric order of the attributes).
Another alternative would have been to register per-AF RTM_GETSTATS
handlers. I discounted this approach as I perceived a common use-case
to be getting all the stats for an interface and this approach would
necessitate multiple requests/dumps to retrieve them all.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Handle multicast packets properly in fast-RX path of mac80211, from
Johannes Berg.
2) Because of a logic bug, the user can't actually force SW
checksumming on r8152 devices. This makes diagnosis of hw
checksumming bugs really annoying. Fix from Hayes Wang.
3) VXLAN route lookup does not take the source and destination ports
into account, which means IPSEC policies cannot be matched properly.
Fix from Martynas Pumputis.
4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.
5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.
6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
construction properly in fib_dump_info(), from David Ahern.
7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
Schmidt.
8) Openvswitch conntack actions need to maintain a correct checksum,
fix from Lance Richardson.
9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
fact it already checks this, but not in all of the necessary spots.
Fix from Basil Gunn.
10) Action GET operations in the packet scheduler can erroneously bump
the reference count of the entry, making it unreleasable. Fix from
Jamal Hadi Salim. Jamal gives a great set of example command lines
that trigger this in the commit message.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
net sched actions: fix refcnt when GETing of action after bind
net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
net/mlx4_core: Fix racy CQ (Completion Queue) free
net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
net/mlx5e: Fix a -Wmaybe-uninitialized warning
ax25: Fix segfault after sock connection timeout
bpf: rework prog_digest into prog_tag
tipc: allocate user memory with GFP_KERNEL flag
net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
ip6_tunnel: Account for tunnel header in tunnel MTU
mld: do not remove mld souce list info when set link down
be2net: fix MAC addr setting on privileged BE3 VFs
be2net: don't delete MAC on close on unprivileged BE3 VFs
be2net: fix status check in be_cmd_pmac_add()
cpmac: remove hopeless #warning
ravb: do not use zero-length alignment DMA descriptor
mlx4: do not call napi_schedule() without care
openvswitch: maintain correct checksum state in conntrack actions
tcp: fix tcp_fastopen unaligned access complaints on sparc
...
We need to setup the trailer in two different cases,
so add a helper to avoid code duplication.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch tries to avoid skb_cow_data on esp6.
On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.
On the decrypt side, we leave the buffer as it is
if it is not cloned.
With this, we can avoid a linearization of the buffer
in most of the cases.
Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch tries to avoid skb_cow_data on esp4.
On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.
On the decrypt side, we leave the buffer as it is
if it is not cloned.
With this, we can avoid a linearization of the buffer
in most of the cases.
Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Demonstrating the issue:
.. add a drop action
$sudo $TC actions add action drop index 10
.. retrieve it
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 0 installed 29 sec used 29 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
... bug 1 above: reference is two.
Reference is actually 1 but we forget to subtract 1.
... do a GET again and we see the same issue
try a few times and nothing changes
~$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 0 installed 31 sec used 31 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
... lets try to bind the action to a filter..
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
... and now a few GETs:
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 3 bind 1 installed 204 sec used 204 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 4 bind 1 installed 206 sec used 206 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 5 bind 1 installed 235 sec used 235 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
.... as can be observed the reference count keeps going up.
After the fix
$ sudo $TC actions add action drop index 10
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 1 bind 0 installed 4 sec used 4 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 1 bind 0 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 1 installed 32 sec used 32 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
$ sudo $TC -s actions get action gact index 10
action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 1 installed 33 sec used 33 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Fixes: aecc5cefc3 ("net sched actions: fix GETing actions")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flower currently allows having the same filter twice with the same
priority. Actions (and statistics update) will always execute on the
first inserted rule leaving the second rule unused.
This patch disallows that.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ax.25 socket connection timed out & the sock struct has been
previously taken down ie. sock struct is now a NULL pointer. Checking
the sock_flag causes the segfault. Check if the socket struct pointer
is NULL before checking sock_flag. This segfault is seen in
timed out netrom connections.
Please submit to -stable.
Signed-off-by: Basil Gunn <basil@pacabunga.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7bd509e311 ("bpf: add prog_digest and expose it via
fdinfo/netlink") was recently discussed, partially due to
admittedly suboptimal name of "prog_digest" in combination
with sha1 hash usage, thus inevitably and rightfully concerns
about its security in terms of collision resistance were
raised with regards to use-cases.
The intended use cases are for debugging resp. introspection
only for providing a stable "tag" over the instruction sequence
that both kernel and user space can calculate independently.
It's not usable at all for making a security relevant decision.
So collisions where two different instruction sequences generate
the same tag can happen, but ideally at a rather low rate. The
"tag" will be dumped in hex and is short enough to introspect
in tracepoints or kallsyms output along with other data such
as stack trace, etc. Thus, this patch performs a rename into
prog_tag and truncates the tag to a short output (64 bits) to
make it obvious it's not collision-free.
Should in future a hash or facility be needed with a security
relevant focus, then we can think about requirements, constraints,
etc that would fit to that situation. For now, rework the exposed
parts for the current use cases as long as nothing has been
released yet. Tested on x86_64 and s390x.
Fixes: 7bd509e311 ("bpf: add prog_digest and expose it via fdinfo/netlink")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_frag_point() doesn't store anything, and thus just calling it
cannot do anything useful.
sctp_apply_peer_addr_params is only called by
sctp_setsockopt_peer_addr_params. When operating on an asoc,
sctp_setsockopt_peer_addr_params will call sctp_apply_peer_addr_params
once for the asoc, and then once for each transport this asoc has,
meaning that the frag_point will be recomputed when updating the
transports and calling it when updating the asoc is not necessary.
IOW, no action is needed here and we can remove this call.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
arp is being checked instead of arp_eth to see if the call to
__skb_header_pointer failed. Fix this by checking arp_eth is
null instead of arp. Also fix to use length hlen rather than
hlen - sizeof(_arp); thanks to Eric Dumazet for spotting
this latter issue.
CoverityScan CID#1396428 ("Logically dead code") on 2nd
arp comparison (which should be arp_eth instead).
Fixes: commit 55733350e5 ("flow disector: ARP support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit d35c99ff77 ("netlink: do not enter direct reclaim from
netlink_dump()") we made sure to not trigger expensive memory reclaim.
Problem is that a bit later, netlink_trim() might be called and
trigger memory reclaim.
netlink_trim() should be best effort, and really as fast as possible.
Under memory pressure, it is fine to not trim this skb.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, we allocate memory always with GFP_ATOMIC flag.
When the system is under memory pressure and a user tries to send,
the send fails due to low memory. However, the user application
can wait for free memory if we allocate it using GFP_KERNEL flag.
In this commit, we use allocate memory with GFP_KERNEL for all user
allocation.
Reported-by: Rune Torgersen <runet@innovsys.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With ip6gre we have a tunnel header which also makes the tunnel MTU
smaller. We need to reserve room for it. Previously we were using up
space reserved for the Tunnel Encapsulation Limit option
header (RFC 2473).
Also, after commit b05229f442 ("gre6: Cleanup GREv6 transmit path,
call common GRE functions") our contract with the caller has
changed. Now we check if the packet length exceeds the tunnel MTU after
the tunnel header has been pushed, unlike before.
This is reflected in the check where we look at the packet length minus
the size of the tunnel header, which is already accounted for in tunnel
MTU.
Fixes: b05229f442 ("gre6: Cleanup GREv6 transmit path, call common GRE functions")
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an IPv6 version of commit 24803f38a5 ("igmp: do not remove igmp
souce list..."). In mld_del_delrec(), we will restore back all source filter
info instead of flush them.
Move mld_clear_delrec() from ipv6_mc_down() to ipv6_mc_destroy_dev() since
we should not remove source list info when set link down. Remove
igmp6_group_dropped() in ipv6_mc_destroy_dev() since we have called it in
ipv6_mc_down().
Also clear all source info after igmp6_group_dropped() instead of in it
because ipv6_mc_down() will call igmp6_group_dropped().
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull nfsd fixes from Bruce Fields:
"Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for
older bugs"
* tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: avoid duplicate dma unmapping during error recovery
sunrpc: don't call sleeping functions from the notifier block callbacks
svcrpc: don't leak contexts on PROC_DESTROY
nfsd: fix supported attributes for acl & labels
The NF_CONNTRACK Kconfig option description makes an incorrect reference
to the "meta" expression where the "ct" expression would be correct.This
patch fixes the respective typographical error.
Fixes: d497c63527 ("netfilter: add help information to new nf_tables Kconfig options")
Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When dumping nft stateful objects, if NFTA_OBJ_TABLE and NFTA_OBJ_TYPE
attributes are not specified either, filter will become NULL, so oops
will happen(actually nft utility will always set NFTA_OBJ_TABLE attr,
so I write a test program to make this happen):
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: nf_tables_dump_obj+0x17c/0x330 [nf_tables]
[...]
Call Trace:
? nf_tables_dump_obj+0x5/0x330 [nf_tables]
? __kmalloc_reserve.isra.35+0x31/0x90
? __alloc_skb+0x5b/0x1e0
netlink_dump+0x124/0x2a0
__netlink_dump_start+0x161/0x190
nf_tables_getobj+0xe8/0x280 [nf_tables]
Fixes: a9fea2a3c3 ("netfilter: nf_tables: allow to filter stateful object dumps by type")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, we check the existing rtable in PREROUTING hook, if RTCF_LOCAL
is set, we assume that the packet is loopback.
But this assumption is incorrect, for example, a packet encapsulated
in ipsec transport mode was received and routed to local, after
decapsulation, it would be delivered to local again, and the rtable
was not dropped, so RTCF_LOCAL check would trigger. But actually, the
packet was not loopback.
So for these normal loopback packets, we can check whether the in device
is IFF_LOOPBACK or not. For these locally generated broadcast/multicast,
we can check whether the skb->pkt_type is PACKET_LOOPBACK or not.
Finally, there's a subtle difference between nft fib expr and xtables
rpfilter extension, user can add the following nft rule to do strict
rpfilter check:
# nft add rule x y meta iif eth0 fib saddr . iif oif != eth0 drop
So when the packet is loopback, it's better to store the in device
instead of the LOOPBACK_IFINDEX, otherwise, after adding the above
nft rule, locally generated broad/multicast packets will be dropped
incorrectly.
Fixes: f83a7ea207 ("netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too")
Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ah6 input processing uses the asynchronous hash crypto API which
supplies an error code as part of the operation completion but
the error code was being ignored.
Treat a crypto API error indication as a verification failure.
While a crypto API reported error would almost certainly result
in a memcpy of the digest failing anyway and thus the security
risk seems minor, performing a memory compare on what might be
uninitialized memory is wrong.
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
ah4 input processing uses the asynchronous hash crypto API which
supplies an error code as part of the operation completion but
the error code was being ignored.
Treat a crypto API error indication as a verification failure.
While a crypto API reported error would almost certainly result
in a memcpy of the digest failing anyway and thus the security
risk seems minor, performing a memory compare on what might be
uninitialized memory is wrong.
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Dan reports following smatch warning:
net/xfrm/xfrm_state.c:659
error: we previously assumed 'afinfo' could be null (see line 651)
649 struct xfrm_state_afinfo *afinfo = xfrm_state_afinfo_get_rcu(family);
651 if (afinfo)
...
658 }
659 afinfo->init_temprop(x, tmpl, daddr, saddr);
I am resonably sure afinfo cannot be NULL here.
xfrm_state4.c and state6.c are both part of ipv4/ipv6 (depends on
CONFIG_XFRM, a boolean) but even if ipv6 is a module state6.c can't
be removed (ipv6 lacks module_exit so it cannot be removed).
The only callers for xfrm6_fini that leads to state backend unregister
are error unwinding paths that can be called during ipv6 init function.
So after ipv6 module is loaded successfully the state backend cannot go
away anymore.
The family value from policy lookup path is taken from dst_entry, so
that should always be AF_INET(6).
However, since this silences the warning and avoids readers of this
code wondering about possible null deref it seems preferrable to
be defensive and just add the old check back.
Fixes: 711059b975 ("xfrm: add and use xfrm_state_afinfo_get_rcu")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Johannes Berg says:
====================
We have a number of fixes, in part because I was late
in actually sending them out - will try to do better in
the future:
* handle VHT opmode properly when hostapd is controlling
full station state
* two fixes for minimum channel width in mac80211
* don't leave SMPS set to junk in HT capabilities
* fix headroom when forwarding mesh packets, recently
broken by another fix that failed to take into account
frame encryption
* fix the TID in null-data packets indicating EOSP (end
of service period) in U-APSD
* prevent attempting to use (and then failing which
results in crashes) TXQs on stations that aren't added
to the driver yet
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
For 4.11, we seem to have more than in the past few releases:
* socket owner support for connections, so when the wifi
manager (e.g. wpa_supplicant) is killed, connections are
torn down - wpa_supplicant is critical to managing certain
operations, and can opt in to this where applicable
* minstrel & minstrel_ht updates to be more efficient (time and space)
* set wifi_acked/wifi_acked_valid for skb->destructor use in the
kernel, which was already available to userspace
* don't indicate new mesh peers that might be used if there's no
room to add them
* multicast-to-unicast support in mac80211, for better medium usage
(since unicast frames can use *much* higher rates, by ~3 orders of
magnitude)
* add API to read channel (frequency) limitations from DT
* add infrastructure to allow randomizing public action frames for
MAC address privacy (still requires driver support)
* many cleanups and small improvements/fixes across the board
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch disables FACK by default as RACK is the successor of FACK
(inspired by the insights behind FACK).
FACK[1] in Linux works as follows: a packet P is deemed lost,
if packet Q of higher sequence is s/acked and P and Q are distant
by at least dupthresh number of packets in sequence space.
FACK is more aggressive than the IETF recommened recovery for SACK
(RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
Recovery Algorithm for TCP), because a single SACK may trigger
fast recovery. This obviously won't work well with reordering so
FACK is dynamically disabled upon detecting reordering.
RACK supersedes FACK by using time distance instead of sequence
distance. On reordering, RACK waits for a quarter of RTT receiving
a single SACK before starting recovery. (the timer can be made more
adaptive in the future by measuring reordering distance in time,
but currently RTT/4 seem to work well.) Once the recovery starts,
RACK behaves almost like FACK because it reduces the reodering
window to 1ms, so it fast retransmits quickly. In addition RACK
can detect loss retransmission as it does not care about the packet
sequences (being repeated or not), which is extremely useful when
the connection is going through a traffic policer.
Google server experiments indicate that disabling FACK after enabling
RACK has negligible impact on the overall loss recovery performance
with more reordering events detected. But we still keep the FACK
implementation for backup if RACK has bugs that needs to be disabled.
[1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thin stream DUPACK is to start fast recovery on only one DUPACK
provided the connection is a thin stream (i.e., low inflight). But
this older feature is now subsumed with RACK. If a connection
receives only a single DUPACK, RACK would arm a reordering timer
and soon starts fast recovery instead of timeout if no further
ACKs are received.
The socket option (THIN_DUPACK) is kept as a nop for compatibility.
Note that this patch does not change another thin-stream feature
which enables linear RTO. Although it might be good to generalize
that in the future (i.e., linear RTO for the first say 3 retries).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>