Currently, chain 0 is implicitly created during block creation. However
that does not align with chain object exposure, creation and destruction
api introduced later on. So make the chain 0 behave the same way as any
other chain and only create it when it is needed. Since chain 0 is
somehow special as the qdiscs need to hold pointer to the first chain
tp, this requires to move the chain head change callback infra to the
block structure.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Push all bits that take care of ops lookup, including module loading
outside tcf_proto_create() function, into tcf_proto_lookup_ops()
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shaochun Chen points out we leak dumper filter state allocations
stored in dump_control->data in case there is an error before netlink sets
cb_running (after which ->done will be called at some point).
In order to fix this, add .start functions and do the allocations
there.
->done is going to clean up, and in case error occurs before
->start invocation no cleanups need to be done anymore.
Reported-by: shaochun chen <cscnull@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.
I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.
Fixes: 9f5afeae51 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.
1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.
We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.
In the future, we might add the possibility of terminating flows
that are proven to be malicious.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])
Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.
Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.
Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.
Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
Fixes: 36a6503fed ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb hash for locally generated ip[v6] fragments belonging
to the same datagram can vary in several circumstances:
* for connected UDP[v6] sockets, the first fragment get its hash
via set_owner_w()/skb_set_hash_from_sk()
* for unconnected IPv6 UDPv6 sockets, the first fragment can get
its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
auto_flowlabel is enabled
For the following frags the hash is usually computed via
skb_get_hash().
The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
scenario the egress tx queue can be selected on a per packet basis
via the skb hash.
It may also fool flow-oriented schedulers to place fragments belonging
to the same datagram in different flows.
Fix the issue by copying the skb hash from the head frag into
the others at fragmentation time.
Before this commit:
perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
perf script
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0
After this commit:
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
Fixes: b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf on xmit")
Fixes: 67800f9b1f ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The page map address is already stored in the RMB descriptor.
There is no need to derive it from the cpu_addr value.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Link group field tokens_used_mask is a bitmap. Use macro
DECLARE_BITMAP for its definition.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace a frequently used construct with a more readable variant,
reducing the code. Also might come handy when we start to support
more than a single per link group.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions to read and write cursors are exclusively used to copy
cursors. Therefore switch to a respective function instead.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename field diag_fallback into diag_mode and set the smc mode of a
connection explicitly.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
allows to set a backup port to be used for known unicast traffic if the
port has gone carrier down. The backup pointer is rcu protected and set
only under RTNL, a counter is maintained so when deleting a port we know
how many other ports reference it as a backup and we remove it from all.
Also the pointer is in the first cache line which is hot at the time of
the check and thus in the common case we only add one more test.
The backup port will be used only for the non-flooding case since
it's a part of the bridge and the flooded packets will be forwarded to it
anyway. To remove the forwarding just send a 0/non-existing backup port.
This is used to avoid numerous scalability problems when using MLAG most
notably if we have thousands of fdbs one would need to change all of them
on port carrier going down which takes too long and causes a storm of fdb
notifications (and again when the port comes back up). In a Multi-chassis
Link Aggregation setup usually hosts are connected to two different
switches which act as a single logical switch. Those switches usually have
a control and backup link between them called peerlink which might be used
for communication in case a host loses connectivity to one of them.
We need a fast way to failover in case a host port goes down and currently
none of the solutions (like bond) cannot fulfill the requirements because
the participating ports are actually the "master" devices and must have the
same peerlink as their backup interface and at the same time all of them
must participate in the bridge device. As Roopa noted it's normal practice
in routing called fast re-route where a precalculated backup path is used
when the main one is down.
Another use case of this is with EVPN, having a single vxlan device which
is backup of every port. Due to the nature of master devices it's not
currently possible to use one device as a backup for many and still have
all of them participate in the bridge (which is master itself).
More detailed information about MLAG is available at the link below.
https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
Further explanation and a diagram by Roopa:
Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.
the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.
br0 -- team0---eth0
|
-- switch-peerlink
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new alternative store callback for port sysfs options
which takes a raw value (buf) and can use it directly. It is needed for the
backup port sysfs support since we have to pass the device by its name.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_configure_link sets dev->rtnl_link_state to
RTNL_LINK_INITIALIZED and unconditionally calls
__dev_notify_flags to notify user-space of dev flags.
current call sequence for rtnl_configure_link
rtnetlink_newlink
rtnl_link_ops->newlink
rtnl_configure_link (unconditionally notifies userspace of
default and new dev flags)
If a newlink handler wants to call rtnl_configure_link
early, we will end up with duplicate notifications to
user-space.
This patch fixes rtnl_configure_link to check rtnl_link_state
and call __dev_notify_flags with gchanges = 0 if already
RTNL_LINK_INITIALIZED.
Later in the series, this patch will help the following sequence
where a driver implementing newlink can call rtnl_configure_link
to initialize the link early.
makes the following call sequence work:
rtnetlink_newlink
rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
link and notifies
user-space of default
dev flags)
rtnl_configure_link (updates dev flags if requested by user ifm
and notifies user-space of new dev flags)
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two scenarios that we will restore deleted records. The first is
when device down and up(or unmap/remap). In this scenario the new filter
mode is same with previous one. Because we get it from in_dev->mc_list and
we do not touch it during device down and up.
The other scenario is when a new socket join a group which was just delete
and not finish sending status reports. In this scenario, we should use the
current filter mode instead of restore old one. Here are 4 cases in total.
old_socket new_socket before_fix after_fix
IN(A) IN(A) ALLOW(A) ALLOW(A)
IN(A) EX( ) TO_IN( ) TO_EX( )
EX( ) IN(A) TO_EX( ) ALLOW(A)
EX( ) EX( ) TO_EX( ) TO_EX( )
Fixes: 24803f38a5 (igmp: do not remove igmp souce list info when set link down)
Fixes: 1666d49e1d (mld: do not remove mld souce list info when set link down)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the mode parameter for igmp/igmp6_group_added as we can get it
from first parameter.
Fixes: 6e2059b53f (ipv4/igmp: init group mode as INCLUDE when join source group)
Fixes: c7ea20c9da (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Example setup:
host: ip -6 addr add dev eth1 2001:db8:104::4
where eth1 is enslaved to a VRF
switch: ip -6 ro add 2001:db8:104::4/128 dev br1
where br1 only has an LLA
ping6 2001:db8:104::4
ssh 2001:db8:104::4
(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).
For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.
For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.
Fixes: 9ff7438460 ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warnings:
net/tipc/link.c:376:5: warning: symbol 'link_bc_rcv_gap' was not declared. Should it be static?
net/tipc/link.c:823:6: warning: symbol 'link_prepare_wakeup' was not declared. Should it be static?
net/tipc/link.c:959:6: warning: symbol 'tipc_link_advance_backlog' was not declared. Should it be static?
net/tipc/link.c:1009:5: warning: symbol 'tipc_link_retrans' was not declared. Should it be static?
net/tipc/monitor.c:687:5: warning: symbol '__tipc_nl_add_monitor_peer' was not declared. Should it be static?
net/tipc/group.c:230:20: warning: symbol 'tipc_group_find_member' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This line makes up what macro PTR_ERR_OR_ZERO already does. So,
make use of PTR_ERR_OR_ZERO rather than an open-code version.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a missing rcu_read_unlock in the error path
Fixes: c95567c803 ("caif: added check for potential null return")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create the tcp_clamp_rto_to_user_timeout() helper routine. To calculate
the correct rto, so that the TCP_USER_TIMEOUT socket option is more
accurate. Taking suggestions and feedback into account from
Eric Dumazet, Neal Cardwell and David Laight. Due to the 1st commit we
can avoid the msecs_to_jiffies() and jiffies_to_msecs() dance.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a seperate helper routine as per Neal Cardwells suggestion. To
be used by the final commit in this series and retransmits_timed_out().
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preparatory commit. Part of this series that improves the
socket TCP_USER_TIMEOUT option accuracy. Implement Eric Dumazets idea
to convert icsk->icsk_user_timeout from jiffies to msecs. To eliminate
the msecs_to_jiffies() and jiffies_to_msecs() dance in future.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating various bridge objects in /sys/class/net/... make sure
that they belong to the container's owner instead of global root (if
they belong to a container/namespace).
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make net_ns_get_ownership() reusable by networking code outside of core.
This is useful, for example, to allow bridge related sysfs files to be
owned by container root.
Add a function comment since this is a potentially dangerous function to
use given the way that kobject_get_ownership() works by initializing uid
and gid before calling .get_ownership().
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating various objects in /sys/class/net/... make sure that they
belong to container's owner instead of global root (if they belong to a
container/namespace).
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An upcoming change will allow container root to open some /sys/class/net
files for writing. The tx_maxrate attribute can result in changes
to actual hardware devices so err on the side of caution by requiring
CAP_NET_ADMIN in the init namespace in the corresponding attribute store
operation.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a patch by Sean Tranchetti.
Fixes: d4546c2509 ("net: Convert GRO SKB handling to list_head.")
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree:
1) No need to set ttl from reject action for the bridge family, from
Taehee Yoo.
2) Use a fixed timeout for flow that are passed up from the flowtable
to conntrack, from Florian Westphal.
3) More preparation patches for tproxy support for nf_tables, from Mate
Eckl.
4) Remove unnecessary indirection in core IPv6 checksum function, from
Florian Westphal.
5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it.
From Florian Westphal.
6) socket match now selects socket infrastructure, instead of depending
on it. From Mate Eckl.
7) Patch series to simplify conntrack tuple building/parsing from packet
path and ctnetlink, from Florian Westphal.
8) Fetch timeout policy from protocol helpers, instead of doing it from
core, from Florian Westphal.
9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from
Florian Westphal.
10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES
respectively, instead of IPV6. Patch from Mate Eckl.
11) Add specific function for garbage collection in conncount,
from Yi-Hung Wei.
12) Catch number of elements in the connlimit list, from Yi-Hung Wei.
13) Move locking to nf_conncount, from Yi-Hung Wei.
14) Series of patches to add lockless tree traversal in nf_conncount,
from Yi-Hung Wei.
15) Resolve clash in matching conntracks when race happens, from
Martynas Pumputis.
16) If connection entry times out, remove template entry from the
ip_vs_conn_tab table to improve behaviour under flood, from
Julian Anastasov.
17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng.
18) Call abort from 2-phase commit protocol before requesting modules,
make sure this is done under the mutex, from Florian Westphal.
19) Grab module reference when starting transaction, also from Florian.
20) Dynamically allocate expression info array for pre-parsing, from
Florian.
21) Add per netns mutex for nf_tables, from Florian Westphal.
22) A couple of patches to simplify and refactor nf_osf code to prepare
for nft_osf support.
23) Break evaluation on missing socket, from Mate Eckl.
24) Allow to match socket mark from nft_socket, from Mate Eckl.
25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is
built-in into nf_conntrack. From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code does not check sk->sk_shutdown & RCV_SHUTDOWN.
tls_sw_recvmsg may return a positive value in the case where bytes have
already been copied when the socket is shutdown. sk->sk_err has been
cleared, causing the tls_wait_data to hang forever on a subsequent
invocation. Checking sk->sk_shutdown & RCV_SHUTDOWN, as in tcp_recvmsg,
fixes this problem.
Fixes: c46234ebb4 ("tls: RX path for ktls")
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:
""" When receiving packets, the CE codepoint MUST be processed as follows:
1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
true and send an immediate ACK.
2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
to false and send an immediate ACK.
"""
Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.
Tested with this packetdrill script provided by Larry Brakmo
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001
0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257
+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001
+0.500 < F. 9501:9501(0) ack 4 win 257
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).
Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.
The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit referred to below introduced an update of the link
capabilities field that is not safe. Given the recently added
feature to remove idle node and link items after 5 minutes, there
is a small risk that the update will happen at the very moment the
targeted link is being removed. To avoid this we have to perform
the update inside the node item's write lock protection.
Fixes: 9012de5089 ("tipc: add sequence number check for link STATE messages")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that the proper structure to use in this particular
case is *skb_iter* instead of skb.
Addresses-Coverity-ID: 1471906 ("Copy-paste error")
Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When first DCCP packet is SYNC or SYNCACK, we insert a new conntrack
that has an un-initialized timeout value, i.e. such entry could be
reaped at any time.
Mark them as INVALID and only ignore SYNC/SYNCACK when connection had
an old state.
Reported-by: syzbot+6f18401420df260e37ed@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its possible to rename two chains to the same name in one
transaction:
nft add chain t c1
nft add chain t c2
nft 'rename chain t c1 c3;rename chain t c2 c3'
This creates two chains named 'c3'.
Appears to be harmless, both chains can still be deleted both
by name or handle, but, nevertheless, its a bug.
Walk transaction log and also compare vs. the pending renames.
Both chains can still be deleted, but nevertheless it is a bug as
we don't allow to create chains with identical names, so we should
prevent this from happening-by-rename too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The new name is stored in the transaction metadata, on commit,
the pointers to the old and new names are swapped.
Therefore in abort and commit case we have to free the
pointer in the chain_trans container.
In commit case, the pointer can be used by another cpu that
is currently dumping the renamed chain, thus kfree needs to
happen after waiting for rcu readers to complete.
Fixes: b7263e071a ("netfilter: nf_tables: Allow chain name of up to 255 chars")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
no need to store the name in separate area.
Furthermore, it uses kmalloc but not kfree and most accesses seem to treat
it as char[IFNAMSIZ] not char *.
Remove this and use dev->name instead.
In case event zeroed dev, just omit the name in the dump.
Fixes: d92191aa84 ("netfilter: nf_tables: cache device name in flowtable object")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow attaching an SA to an xfrm interface id after
the creation of the SA, so that tasks such as keying
which must be done as the SA is created, can remain
separate from the decision on how to route traffic
from an SA. This permits SA creation to be decomposed
in to three separate steps:
1) allocation of a SPI
2) algorithm and key negotiation
3) insertion into the data path
Signed-off-by: Nathan Harold <nharold@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In order to remove performance impact of having the extra u32 in every
single flowi, this change removes the flowi_xfrm struct, prefering to
take the if_id as a method parameter where needed.
In the inbound direction, if_id is only needed during the
__xfrm_check_policy() function, and the if_id can be determined at that
point based on the skb. As such, xfrmi_decode_session() is only called
with the skb in __xfrm_check_policy().
In the outbound direction, the only place where if_id is needed is the
xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
directly passed into the xfrm_lookup_with_ifid() call. All existing
callers can still call xfrm_lookup(), which uses a default if_id of 0.
This change does not change any behavior of XFRMIs except for improving
overall system performance via flowi size reduction.
This change has been tested against the Android Kernel Networking Tests:
https://android.googlesource.com/kernel/tests/+/master/net/test
Signed-off-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>