Different namespace application might require different time period in
second to disable Fastopen on active TCP sockets.
Tested:
Simulate following similar situation that the server's data gets dropped
after 3WHS.
C ---- syn-data ---> S
C <--- syn/ack ----- S
C ---- ack --------> S
S (accept & write)
C? X <- data ------ S
[retry and timeout]
And then print netstat of TCPFastOpenBlackhole, the counter increased as
expected when the firewall blackhole issue is detected and active TFO is
disabled.
# cat /proc/net/netstat | awk '{print $91}'
TCPFastOpenBlackhole
1
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different namespace application might require different tcp_fastopen_key
independently of the host.
David Miller pointed out there is a leak without releasing the context
of tcp_fastopen_key during netns teardown. So add the release action in
exit_batch path.
Tested:
1. Container namespace:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
2817fff2-f803cf97-eadfd1f3-78c0992b
cookie key in tcp syn packets:
Fast Open Cookie
Kind: TCP Fast Open Cookie (34)
Length: 10
Fast Open Cookie: 1e5dd82a8c492ca9
2. Host:
# cat /proc/sys/net/ipv4/tcp_fastopen_key:
107d7c5f-68eb2ac7-02fb06e6-ed341702
cookie key in tcp syn packets:
Fast Open Cookie
Kind: TCP Fast Open Cookie (34)
Length: 10
Fast Open Cookie: e213c02bf0afbc8a
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'publish' logic is not necessary after commit dfea2aa654 ("tcp:
Do not call tcp_fastopen_reset_cipher from interrupt context"), because
in tcp_fastopen_cookie_gen,it wouldn't call tcp_fastopen_init_key_once.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different namespace application might require enable TCP Fast Open
feature independently of the host.
This patch series continues making more of the TCP Fast Open related
sysctl knobs be per net-namespace.
Reported-by: Luca BRUNO <lucab@debian.org>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the dsa_ptr is a dsa_port instance, there is no need to keep
the tag operations in the dsa_switch_tree structure. Remove it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to make DSA master devices point to their corresponding
CPU port instead of the whole tree, add copies of dst and rcv in the
dsa_port structure so that we keep fast access in the receive hot path.
Also keep the copies at the beginning of the dsa_port structure in order
to ensure they are available in cacheline 1.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA tagging protocol operations are specific to each CPU port,
thus the dsa_device_ops pointer belongs to the dsa_port structure.
>From now on assign a slave's xmit copy from its CPU port tagging
operations. This will ease the future support for multiple CPU ports.
Also keep the tag_ops at the beginning of the dsa_port structure so that
we ensure copies for hot path are in cacheline 1.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP early demux can leverate the rx dst cache even for
multicast unconnected sockets.
In such scenario the ipv4 source address is validated only on
the first packet in the given flow. After that, when we fetch
the dst entry from the socket rx cache, we stop enforcing
the rp_filter and we even start accepting any kind of martian
addresses.
Disabling the dst cache for unconnected multicast socket will
cause large performace regression, nearly reducing by half the
max ingress tput.
Instead we factor out a route helper to completely validate an
skb source address for multicast packets and we call it from
the UDP early demux for mcast packets landing on unconnected
sockets, after successful fetching the related cached dst entry.
This still gives a measurable, but limited performance
regression:
rp_filter = 0 rp_filter = 1
edmux disabled: 1182 Kpps 1127 Kpps
edmux before: 2238 Kpps 2238 Kpps
edmux after: 2037 Kpps 2019 Kpps
The above figures are on top of current net tree.
Applying the net-next commit 6e617de84e ("net: avoid a full
fib lookup when rp_filter is disabled.") the delta with
rp_filter == 0 will decrease even more.
Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_weight in fib_info is set but not used. Remove it and the
helpers for setting it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the ipmr module register as a FIB notifier. To do that, implement both
the ipmr_seq_read and ipmr_dump ops.
The ipmr_seq_read op returns a sequence counter that is incremented on
every notification related operation done by the ipmr. To implement that,
add a sequence counter in the netns_ipv4 struct and increment it whenever a
new MFC route or VIF are added or deleted. The sequence operations are
protected by the RTNL lock.
The ipmr_dump iterates the list of MFC routes and the list of VIF entries
and sends notifications about them. The entries dump is done under RCU
where the VIF dump uses the mrt_lock too, as the vif->dev field can change
under RCU.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order for an interface to forward packets according to the kernel
multicast routing table, it must be configured with a VIF index according
to the mroute user API. The VIF index is then used to refer to that
interface in the mroute user API, for example, to set the iif and oifs of
an MFC entry.
In order to allow drivers to be aware and offload multicast routes, they
have to be aware of the VIF add and delete notifications.
Due to the fact that a specific VIF can be deleted and re-added pointing to
another netdevice, and the MFC routes that point to it will forward the
matching packets to the new netdevice, a driver willing to offload MFC
cache entries must be aware of the VIF add and delete events in addition to
MFC routes notifications.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a helper called is_tcf_gact_pass which could be used to
tell if the action is gact pass or not.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Key length can't be negative.
Leave comparisons against nla_len() signed just in case truncated attribute
can sneak in there.
Space savings:
add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-7 (-7)
function old new delta
pneigh_delete 273 272 -1
mlx5e_rep_netevent_event 1415 1414 -1
mlx5e_create_encap_header_ipv6 1194 1193 -1
mlx5e_create_encap_header_ipv4 1071 1070 -1
cxgb4_l2t_get 1104 1103 -1
__pneigh_lookup 69 68 -1
__neigh_create 2452 2451 -1
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_KASAN is enabled, the "--param asan-stack=1" causes rather large
stack frames in some functions. This goes unnoticed normally because
CONFIG_FRAME_WARN is disabled with CONFIG_KASAN by default as of commit
3f181b4d86 ("lib/Kconfig.debug: disable -Wframe-larger-than warnings with
KASAN=y").
The kernelci.org build bot however has the warning enabled and that led
me to investigate it a little further, as every build produces these warnings:
net/wireless/nl80211.c:4389:1: warning: the frame size of 2240 bytes is larger than 2048 bytes [-Wframe-larger-than=]
net/wireless/nl80211.c:1895:1: warning: the frame size of 3776 bytes is larger than 2048 bytes [-Wframe-larger-than=]
net/wireless/nl80211.c:1410:1: warning: the frame size of 2208 bytes is larger than 2048 bytes [-Wframe-larger-than=]
net/bridge/br_netlink.c:1282:1: warning: the frame size of 2544 bytes is larger than 2048 bytes [-Wframe-larger-than=]
Most of this problem is now solved in gcc-8, which can consolidate
the stack slots for the inline function arguments. On older compilers
we can add a workaround by declaring a local variable in each function
to pass the inline function argument.
Cc: stable@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replay detection bitmaps can't have negative length.
Comparisons with nla_len() are left signed just in case negative value
can sneak in there.
Propagate unsignedness for code size savings:
add/remove: 0/0 grow/shrink: 0/5 up/down: 0/-38 (-38)
function old new delta
xfrm_state_construct 1802 1800 -2
xfrm_update_ae_params 295 289 -6
xfrm_state_migrate 1345 1339 -6
xfrm_replay_notify_esn 349 337 -12
xfrm_replay_notify_bmp 345 333 -12
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Key lengths can't be negative.
Comparison with nla_len() is left signed just in case negative value
can sneak in there.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Key lengths can't be negative.
Comparison with nla_len() is left signed just in case negative value
can sneak in there.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Key lengths can't be negative.
Comparison with nla_len() is left signed just in case negative value
can sneak in there.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In linux-4.13, Wei worked hard to convert dst to a traditional
refcounted model, removing GC.
We now want to make sure a dst refcount can not transition from 0 back
to 1.
The problem here is that input path attached a not refcounted dst to an
skb. Then later, because packet is forwarded and hits skb_dst_force()
before exiting RCU section, we might try to take a refcount on one dst
that is about to be freed, if another cpu saw 1 -> 0 transition in
dst_release() and queued the dst for freeing after one RCU grace period.
Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
always perform the complete check against dst refcount, and not assume
it is not zero.
Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
[ 989.919496] skb_dst_force+0x32/0x34
[ 989.919498] __dev_queue_xmit+0x1ad/0x482
[ 989.919501] ? eth_header+0x28/0xc6
[ 989.919502] dev_queue_xmit+0xb/0xd
[ 989.919504] neigh_connected_output+0x9b/0xb4
[ 989.919507] ip_finish_output2+0x234/0x294
[ 989.919509] ? ipt_do_table+0x369/0x388
[ 989.919510] ip_finish_output+0x12c/0x13f
[ 989.919512] ip_output+0x53/0x87
[ 989.919513] ip_forward_finish+0x53/0x5a
[ 989.919515] ip_forward+0x2cb/0x3e6
[ 989.919516] ? pskb_trim_rcsum.part.9+0x4b/0x4b
[ 989.919518] ip_rcv_finish+0x2e2/0x321
[ 989.919519] ip_rcv+0x26f/0x2eb
[ 989.919522] ? vlan_do_receive+0x4f/0x289
[ 989.919523] __netif_receive_skb_core+0x467/0x50b
[ 989.919526] ? tcp_gro_receive+0x239/0x239
[ 989.919529] ? inet_gro_receive+0x226/0x238
[ 989.919530] __netif_receive_skb+0x4d/0x5f
[ 989.919532] netif_receive_skb_internal+0x5c/0xaf
[ 989.919533] napi_gro_receive+0x45/0x81
[ 989.919536] ixgbe_poll+0xc8a/0xf09
[ 989.919539] ? kmem_cache_free_bulk+0x1b6/0x1f7
[ 989.919540] net_rx_action+0xf4/0x266
[ 989.919543] __do_softirq+0xa8/0x19d
[ 989.919545] irq_exit+0x5d/0x6b
[ 989.919546] do_IRQ+0x9c/0xb5
[ 989.919548] common_interrupt+0x93/0x93
[ 989.919548] </IRQ>
Similarly dst_clone() can use dst_hold() helper to have additional
debugging, as a follow up to commit 44ebe79149 ("net: add debug
atomic_inc_not_zero() in dst_hold()")
In net-next we will convert dst atomic_t to refcount_t for peace of
mind.
Fixes: a4c2fd7f78 ("net: remove DST_NOCACHE flag")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Bisected-by: Paweł Staszewski <pstaszewski@itcare.pl>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
> net/ipv4/fib_frontend.c: In function 'fib_validate_source':
> net/ipv4/fib_frontend.c:411:16: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
> if (net->ipv4.fib_has_custom_local_routes)
> ^
> net/ipv4/fib_frontend.c: In function 'inet_rtm_newroute':
> net/ipv4/fib_frontend.c:773:12: error: 'struct netns_ipv4' has no member named 'fib_has_custom_local_routes'
> net->ipv4.fib_has_custom_local_routes = true;
> ^
Fixes: 6e617de84e ("net: avoid a full fib lookup when rp_filter is disabled.")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 1dced6a854 ("ipv4: Restore accept_local behaviour
in fib_validate_source()") a full fib lookup is needed even if
the rp_filter is disabled, if accept_local is false - which is
the default.
What we really need in the above scenario is just checking
that the source IP address is not local, and in most case we
can do that is a cheaper way looking up the ifaddr hash table.
This commit adds a helper for such lookup, and uses it to
validate the src address when rp_filter is disabled and no
'local' routes are created by the user space in the relevant
namespace.
A new ipv4 netns flag is added to account for such routes.
We need that to preserve the same behavior we had before this
patch.
It also drops the checks to bail early from __fib_validate_source,
added by the commit 1dced6a854 ("ipv4: Restore accept_local
behaviour in fib_validate_source()") they do not give any
measurable performance improvement: if we do the lookup with are
on a slower path.
This improves UDP performances for unconnected sockets
when rp_filter is disabled by 5% and also gives small but
measurable performance improvement for TCP flood scenarios.
v1 -> v2:
- use the ifaddr lookup helper in __ip_dev_find(), as suggested
by Eric
- fall-back to full lookup if custom local routes are present
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function hasn't been used since the removal of iwmc3200wifi
in 2012. It also appears to have a bug when qos=True, since then
it'll copy uninitialized stack memory to the SKB.
Just remove the function entirely.
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add documentation to ieee80211_rx_ba_offl() function and, while at it,
rename the bit argument to tid, for consistency.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Current ieee80211_ie_split() implementation doesn't
account for elements that are sub-elements of the
EXTENSION IE. To extend support to these IEs as well,
treat the WLAN_EID_EXTENSION ids in the %ids array
as indicating that the next id in the array is a
sub-element of the EXTENSION IE.
Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Having a global list of labels do not scale to thousands of
netns in the cloud era. This causes quadratic behavior on
netns creation and deletion.
This is time having a per netns list of ~10 labels.
Tested:
$ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]
real 0m20.837s # instead of 0m24.227s
user 0m0.328s
sys 0m20.338s # instead of 0m23.753s
16.17% ip [kernel.kallsyms] [k] netlink_broadcast_filtered
12.30% ip [kernel.kallsyms] [k] netlink_has_listeners
6.76% ip [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.78% ip [kernel.kallsyms] [k] memset_erms
5.77% ip [kernel.kallsyms] [k] kobject_uevent_env
5.18% ip [kernel.kallsyms] [k] refcount_sub_and_test
4.96% ip [kernel.kallsyms] [k] _raw_read_lock
3.82% ip [kernel.kallsyms] [k] refcount_inc_not_zero
3.33% ip [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
2.11% ip [kernel.kallsyms] [k] unmap_page_range
1.77% ip [kernel.kallsyms] [k] __wake_up
1.69% ip [kernel.kallsyms] [k] strlen
1.17% ip [kernel.kallsyms] [k] __wake_up_common
1.09% ip [kernel.kallsyms] [k] insert_header
1.04% ip [kernel.kallsyms] [k] page_remove_rmap
1.01% ip [kernel.kallsyms] [k] consume_skb
0.98% ip [kernel.kallsyms] [k] netlink_trim
0.51% ip [kernel.kallsyms] [k] kernfs_link_sibling
0.51% ip [kernel.kallsyms] [k] filemap_map_pages
0.46% ip [kernel.kallsyms] [k] memcpy_erms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller no longer needs to wait for a grace period. So this
patch gets rid of it.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to store a copy of the master ethtool ops, storing the
original pointer in DSA and the new one in the master netdev itself is
enough.
In the meantime, set orig_ethtool_ops to NULL when restoring the master
ethtool ops and check the presence of the master original ethtool ops as
well as its needed functions before calling them.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).
Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.
This saves some overhead in both TCP and netem.
v2: removes the swtstamp field from struct tcp_skb_cb
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Global function ipv6_rcv_saddr_equal and static functions
ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int.
bool is slightly more descriptive for these functions so change
their return type from int to bool.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 86fdb3448c ("sctp: ensure ep is not destroyed before doing the
dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
with holding sock and sock lock.
But Paolo noticed that endpoint could be destroyed in sctp_rcv without
sock lock protection. It means the use-after-free issue still could be
triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
!ep, although it's pretty hard to reproduce.
I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
and sctp_sock_dump long time.
This patch is to add another param cb_done to sctp_for_each_transport
and dump ep->assocs with holding tsp after jumping out of transport's
traversal in it to avoid this issue.
It can also improve sctp diag dump to make it run faster, as no need
to save sk into cb->args[5] and keep calling sctp_for_each_transport
any more.
This patch is also to use int * instead of int for the pos argument
in sctp_for_each_transport, which could make postion increment only
in sctp_for_each_transport and no need to keep changing cb->args[2]
in sctp_sock_filter and sctp_sock_dump any more.
Fixes: 86fdb3448c ("sctp: ensure ep is not destroyed before doing the dump")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This code causes a static checker warning because Smatch doesn't trust
anything that comes from skb->data. I've reviewed this code and I do
think skb->data can be controlled by the user here.
The sctp_event_subscribe struct has 13 __u8 fields and we want to see
if ours is non-zero. sn_type can be any value in the 0-USHRT_MAX range.
We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
either before the start of the struct or after the end.
This is a very old bug and it's surprising that it would go undetected
for so long but my theory is that it just doesn't have a big impact so
it would be hard to notice.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen estimator has been rewritten in commit 1c0d32fde5
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.
This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().
Please see also the comments in code.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix SCTP connection setup when IPVS module is loaded and any scheduler
is registered, from Xin Long.
2) Don't create a SCTP connection from SCTP ABORT packets, also from
Xin Long.
3) WARN_ON() and drop packet, instead of BUG_ON() races when calling
nf_nat_setup_info(). This is specifically a longstanding problem
when br_netfilter with conntrack support is in place, patch from
Florian Westphal.
4) Avoid softlock splats via iptables-restore, also from Florian.
5) Revert NAT hashtable conversion to rhashtable, semantics of rhlist
are different from our simple NAT hashtable, this has been causing
problems in the recent Linux kernel releases. From Florian.
6) Add per-bucket spinlock for NAT hashtable, so at least we restore
one of the benefits we got from the previous rhashtable conversion.
7) Fix incorrect hashtable size in memory allocation in xt_hashlimit,
from Zhizhou Tian.
8) Fix build/link problems with hashlimit and 32-bit arches, to address
recent fallout from a new hashlimit mode, from Vishwanath Pai.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 870190a9ec.
It was not a good idea. The custom hash table was a much better
fit for this purpose.
A fast lookup is not essential, in fact for most cases there is no lookup
at all because original tuple is not taken and can be used as-is.
What needs to be fast is insertion and deletion.
rhlist removal however requires a rhlist walk.
We can have thousands of entries in such a list if source port/addresses
are reused for multiple flows, if this happens removal requests are so
expensive that deletions of a few thousand flows can take several
seconds(!).
The advantages that we got from rhashtable are:
1) table auto-sizing
2) multiple locks
1) would be nice to have, but it is not essential as we have at
most one lookup per new flow, so even a million flows in the bysource
table are not a problem compared to current deletion cost.
2) is easy to add to custom hash table.
I tried to add hlist_node to rhlist to speed up rhltable_remove but this
isn't doable without changing semantics. rhltable_remove_fast will
check that the to-be-deleted object is part of the table and that
requires a list walk that we want to avoid.
Furthermore, using hlist_node increases size of struct rhlist_head, which
in turn increases nf_conn size.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196821
Reported-by: Ivan Babrou <ibobrik@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johannes Berg says:
====================
Back from a long absence, so we have a number of things:
* a remain-on-channel fix from Avi
* hwsim TX power fix from Beni
* null-PTR dereference with iTXQ in some rare configurations (Chunho)
* 40 MHz custom regdomain fixes (Emmanuel)
* look at right place in HT/VHT capability parsing (Igor)
* complete A-MPDU teardown properly (Ilan)
* Mesh ID Element ordering fix (Liad)
* avoid tracing warning in ht_dbg() (Sharon)
* fix print of assoc/reassoc (Simon)
* fix encrypted VLAN with iTXQ (myself)
* fix calling context of TX queue wake (myself)
* fix a deadlock with ath10k aggregation (myself)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Let switch drivers indicate how many TX queues they support. Some
switches, such as Broadcom Starfighter 2 are designed with 8 egress
queues. Future changes will allow us to leverage the queue mapping and
direct the transmission towards a particular queue.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_flow_dissect is riddled with gotos that make discerning the flow,
debugging, and extending the capability difficult. This patch
reorganizes things so that we only perform goto's after the two main
switch statements (no gotos within the cases now). It also eliminates
several goto labels so that there are only two labels that can be target
for goto.
Reported-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We get a new link error in allmodconfig kernels after ftgmac100
started using the ncsi helpers:
ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!
ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!
Related to that, we get another error when CONFIG_NET_NCSI is disabled:
drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean 'ncsi_start_dev'?
drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean 'ncsi_vlan_rx_add_vid'?
This fixes both problems at once, using a 'static inline' stub helper
for the disabled case, and exporting the functions when they are present.
Fixes: 51564585d8 ("ftgmac100: Support NCSI VLAN filtering when available")
Fixes: 21acf63013 ("net/ncsi: Configure VLAN tag filter")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
With TXQs, the AP_VLAN interfaces are resolved to their owner AP
interface when enqueuing the frame, which makes sense since the
frame really goes out on that as far as the driver is concerned.
However, this introduces a problem: frames to be encrypted with
a VLAN-specific GTK will now be encrypted with the AP GTK, since
the information about which virtual interface to use to select
the key is taken from the TXQ.
Fix this by preserving info->control.vif and using that in the
dequeue function. This now requires doing the driver-mapping
in the dequeue as well.
Since there's no way to filter the frames that are sitting on a
TXQ, drop all frames, which may affect other interfaces, when an
AP_VLAN is removed.
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pablo Neira Ayuso says:
====================
Netfilter updates for next-net (part 2)
The following patchset contains Netfilter updates for net-next. This
patchset includes updates for nf_tables, removal of
CONFIG_NETFILTER_DEBUG and a new mode for xt_hashlimit. More
specifically, they:
1) Add new rate match mode for hashlimit, this introduces a new revision
for this match. The idea is to stop matching packets until ratelimit
criteria stands true. Patch from Vishwanath Pai.
2) Add ->select_ops indirection to nf_tables named objects, so we can
choose between different flavours of the same object type, patch from
Pablo M. Bermudo.
3) Shorter function names in nft_limit, basically:
s/nft_limit_pkt_bytes/nft_limit_bytes, also from Pablo M. Bermudo.
4) Add new stateful limit named object type, this allows us to create
limit policies that you can identify via name, also from Pablo.
5) Remove unused hooknum parameter in conntrack ->packet indirection.
From Florian Westphal.
6) Patches to remove CONFIG_NETFILTER_DEBUG and macros such as
IP_NF_ASSERT and IP_NF_ASSERT. From Varsha Rao.
7) Add nf_tables_updchain() helper function and use it from
nf_tables_newchain() to make it more maintainable. Similarly,
add nf_tables_addchain() and use it too.
8) Add new netlink NLM_F_NONREC flag, this flag should only be used for
deletion requests, specifically, to support non-recursive deletion.
Based on what we discussed during NFWS'17 in Faro.
9) Use NLM_F_NONREC from table and sets in nf_tables.
10) Support for recursive chain deletion. Table and set deletion
commands come with an implicit content flush on deletion, while
chains do not. This patch addresses this inconsistency by adding
the code to perform recursive chain deletions. This also comes with
the bits to deal with the new NLM_F_NONREC netlink flag.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they
are no longer required. Replace _ASSERT() macros with WARN_ON().
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.
This change is needed for upcoming additions to the stateful objects
infrastructure.
Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>