Be there a platform with the following layout:
Regular NIC
|
+----> DSA master for switch port
|
+----> DSA master for another switch port
After changing DSA back to static lockdep class keys in commit
1a33e10e4a ("net: partially revert dynamic lockdep key changes"), this
kernel splat can be seen:
[ 13.361198] ============================================
[ 13.366524] WARNING: possible recursive locking detected
[ 13.371851] 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988 Not tainted
[ 13.377874] --------------------------------------------
[ 13.383201] swapper/0/0 is trying to acquire lock:
[ 13.388004] ffff0000668ff298 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[ 13.397879]
[ 13.397879] but task is already holding lock:
[ 13.403727] ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[ 13.413593]
[ 13.413593] other info that might help us debug this:
[ 13.420140] Possible unsafe locking scenario:
[ 13.420140]
[ 13.426075] CPU0
[ 13.428523] ----
[ 13.430969] lock(&dsa_slave_netdev_xmit_lock_key);
[ 13.435946] lock(&dsa_slave_netdev_xmit_lock_key);
[ 13.440924]
[ 13.440924] *** DEADLOCK ***
[ 13.440924]
[ 13.446860] May be due to missing lock nesting notation
[ 13.446860]
[ 13.453668] 6 locks held by swapper/0/0:
[ 13.457598] #0: ffff800010003de0 ((&idev->mc_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x0/0x400
[ 13.466593] #1: ffffd4d3fb478700 (rcu_read_lock){....}-{1:2}, at: mld_sendpack+0x0/0x560
[ 13.474803] #2: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: ip6_finish_output2+0x64/0xb10
[ 13.483886] #3: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[ 13.492793] #4: ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[ 13.503094] #5: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[ 13.512000]
[ 13.512000] stack backtrace:
[ 13.516369] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988
[ 13.530421] Call trace:
[ 13.532871] dump_backtrace+0x0/0x1d8
[ 13.536539] show_stack+0x24/0x30
[ 13.539862] dump_stack+0xe8/0x150
[ 13.543271] __lock_acquire+0x1030/0x1678
[ 13.547290] lock_acquire+0xf8/0x458
[ 13.550873] _raw_spin_lock+0x44/0x58
[ 13.554543] __dev_queue_xmit+0x84c/0xbe0
[ 13.558562] dev_queue_xmit+0x24/0x30
[ 13.562232] dsa_slave_xmit+0xe0/0x128
[ 13.565988] dev_hard_start_xmit+0xf4/0x448
[ 13.570182] __dev_queue_xmit+0x808/0xbe0
[ 13.574200] dev_queue_xmit+0x24/0x30
[ 13.577869] neigh_resolve_output+0x15c/0x220
[ 13.582237] ip6_finish_output2+0x244/0xb10
[ 13.586430] __ip6_finish_output+0x1dc/0x298
[ 13.590709] ip6_output+0x84/0x358
[ 13.594116] mld_sendpack+0x2bc/0x560
[ 13.597786] mld_ifc_timer_expire+0x210/0x390
[ 13.602153] call_timer_fn+0xcc/0x400
[ 13.605822] run_timer_softirq+0x588/0x6e0
[ 13.609927] __do_softirq+0x118/0x590
[ 13.613597] irq_exit+0x13c/0x148
[ 13.616918] __handle_domain_irq+0x6c/0xc0
[ 13.621023] gic_handle_irq+0x6c/0x160
[ 13.624779] el1_irq+0xbc/0x180
[ 13.627927] cpuidle_enter_state+0xb4/0x4d0
[ 13.632120] cpuidle_enter+0x3c/0x50
[ 13.635703] call_cpuidle+0x44/0x78
[ 13.639199] do_idle+0x228/0x2c8
[ 13.642433] cpu_startup_entry+0x2c/0x48
[ 13.646363] rest_init+0x1ac/0x280
[ 13.649773] arch_call_rest_init+0x14/0x1c
[ 13.653878] start_kernel+0x490/0x4bc
Lockdep keys themselves were added in commit ab92d68fc2 ("net: core:
add generic lockdep keys"), and it's very likely that this splat existed
since then, but I have no real way to check, since this stacked platform
wasn't supported by mainline back then.
>From Taehee's own words:
This patch was considered that all stackable devices have LLTX flag.
But the dsa doesn't have LLTX, so this splat happened.
After this patch, dsa shares the same lockdep class key.
On the nested dsa interface architecture, which you illustrated,
the same lockdep class key will be used in __dev_queue_xmit() because
dsa doesn't have LLTX.
So that lockdep detects deadlock because the same lockdep class key is
used recursively although actually the different locks are used.
There are some ways to fix this problem.
1. using NETIF_F_LLTX flag.
If possible, using the LLTX flag is a very clear way for it.
But I'm so sorry I don't know whether the dsa could have LLTX or not.
2. using dynamic lockdep again.
It means that each interface uses a separate lockdep class key.
So, lockdep will not detect recursive locking.
But this way has a problem that it could consume lockdep class key
too many.
Currently, lockdep can have 8192 lockdep class keys.
- you can see this number with the following command.
cat /proc/lockdep_stats
lock-classes: 1251 [max: 8192]
...
The [max: 8192] means that the maximum number of lockdep class keys.
If too many lockdep class keys are registered, lockdep stops to work.
So, using a dynamic(separated) lockdep class key should be considered
carefully.
In addition, updating lockdep class key routine might have to be existing.
(lockdep_register_key(), lockdep_set_class(), lockdep_unregister_key())
3. Using lockdep subclass.
A lockdep class key could have 8 subclasses.
The different subclass is considered different locks by lockdep
infrastructure.
But "lock-classes" is not counted by subclasses.
So, it could avoid stopping lockdep infrastructure by an overflow of
lockdep class keys.
This approach should also have an updating lockdep class key routine.
(lockdep_set_subclass())
4. Using nonvalidate lockdep class key.
The lockdep infrastructure supports nonvalidate lockdep class key type.
It means this lockdep is not validated by lockdep infrastructure.
So, the splat will not happen but lockdep couldn't detect real deadlock
case because lockdep really doesn't validate it.
I think this should be used for really special cases.
(lockdep_set_novalidate_class())
Further discussion here:
https://patchwork.ozlabs.org/project/netdev/patch/20200503052220.4536-2-xiyou.wangcong@gmail.com/
There appears to be no negative side-effect to declaring lockless TX for
the DSA virtual interfaces, which means they handle their own locking.
So that's what we do to make the splat go away.
Patch tested in a wide variety of cases: unicast, multicast, PTP, etc.
Fixes: ab92d68fc2 ("net: core: add generic lockdep keys")
Suggested-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
change typo in function name "nofity" to "notify"
sctp_ulpevent_nofity_peer_addr_change ->
sctp_ulpevent_notify_peer_addr_change
Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a field to the tls rx offload context which enables
drivers to force a send_resync call.
This field can be used by drivers to request a resync at the next
possible tls record. It is beneficial for hardware that provides the
resync sequence number asynchronously. In such cases, the packet that
triggered the resync does not contain the information required for a
resync. Instead, the driver requests resync for all the following
TLS record until the asynchronous notification with the resync request
TCP sequence arrives.
A following series for mlx5e ConnectX-6DX TLS RX offload support will
use this mechanism.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang says:
====================
net_sched: reduce the number of qdisc resets
This patchset aims to reduce the number of qdisc resets during
qdisc tear down. Patch 1~3 are preparation for their following
patches, especially patch 2 and patch 3 add a few tracepoints
so that we can observe the whole lifetime of qdisc's. Patch 4
and patch 5 are the ones do the actual work. Please find more
details in each patch description.
Vaclav Zindulka tested this patchset and his large ruleset with
over 13k qdiscs defined got from 22s to 520ms.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Resetting old qdisc on dev_queue->qdisc_sleeping in
dev_qdisc_reset() is redundant, because this qdisc,
even if not same with dev_queue->qdisc, is reset via
qdisc_put() right after calling dev_graft_qdisc() when
hitting refcnt 0.
This is very easy to observe with qdisc_reset() tracepoint
and stack traces.
Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Tested-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Except for sch_mq and sch_mqprio, each dev queue points to the
same root qdisc, so when we reset the dev queues with
netdev_for_each_tx_queue() we end up resetting the same instance
of the root qdisc for multiple times.
Avoid this by checking the __QDISC_STATE_DEACTIVATED bit in
each iteration, so for sch_mq/sch_mqprio, we still reset all
of them like before, for the rest, we only reset it once.
Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Tested-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_destroy() calls ops->reset() and cleans up qdisc->gso_skb
and qdisc->skb_bad_txq, these are nearly same with qdisc_reset(),
so just call it directly, and cosolidate the code for the next
patch.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The change to bprm->have_execfd was incomplete, leading
to a build failure:
fs/binfmt_elf_fdpic.c: In function 'create_elf_fdpic_tables':
fs/binfmt_elf_fdpic.c:591:27: error: 'BINPRM_FLAGS_EXECFD' undeclared
Change the last user of BINPRM_FLAGS_EXECFD in a corresponding
way.
Reported-by: Valdis Klētnieks <valdis.kletnieks@vt.edu>
Fixes: b8a61c9e7b ("exec: Generic execfd support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
No element named "client" exists within "struct ipa_endpoint".
It might be a heritage forgotten to be removed. Delete it now.
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "transactio" -> "transaction". Also an alignment correction.
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
tcp: tcp_v4_err() cleanups
This series is a followup of patch 239174945d ("tcp: tcp_v4_err() icmp
skb is named icmp_skb").
Move the RFC 6069 code into a helper, and rename icmp_skb to standard
skb name so that tcp_v4_err() and tcp_v6_err() are using consistent names.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This essentially reverts 4d1a2d9ec1 ("Revert Backoff [v3]:
Rename skb to icmp_skb in tcp_v4_err()")
Now we have tcp_ld_RTO_revert() helper, we can use the usual
name for sk_buff parameter, so that tcp_v4_err() and
tcp_v6_err() use similar names.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 6069 logic has been implemented for IPv4 only so far,
right in the middle of tcp_v4_err() and was error prone.
Move this code to one helper, to make tcp_v4_err() more
readable and to eventually expand RFC 6069 to IPv6 in
the future.
Also perform sock_owned_by_user() check a bit sooner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan says:
====================
net: hns3: misc updates for -next
This patchset includes some misc updates for the HNS3 ethernet driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When initializing CMDQ fails because of reset pending,
there is no hint for debugging, so adds a log for it.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets will not pass through MAC during app loopback.
Therefore, it is meaningless to enable MAC while doing
app loopback. This patch removes this unnecessary action.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The HNS RDMA driver will support VF device later, whose
re-initialization should be done after PF's. This patch
changes the order of hclge_reset_prepare_up() and
hclge_notify_roce_client(), so that PF's RoCE client
will be reinitialized before VF's.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To prevent from initializing VF NIC client in reset handling state,
this patch adds resetting check in hclgevf_init_nic_client_instance().
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart says:
====================
net: mscc: allow forwarding ioctl operations to attached PHYs
These two patches allow forwarding ioctl to the PHY MII implementation,
and support is added for offloading timestamping operations to
compatible attached PHYs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for offloading timestamping operations not only
to the Ocelot switch (as already supported) but to compatible PHYs.
When both the PHY and the Ocelot switch support timestamping operations,
the PHY implementation is chosen as the timestamp will happen closer to
the medium.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow ioctl to be implemented by the PHY, when a PHY is attached to the
Ocelot switch. In case the ioctl is a request to set or get the hardware
timestamp, use the Ocelot switch implementation for now.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The AMD Starship USB 3.0 host controller advertises Function Level Reset
support, but it apparently doesn't work. Add a quirk to prevent use of FLR
on this device.
Without this quirk, when attempting to assign (pass through) an AMD
Starship USB 3.0 host controller to a guest OS, the system becomes
increasingly unresponsive over the course of several minutes, eventually
requiring a hard reset. Shortly after attempting to start the guest, I see
these messages:
vfio-pci 0000:05:00.3: not ready 1023ms after FLR; waiting
vfio-pci 0000:05:00.3: not ready 2047ms after FLR; waiting
vfio-pci 0000:05:00.3: not ready 4095ms after FLR; waiting
vfio-pci 0000:05:00.3: not ready 8191ms after FLR; waiting
And then eventually:
vfio-pci 0000:05:00.3: not ready 65535ms after FLR; giving up
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 0.000 msecs
perf: interrupt took too long (642744 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 82.270 msecs
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 680.608 msecs
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 100.952 msecs
...
watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [qemu-system-x86:7487]
Tested on a Micro-Star International Co., Ltd. MS-7C59/Creator TRX40
motherboard with an AMD Ryzen Threadripper 3970X.
Link: https://lore.kernel.org/r/20200524003529.598434ff@f31-4.lan
Signed-off-by: Kevin Buettner <kevinb@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The AMD Matisse HD Audio & USB 3.0 devices advertise Function Level Reset
support, but hang when an FLR is triggered.
To reproduce the problem, attach the device to a VM, then detach and try to
attach again.
Rename the existing quirk_intel_no_flr(), which was not Intel-specific, to
quirk_no_flr(), and apply it to prevent the use of FLR on these AMD
devices.
Link: https://lore.kernel.org/r/CAAri2DpkcuQZYbT6XsALhx2e6vRqPHwtbjHYeiH7MNp4zmt1RA@mail.gmail.com
Signed-off-by: Marcos Scriven <marcos@scriven.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
On device updates, the hooknum and priority attributes are not required.
This patch makes optional these two netlink attributes.
Moreover, bail out with EOPNOTSUPP if userspace tries to update the
hooknum and priority for existing flowtables.
While at this, turn EINVAL into EOPNOTSUPP in case the hooknum is not
ingress. EINVAL is reserved for missing netlink attribute / malformed
netlink messages.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A flowtable might be composed of dynamic interfaces only. Such dynamic
interfaces might show up at a later stage. This patch allows users to
register a flowtable with no devices. Once the dynamic interface becomes
available, the user adds the dynamic devices to the flowtable.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Update the flowtable netlink notifier to take the list of hooks as input.
This allows to reuse this function in incremental flowtable hook updates.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Update nft_flowtable_parse_hook() to take the flowtable hook list as
parameter. This allows to reuse this function to update the hooks.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)
It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).
This patch adds filtering in kernel side. To achieve this goal, we:
* Add a new CTA_FILTER netlink attributes, actually a flag list to
parametize filtering
* Convert some *nlattr_to_tuple() functions, to allow a partial parsing
of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
fully set)
Filtering is now possible on:
* IP SRC/DST values
* Ports for TCP and UDP flows
* IMCP(v6) codes types and IDs
Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.
Changes since v1:
Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered
Changes since v2:
Move several constants to nf_internals.h
Move a fix on netlink values check in a separate patch
Add a check on not-supported flags
Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
(not yet implemented)
Code style issues
Changes since v3:
Fix compilation warning reported by kbuild test robot
Changes since v4:
Fix a regression introduced in v3 (returned EINVAL for valid netlink
messages without CTA_MARK)
Changes since v5:
Change definition of CTA_FILTER_F_ALL
Fix a regression when CTA_TUPLE_ZONE is not set
Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The most common way to set ECE option will be during modify QP command in
INIT2RTR, RTR2RTS and RTS2RTS stages, so update mlx5 to support it.
The new bit in the comp_mask is needed to mark that kernel supports ECE
and can receive data instead of "reserved" field in the struct
mlx5_ib_modify_qp.
Link: https://lore.kernel.org/r/20200526115440.205922-8-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IBTA declares "vendor option not supported" reject reason in REJ messages
if passive side doesn't want to accept proposed ECE options.
Due to the fact that ECE is managed by userspace, there is a need to let
users to provide such rejected reason.
Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rdma_accept() is called by both passive and active sides of CMID
connection to mark readiness to start data transfer. For passive side,
this is called explicitly, for active side, it is called implicitly while
receiving REP message.
Provide ECE data to rdma_accept function needed for passive side to send
that REP message.
Link: https://lore.kernel.org/r/20200526103304.196371-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Passive side of CMID connection receives ECE request through REQ message
and needs to respond with relevant REP message which will be forwarded to
active side.
The UCMA events interface is responsible for such communication with the
user space (librdmacm). Extend it to provide ECE wire data.
Link: https://lore.kernel.org/r/20200526103304.196371-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>