Commit Graph

935158 Commits

Author SHA1 Message Date
Vladimir Oltean
2b86cb8299 net: dsa: declare lockless TX feature for slave ports
Be there a platform with the following layout:

      Regular NIC
       |
       +----> DSA master for switch port
               |
               +----> DSA master for another switch port

After changing DSA back to static lockdep class keys in commit
1a33e10e4a ("net: partially revert dynamic lockdep key changes"), this
kernel splat can be seen:

[   13.361198] ============================================
[   13.366524] WARNING: possible recursive locking detected
[   13.371851] 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988 Not tainted
[   13.377874] --------------------------------------------
[   13.383201] swapper/0/0 is trying to acquire lock:
[   13.388004] ffff0000668ff298 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[   13.397879]
[   13.397879] but task is already holding lock:
[   13.403727] ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[   13.413593]
[   13.413593] other info that might help us debug this:
[   13.420140]  Possible unsafe locking scenario:
[   13.420140]
[   13.426075]        CPU0
[   13.428523]        ----
[   13.430969]   lock(&dsa_slave_netdev_xmit_lock_key);
[   13.435946]   lock(&dsa_slave_netdev_xmit_lock_key);
[   13.440924]
[   13.440924]  *** DEADLOCK ***
[   13.440924]
[   13.446860]  May be due to missing lock nesting notation
[   13.446860]
[   13.453668] 6 locks held by swapper/0/0:
[   13.457598]  #0: ffff800010003de0 ((&idev->mc_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x0/0x400
[   13.466593]  #1: ffffd4d3fb478700 (rcu_read_lock){....}-{1:2}, at: mld_sendpack+0x0/0x560
[   13.474803]  #2: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: ip6_finish_output2+0x64/0xb10
[   13.483886]  #3: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[   13.492793]  #4: ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[   13.503094]  #5: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[   13.512000]
[   13.512000] stack backtrace:
[   13.516369] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988
[   13.530421] Call trace:
[   13.532871]  dump_backtrace+0x0/0x1d8
[   13.536539]  show_stack+0x24/0x30
[   13.539862]  dump_stack+0xe8/0x150
[   13.543271]  __lock_acquire+0x1030/0x1678
[   13.547290]  lock_acquire+0xf8/0x458
[   13.550873]  _raw_spin_lock+0x44/0x58
[   13.554543]  __dev_queue_xmit+0x84c/0xbe0
[   13.558562]  dev_queue_xmit+0x24/0x30
[   13.562232]  dsa_slave_xmit+0xe0/0x128
[   13.565988]  dev_hard_start_xmit+0xf4/0x448
[   13.570182]  __dev_queue_xmit+0x808/0xbe0
[   13.574200]  dev_queue_xmit+0x24/0x30
[   13.577869]  neigh_resolve_output+0x15c/0x220
[   13.582237]  ip6_finish_output2+0x244/0xb10
[   13.586430]  __ip6_finish_output+0x1dc/0x298
[   13.590709]  ip6_output+0x84/0x358
[   13.594116]  mld_sendpack+0x2bc/0x560
[   13.597786]  mld_ifc_timer_expire+0x210/0x390
[   13.602153]  call_timer_fn+0xcc/0x400
[   13.605822]  run_timer_softirq+0x588/0x6e0
[   13.609927]  __do_softirq+0x118/0x590
[   13.613597]  irq_exit+0x13c/0x148
[   13.616918]  __handle_domain_irq+0x6c/0xc0
[   13.621023]  gic_handle_irq+0x6c/0x160
[   13.624779]  el1_irq+0xbc/0x180
[   13.627927]  cpuidle_enter_state+0xb4/0x4d0
[   13.632120]  cpuidle_enter+0x3c/0x50
[   13.635703]  call_cpuidle+0x44/0x78
[   13.639199]  do_idle+0x228/0x2c8
[   13.642433]  cpu_startup_entry+0x2c/0x48
[   13.646363]  rest_init+0x1ac/0x280
[   13.649773]  arch_call_rest_init+0x14/0x1c
[   13.653878]  start_kernel+0x490/0x4bc

Lockdep keys themselves were added in commit ab92d68fc2 ("net: core:
add generic lockdep keys"), and it's very likely that this splat existed
since then, but I have no real way to check, since this stacked platform
wasn't supported by mainline back then.

>From Taehee's own words:

  This patch was considered that all stackable devices have LLTX flag.
  But the dsa doesn't have LLTX, so this splat happened.
  After this patch, dsa shares the same lockdep class key.
  On the nested dsa interface architecture, which you illustrated,
  the same lockdep class key will be used in __dev_queue_xmit() because
  dsa doesn't have LLTX.
  So that lockdep detects deadlock because the same lockdep class key is
  used recursively although actually the different locks are used.
  There are some ways to fix this problem.

  1. using NETIF_F_LLTX flag.
  If possible, using the LLTX flag is a very clear way for it.
  But I'm so sorry I don't know whether the dsa could have LLTX or not.

  2. using dynamic lockdep again.
  It means that each interface uses a separate lockdep class key.
  So, lockdep will not detect recursive locking.
  But this way has a problem that it could consume lockdep class key
  too many.
  Currently, lockdep can have 8192 lockdep class keys.
   - you can see this number with the following command.
     cat /proc/lockdep_stats
     lock-classes:                         1251 [max: 8192]
     ...
     The [max: 8192] means that the maximum number of lockdep class keys.
  If too many lockdep class keys are registered, lockdep stops to work.
  So, using a dynamic(separated) lockdep class key should be considered
  carefully.
  In addition, updating lockdep class key routine might have to be existing.
  (lockdep_register_key(), lockdep_set_class(), lockdep_unregister_key())

  3. Using lockdep subclass.
  A lockdep class key could have 8 subclasses.
  The different subclass is considered different locks by lockdep
  infrastructure.
  But "lock-classes" is not counted by subclasses.
  So, it could avoid stopping lockdep infrastructure by an overflow of
  lockdep class keys.
  This approach should also have an updating lockdep class key routine.
  (lockdep_set_subclass())

  4. Using nonvalidate lockdep class key.
  The lockdep infrastructure supports nonvalidate lockdep class key type.
  It means this lockdep is not validated by lockdep infrastructure.
  So, the splat will not happen but lockdep couldn't detect real deadlock
  case because lockdep really doesn't validate it.
  I think this should be used for really special cases.
  (lockdep_set_novalidate_class())

Further discussion here:
https://patchwork.ozlabs.org/project/netdev/patch/20200503052220.4536-2-xiyou.wangcong@gmail.com/

There appears to be no negative side-effect to declaring lockless TX for
the DSA virtual interfaces, which means they handle their own locking.
So that's what we do to make the splat go away.

Patch tested in a wide variety of cases: unicast, multicast, PTP, etc.

Fixes: ab92d68fc2 ("net: core: add generic lockdep keys")
Suggested-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:09:28 -07:00
Jonas Falkevik
50ce4c099b sctp: fix typo sctp_ulpevent_nofity_peer_addr_change
change typo in function name "nofity" to "notify"
sctp_ulpevent_nofity_peer_addr_change ->
sctp_ulpevent_notify_peer_addr_change

Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:08:02 -07:00
Tariq Toukan
b3ae2459f8 net/tls: Add force_resync for driver resync
This patch adds a field to the tls rx offload context which enables
drivers to force a send_resync call.

This field can be used by drivers to request a resync at the next
possible tls record. It is beneficial for hardware that provides the
resync sequence number asynchronously. In such cases, the packet that
triggered the resync does not contain the information required for a
resync. Instead, the driver requests resync for all the following
TLS record until the asynchronous notification with the resync request
TCP sequence arrives.

A following series for mlx5e ConnectX-6DX TLS RX offload support will
use this mechanism.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:07:13 -07:00
David S. Miller
bdad7f9441 Merge branch 'net_sched-reduce-the-number-of-qdisc-resets'
Cong Wang says:

====================
net_sched: reduce the number of qdisc resets

This patchset aims to reduce the number of qdisc resets during
qdisc tear down. Patch 1~3 are preparation for their following
patches, especially patch 2 and patch 3 add a few tracepoints
so that we can observe the whole lifetime of qdisc's. Patch 4
and patch 5 are the ones do the actual work. Please find more
details in each patch description.

Vaclav Zindulka tested this patchset and his large ruleset with
over 13k qdiscs defined got from 22s to 520ms.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:50 -07:00
Cong Wang
759ae57f1b net_sched: get rid of unnecessary dev_qdisc_reset()
Resetting old qdisc on dev_queue->qdisc_sleeping in
dev_qdisc_reset() is redundant, because this qdisc,
even if not same with dev_queue->qdisc, is reset via
qdisc_put() right after calling dev_graft_qdisc() when
hitting refcnt 0.

This is very easy to observe with qdisc_reset() tracepoint
and stack traces.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Tested-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
70f5096533 net_sched: avoid resetting active qdisc for multiple times
Except for sch_mq and sch_mqprio, each dev queue points to the
same root qdisc, so when we reset the dev queues with
netdev_for_each_tx_queue() we end up resetting the same instance
of the root qdisc for multiple times.

Avoid this by checking the __QDISC_STATE_DEACTIVATED bit in
each iteration, so for sch_mq/sch_mqprio, we still reset all
of them like before, for the rest, we only reset it once.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Tested-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
f5a7833e83 net_sched: add a tracepoint for qdisc creation
With this tracepoint, we could know when qdisc's are created,
especially those default qdisc's.

Sample output:

  tc-736   [001] ...1    56.230107: qdisc_create: dev=ens3 kind=pfifo parent=1:0
  tc-736   [001] ...1    56.230113: qdisc_create: dev=ens3 kind=hfsc parent=ffff:ffff
  tc-738   [001] ...1    56.256816: qdisc_create: dev=ens3 kind=pfifo parent=1:100
  tc-739   [001] ...1    56.267584: qdisc_create: dev=ens3 kind=pfifo parent=1:200
  tc-740   [001] ...1    56.279649: qdisc_create: dev=ens3 kind=fq_codel parent=1:100
  tc-741   [001] ...1    56.289996: qdisc_create: dev=ens3 kind=pfifo_fast parent=1:200
  tc-745   [000] .N.1   111.687483: qdisc_create: dev=ens3 kind=ingress parent=ffff:fff1

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
a34dac0b90 net_sched: add tracepoints for qdisc_reset() and qdisc_destroy()
Add two tracepoints for qdisc_reset() and qdisc_destroy() to track
qdisc resetting and destroying.

Sample output:

  tc-756   [000] ...3   138.355662: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355720: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355867: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355930: qdisc_destroy: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-757   [000] ...2   143.073780: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.073878: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.074114: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.074228: qdisc_destroy: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
4909daba37 net_sched: use qdisc_reset() in qdisc_destroy()
qdisc_destroy() calls ops->reset() and cleans up qdisc->gso_skb
and qdisc->skb_bad_txq, these are nearly same with qdisc_reset(),
so just call it directly, and cosolidate the code for the next
patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Arnd Bergmann
011593480d binfmt_elf_fdpic: fix execfd build regression
The change to bprm->have_execfd was incomplete, leading
to a build failure:

fs/binfmt_elf_fdpic.c: In function 'create_elf_fdpic_tables':
fs/binfmt_elf_fdpic.c:591:27: error: 'BINPRM_FLAGS_EXECFD' undeclared

Change the last user of BINPRM_FLAGS_EXECFD in a corresponding
way.

Reported-by: Valdis Klētnieks <valdis.kletnieks@vt.edu>
Fixes: b8a61c9e7b ("exec: Generic execfd support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-05-27 16:59:53 -05:00
Wang Wenhu
b3037ac501 drivers: ipa: remove discription of nonexistent element
No element named "client" exists within "struct ipa_endpoint".
It might be a heritage forgotten to be removed. Delete it now.

Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:58:28 -07:00
Wang Wenhu
8fa54b1160 drivers: ipa: fix typoes for ipa
Change "transactio" -> "transaction". Also an alignment correction.

Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:58:28 -07:00
David S. Miller
be372e919c Merge branch 'tcp-tcp_v4_err-cleanups'
Eric Dumazet says:

====================
tcp: tcp_v4_err() cleanups

This series is a followup of patch 239174945d ("tcp: tcp_v4_err() icmp
skb is named icmp_skb").

Move the RFC 6069 code into a helper, and rename icmp_skb to standard
skb name so that tcp_v4_err() and tcp_v6_err() are using consistent names.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:57:27 -07:00
Eric Dumazet
a12daf13a4 tcp: rename tcp_v4_err() skb parameter
This essentially reverts 4d1a2d9ec1 ("Revert Backoff [v3]:
Rename skb to icmp_skb in tcp_v4_err()")

Now we have tcp_ld_RTO_revert() helper, we can use the usual
name for sk_buff parameter, so that tcp_v4_err() and
tcp_v6_err() use similar names.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:57:27 -07:00
Eric Dumazet
f745664257 tcp: add tcp_ld_RTO_revert() helper
RFC 6069 logic has been implemented for IPv4 only so far,
right in the middle of tcp_v4_err() and was error prone.

Move this code to one helper, to make tcp_v4_err() more
readable and to eventually expand RFC 6069 to IPv6 in
the future.

Also perform sock_owned_by_user() check a bit sooner.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:57:27 -07:00
David S. Miller
d6a3511b60 Merge branch 'hns3-next'
Huazhong Tan says:

====================
net: hns3: misc updates for -next

This patchset includes some misc updates for the HNS3 ethernet driver.

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:56:08 -07:00
Huazhong Tan
6f45a9bdd2 net: hns3: add a print for initializing CMDQ when reset pending
When initializing CMDQ fails because of reset pending,
there is no hint for debugging, so adds a log for it.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:56:08 -07:00
Yufeng Mo
01952206e1 net: hns3: remove unnecessary MAC enable in app loopback
Packets will not pass through MAC during app loopback.
Therefore, it is meaningless to enable MAC while doing
app loopback. This patch removes this unnecessary action.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:56:08 -07:00
Yufeng Mo
60c800c64d net: hns3: change the order of reinitializing RoCE and NIC client during reset
The HNS RDMA driver will support VF device later, whose
re-initialization should be done after PF's. This patch
changes the order of hclge_reset_prepare_up() and
hclge_notify_roce_client(), so that PF's RoCE client
will be reinitialized before VF's.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:56:08 -07:00
Guangbin Huang
4cd5beaa89 net: hns3: add a resetting check in hclgevf_init_nic_client_instance()
To prevent from initializing VF NIC client in reset handling state,
this patch adds resetting check in hclgevf_init_nic_client_instance().

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:56:08 -07:00
David S. Miller
4a11c551a6 Merge branch 'net-mscc-allow-forwarding-ioctl-operations-to-attached-PHYs'
Antoine Tenart says:

====================
net: mscc: allow forwarding ioctl operations to attached PHYs

These two patches allow forwarding ioctl to the PHY MII implementation,
and support is added for offloading timestamping operations to
compatible attached PHYs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:54:32 -07:00
Antoine Tenart
b2e118f638 net: mscc: allow offloading timestamping operations to the PHY
This patch adds support for offloading timestamping operations not only
to the Ocelot switch (as already supported) but to compatible PHYs.
When both the PHY and the Ocelot switch support timestamping operations,
the PHY implementation is chosen as the timestamp will happen closer to
the medium.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:54:31 -07:00
Antoine Tenart
7ff4f3f315 net: mscc: use the PHY MII ioctl interface when possible
Allow ioctl to be implemented by the PHY, when a PHY is attached to the
Ocelot switch. In case the ioctl is a request to set or get the hardware
timestamp, use the Ocelot switch implementation for now.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:54:31 -07:00
Krzysztof Wilczyński
3910ebaca8 PCI: Rename _DSM constants to align with spec
Rename PCI-related _DSM constants to align them with the PCI Firmware Spec,
r3.2, sec 4.6.  No functional change intended.

Link: https://lore.kernel.org/r/20200526213905.2479381-1-kw@linux.com
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-05-27 16:48:21 -05:00
Kevin Buettner
5727043c73 PCI: Avoid FLR for AMD Starship USB 3.0
The AMD Starship USB 3.0 host controller advertises Function Level Reset
support, but it apparently doesn't work.  Add a quirk to prevent use of FLR
on this device.

Without this quirk, when attempting to assign (pass through) an AMD
Starship USB 3.0 host controller to a guest OS, the system becomes
increasingly unresponsive over the course of several minutes, eventually
requiring a hard reset.  Shortly after attempting to start the guest, I see
these messages:

  vfio-pci 0000:05:00.3: not ready 1023ms after FLR; waiting
  vfio-pci 0000:05:00.3: not ready 2047ms after FLR; waiting
  vfio-pci 0000:05:00.3: not ready 4095ms after FLR; waiting
  vfio-pci 0000:05:00.3: not ready 8191ms after FLR; waiting

And then eventually:

  vfio-pci 0000:05:00.3: not ready 65535ms after FLR; giving up
  INFO: NMI handler (perf_event_nmi_handler) took too long to run: 0.000 msecs
  perf: interrupt took too long (642744 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
  INFO: NMI handler (perf_event_nmi_handler) took too long to run: 82.270 msecs
  INFO: NMI handler (perf_event_nmi_handler) took too long to run: 680.608 msecs
  INFO: NMI handler (perf_event_nmi_handler) took too long to run: 100.952 msecs
  ...
  watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [qemu-system-x86:7487]

Tested on a Micro-Star International Co., Ltd. MS-7C59/Creator TRX40
motherboard with an AMD Ryzen Threadripper 3970X.

Link: https://lore.kernel.org/r/20200524003529.598434ff@f31-4.lan
Signed-off-by: Kevin Buettner <kevinb@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-05-27 16:20:24 -05:00
Marcos Scriven
0d14f06cd6 PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
The AMD Matisse HD Audio & USB 3.0 devices advertise Function Level Reset
support, but hang when an FLR is triggered.

To reproduce the problem, attach the device to a VM, then detach and try to
attach again.

Rename the existing quirk_intel_no_flr(), which was not Intel-specific, to
quirk_no_flr(), and apply it to prevent the use of FLR on these AMD
devices.

Link: https://lore.kernel.org/r/CAAri2DpkcuQZYbT6XsALhx2e6vRqPHwtbjHYeiH7MNp4zmt1RA@mail.gmail.com
Signed-off-by: Marcos Scriven <marcos@scriven.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-05-27 16:18:43 -05:00
Wei Liu
60369a4f8d x86/PCI: Drop unused xen_register_pirq() gsi_override parameter
All callers of xen_register_pirq() pass -1 (no override) for the
gsi_override parameter.  Remove it and related code.

Link: https://lore.kernel.org/r/20200428153640.76476-1-wei.liu@kernel.org
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2020-05-27 16:18:43 -05:00
Al Viro
9e46365459 copy_xstate_to_kernel(): don't leave parts of destination uninitialized
copy the corresponding pieces of init_fpstate into the gaps instead.

Cc: stable@kernel.org
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-27 17:06:31 -04:00
Pablo Neira Ayuso
5b6743fb2c netfilter: nf_tables: skip flowtable hooknum and priority on device updates
On device updates, the hooknum and priority attributes are not required.
This patch makes optional these two netlink attributes.

Moreover, bail out with EOPNOTSUPP if userspace tries to update the
hooknum and priority for existing flowtables.

While at this, turn EINVAL into EOPNOTSUPP in case the hooknum is not
ingress. EINVAL is reserved for missing netlink attribute / malformed
netlink messages.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:35 +02:00
Pablo Neira Ayuso
05abe4456f netfilter: nf_tables: allow to register flowtable with no devices
A flowtable might be composed of dynamic interfaces only. Such dynamic
interfaces might show up at a later stage. This patch allows users to
register a flowtable with no devices. Once the dynamic interface becomes
available, the user adds the dynamic devices to the flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
abadb2f865 netfilter: nf_tables: delete devices from flowtable
This patch allows users to delete devices from existing flowtables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
78d9f48f7f netfilter: nf_tables: add devices to existing flowtable
This patch allows users to add devices to an existing flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
c42d8bda69 netfilter: nf_tables: pass hook list to flowtable event notifier
Update the flowtable netlink notifier to take the list of hooks as input.
This allows to reuse this function in incremental flowtable hook updates.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
389a2cbcb7 netfilter: nf_tables: add nft_flowtable_hooks_destroy()
This patch adds a helper function destroy the flowtable hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
f9382669cf netfilter: nf_tables: pass hook list to nft_{un,}register_flowtable_net_hooks()
This patch prepares for incremental flowtable hook updates.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
d9246a5375 netfilter: nf_tables: generalise flowtable hook parsing
Update nft_flowtable_parse_hook() to take the flowtable hook list as
parameter. This allows to reuse this function to update the hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Romain Bellan
cb8aa9a3af netfilter: ctnetlink: add kernel side filtering for dump
Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)

It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).

This patch adds filtering in kernel side. To achieve this goal, we:

 * Add a new CTA_FILTER netlink attributes, actually a flag list to
   parametize filtering
 * Convert some *nlattr_to_tuple() functions, to allow a partial parsing
   of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
   fully set)

Filtering is now possible on:
 * IP SRC/DST values
 * Ports for TCP and UDP flows
 * IMCP(v6) codes types and IDs

Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.

Changes since v1:
  Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered

Changes since v2:
  Move several constants to nf_internals.h
  Move a fix on netlink values check in a separate patch
  Add a check on not-supported flags
  Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
  (not yet implemented)
  Code style issues

Changes since v3:
  Fix compilation warning reported by kbuild test robot

Changes since v4:
  Fix a regression introduced in v3 (returned EINVAL for valid netlink
  messages without CTA_MARK)

Changes since v5:
  Change definition of CTA_FILTER_F_ALL
  Fix a regression when CTA_TUPLE_ZONE is not set

Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Leon Romanovsky
50aec2c313 RDMA/mlx5: Return ECE data after modify QP
After users sets the ECE option, FW will return the agreed/supported bits
through an output structures of modify QP stages for regular QPs or
through create QP for the DCT.

Link: https://lore.kernel.org/r/20200526115440.205922-9-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
5f62a521ff RDMA/mlx5: Set ECE options during modify QP
The most common way to set ECE option will be during modify QP command in
INIT2RTR, RTR2RTS and RTS2RTS stages, so update mlx5 to support it.

The new bit in the comp_mask is needed to mark that kernel supports ECE
and can receive data instead of "reserved" field in the struct
mlx5_ib_modify_qp.

Link: https://lore.kernel.org/r/20200526115440.205922-8-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
f18e26af6a RDMA/mlx5: Convert modify QP to use MLX5_SET macros
Instead of hand crafted mlx5_qp_context and mlx5_qp_path use common
MLX5_SET() macros.

Link: https://lore.kernel.org/r/20200526115440.205922-7-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
70bd7fb876 RDMA/mlx5: Remove manually crafted QP context the query call
As a preparation to removal hand crafted mlx5_qp_context, convert
query_qp_attr() to use proper MLX5_GET() macros.

Link: https://lore.kernel.org/r/20200526115440.205922-6-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
64bae2d455 RDMA/mlx5: Use direct modify QP implementation
As a preparation to removal hand crafted mlx5_qp_context, convert counter
code to use mlx5_cmd_exec_in() directly.

Link: https://lore.kernel.org/r/20200526115440.205922-5-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
e383085c24 RDMA/mlx5: Set ECE options during QP create
Allow users to ask creation of QPs with specific ECE options.  Such early
set even before RDMA-CM connection is established is useful if user knows
exactly which option he needs.

Link: https://lore.kernel.org/r/20200526115440.205922-4-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
3e09a427ae RDMA/mlx5: Get ECE options from FW during create QP
Supported ECE options are returned from FW in the create_qp phase and zero
means that field is not valid. Such default value allows us to reuse
reserved field without worries about comp_mask.

Update create QP API to return ECE options.

Link: https://lore.kernel.org/r/20200526115440.205922-3-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
8094ba0ace RDMA/cma: Provide ECE reject reason
IBTA declares "vendor option not supported" reject reason in REJ messages
if passive side doesn't want to accept proposed ECE options.

Due to the fact that ECE is managed by userspace, there is a need to let
users to provide such rejected reason.

Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
0cb15372a6 RDMA/cma: Connect ECE to rdma_accept
The rdma_accept() is called by both passive and active sides of CMID
connection to mark readiness to start data transfer. For passive side,
this is called explicitly, for active side, it is called implicitly while
receiving REP message.

Provide ECE data to rdma_accept function needed for passive side to send
that REP message.

Link: https://lore.kernel.org/r/20200526103304.196371-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
a20652e175 RDMA/cm: Send and receive ECE parameter over the wire
ECE parameters are exchanged through REQ->REP/SIDR_REP messages, this
patch adds the data to provide to other side of CMID communication
channel.

Link: https://lore.kernel.org/r/20200526103304.196371-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
93531ee7b9 RDMA/ucma: Deliver ECE parameters through UCMA events
Passive side of CMID connection receives ECE request through REQ message
and needs to respond with relevant REP message which will be forwarded to
active side.

The UCMA events interface is responsible for such communication with the
user space (librdmacm). Extend it to provide ECE wire data.

Link: https://lore.kernel.org/r/20200526103304.196371-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
34e2ab57a9 RDMA/ucma: Extend ucma_connect to receive ECE parameters
Active side of CMID initiates connection through librdmacm's
rdma_connect() and kernel's ucma_connect(). Extend UCMA interface to
handle those new parameters.

Link: https://lore.kernel.org/r/20200526103304.196371-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
fef17f91da RDMA/cm: Add Enhanced Connection Establishment (ECE) bits
Extend REQ (request for communications), REP (reply to request for
communication), rejected reason and SIDR_REP (service ID resolution
response) structures with hardware vendor ID bits according to IBTA v1.4.

Link: https://lore.kernel.org/r/20200526103304.196371-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00