There are number of counters types supported in mlx5_ib: HW counters,
congestion counters, Q-counters and flow counters. Almost all supporting
code was placed in main.c that made almost impossible to maintain the code
anymore. Let's create separate code namespace for the counters to easy
future generalization effort.
Link: https://lore.kernel.org/r/20200702081809.423482-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In the loopback tests, the following call trace occurs.
Call Trace:
__rxe_do_task+0x1a/0x30 [rdma_rxe]
rxe_qp_destroy+0x61/0xa0 [rdma_rxe]
rxe_destroy_qp+0x20/0x60 [rdma_rxe]
ib_destroy_qp_user+0xcc/0x220 [ib_core]
uverbs_free_qp+0x3c/0xc0 [ib_uverbs]
destroy_hw_idr_uobject+0x24/0x70 [ib_uverbs]
uverbs_destroy_uobject+0x43/0x1b0 [ib_uverbs]
uobj_destroy+0x41/0x70 [ib_uverbs]
__uobj_get_destroy+0x39/0x70 [ib_uverbs]
ib_uverbs_destroy_qp+0x88/0xc0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb9/0xf0 [ib_uverbs]
ib_uverbs_cmd_verbs+0xb16/0xc30 [ib_uverbs]
The root cause is that the actual RDMA connection is not created in the
loopback tests and the rxe_match_dgid will fail randomly.
To fix this call trace which appear in the loopback tests, skip check of
the dgid.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200630123605.446959-1-leon@kernel.org
Signed-off-by: Zhu Yanjun <yanjunz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Refactor mlx5_ib_alloc_ucontext() to set its response fields in a
cleaner way.
It includes,
- Move the relevant code to a self contained function.
- Calculate the response length once and drop redundant code all around.
- Reuse previously set ucontext fields once preparing the response.
The self contained function will be used in next patch as part of
implementing the query ucontext functionality.
Link: https://lore.kernel.org/r/20200630093916.332097-5-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Set IOVA on IB MR in uverbs layer to let all drivers have it, this
includes both reg/rereg MR flows.
As part of this change cleaned-up this setting from the drivers that
already did it by themselves in their user flows.
Fixes: e6f0330106 ("mlx4_ib: set user mr attributes in struct ib_mr")
Link: https://lore.kernel.org/r/20200630093916.332097-3-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Michael Guralnik says:
====================
This series handles IPoIB child interface creation with setting
interface's HW address.
In current implementation, lladdr requested by user is ignored and
overwritten. Child interface gets the same GID as the parent interface and
a QP number which is assigned by the underlying drivers.
In this series we fix this behavior so that user's requested address is
assigned to the newly created interface.
As specific QP number request is not supported for all vendors, QP number
requested by user will still be overwritten when this is not supported.
Behavior of creation of child interfaces through the sysfs mechanism or
without specifying a requested address, stays the same.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_ipoib_qpn':
RDMA/ipoib: Handle user-supplied address when creating child
net/mlx5: Enable QP number request when creating IPoIB underlay QP
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For debugging purpose it will be easier to understand if prefetch works
okay if it has its own counter. Introduce ODP prefetch counter and count
per MR the total number of prefetched pages.
In addition remove comment which is not relevant anymore and anyway not in
the correct place.
Link: https://lore.kernel.org/r/20200621104147.53795-1-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is a race condition where ib_nl_make_request() inserts the request
data into the linked list but the timer in ib_nl_request_timeout() can see
it and destroy it before ib_nl_send_msg() is done touching it. This could
happen, for instance, if there is a long delay allocating memory during
nlmsg_new()
This causes a use-after-free in the send_mad() thread:
[<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
[ <ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
[<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
[<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
[<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850 [rds_rdma]
[<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850 [rds_rdma]
[<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
[<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
[<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
[<ffffffff810a02f9>] process_one_work+0x169/0x4a0
[<ffffffff810a0b2b>] worker_thread+0x5b/0x560
[<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
[<ffffffff810a68fb>] kthread+0xcb/0xf0
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
[<ffffffff816f25a7>] ret_from_fork+0x47/0x90
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
The ownership rule is once the request is on the list, ownership transfers
to the list and the local thread can't touch it any more, just like for
the normal MAD case in send_mad().
Thus, instead of adding before send and then trying to delete after on
errors, move the entire thing under the spinlock so that the send and
update of the lists are atomic to the conurrent threads. Lightly reoganize
things so spinlock safe memory allocations are done in the final NL send
path and the rest of the setup work is done before and outside the lock.
Fixes: 3ebd2fd0d0 ("IB/sa: Put netlink request into the request list before sending")
Link: https://lore.kernel.org/r/1592964789-14533-1-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_unregister_device_queued() can only be used by drivers using the new
dealloc_device callback flow, and it has a safety WARN_ON to ensure
drivers are using it properly.
However, if unregister and register are raced there is a special
destruction path that maintains the uniform error handling semantic of
'caller does ib_dealloc_device() on failure'. This requires disabling the
dealloc_device callback which triggers the WARN_ON.
Instead of using NULL to disable the callback use a special function
pointer so the WARN_ON does not trigger.
Fixes: d0899892ed ("RDMA/device: Provide APIs from the core code to help unregistration")
Link: https://lore.kernel.org/r/0-v1-a36d512e0a99+762-syz_dealloc_driver_jgg@nvidia.com
Reported-by: syzbot+4088ed905e4ae2b0e13b@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ipoib_mcast_carrier_on_task() insanely open codes a rtnl_lock() such that
the only time flush_workqueue() can be called is if it also clears
IPOIB_FLAG_OPER_UP.
Thus the flush inside ipoib_flush_ah() will deadlock if it gets unlucky
enough, and lockdep doesn't help us to find it early:
CPU0 CPU1 CPU2
__ipoib_ib_dev_flush()
down_read(vlan_rwsem)
ipoib_vlan_add()
rtnl_trylock()
down_write(vlan_rwsem)
ipoib_mcast_carrier_on_task()
while (!rtnl_trylock())
msleep(20);
ipoib_flush_ah()
flush_workqueue(priv->wq)
Clean up the ah_reaper related functions and lifecycle to make sense:
- Start/Stop of the reaper should only be done in open/stop NDOs, not in
any other places
- cancel and flush of the reaper should only happen in the stop NDO.
cancel is only functional when combined with IPOIB_STOP_REAPER.
- Non-stop places were flushing the AH's just need to flush out dead AH's
synchronously and ignore the background task completely. It is fully
locked and harmless to leave running.
Which ultimately fixes the ABBA deadlock by removing the unnecessary
flush_workqueue() from the problematic place under the vlan_rwsem.
Fixes: efc82eeeae ("IB/ipoib: No longer use flush as a parameter")
Link: https://lore.kernel.org/r/20200625174219.290842-1-kamalheib1@gmail.com
Reported-by: Kamal Heib <kheib@redhat.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
pcie_speeds() and restore_pci_variables() returns PCIBIOS_ error codes
from PCIe capability accessors.
PCIBIOS_ error codes have positive values. Passing on these values is
inconsistent with functions which return only a negative value on failure.
Before passing on the return value of PCIe capability accessors, call
pcibios_err_to_errno() to convert any positive PCIBIOS_ error codes to
negative generic error values.
Link: https://lore.kernel.org/r/20200615073225.24061-3-refactormyself@gmail.com
Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2020-06-25
This series contains updates to i40e driver and removes the individual
driver versions from all of the Intel wired LAN drivers.
Shiraz moves the client header so that it can easily be shared between
the i40e LAN driver and i40iw RDMA driver.
Jesse cleans up the unused defines, since they are just dead weight.
Alek reduces the unreasonably long wait time for a PF reset after reboot
by using jiffies to limit the maximum wait time for the PF reset to
succeed. Added additional logging to let the user know when the driver
transitions into recovery mode. Adds new device support for our 5 Gbps
NICs.
Todd adds a check to see if MFS is set after warm reboot and notifies
the user when MFS is set to anything lower than the default value.
Arkadiusz fixes a possible race condition, where were holding a
spin-lock while in atomic context.
v2: removed code comments that were no longer applicable in patch 2 of
the series. Also removed 'inline' from patch 4 and patch 8 of the
series. Also re-arranged code to be able to remove the forward
function declarations. Dropped patch 9 of the series, while the
author works on cleaning up the commit message.
v3: Updated patch 8 description to answer Jakub's questions
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls of atomic_dec() to mad_agent_priv->refcount with calls to
deref_mad_agent() in order to issue complete. Most likely the refcount is
> 1 at these points, but it is difficult to prove. Performance is not
important on these paths, so be obviously correct.
Link: https://lore.kernel.org/r/20200621104738.54850-3-leon@kernel.org
Signed-off-by: Shay Drory <shayd@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When running iperf in a two host configuration the following trace can
occur:
[ 319.728730] NETDEV WATCHDOG: ib0 (hfi1): transmit queue 0 timed out
The issue happens because the current implementation relies on the netif
txq being stopped to control the flushing of the tx list.
There are two resources that the transmit logic can wait on and stop the
txq:
- SDMA descriptors
- Ring space to hold completions
The ring space is tested on the sending side and relieved when the ring is
consumed in the napi tx reaping.
Unfortunately, that reaping can run conncurrently with the workqueue
flushing of the txlist. If the txq is started just before the workitem
executes, the txlist will never be flushed, leading to the txq being
stuck.
Fix by:
- Adding sleep/wakeup wrappers
* Use an atomic to control the call to the netif routines inside the
wrappers
- Use another atomic to record ring space exhaustion
* Only wakeup when the a ring space exhaustion has happened and it
relieved
Add additional wrappers to clarify the ring space resource handling.
Fixes: d99dc602e2 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
Link: https://lore.kernel.org/r/20200623204327.108092.4024.stgit@awfm-01.aw.intel.com
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The current code mishandles -EBUSY in two ways:
- The flow change doesn't test the return from the flush and runs on to
process the current packet racing with the wakeup processing
- The -EBUSY handling for a single packet inserts the tx into the txlist
after the submit call, racing with the same wakeup processing
Fix the first by dropping the skb and returning NETDEV_TX_OK.
Fix the second by insuring the the list entry within the txreq is inited
when allocated. This enables the sleep routine to detect that the txreq
has used the non-list api and queue the packet to the txlist.
Both flaws can lead to having the flushing thread executing in causing two
threads to manipulate the txlist.
Fixes: d99dc602e2 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
Link: https://lore.kernel.org/r/20200623204321.108092.83898.stgit@awfm-01.aw.intel.com
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Maor Gottlieb says:
====================
The following series adds support to get the RDMA resource data in RAW
format. The main motivation for doing this is to enable vendors to return
the entire QP/CQ/MR data without a need from the vendor to set each
field separately.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies
* branch 'raw_dumps':
RDMA/mlx5: Add support to get MR resource in RAW format
RDMA/mlx5: Add support to get CQ resource in RAW format
RDMA/mlx5: Add support to get QP resource in RAW format
RDMA: Add support to dump resource tracker in RAW format
RDMA: Add dedicated CM_ID resource tracker function
RDMA: Add dedicated QP resource tracker function
RDMA: Add a dedicated CQ resource tracker function
RDMA: Add dedicated MR resource tracker function
RDMA/core: Don't call fill_res_entry for PD
net/mlx5: Add support in query QP, CQ and MKEY segments
net/mlx5: Export resource dump interface
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>