There is a a bug in defered ack stuff that causes a race with the
destroy of a QP.
A packet causes a defered ack to be pended by putting the QP
into an rcd queue.
A return from the driver interrupt processing will process that rcd
queue of QPs and attempt to do a direct send of the ack. At this
point no locks are held and the above QP could now be put in the reset
state in the qp destroy logic. A refcount protects the QP while it
is in the rcd queue so it isn't going anywhere yet.
If the direct send fails to allocate a pio buffer,
hfi1_schedule_send() is called to trigger sending an ack from the
send engine. There is no state test in that code path.
The refcount is then dropped from the driver.c caller
potentially allowing the qp destroy to continue from its
refcount wait in parallel with the workqueue scheduling of the qp.
Cc: stable@vger.kernel.org
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In certain cases, if the tail of an SGE is not
8-byte aligned, bytes beyond the end to an 8-byte
alignment can be read. Change the copy routine
to avoid the over-read. Instead, stop on the final
whole quad-word, then read the remaining bytes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ib_write_bw test allows using up to 16384 QPs. When a relatively
large number of QPs (within that range) is used, the test can fail
because the number of CQ entries needed exceeds the limit set by the
driver.
This patch increases the default setting of max_cqes from 0x2FFFF
(196607) to 0x2FFFFF(3145727), which is sufficient to cover the
maximum number needed by the ib_write_bw test (2097152). The default
setting of max_qps is also increased from 16384 to 32768 to allow
the test to run successfully with 16383 or 16384 QPs.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no need to have a global qpt_mask as that does not support the
multiple chip model which qib has. Instead rely on the value which
exists already in the device data (dd).
Fixes: 898fa52b4a "IB/qib: Remove qpn, qp tables and related variables from qib"
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Existing locking scheme in affinity.c file using the
&node_affinity.lock spinlock is not very elegant.
We acquire the lock to get hfi1_affinity_node entry,
unlock, and then use the entry without the lock held.
With more functions being added, which access and
modify the entries, this can lead to race conditions.
This patch makes this locking scheme more consistent.
It changes the spinlock to mutex. Since all the code
is executed in a user process context there is no need
for a spinlock. This also allows to keep the lock
not only while we look up for the node affinity entry,
but over the whole section where the entry is being used.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The dma_XXX API functions return bus addresses which are
physical addresses when IOMMU is disabled. Buffer
mapping to user-space is done via remap_pfn_range() with PFN
based on bus address instead of physical. This results in
wrong pages being mapped to user-space when IOMMU is enabled.
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Tymoteusz Kielan <tymoteusz.kielan@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Each user SDMA request coming into the driver may contain multiple packets.
Each user packet may use multiple SDMA descriptors to fill the send buffer.
The field seqsubmitted in struct user_sdma_request counts the number of
user packets submitted to an SDMA engine. Sometimes, the intermediate count
may not be updated properly. However, once all the packets' descriptors
are successfully submitted to the SDMA engine, the final count is updated
correctly. But, if only some of the packets are submitted to the engine due
to an error, the intermediate count doesn't reflect the partial number of
packets submitted to the SDMA engine. This can cause a hang later in the
code as the count of packets submitted to the SDMA engine doesn't match the
the count of packets processed by the SDMA engine.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use common header file structs, defines, and accessors
in the drivers. The old declarations are removed.
The repositioning of the includes allows for the removal
of hfi1_message_header and replaces its use with ib_header.
Also corrected are two issues with set_armed_to_active():
- The "packet" parameter is now a pointer as it should have been
- The etype is validated to insure that the header is correct
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited. It also prints a warning
everytime this feature is used as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allocate resources dynamically to cxgb4's Upper layer driver's(ULD) like
cxgbit, iw_cxgb4 and cxgb4i. Allocate resources when they register with
cxgb4 driver and free them while unregistering. All the queues and the
interrupts for them will be allocated during ULD probe only and freed
during remove.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull rdma fixes from Doug Ledford:
"Round three of 4.8 rc fixes.
This is likely the last rdma pull request this cycle. The new rxe
driver had a few issues (you probably saw the boot bot bug report) and
they should be addressed now. There are a couple other fixes here,
mainly mlx4. There are still two outstanding issues that need
resolved but I don't think their fix will make this kernel cycle.
Summary:
- Various fixes to rdmavt, ipoib, mlx5, mlx4, rxe"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/rdmavt: Don't vfree a kzalloc'ed memory region
IB/rxe: Fix kmem_cache leak
IB/rxe: Fix race condition between requester and completer
IB/rxe: Fix duplicate atomic request handling
IB/rxe: Fix kernel panic in udp_setup_tunnel
IB/mlx5: Set source mac address in FTE
IB/mlx5: Enable MAD_IFC commands for IB ports only
IB/mlx4: Diagnostic HW counters are not supported in slave mode
IB/mlx4: Use correct subnet-prefix in QP1 mads under SR-IOV
IB/mlx4: Fix code indentation in QP1 MAD flow
IB/mlx4: Fix incorrect MC join state bit-masking on SR-IOV
IB/ipoib: Don't allow MC joins during light MC flush
IB/rxe: fix GFP_KERNEL in spinlock context
Set the source mac address in the FTE when L2 specification
is provided.
Fixes: 038d2ef875 ('IB/mlx5: Add flow steering support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
MAD_IFC command is supported only for physical functions (PF)
and when physical port is IB. The proposed fix enforces it.
Fixes: d603c809ef ("IB/mlx5: Fix decision on using MAD_IFC")
Reported-by: David Chang <dchang@suse.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Modify the mlx4_ib_diag_counters() to avoid the following error in the
hypervisor when the slave tries to query the hardware counters in SR-IOV
mode.
mlx4_core 0000:81:00.0: Unknown command:0x30 accepted from slave:1
Fixes: 3f85f2aaab ("IB/mlx4: Add diagnostic hardware counters")
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When sending QP1 MAD packets which use a GRH, the source GID
(which consists of the 64-bit subnet prefix, and the 64 bit port GUID)
must be included in the packet GRH.
For SR-IOV, a GID cache is used, since the source GID needs to be the
slave's source GID, and not the Hypervisor's GID. This cache also
included a subnet_prefix. Unfortunately, the subnet_prefix field in
the cache was never initialized (to the default subnet prefix 0xfe80::0).
As a result, this field remained all zeroes. Therefore, when SR-IOV
was active, all QP1 packets which included a GRH had a source GID
subnet prefix of all-zeroes.
However, the subnet-prefix should initially be 0xfe80::0 (the default
subnet prefix). In addition, if OpenSM modifies a port's subnet prefix,
the new subnet prefix must be used in the GRH when sending QP1 packets.
To fix this we now initialize the subnet prefix in the SR-IOV GID cache
to the default subnet prefix. We update the cached value if/when OpenSM
modifies the port's subnet prefix. We take this cached value when sending
QP1 packets when SR-IOV is active.
Note that the value is stored as an atomic64. This eliminates any need
for locking when the subnet prefix is being updated.
Note also that we depend on the FW generating the "port management change"
event for tracking subnet-prefix changes performed by OpenSM. If running
early FW (before 2.9.4630), subnet prefix changes will not be tracked (but
the default subnet prefix still will be stored in the cache; therefore
users who do not modify the subnet prefix will not have a problem).
IF there is a need for such tracking also for early FW, we will add that
capability in a subsequent patch.
Fixes: 1ffeb2eb8b ("IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The indentation in the QP1 GRH flow in procedure build_mlx_header is
really confusing. Fix it, in preparation for a commit which touches
this code.
Fixes: 1ffeb2eb8b ("IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Because of an incorrect bit-masking done on the join state bits, when
handling a join request we failed to detect a difference between the
group join state and the request join state when joining as send only
full member (0x8). This caused the MC join request not to be sent.
This issue is relevant only when SRIOV is enabled and SM supports
send only full member.
This fix separates scope bits and join states bits a nibble each.
Fixes: b9c5d6a643 ('IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV')
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add cxgb_mk_rx_data_ack() to remove duplicate
code to form CPL_RX_DATA_ACK hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_abort_rpl() to remove duplicate
code to form CPL_ABORT_RPL hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_abort_req() to remove duplicate code
to form CPL_ABORT_REQ hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_close_con_req() to remove duplicate
code to form CPL_CLOSE_CON_REQ hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_tid_release() to remove duplicate code
to form CPL_TID_RELEASE hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_compute_wscale() in libcxgb_cm.h to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_best_mtu() in libcxgb_cm.h to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_is_neg_adv() in libcxgb_cm.h to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_find_route6() in libcxgb_cm.c to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_find_route() in libcxgb_cm.c to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_get_4tuple() in libcxgb_cm.c to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
"A set of fixes for the current series in the realm of block.
Like the previous pull request, the meat of it are fixes for the nvme
fabrics/target code. Outside of that, just one fix from Gabriel for
not doing a queue suspend if we didn't get the admin queue setup in
the first place"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme-rdma: add back dependency on CONFIG_BLOCK
nvme-rdma: fix null pointer dereference on req->mr
nvme-rdma: use ib_client API to detect device removal
nvme-rdma: add DELETING queue flag
nvme/quirk: Add a delay before checking device ready for memblaze device
nvme: Don't suspend admin queue that wasn't created
nvme-rdma: destroy nvme queue rdma resources on connect failure
nvme_rdma: keep a ref on the ctrl during delete/flush
iw_cxgb4: block module unload until all ep resources are released
iw_cxgb4: call dev_put() on l2t allocation failure
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise an endpoint can be still closing down causing a touch
after free crash. Also WARN_ON if ulps have failed to destroy
various resources during device removal.
Fixes: ad61a4c7a9 ("iw_cxgb4: don't block in destroy_qp awaiting the last deref")
Reviewed-by: Sagi Grimberg <sagi@grimbrg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
The debugfs RCU trips many debug kernel warnings because of potential
sleeps with an RCU read lock held. This includes both user copy calls
and slab allocations throughout the file.
This patch switches the RCU to use SRCU for file remove/access
race protection.
In one case, the SRCU is implicit in the use of the raw debugfs file
object and just works.
In the seq_file case, a wrapper around seq_read() and seq_lseek() is
used to enforce the SRCU using the debugfs supplied functions
debugfs_use_file_start() and debugfs_use_file_stop().
The sychronize_rcu() is deleted since the SRCU prevents the remove
access race.
The RCU locking is kept for qp_stats since the QP hash list is
protected using the non-sleepable RCU.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The global variable n_krcvqs stores the sum of the number of kernel
receive queues of VLs 0-7 which the user can pass to the driver through
the module parameter array krcvqs which is of type unsigned integer. If
the user passes large value(s) into krcvqs parameter array, it can cause
an arithmetic overflow while calculating n_krcvqs which is also of type
unsigned int. The overflow results in an incorrect value of n_krcvqs
which can lead to kernel crash while loading the driver.
Fix by changing the data type of n_krcvqs to unsigned long. This patch
also changes the data type of other variables that get their values from
n_krcvqs.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the set_txreq_header_ahg(), The KDETH Intr bit is obtained from the
header in the user sdma request using a KDETH_GET shift and mask macro.
This value is then futher right shifted by 16 causing us to lose the
value i.e it is shifted to zero, leading to the following
smatch warning:
drivers/infiniband/hw/hfi1/user_sdma.c:1482 set_txreq_header_ahg()
warn: mask and shift to zero
The Intr bit should be left shifted into its correct position in the
KDETH header before the AHG update.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When trying to align the source pointer and there's a byte carry
in an SGE copy, bytes are borrowed from the next quad-word X to
complete the required quad-word copy. Then, the SGE length is
reduced by the number of borrowed bytes. After this, if the
remaining number of bytes from quad-word X (extra bytes) is
greater than the new SGE length, the number of extra bytes needs
to be updated to the new SGE length. Otherwise, when the
SGE length gets updated again after the extra bytes are read to
create the new byte carry, it goes negative, which then becomes
a very large number as the SGE length is an unsigned integer.
This causes SGE buffer to be over-read.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove returning errors from mlx5 poll_cq function. Polling CQ
operation in kernel never fails by Mellanox HCA architecture and
respective driver design.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>