Heavy contention of the sde flushlist_lock can cause hard lockups at
extreme scale when the flushing logic is under stress.
Mitigate by replacing the item at a time copy to the local list with
an O(1) list_splice_init() and using the high priority work queue to
do the flushes.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Cc: <stable@vger.kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, EQ joined the chain notifier on creation.
This forced the caller to be ready to handle events before creating
the EQ through eq_create_generic interface.
To help the caller control when the created EQ will be attached to the
IRQ, add enable/disable API.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The patch modifies the IRQ allocation so that all async EQs are
assigned to the same IRQ resulting in more available IRQs for
completion EQs.
The changes are using the support for IRQ sharing and EQ polling budget
that was introduced in previous patches so when the shared interrupt is
triggered, the kernel will serially call the handler of each of the
sharing EQs with a certain budget of EQEs to poll in order to prevent
starvation.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Instead of requesting IRQ with eq creation, IRQs will be requested
before EQ table creation.
Instead of freeing the IRQs after EQ destroy, free IRQs after eq
table destroy.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Multiple EQs may share the same IRQ in subsequent patches.
Instead of calling the IRQ handler directly, the EQ will register
to an atomic chain notfier.
The Linux built-in shared IRQ is not used because it forces the caller
to disable the IRQ and clear affinity before free_irq() can be called.
This patch is the first step in the separation of IRQ and EQ logic.
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This driver was first merged over 10 years ago and has not seen major
activity by the authors in the last 7 years. However, in that time it has
been patched 150 times to adapt it to changing kernel APIs.
Further, the hardware has several issues, like not supporting 64 bit DMA,
that make it rather uninteresting for use with modern systems and RDMA.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Like all other destroy commands, .destroy_cq() call is not supposed
to fail. In all flows, the attempt to return earlier caused to memory
leaks.
This patch converts .destroy_cq() to do not return any errors.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The memory allocation call can fail and cause to early return
from nes_desotroy_cq() function. This situation will cause to
memory leak of struct nes_cq. Rewrite function to avoid memory
allocation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The call to sdma_progress() is called outside the wait lock.
In this case, there is a race condition where sdma_progress() can return
false and the sdma_engine can idle. If that happens, there will be no
more sdma interrupts to cause the wakeup and the user_sdma xmit will hang.
Fix by moving the lock to enclose the sdma_progress() call.
Also, delete busycount. The need for this was removed by:
commit bcad29137a ("IB/hfi1: Serve the most starved iowait entry first")
Cc: <stable@vger.kernel.org>
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This more closely follows how other subsytems work, with owner being a
member of the structure containing the function pointers.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When user post recv a srq with multiple sges, the hardware will get the
last correct sge and count the sge numbers according to the specific
identifier with lkey. For example, when the driver fills the sges with
every wr less than the max sge that the user configured when creating srq,
the hardware will stop getting the sge according to the specific lkey in
the sge. However, it will always end with the first sge in the current
post srq recv interface implementation.
Fixes: c7bcb13442 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
A previous change incorrectly changed the inverted logic and logically
negated the readl rather than the shifted readl result. Fix this by
adding in missing parentheses around the expression that needs to be
logically negated.
Addresses-Coverity: ("Logically dead code")
Fixes: 669cefb654 ("RDMA/hns: Remove jiffies operation in disable interrupt context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pull rdma fixes from Jason Gunthorpe:
"Things are looking pretty quiet here in RDMA, not too many bug fixes
rolling in right now. The usual driver bug fixes and fixes for a
couple of regressions introduced in 5.2:
- Fix a race on bootup with RDMA device renaming and srp. SRP also
needs to rename its internal sys files
- Fix a memory leak in hns
- Don't leak resources in efa on certain error unwinds
- Don't panic in certain error unwinds in ib_register_device
- Various small user visible bug fix patches for the hfi and efa
drivers
- Fix the 32 bit compilation break"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/efa: Remove MAYEXEC flag check from mmap flow
mlx5: avoid 64-bit division
IB/hfi1: Validate page aligned for a given virtual address
IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value
IB/hfi1: Insure freeze_work work_struct is canceled on shutdown
IB/rdmavt: Fix alloc_qpn() WARN_ON()
RDMA/core: Fix panic when port_data isn't initialized
RDMA/uverbs: Pass udata on uverbs error unwind
RDMA/core: Clear out the udata before error unwind
RDMA/hns: Fix PD memory leak for internal allocation
RDMA/srp: Rename SRP sysfs name after IB device rename trigger
Saeed Mahameed says:
====================
mlx5-updates-2019-05-31
This series provides some updates to mlx5 core and netdevice driver.
1) use __netdev_tx_sent_queue() to improve performance under GSO workload
2) Allow matching only enc_key_id/enc_dst_port for decapsulation action
3) Geneve support:
This patchset adds support for GENEVE tunnel encap/decap flows offload:
encapsulating layer 2 Ethernet frames within layer 4 UDP datagrams.
The driver supports 6081 destination UDP port number, which is the
default IANA-assigned port.
Encap:
ConnectX-5 inserts the header (w/ or w/o Geneve TLV options) that is
provided by the mlx5 driver to the outgoing packet.
Decap:
Geneve header is matched and the packet is decapsulated.
Notes about decap flows with Geneve TLV Options:
- Support offloading of 32-bit options data only
- At any given time, only one combination of class/type parameters
can be offloaded, but the same class/type combination can have
many different flows offloaded with different 32-bit option data
- Options with value of 0 can't be offloaded
Managing Geneve TLV options:
Matching (on receive) is done by ConnectX-5 flex parser.
Geneve TLV options are managed using General Object of type
“Geneve TLV Options”.
When the first flow with a certain class/type values is requested
to be offloaded, the driver creates a FW object with FW command
(Geneve TLV Options general object) and starts counting the number
of flows using this object.
During this time, any request with a different class/type values
will fail to be offloaded.
Once the refcount reaches 0, the driver destroys the TLV options
general object, and can now offload a flow with any class/type parameters.
Geneve TLV Options object is added to core device.
It is currently used to manage Geneve TLV options general
object allocation in FW and its reference counting only.
In the future it will also be used for managing geneve ports
by registering callbacks for ndo_udp_tunnel_add/del.
TC tunnel code refactoring:
As a preparation for Geneve code, the TC tunnel code in mlx5
was rearranged in a modular way, so that it would be easier
to add future tunnels:
- Defined tc tunnel object with the fields and callbacks that
any tunnel must implement.
- Define tc UDP tunnel object for UDP tunnels, such as VXLAN
- Move each tunnel code (GRE, VXLAN) to its own separate file
- Rewrite tc tunnel implementation in a general way – using
only the objects and their callbacks.
4) Termination tables:
Actions in tables set with the termination flag are guaranteed to terminate
the action list. Thus, potential looping functionality (e.g. haripin) can safely be
executed without potential loops.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit:
4b53a3412d ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")
the tsk_nr_cpus_allowed() wrapper was removed. There was not
much difference in !RT but in RT we used this to implement
migrate_disable(). Within a migrate_disable() section the CPU mask is
restricted to single CPU while the "normal" CPU mask remains untouched.
As an alternative implementation Ingo suggested to use:
struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
};
with
t->cpus_ptr = &t->cpus_mask;
In -RT we then can switch the cpus_ptr to:
t->cpus_ptr = &cpumask_of(task_cpu(p));
in a migration disabled region. The rules are simple:
- Code that 'uses' ->cpus_allowed would use the pointer.
- Code that 'modifies' ->cpus_allowed would use the direct mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ifa_list is protected by rcu, yet code doesn't reflect this.
Add the __rcu annotations and fix up all places that are now reported by
sparse.
I've done this in the same commit to not add intermediate patches that
result in new warnings.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like previous patches, use the new iterator macros to avoid sparse
warnings once proper __rcu annotations are added.
Compile tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series provides some low level updates for mlx5 driver needed for
both rdma and netdev trees.
1) Termination flow steering table bits and hardware definitions.
2) Introduce the core dump HW access registers definitions.
3) Refactor and cleans-up VF representors functions handlers.
4) Renames host_params bits to function_changed bits and add the
support for eswitch functions change event in the eswitch general case.
(for both legacy and switchdev modes).
5) Potential error pointer dereference in error handling
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently for every representor type and for every single vport,
representer function pointers copy is stored even though they don't
change from one to other vport.
Additionally priv data entry for the rep is not passed during
registration, but its copied. It is used (set and cleared) by the user
of the reps.
As we want to scale vports, to simplify and also to split constants
from data,
1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops
prefix with other standard netdev, ibdev ops.
2. Constify the IB and Ethernet rep ops structure.
3. Instead of storing copy of all rep function pointers, store copy
per eswitch rep type.
4. Split data and function pointers to mlx5_eswitch_rep_ops and
mlx5_eswitch_rep_data.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Avoid typecasting from void* to mlx5_ib_dev* or mlx5e_rep_priv*
as it is not needed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When the user submits more than 32 work request to a srq queue
at a time, it needs to find the corresponding number of entries
in the bitmap in the idx queue. However, the original lookup
function named ffs only processes 32 bits of the array element,
When the number of srq wqe issued exceeds 32, the ffs will only
process the lower 32 bits of the elements, it will not be able
to get the correct wqe index for srq wqe.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
So, replace the following form:
sizeof(struct opa_port_status_rsp) + num_vls * sizeof(struct _vls_pctrs)
with:
struct_size(rsp, vls, num_vls)
and so on...
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
So, replace the following form:
sizeof(*pkt) + sizeof(pkt->addr[0])*n
with:
struct_size(pkt, addr, n)
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When creating the chunks list the rdma_for_each_block() iterator is used
in order to iterate over the payload in EFA_CHUNK_PAYLOAD_SIZE (device
defined) strides.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The admin commands abort flow is buggy (use-after-free) and not really
necessary as it is guaranteed that after ib_unregister_device() is called
there are no user verbs threads running in parallel, delete it.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use kvzalloc which attempts to allocate a physically continuous buffer and
fallbacks to virtually continuous on failure instead of open coding it in
the driver.
The is_vmalloc_addr function is used to determine whether the buffer is
physically continuous or not (which determines direct vs indirect MR
registration mode).
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
MAYEXEC test was mistakenly added, remove it. Checking MAYEXEC in the
driver prevents it from working with userspace that uses things like EXEC
STACK. (ie some Fortran and other runtimes)
Fixes: 40909f664d ("RDMA/efa: Add EFA verbs implementation")
Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Commit 25c13324d0 ("IB/mlx5: Add steering SW ICM device memory type")
breaks i386 build by introducing three 64-bit divisions. As the divisor is
MLX5_SW_ICM_BLOCK_SIZE() which is always a power of 2, we can replace the
division with bit operations.
Fixes: 25c13324d0 ("IB/mlx5: Add steering SW ICM device memory type")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A recent patch to hfi1 left behind a checkpatch error.
Fixes: fb24ea52f7 ("drivers: Remove explicit invocations of mmiowb()")
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
User applications can register memory regions for TID buffers that are not
aligned on page boundaries. Hfi1 is expected to pin those pages in memory
and cache the pages with mmu_rb. The rb tree will fail to insert pages
that are not aligned correctly.
Validate whether a given virtual address is page aligned before pinning.
Fixes: 7e7a436ecb ("staging/hfi1: Add TID entry program function body")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The command 'ibv_devinfo -v' reports 0 for max_mr.
Fix by assigning the query values after the mr lkey_table has been built
rather than early on in the driver.
Fixes: 7b1e2099ad ("IB/rdmavt: Move memory registration into rdmavt")
Reviewed-by: Josh Collier <josh.d.collier@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
For infiniband code that retains pages via get_user_pages*(), release
those pages via the new put_user_page(), or put_user_pages*(), instead of
put_page()
This is a tiny part of the second step of fixing the problem described in
[1]. The steps are:
1) Provide put_user_page*() routines, intended to be used for releasing
pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to invoke
put_user_page*(), instead of put_page(). This involves dozens of call
sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem. Again, [1] provides details as to why that is
desirable.
[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/hfi1/tid_rdma.c: In function tid_rdma_rcv_error:
drivers/infiniband/hw/hfi1/tid_rdma.c:2029:7: warning: variable offset set but not used [-Wunused-but-set-variable]
drivers/infiniband/hw/hfi1/tid_rdma.c: In function hfi1_rc_rcv_tid_rdma_ack:
drivers/infiniband/hw/hfi1/tid_rdma.c:4555:35: warning: variable fspsn set but not used [-Wunused-but-set-variable]
'offset' is never used since introduction in
commit d0d564a1ca ("IB/hfi1: Add functions to receive TID RDMA READ request")
'fspsn' is never used since introduciotn in
commit 9e93e967f7 ("IB/hfi1: Add a function to receive TID RDMA ACK packet")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In some functions, the jiffies operation is unnecessary, and we can
control delay using mdelay and udelay functions only. Especially, in
hns_roce_v1_clear_hem, the function calls spin_lock_irqsave, the context
disables interrupt, so we can not use jiffies and msleep functions.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When hip08 set gid, it will call spin_unlock_bh when send cmq. if main.ko
call spin_lock_irqsave firstly, and the kernel is before commit
f71b74bca6 ("irq/softirqs: Use lockdep to assert IRQs are
disabled/enabled"), it will cause WARN_ON_ONCE because of calling
spin_unlock_bh in disable context.
In fact, the spin_lock_irqsave in main.ko is only used for hip06, and
should be placed in hns_roce_hw_v1.c. hns_roce_hw_v2.c uses its own
spin_unlock_bh and does not need main.ko manage spin_lock.
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
According to hip08 UM, the maximum number of CQEs supported by each CQ is
4M.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to print when communication is established, especially
while lots of qp used by application.
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add await in destroy_qp() so that all references to qp are dereferenced
and qp is freed in destroy_qp() itself. This ensures freeing of all QPs
before invocation of dealloc_ucontext(), which prevents loss of in use
qpids stored in the ucontext.
Signed-off-by: Nirranjan Kirubaharan <nirranjan@chelsio.com>
Reviewed-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>