Conflicts:
drivers/infiniband/hw/mlx5/main.c - Both add new code
include/rdma/ib_verbs.h - Both add new code
Signed-off-by: Doug Ledford <dledford@redhat.com>
slid field in struct ib_wc is increased to 32 bits.
This enables core components to use larger LIDs if needed.
The user ABI is unchanged and return 16 bit values when queried.
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Logic of retrieving netdev speed from net_device and translating it to
IB speed is implemented in rxe, in usnic and in bnxt drivers.
Define new function which merges all.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Christian Benvenuti <benve@cisco.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Its better to use __func__ to print functions name instead of writing
the name in the print statement.
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use DEVICE_ATTR RO() macro and rename the show function accordingly.
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current computation of qp->timeout_jiffies in rvt_modify_qp() will cause
overflow due to the fact that the input to the function usecs_to_jiffies
is only 32-bit ( unsigned int). Overflow will occur when attr->timeout is
equal to or greater than 30. The consequence is unnecessarily excessive
retry and thus degradation of the system performance.
This patch fixes the problem by limiting the input to 5-bit and calling
usecs_to_jiffies() before multiplying the scaling factor.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The caller to the driver marks GFP_NOIO allocations with help
of memalloc_noio-* calls now. This makes redundant to pass down
to the driver gfp flags, which can be GFP_KERNEL only.
The patch removes the gfp flags argument and updates all driver paths.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fixes a over-read condition detected by FORTIFY_SOURCE for this
line:
memcpy(SKB_TO_PKT(skb), &ack_pkt, sizeof(skb->cb));
The error was:
In file included from ./include/linux/bitmap.h:8:0,
from ./include/linux/cpumask.h:11,
from ./include/linux/mm_types_task.h:13,
from ./include/linux/mm_types.h:4,
from ./include/linux/kmemcheck.h:4,
from ./include/linux/skbuff.h:18,
from drivers/infiniband/sw/rxe/rxe_resp.c:34:
In function 'memcpy',
inlined from 'send_atomic_ack.constprop' at drivers/infiniband/sw/rxe/rxe_resp.c:998:2,
inlined from 'acknowledge' at drivers/infiniband/sw/rxe/rxe_resp.c:1026:3,
inlined from 'rxe_responder' at drivers/infiniband/sw/rxe/rxe_resp.c:1286:10:
./include/linux/string.h:309:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter
__read_overflow2();
Daniel Micay noted that struct rxe_pkt_info is 32 bytes on 32-bit
architectures, but skb->cb is still 64. The memcpy() over-reads 32
bytes. This fixes it by zeroing the unused bytes in skb->cb.
Link: http://lkml.kernel.org/r/1497903987-21002-5-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure we can't come up with an array size that is bigger than the array
by applying the QPN mask before the divide in the free_qpn function.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The free_qpn() function from the hfi1/qib driver which was the basis for
rdmavt_free_qpn() function was accidentally left in the code. Remove it.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
SGEs that are contiguous needlessly consume driver dependent TX resources.
The lkey validation logic is enhanced to compress the SGE that ends
up in the send wqe when consecutive addresses are detected.
The lkey validation API used to return 1 (success) or 0 (fail).
The return value is now an -errno, 0 (compressed), or 1 (uncompressed). A
additional argument is added to pass the last SQE for the compression.
Loopback callers always pass a NULL to last_sge since the optimization is
of little benefit in that situation.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull rdma fixes from Doug Ledford:
"I had thought at the time of the last pull request that there wouldn't
be much more to go, but several things just kept trickling in over the
last week.
Instead of just the six patches to bnxt_re that I had anticipated,
there are another five IPoIB patches, two qedr patches, and a few
other miscellaneous patches.
The bnxt_re patches are more lines of diff than I like to submit this
late in the game. That's mostly because of the first two patches in
the series of six. I almost dropped them just because of the lines of
churn, but on a close review, a lot of the churn came from removing
duplicated code sections and consolidating them into callable
routines. I felt like this made the number of lines of change more
acceptable, and they address problems, so I left them. The remainder
of the patches are all small, well contained, and well understood.
These have passed 0day testing, but have not been submitted to
linux-next (but a local merge test with your current master was
without any conflicts).
Summary:
- A fix for fix eea40b8f62 ("infiniband: call ipv6 route lookup via
the stub interface")
- Six patches against bnxt_re...the first two are considerably larger
than I would like, but as they address real issues I went ahead and
submitted them (it also helped that a good deal of the churn was
removing code repeated in multiple places and consolidating it to
one common function)
- Two fixes against qedr that just came in
- One fix against rxe that took a few revisions to get right plus
time to get the proper reviews
- Five late breaking IPoIB fixes
- One late cxgb4 fix"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
rdma/cxgb4: Fix memory leaks during module exit
IB/ipoib: Fix memory leak in create child syscall
IB/ipoib: Fix access to un-initialized napi struct
IB/ipoib: Delete napi in device uninit default
IB/ipoib: Limit call to free rdma_netdev for capable devices
IB/ipoib: Fix memory leaks for child interfaces priv
rxe: Fix a sleep-in-atomic bug in post_one_send
RDMA/qedr: Add 64KB PAGE_SIZE support to user-space queues
RDMA/qedr: Initialize byte_len in WC of READ and SEND commands
RDMA/bnxt_re: Remove FMR support
RDMA/bnxt_re: Fix RQE posting logic
RDMA/bnxt_re: Add HW workaround for avoiding stall for UD QPs
RDMA/bnxt_re: Dereg MR in FW before freeing the fast_reg_page_list
RDMA/bnxt_re: HW workarounds for handling specific conditions
RDMA/bnxt_re: Fixing the Control path command and response handling
IB/addr: Fix setting source address in addr6_resolve()
The driver may sleep under a spin lock, and the function call path is:
post_one_send (acquire the lock by spin_lock_irqsave)
init_send_wqe
copy_from_user --> may sleep
There is no flow that makes "qp->is_user" true, and copy_from_user may
cause bug when a non-user pointer is used. So the lines of copy_from_user
and check of "qp->is_user" are removed.
Signed-off-by: Jia-Ju Bai <baijiaju1990@163.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory. The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.
It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.
For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:
return %i7+8
lduw [%o5+16], %o0 ! MEM[(u32 *)__shash_desc.1_10 + 16B],
%o5 holds the base of the on-stack area allocated for the shash
descriptor. But the return released the stack frame and the
register window.
So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.
Add a data compiler barrier to work around this problem. This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)
With crucial insight from Eric Sandeen.
Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Callers of rxe_mem_copy() provide pointer to store updated CRC
value. That pointer was supposed to be updated, but the
commit cee2688e3c ("IB/rxe: Offload CRC calculation when possible")
mistakenly removed that assignment for RXE_MEM_TYPE_DMA memory type.
The code worked because there are no actual callers with
RXE_MEM_TYPE_DMA, who are interested in returned value of crcp.
The one caller in read_reply(), who uses the returned crcp didn't
set RXE_MEM_TYPE_DMA as mem->type.
Fixes: cee2688e3c ("IB/rxe: Offload CRC calculation when possible")
Reported-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When reading a RDMA WRITE FIRST packet we copy the DMA length from the RDMA
header into the qp->resp.resid variable for later use. Later in check_rkey()
we clamp it to the MTU if the packet is an RDMA WRITE packet and has a
residual length bigger than the MTU. Later in write_data_in() we subtract the
payload of the packet from the residual length. If the packet happens to have a
payload of exactly the MTU size we end up with a residual length of 0 despite
the packet not being the last in the conversation. When the next packet in the
conversation arrives, we don't have any residual length left and thus set the QP
into an error state.
This broke NVMe over Fabrics functionality over rdma_rxe.ko
The patch was verified using the following test.
# echo eth0 > /sys/module/rdma_rxe/parameters/add
# nvme connect -t rdma -a 192.168.155.101 -s 1023 -n nvmf-test
# mkfs.xfs -fK /dev/nvme0n1
meta-data=/dev/nvme0n1 isize=256 agcount=4, agsize=65536 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0, sparse=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# mount /dev/nvme0n1 /tmp/
[ 148.923263] XFS (nvme0n1): Mounting V4 Filesystem
[ 148.961196] XFS (nvme0n1): Ending clean mount
# dd if=/dev/urandom of=test.bin bs=1M count=128
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.437991 s, 306 MB/s
# sha256sum test.bin
cde42941f045efa8c4f0f157ab6f29741753cdd8d1cff93a6b03649d83c4129a test.bin
# cp test.bin /tmp/
sha256sum /tmp/test.bin
cde42941f045efa8c4f0f157ab6f29741753cdd8d1cff93a6b03649d83c4129a /tmp/test.bin
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma_ah_attr can now be either ib or roce allowing
core components to use one type or the other and also
to define attributes unique to a specific type. struct
ib_ah is also initialized with the type when its first
created. This ensures that calls such as modify_ah
dont modify the type of the address handle attribute.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).
The current code only uses the MGID for identifying multicast groups.
Update the driver to be compliant with this definition.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This logic seems to be duplicated in (at least) three separate files.
Move it to one place so code can be re-use.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
In RC QP there is no need to resolve the outgoing interface
for each packet, as this does not change during QP life cycle.
Instead cache the interface on the socket and use that one.
This improves performance by 12% by sparing redundant
calls to rxe_find_route.
ib_send_bw -d rxe0 -x 1 -n 9000 -e -s $((1024 * 1024 )) -l 100
----------------------------------------------------------------------------------------
| | bytes | iterations | BW peak[MB/sec] | BW average[MB/sec] | MsgRate[Mpps] |
----------------------------------------------------------------------------------------
| before | 1048576 | 9000 | inf | 551.21 | 0.000551 |
| after | 1048576 | 9000 | inf | 615.54 | 0.000616 |
----------------------------------------------------------------------------------------
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Function rxe_rcv is used internally in RXE and don't need to be
exported. This patch removes such export declaration.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose new counters using the get_hw_stats callback.
We expose the following counters:
+---------------------+----------------------------------------+
| Name | Description |
|---------------------+----------------------------------------|
|sent_pkts | number of sent pkts |
|---------------------+----------------------------------------|
|rcvd_pkts | number of received packets |
|---------------------+----------------------------------------|
|out_of_sequence | number of errors due to packet |
| | transport sequence number |
|---------------------+----------------------------------------|
|duplicate_request | number of received duplicated packets. |
| | A request that previously executed is |
| | named duplicated. |
|---------------------+----------------------------------------|
|rcvd_rnr_err | number of received RNR by completer |
|---------------------+----------------------------------------|
|send_rnr_err | number of sent RNR by responder |
|---------------------+----------------------------------------|
|rcvd_seq_err | number of out of sequence packets |
| | received |
|---------------------+----------------------------------------|
|ack_deffered | number of deferred handling of ack |
| | packets. |
|---------------------+----------------------------------------|
|retry_exceeded_err | number of times retry exceeded |
|---------------------+----------------------------------------|
|completer_retry_err | number of times completer decided to |
| | retry |
|---------------------+----------------------------------------|
|send_err | number of failed send packet |
+---------------------+----------------------------------------+
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The synchronize_rcu() call can be eliminated to improve memory deregistration
performance.
There are two key fields involved:
- The rcu pointer itself
- the lkey_published field
To close the window between the rcu read of the mregion pointer and the
reference count the code should:
1. To lkey/rkey validation (reader)
Read the rcu pointer. If the pointer is non-NULL, get a reference.
To the current validation tests use a READ_ONCE() on the lkey_published.
Upon any failure release the reference.
2. To the remove logic (delete)
Insure the published is zeroed prior to setting the pointer to NULL.
This requires using rcu_assign_pointer() to insure lkey_published
is written prior to the NULL.
3. To the insert logic (add)
Insure the published is set use an rcu_assign_pointer() to insure the
pointer is after all MR fields.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The wqe should be read only and in fact the superfluous reset of the
RVT_SEND_RESERVE_USED flag causes an issue where reserved operations
elicit a bad completion to the ULP.
The maintenance of the flag is now entirely within rvt_post_one_wr()
where a reserved operation will set the flag and a non-reserved operation
will insure the operation that is about to be posted has the flag reset.
Fixes: Commit 856cc4c237 ("IB/hfi1: Add the capability for reserved operations")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>