There is a four byte hole at the end of the "uresp" struct after the
->qid_mask member.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
These sizes should be unsigned so we don't allow negative values and
have underflow bugs. These can come from the user so there may be
security implications, but I have not tested this.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The current logic suffers from a slow response time to disable user DB
usage, and also fails to avoid DB FIFO drops under heavy load. This commit
fixes these deficiencies and makes the avoidance logic more optimal.
This is done by more efficiently notifying the ULDs of potential DB
problems, and implements a smoother flow control algorithm in iw_cxgb4,
which is the ULD that puts the most load on the DB fifo.
Design:
cxgb4:
Direct ULD callback from the DB FULL/DROP interrupt handler. This allows
the ULD to stop doing user DB writes as quickly as possible.
While user DB usage is disabled, the LLD will accumulate DB write events
for its queues. Then once DB usage is reenabled, a single DB write is
done for each queue with its accumulated write count. This reduces the
load put on the DB fifo when reenabling.
iw_cxgb4:
Instead of marking each qp to indicate DB writes are disabled, we create
a device-global status page that each user process maps. This allows
iw_cxgb4 to only set this single bit to disable all DB writes for all
user QPs vs traversing the idr of all the active QPs. If the libcxgb4
doesn't support this, then we fall back to the old approach of marking
each QP. Thus we allow the new driver to work with an older libcxgb4.
When the LLD upcalls iw_cxgb4 indicating DB FULL, we disable all DB writes
via the status page and transition the DB state to STOPPED. As user
processes see that DB writes are disabled, they call into iw_cxgb4
to submit their DB write events. Since the DB state is in STOPPED,
the QP trying to write gets enqueued on a new DB "flow control" list.
As subsequent DB writes are submitted for this flow controlled QP, the
amount of writes are accumulated for each QP on the flow control list.
So all the user QPs that are actively ringing the DB get put on this
list and the number of writes they request are accumulated.
When the LLD upcalls iw_cxgb4 indicating DB EMPTY, which is in a workq
context, we change the DB state to FLOW_CONTROL, and begin resuming all
the QPs that are on the flow control list. This logic runs on until
the flow control list is empty or we exit FLOW_CONTROL mode (due to
a DB DROP upcall, for example). QPs are removed from this list, and
their accumulated DB write counts written to the DB FIFO. Sets of QPs,
called chunks in the code, are removed at one time. The chunk size is 64.
So 64 QPs are resumed at a time, and before the next chunk is resumed, the
logic waits (blocks) for the DB FIFO to drain. This prevents resuming to
quickly and overflowing the FIFO. Once the flow control list is empty,
the db state transitions back to NORMAL and user QPs are again allowed
to write directly to the user DB register.
The algorithm is designed such that if the DB write load is high enough,
then all the DB writes get submitted by the kernel using this flow
controlled approach to avoid DB drops. As the load lightens though, we
resume to normal DB writes directly by user applications.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch refactors the IB core umem code and vendor drivers to use a
linear (chained) SG table instead of chunk list. With this change the
relevant code becomes clearer—no need for nested loops to build and
use umem.
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Building mem.o for 32 bits x86 triggers a GCC warning:
drivers/infiniband/hw/cxgb4/mem.c: In function '_c4iw_write_mem_dma_aligned':
drivers/infiniband/hw/cxgb4/mem.c:79:25: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Silence that warning by casting "&wr_wait" to unsigned long before
casting it to __be64. That's what _c4iw_write_mem_inline() already does.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pull networking fixes from David Miller:
"Some holiday bug fixes for 3.13... There is still one bug I'd like to
get fixed before 3.13-final.
The vlan code erroneously assignes the header ops of the underlying
real device to the VLAN device above it when the real device can
hardware offload VLAN handling. That's completely bogus because
header ops are tied to the device type, so they only expect to see a
'dev' argument compatible with their ops.
The fix is the have the VLAN code use a special set of header ops that
does the pass-thru correctly, by calling the underlying real device's
header ops but _also_ passing in the real device instead of the VLAN
device.
That fix is currently waiting some testing.
Anyways, of note here:
1) Fix bitmap edge case in radiotap, from Johannes Berg.
2) Fix oops on driver unload in rtlwifi, from Larry Finger.
3) Bonding doesn't do locking correctly during speed/duplex/link
changes, from Ding Tianhong.
4) Fix header parsing in GRE code, this bug has been around for a few
releases. From Timo Teräs.
5) SIT tunnel driver MTU check needs to take GSO into account, from
Eric Dumazet.
6) Minor info leak in inet_diag, from Daniel Borkmann.
7) Info leak in YAM hamradio driver, from Salva Peiró.
8) Fix route expiration state handling in ipv6 routing code, from Li
RongQing.
9) DCCP probe module does not check request_module()'s return value,
from Wang Weidong.
10) cpsw driver passes NULL device names to request_irq(), from
Mugunthan V N.
11) Prevent a NULL splat in RDS binding code, from Sasha Levin.
12) Fix 4G overflow test in tg3 driver, from Nithin Sujir.
13) Cure use after free in arc_emac and fec driver's software
timestamp handling, from Eric Dumazet.
14) SIT driver can fail to release the route when
iptunnel_handle_offloads() throws an error. From Li RongQing.
15) Several batman-adv fixes from Simon Wunderlich and Antonio
Quartulli.
16) Fix deadlock during TIPC socket release, from Ying Xue.
17) Fix regression in ROSE protocol recvmsg() msg_name handling, from
Florian Westphal.
18) stmmac PTP support releases wrong spinlock, from Vince Bridgers"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
stmmac: Fix incorrect spinlock release and PTP cap detection.
phy: IRQ cannot be shared
net: rose: restore old recvmsg behavior
xen-netback: fix guest-receive-side array sizes
fec: Do not assume that PHY reset is active low
tipc: fix deadlock during socket release
netfilter: nf_tables: fix wrong datatype in nft_validate_data_load()
batman-adv: fix vlan header access
batman-adv: clean nf state when removing protocol header
batman-adv: fix alignment for batadv_tvlv_tt_change
batman-adv: fix size of batadv_bla_claim_dst
batman-adv: fix size of batadv_icmp_header
batman-adv: fix header alignment by unrolling batadv_header
batman-adv: fix alignment for batadv_coded_packet
netfilter: nf_tables: fix oops when updating table with user chains
netfilter: nf_tables: fix dumping with large number of sets
ipv6: release dst properly in ipip6_tunnel_xmit
netxen: Correct off-by-one errors in bounds checks
net: Add some clarification to skb_tx_timestamp() comment.
arc_emac: fix potential use after free
...
This patch marks the function _c4iw_write_mem_dma() as static
because it is not used outside this file, which fixes the warning:
drivers/infiniband/hw/cxgb4/mem.c:176:5: warning: no previous prototype for ‘_c4iw_write_mem_dma’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Physical addresses may be wider than virtual addresses (e.g. on i386
with PAE) and must not be formatted with %p.
Compile-tested only.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Lustre uses a advertised max MR size of ~0ULL to indicate it should
use a dma_mr. Hence advertise max MR size as ~0ULL.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When polling, we do a GTS update if the accumulated cidx_inc == the CQ
depth / 16. However, if the CQ is large enough, Cq depth / 16 exceeds
the size of the field in the GTS word. So we also need to update if
cidx_inc hits CIDXINC_MASK to avoid overflowing the field.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
accept_cr() failed to set the arp error handler on a reused skb. This
results in a kernel crash if the arp does indeed time out.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch makes following fixes in QP flush logic:
- correctly flushes unsignaled WRs followed by a signaled WR
- supports for flushing a CQ bound to multiple QPs
- resets cidx_flush if a active queue starts getting HW CQEs again
- marks WQ in error when we leave RTS. This was only being done for
user queues, but we need it for kernel queues too so that
post_send/post_recv will start returning the appropriate error
synchronously
- eats unsignaled read resp CQEs. HW always inserts CQEs so we must
silently discard them if the read work request was unsignaled.
- handles QP flushes with pending SW CQEs. The flush and out of order
completion logic has a bug where if out of order completions are
flushed but not yet polled by the consumer and the qp is then
flushed then we end up inserting duplicate completions.
- c4iw_flush_sq() should only flush wrs that have not already been
flushed. Since we already track where in the SQ we've flushed via
sq.cidx_flush, just start at that point and flush any remaining.
This bug only caused a problem in the presence of unsignaled work
requests.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
[ Fixed sparse warning due to htonl/ntohl confusion. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move QP to TERMINATE instead to allow the peer to get the TERM
message. This bug wasn't detectable until newer FW that moves
connections out of RDMA mode as soon as an error is detected.
QP can exit RTS before the last AE arrives. This was introduced by
changes in the FW to kick connections out of RDMA mode as soon as an
error is detected. A side effect of this is that the driver can move
the QP out of RTS before the AE causing the connection to get kicked
out of RDMA mode is processed. Fix for this is to always post async
errors even if the QP is out of RTS.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add new cpl messages, cpl_act_open_req6 and cpl_t5_act_open_req6, for
initiating active open connections.
Use LLD api cxgb4_create_server and cxgb4_create_server6 for
initiating passive open connections. Similarly use cxgb4_remove_server
to remove the passive open connections in place of listen_stop.
Add support for iWARP over VLAN device and enable IPv6 support on VLAN device.
Make use of import_ep in c4iw_reconnect.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
[ Fix build when IPv6 is disabled and make sure iw_cxgb4 is not built-in
when ipv6 is a module. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the type of local_addr and remote_addr fields in struct
iw_cm_id from struct sockaddr_in to struct sockaddr_storage to hold
IPv6 and IPv4 addresses uniformly.
Change the references of local_addr and remote_addr in cxgb4, cxgb3,
nes and amso drivers to match this. However to be able to actully run
traffic over IPv6, low-level drivers have to add code to support this.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
[ Fix unused variable warnings when INFINIBAND_NES_DEBUG not set.
- Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pull InfiniBand/RDMA changes from Roland Dreier:
- XRC transport fixes
- Fix DHCP on IPoIB
- mlx4 preparations for flow steering
- iSER fixes
- miscellaneous other fixes
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (23 commits)
IB/iser: Add support for iser CM REQ additional info
IB/iser: Return error to upper layers on EAGAIN registration failures
IB/iser: Move informational messages from error to info level
IB/iser: Add module version
mlx4_core: Expose a few helpers to fill DMFS HW strucutures
mlx4_core: Directly expose fields of DMFS HW rule control segment
mlx4_core: Change a few DMFS fields names to match firmare spec
mlx4: Match DMFS promiscuous field names to firmware spec
mlx4_core: Move DMFS HW structs to common header file
IB/mlx4: Set link type for RAW PACKET QPs in the QP context
IB/mlx4: Disable VLAN stripping for RAW PACKET QPs
mlx4_core: Reduce warning message for SRQ_LIMIT event to debug level
RDMA/iwcm: Don't touch cmid after dropping reference
IB/qib: Correct qib_verbs_register_sysfs() error handling
IB/ipath: Correct ipath_verbs_register_sysfs() error handling
RDMA/cxgb4: Fix SQ allocation when on-chip SQ is disabled
SRPT: Fix odd use of WARN_ON()
IPoIB: Fix ipoib_hard_header() return value
RDMA: Rename random32() to prandom_u32()
RDMA/cxgb3: Fix uninitialized variable
...
Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull infiniband/rdma fixes from Roland Dreier:
"Small batch of InfiniBand/RDMA fixes for 3.9:
- Fix for TX lockup in IPoIB
- QLogic -> Intel update for qib driver
- Small static checker fix for qib
- Fix error path return value in cxgb4"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/qib: change QLogic to Intel
IB/ipath: Silence a static checker warning
IPoIB: Fix send lockup due to missed TX completion
RDMA/cxgb4: Fix error return code in create_qp()
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.
As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:
Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.
before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
When neighbour table is full, dst_neigh_lookup/dst_neigh_lookup_skb will return
-ENOBUFS which is absolutely non zero, while all the code in kernel which use
above functions assume failure only on zero return which will cause panic. (for
example: : https://bugzilla.kernel.org/show_bug.cgi?id=54731).
This patch corrects above error with smallest changes to kernel source code and
also correct two return value check missing bugs in drivers/infiniband/hw/cxgb4/cm.c
Tested on my x86_64 SMP machine
Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
T5 adapter does not support onchip queue memory. Present logic fails to
allocate QP for T5 and returns an error. Also, if module parameter ocqp_support
is zero then we are unable to allocate QP which should not be the case. Ideally
if ocqp_support parameter is 0 or onchip queue support is disable then host QP
should be allocated before returning an error.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Always bump the tcam_full stat. Also, bump wr reply timeout to 30 seconds.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It enables direct DMA by HW to memory region PBL arrays and fast register PBL
arrays from host memory, vs the T4 way of passing these arrays in the WR itself.
The result is lower latency for memory registration, and larger PBL array
support for fast register operations.
This patch also updates ULP_TX_MEM_WRITE command fields for T5. Ordering bit of
ULP_TX_MEM_WRITE is at bit position 22 in T5 and at 23 in T4.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both DB Flow-Control and DB Coalescing are disabled by default on T5
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for Chelsio T5 adapter.
Enables T5's Write Combining feature.
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enhances the IB core support for Memory Windows (MWs).
MWs allow an application to have better/flexible control over remote
access to memory.
Two types of MWs are supported, with the second type having two flavors:
Type 1 - associated with PD only
Type 2A - associated with QPN only
Type 2B - associated with PD and QPN
Applications can allocate a MW once, and then repeatedly bind the MW
to different ranges in MRs that are associated to the same PD. Type 1
windows are bound through a verb, while type 2 windows are bound by
posting a work request.
The 32-bit memory key is composed of a 24-bit index and an 8-bit
key. The key is changed with each bind, thus allowing more control
over the peer's use of the memory key.
The changes introduced are the following:
* add memory window type enum and a corresponding parameter to ib_alloc_mw.
* type 2 memory window bind work request support.
* create a struct that contains the common part of the bind verb struct
ibv_mw_bind and the bind work request into a single struct.
* add the ib_inc_rkey helper function to advance the tag part of an rkey.
Consumer interface details:
* new device capability flags IB_DEVICE_MEM_WINDOW_TYPE_2A and
IB_DEVICE_MEM_WINDOW_TYPE_2B are added to indicate device support
for these features.
Devices can set either IB_DEVICE_MEM_WINDOW_TYPE_2A or
IB_DEVICE_MEM_WINDOW_TYPE_2B if it supports type 2A or type 2B
memory windows. It can set neither to indicate it doesn't support
type 2 windows at all.
* modify existing provides and consumers code to the new param of
ib_alloc_mw and the ib_mw_bind_info structure
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>