The extended address vector is the highest bit in be32 variable,
but it was compared with the lowest. This patch fixes the endianness
of that check and removes already declared define.
Fixes: 17d2f88f92 ("IB/mlx5: Add ODP atomics support")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to mark relevant fields while setting the MTPPS register
add field select. Otherwise it can cause a misconfiguration in
firmware.
Fixes: ee7f12205a ('net/mlx5e: Implement 1PPS support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch adds below requester and responder side error counters,
which will be exposed by hardware counters interface and are supported
as part of query Q counters command extension.
+---------------------------+-------------------------------------+
| Name | Description |
|---------------------------+-------------------------------------|
|resp_local_length_error | Number of times responder detected |
| | local length errors |
|---------------------------+-------------------------------------|
|resp_cqe_error | Number of CQEs completed with error |
| | at responder |
|---------------------------+-------------------------------------|
|req_cqe_error | Number of CQEs completed with error |
| | at requester |
|---------------------------+-------------------------------------|
|req_remote_invalid_request | Number of times requester detected |
| | remote invalid request error |
|---------------------------+-------------------------------------|
|req_remote_access_error | Number of times requester detected |
| | remote access error |
|---------------------------+-------------------------------------|
|resp_remote_access_error | Number of times responder detected |
| | remote access error |
|---------------------------+-------------------------------------|
|resp_cqe_flush_error | Number of CQEs completed with |
| | flushed with error at responder |
|---------------------------+-------------------------------------|
|req_cqe_flush_error | Number of CQEs completed with |
| | flushed with error at requester |
+---------------------------+-------------------------------------+
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The extended address vector is the highest bit in be32 variable,
but it was compared with the lowest. This patch fixes the endianness
of that check and removes already declared define.
Fixes: 17d2f88f92 ("IB/mlx5: Add ODP atomics support")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Report 'ipoib_enhanced_offloads' capabilities from
the core layer, it will be used in the next patch from this series.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
RQs that were configured for "delay drop" will prevent packet drops
when their WQEs are depleted.
Marking an RQ to be drop-less is done by setting delay_drop_en in RQ
context using CREATE_RQ command.
Since this feature is globally activated/deactivated by using the
SET_DELAY_DROP command on all the marked RQs, we activated/deactivated
it according to the number of RQs with 'delay_drop' enabled.
When timeout is expired, then the feature is deactivated. Therefore
the driver handles the delay drop timeout event and reactivate it.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When delay drop timeout is expired, the firmware raises
general notification event of DELAY_DROP_TIMEOUT subtype.
In addition the feature is disable so the driver have to
reactivate the timeout.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support to SET_DELAY_DROP command.
This command will be used in downstream patches for delay packet drop.
The timeout value should be indicated by delay_drop_timeout field.
Packet processing will be delayed till timeout value passed or until
more WQEs are posted.
Setting this value to 0 disables the feature.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When a user sets port_guid, node_guid or policy of an IB virtual
function, save this information in "struct mlx5_vf_context".
This information will be restored later when pci_resume is called.
To make sure this works, one can use aer-inject to generate PCI
errors on mlx5 devices and verify if relevant fields are restored
after PCI resume.
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds debug control parameters for congestion control which
can be read or written through debugfs. They are for reaction point and
notification point nodes.
These control parameters are as below:
+------------------------------+-----------------------------------------+
| Name | Description |
|------------------------------+-----------------------------------------|
|rp_clamp_tgt_rate | When set target rate is updated to |
| | current rate |
|------------------------------+-----------------------------------------|
|rp_clamp_tgt_rate_ati | When set update target rate based on |
| | timer as well |
|------------------------------+-----------------------------------------|
|rp_time_reset | time between rate increase if no |
| | CNP is received unit in usec |
|------------------------------+-----------------------------------------|
|rp_byte_reset | Number of bytes between rate inease if |
| | no CNP is received |
|------------------------------+-----------------------------------------|
|rp_threshold | Threshold for reaction point rate |
| | control |
|------------------------------+-----------------------------------------|
|rp_ai_rate | Rate for target rate, unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_hai_rate | Rate for hyper increase state |
| | unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_min_dec_fac | Minimum factor by which the current |
| | transmit rate can be changed when |
| | processing a CNP, unit is percerntage |
|------------------------------+-----------------------------------------|
|rp_min_rate | Minimum value for rate limit, |
| | unit in Mbps |
|------------------------------+-----------------------------------------|
|rp_rate_to_set_on_first_cnp | Rate that is set when first CNP is |
| | received, unit is Mbps |
|------------------------------+-----------------------------------------|
|rp_dce_tcp_g | Used to calculate alpha |
|------------------------------+-----------------------------------------|
|rp_dce_tcp_rtt | Time between updates of alpha value, |
| | unit is usec |
|------------------------------+-----------------------------------------|
|rp_rate_reduce_monitor_period | Minimum time between consecutive rate |
| | reductions |
|------------------------------+-----------------------------------------|
|rp_initial_alpha_value | Initial value of alpha |
|------------------------------+-----------------------------------------|
|rp_gd | When CNP is received, flow rate is |
| | reduced based on gd, rp_gd is given as |
| | log2(rp_gd) |
|------------------------------+-----------------------------------------|
|np_cnp_dscp | dscp code point for generated cnp |
|------------------------------+-----------------------------------------|
|np_cnp_prio_mode | 802.1p priority for generated cnp |
|------------------------------+-----------------------------------------|
|np_cnp_prio | cnp priority mode |
+------------------------------+-----------------------------------------+
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some overlapping changes in the mlx5 driver.
A merge conflict resolution posted by Stephen Rothwell was used as a
guide.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Innova IPSec ESP crypto offload configuration paths.
Detect Innova IPSec device and set the NETIF_F_HW_ESP flag.
Configure Security Associations using the API introduced in a previous
patch.
Add Software-parser hardware descriptor layout
Software-Parser (swp) is a hardware feature in ConnectX which allows the
host software to specify protocol header offsets in the TX path, thus
overriding the hardware parser.
This is useful for protocols that the ASIC may not be able to parse on
its own.
Note that due to inline metadata, XDP is not supported in Innova IPSec.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add routines for manipulating the hardware IPSec SA database (SADB).
In Innova IPSec, a Security Association (SA) is added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.
Add implementation for Innova IPSec (FPGA-based) hardware.
These routines will be used by the IPSec offload support in a later patch
However they may also be used by others such as RDMA and RoCE IPSec.
mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.
In this patchset we add Innova IPSec support and mlx5/accel delegates
IPSec offloads to Innova routines.
In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add interface to initialize and interact with Innova FPGA SBU
connections.
A client driver may use these functions to set up a high-speed DMA
connection with its SBU hardware logic, and send/receive messages
over this connection.
A later patch in this patchset will make use of these functions for
Innova IPSec offload in mlx5 Ethernet driver.
Add commands to retrieve Innova FPGA SBU capabilities, and to
read/write Innova FPGA configuration space registers and memory,
over internal I2C.
At high level, the FPGA configuration space is divided such:
0x00000000 - 0x007fffff is reserved for the SBU
0x00800000 - 0xffffffff is reserved for the Shell
0x400000000 - ... is DDR memory
A later patchset will add support for accessing FPGA CrSpace and memory
over a high-speed connection. This is the reason for the ACCESS_TYPE
enumeration, which currently only supports I2C.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The Innova FPGA includes shell hardware and Sandbox-Unit (SBU) hardware.
The shell hardware is handled by mlx5_core itself, while the SBU is
handled by a client driver.
Reset the SBU to a well-known initial state when initializing a new
device, and set the FPGA to bypass mode when uninitializing a device.
This allows the client driver to assume that its device has been
reset when a new device is detected.
During SBU reset, the FPGA is put into SBU-bypass mode. In this mode
packets do not pass through the SBU, so it cannot affect the network
data stream at all.
A factory-image does not have an SBU, so skip these flows.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The FPGA QP is a high-bandwidth communication channel between the host
CPU and the FPGA device. It allows performing DMA operations between
host memory and the FPGA logic via the ConnectX chip.
Add ConnectX FW commands which create and manipulate FPGA QPs.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Previously, only mlx5_ib enabled RoCE on the port, but FPGA needs it as
well.
Add support for counting number of enables, so that FPGA and IB can work
in parallel and independently.
Program the HW to enable RoCE on the first enable call, and program to
disable RoCE on the last disable call.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reserved GIDs are entries in the GID table in use by the mlx5_core
and its submodules (e.g. FPGA, SRIOV, E-Swtich, netdev).
The entries are reserved at the high indexes of the GID table.
A mlx5 submodule may reserve a certain amount of GIDs for its own use
during the load sequence by calling mlx5_core_reserve_gids, and must
also take care to un-reserve these GIDs when it closes.
Reservation is only allowed during the load sequence and before any
interfaces (e.g. mlx5_ib or mlx5_en) are up.
After reservation, a submodule may call mlx5_core_reserved_gid_alloc/
free to allocate entries from the reserved GIDs pool.
Reserve a GID table entry for every supported FPGA QP.
A later patch in the patchset will remove them from being reported to
IB core.
Another such patch will make use of these for FPGA QPs in Innova NIC.
Added lib/mlx5.h to serve as a library for mlx5 submodlues, and to
expose only public mlx5 API, more mlx5 library files will be added in
future submissions.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Draining the health workqueue will ignore future health works including
the one that report hardware failure and thus we can't enter error state
Instead cancel the recovery flow and make sure only recovery flow won't
be scheduled.
Fixes: 5e44fca504 ('net/mlx5: Only cancel recovery work when cleaning up device')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
The offending commit pushed fwd the field by two bits but
didn't increment the offset, fix that. Currently, no damage
was done b/c this is just a field name, but lets have it right.
Fixes: f32f5bd2eb ('net/mlx5: Configure cache line size for start and end padding')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Enhance MCAM to allow the driver to query which access regs are
supported. For now, expose the regs needed for FW flashing.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
MCC (Management Component Control) allows to control a firmware
component update.
MCDA (Management Component Data Access) allows to read and write
a firmware component.
MCQI (Management Component Query Information) allows to query
information about firmware components.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Enable offloading of TC matching on ip ttl / hop-limit
As matching on ttl is supported only by newer HW brands (ConnectX-5),
we should do capability check before attempting to offload that.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Adding a support to flush all HW resources with one FW command and
skip all the heavy unload flows of the driver on kernel shutdown.
There's no need to free all the SW context since a new fresh kernel
will be loaded afterwards.
Regarding the FW resources, they should be closed, otherwise we will
have leakage in the FW. To accelerate this flow, we execute one command
in the beginning that tells the FW that the driver isn't going to close
any of the FW resources and asks the FW to clean up everything.
Once the commands complete, it's safe to close the PCI resources and
finish the routine.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add a new interface for commands execution that allows the
caller to wait for the command's completion in a busy-wait
loop (polling mode).
This is useful if we want to execute a command in a polling mode
while the driver is working in events mode for the rest of
the commands.
This interface will be used in the downstream patches.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Move "query queue counter out of buffer" helper function out of
qp.c to en_main.c, since mlx5e netdev driver is the only one to use it.
Also allocate the output buffer on the stack instead of the heap, to reduce
number of heap allocs on update_stats work.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Fixed few places where endianness was misspelled and
one spot whwere output was:
CHECK: 'endianess' may be misspelled - perhaps 'endianness'?
CHECK: 'ouput' may be misspelled - perhaps 'output'?
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Read port connector type from the firmware instead of caching it in the
driver metadata.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Update struct mlx5_ifc_create(modify)_flow_table_bits according to
the last device specification.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
HW can implement UMR wqe re-transmission in various ways.
Thus, add HCA cap to distinguish the needed fence for UMR to make
sure that the wqe wouldn't fail on mkey checks.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net'
restricting a HW workaround alongside cleanups in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5-update-2017-05-23
First patch from Leon, came to remove the redundant usage of mlx5_vzalloc,
and directly use kvzalloc across all mlx5 drivers.
2nd patch from Noa, adds new device IDs into the supported devices list.
3rd and 4th patches from Ilan are adding the basic infrastructure and
support for Mellanox's mlx5 FPGA.
Last two patches from Tariq came to modify the outdated driver version
reported in ethtool and in mlx5_ib to more reflect the current driver state
and remove the redundant date string reported in the version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
mlx5-fixes-2017-05-23
Some TC offloads fixes from Or Gerlitz.
From Erez, mlx5 IPoIB RX fix to improve GRO.
From Mohamad, Command interface fix to improve mitigation against FW
commands timeouts.
From Tariq, Driver load Tolerance against affinity settings failures.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Masks for extracting part of the Completion Queue Entry (CQE)
field rss_hash_type was swapped, namely CQE_RSS_HTYPE_IP and
CQE_RSS_HTYPE_L4.
The bug resulted in setting skb->l4_hash, even-though the
rss_hash_type indicated that hash was NOT computed over the
L4 (UDP or TCP) part of the packet.
Added comments from the datasheet, to make it more clear what
these masks are selecting.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when firmware command gets stuck or it takes long time to
complete, the driver command will get timeout and the command slot is
freed and can be used for new commands, and if the firmware receive new
command on the old busy slot its behavior is unexpected and this could
be harmful.
To fix this when the driver command gets timeout we return failure,
but we don't free the command slot and we wait for the firmware to
explicitly respond to that command.
Once all the entries are busy we will stop processing new firmware
commands.
Fixes: 9cba4ebcf3 ('net/mlx5: Fix potential deadlock in command mode change')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mellanox Innova is a NIC with ConnectX and an FPGA on the same
board. The FPGA is a bump-on-the-wire and thus affects operation of
the mlx5_core driver on the ConnectX ASIC.
Add basic support for Innova in mlx5_core.
This allows using the Innova card as a regular NIC, by detecting
the FPGA capability bit, and verifying its load state before
initializing ConnectX interfaces.
Also detect FPGA fatal runtime failures and enter error state if
they ever happen.
All new FPGA-related logic is placed in its own subdirectory 'fpga',
which may be built by selecting CONFIG_MLX5_FPGA.
This prepares for further support of various Innova features in later
patchsets.
Additional details about hardware architecture will be provided as
more features get submitted.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Introduce new function for entering bad-health state.
This function will be called from FPGA-related logic in a later patch from
asynchronous event (IRQ) context, for that we change the spin lock to an
IRQ-safe one.
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Commit a7c3e901a4 ("mm: introduce kv[mz]alloc helpers") added
proper implementation of mlx5_vzalloc function to the MM core.
This made the mlx5_vzalloc function useless, so let's remove it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Root flow table is dynamically changed by the underlying flow steering
layer, and IPoIB/ULPs have no idea what will be the root flow table in
the future, hence we need a dynamic infrastructure to move Underlay QPs
with the root flow table.
Fixes: b3ba51498b ("net/mlx5: Refactor create flow table method to accept underlay QP")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Pull more rdma updates from Doug Ledford:
"As mentioned in my first pull request, this is the subsequent pull
requests I had. This is all I have, and in fact this cleans out the
RDMA subsystem's entire patchworks queue of kernel changes that are
ready to go (well, it did for the weekend anyway, a few new patches
are in, but they'll be coming during the -rc cycle).
The first tag contains a single patch that would have conflicted if
taken from my tree or DaveM's tree as it needed our trees merged to
come cleanly.
The second tag contains the patch series from Intel plus three other
stragllers that came in late last week. I took them because it allowed
me to legitimately claim that the RDMA patchworks queue was, for a
short time, 100% cleared of all waiting kernel patches, woohoo! :-).
I have it under my for-next tag, so it did get 0day and linux- next
over the end of last week, and linux-next did show one minor conflict.
Summary:
'for-linus' tag:
- mlx5/IPoIB fixup patch
'for-next' tag:
- the hfi1 15 patch set that landed late
- IPoIB get_link_ksettings which landed late because I asked for a
respin
- one late rxe change
- one -rc worthy fix that's in early"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/mlx5: Enable IPoIB acceleration
* tag 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
rxe: expose num_possible_cpus() cnum_comp_vectors
IB/rxe: Update caller's CRC for RXE_MEM_TYPE_DMA memory type
IB/hfi1: Clean up on context initialization failure
IB/hfi1: Fix an assign/ordering issue with shared context IDs
IB/hfi1: Clean up context initialization
IB/hfi1: Correctly clear the pkey
IB/hfi1: Search shared contexts on the opened device, not all devices
IB/hfi1: Remove atomic operations for SDMA_REQ_HAVE_AHG bit
IB/hfi1: Use filedata rather than filepointer
IB/hfi1: Name function prototype parameters
IB/hfi1: Fix a subcontext memory leak
IB/hfi1: Return an error on memory allocation failure
IB/hfi1: Adjust default eager_buffer_size to 8MB
IB/hfi1: Get rid of divide when setting the tx request header
IB/hfi1: Fix yield logic in send engine
IB/hfi1, IB/rdmavt: Move r_adefered to r_lock cache line
IB/hfi1: Fix checks for Offline transient state
IB/ipoib: add get_link_ksettings in ethtool
Enable mlx5 IPoIB acceleration by declaring
mlx5_ib_{alloc,free}_rdma_netdev and assigning the mlx5
IPoIB rdma_netdev callbacks.
In addition, this patch brings in sync mlx5's IPoIB parts for net and IB
trees. As a precaution, we disabled IPoIB acceleration by default (in
the mlx5_core Kconfig file).
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull rdma updates from Doug Ledford:
"More exchaustive description of primary updates in this release:
- Lots of driver fixes and misc fixes across the board.
- I had to base on a net-next tree because the IPoIB Accelorator
patches needed it.
Unfortunately, it was known to Mellanox that there would need to be
an IPoIB accelorator patch to the net tree (which left some
functions turned off by an #ifdef construct to avoid warnings about
defined but unused functions), then one to the RDMA tree, then a
fixup that went back and re-enabled the functions in the net tree
and enabled their use in the rdma tree
Also, a sparse fix was sent to the net tree after I did my pull,
and the fixup patch conflicts quite directly with that sparse fix,
so I'm going to submit the fixup patch towards the end of the merge
window by itself and based upon your master branch at the time.
- Two separate rounds of hfi1 fixes, one that got dropped from last
release because it came in just a day or two before the end of the
merge window and then the one from this release cycle.
Of note is that I now have a third series that just landed from
Intel yesterday. It is not included in this pull request, but I may
submit it by the end of the week. I'll talk to Intel about
improving the timing of thier submissions for my workflow.
- Changes to our idr usage in the RDMA subsystem that will tie into
our cgroup management and also into the upcoming changes for the
RDMA kernel<->userspace API.
- Addition of support for a netdev to be tied to an RDMA device at
the core level
- Addition of the VNIC driver from Intel.
While IPoIB provides IP over InfiniBand (and *only* IP, no lower
layer protocol headers are allowed or supported), the VNIC driver
presents a virtual Ethernet device with support for things like
varying Ethertypes, VLANs, priorities and other features of
Ethernet.
The virtual devices are centrally managed by the OPA fabric
manager, making this (for the time being) a strictly OPA specific
feature.
- Improvements to the On-Demand Paging support in the RDMA subsystem.
- Addition of three significant OPA changes.
While we added OPA support some time ago (via the hfi1 driver), the
RDMA subsystem has so far glossed over the areas where OPA and
InfiniBand differ.
With this release we are starting to add support for the OPA
extensions into the RDMA core in the following area: Extended port
information for OPA is now supported, extended Address Handle
attributes for OPA are now supported, and extended SA Queries to
get OPA specific subnet information is now supported.
Concise summary from the tag:
- idr usage and locking changes
- build fix for hns
- ipoib debug path record file fix
- hfi1 updates
- core RDMA netdev addition
- Intel VNIC driver addition
- Enhanced accelerators for IPoIB addition
- Debug cleanups in cxgb3/4
- Trivial cleanups from SF Markus Elfring
- Misc rxe fixes from Mellanox
- Misc ipoib fixes from Mellanox
- Lots of mlx4/mlx5 changes from Mellanox
- Misc fixes across the RDMA subsystem
- ODP paging fixes and improvements
- qedr updates
- hfi1 updates
- OPA port info patches
- OPA AH patches
- OPA SA Query patches"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (191 commits)
infiniband: avoid dereferencing uninitialized dst on error path
IB/SA: Add OPA addr header
IB/mlx5: Add port_xmit_wait to counter registers read
IB/ocrdma: fix out of bounds access to local buffer
IB/mlx4: Fix incorrect order of formal and actual parameters
IB/mlx4: Change flush logic so it adheres to the variable name
mlx5: Fix mlx5_ib_map_mr_sg mr length
IB/rxe: Don't clamp residual length to mtu
IB/SA: Add support to query OPA path records
IB/SA: Add OPA path record type
IB/SA: Split struct sa_path_rec based on IB and ROCE specific fields
IB/SA: Introduce path record specific types
IB/SA: Rename ib_sa_path_rec to sa_path_rec
IB/CM: Add braces when using sizeof
IB/core: Define 'opa' rdma_ah_attr type
IB/core: Define 'ib' and 'roce' rdma_ah_attr types
IB/core: Use rdma_ah_attr accessor functions
IB/core: Add accessor functions for rdma_ah_attr fields
IB/PVRDMA: Rename ib_ah_attr related functions
IB/mthca: Rename to_ib_ah_attr to to_rdma_ah_attr
...
Add port_xmit_wait to the error counters read by mlx5_ib_process_mad to
ensure sysfs port counter provides correct value for PortXmitWait.
Otherwise the sysfs port_xmit_wait file always contains zero.
The previous MAD_IFC implementation populated this counter, but it was
removed during the migration to PPCNT for error counters (32-bit only).
Signed-off-by: Tim Wright <tim@binbash.co.uk>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When IP tunnel encapsulation rules are offloaded, the kernel can't see
the traffic of the offloaded flow. The neighbour for the IP tunnel
destination of the offloaded flow can mistakenly become STALE and
deleted by the kernel since its 'used' value wasn't changed.
To make sure that a neighbour which is used by the HW won't become
STALE, we proactively update the neighbour 'used' value every
DELAY_PROBE_TIME period, when packets were matched and counted by the HW
for one of the tunnel encap flows related to this neighbour.
The periodic task that updates the used neighbours is scheduled when a
tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
itself as long as the representor's neighbours list isn't empty.
Add, remove, lookup and status change operations done over the
representor's neighbours list or the neighbour hash entry encaps list
are all serialized by RTNL lock.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This patch adds support to query the congestion related hardware counters
through new command and links them with other hw counters being available
in hw_counters sysfs location.
In order to reuse existing infrastructure it renames related q_counter
data structures to more generic counters to reflect q_counters and
congestion counters and maybe some other counters in the future.
New hardware counters:
* rp_cnp_handled - CNP packets handled by the reaction point
* rp_cnp_ignored - CNP packets ignored by the reaction point
* np_cnp_sent - CNP packets sent by notification point to respond to
CE marked RoCE packets
* np_ecn_marked_roce_packets - CE marked RoCE packets received by
notification point
It also avoids returning ENOSYS which is specific for invalid
system call and produces the following checkpatch.pl warning.
WARNING: ENOSYS means 'invalid syscall nr' and nothing else
+ return -ENOSYS;
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>