[ Upstream commit 6d0b65569c0a10b27c49bacd8d25bcd406003533 ]
Fix race condition between Interrupt thread and Chip reset thread in trying
to flush the same mailbox. With the race condition, the "ha->mbx_intr_comp"
will get an extra complete() call. The extra complete call create erroneous
mailbox timeout condition when the next mailbox is sent where the mailbox
call does not wait for interrupt to arrive. Instead, it advances without
waiting.
Add lock protection around the check for mailbox completion.
Cc: stable@vger.kernel.org
Fixes: b2000805a9 ("scsi: qla2xxx: Flush mailbox commands on chip reset")
Signed-off-by: Quinn Tran <quinn.tran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230821130045.34850-3-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5b51f35d127e7bef55fa869d2465e2bca4636454 upstream.
Link up failure occurred where driver failed to see certain events from FW
indicating link up (AEN 8011) and fabric login completion (AEN 8014).
Without these 2 events, driver would not proceed forward to scan the
fabric. The cause of this is due to delay in the receive of interrupt for
Mailbox 60 that causes qla to set the fw_started flag late. The late
setting of this flag causes other interrupts to be dropped. These dropped
interrupts happen to be the link up (AEN 8011) and fabric login completion
(AEN 8014).
Set fw_started flag early to prevent interrupts being dropped.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-6-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6dfe4344c168c6ca20fe7640649aacfcefcccb26 upstream.
System crash when using debug kernel due to link list corruption. The cause
of the link list corruption is due to session deletion was allowed to queue
up twice. Here's the internal trace that show the same port was allowed to
double queue for deletion on different cpu.
20808683956 015 qla2xxx [0000:13:00.1]-e801:4: Scheduling sess ffff93ebf9306800 for deletion 50:06:0e:80:12:48:ff:50 fc4_type 1
20808683957 027 qla2xxx [0000:13:00.1]-e801:4: Scheduling sess ffff93ebf9306800 for deletion 50:06:0e:80:12:48:ff:50 fc4_type 1
Move the clearing/setting of deleted flag lock.
Cc: stable@vger.kernel.org
Fixes: 726b854870 ("qla2xxx: Add framework for async fabric discovery")
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230714070104.40052-2-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc0cba0c7be8261a1625098bd1d695077ec621c9 upstream.
System crash due to use after free.
Current code allows terminate_rport_io to exit before making
sure all IOs has returned. For FCP-2 device, IO's can hang
on in HW because driver has not tear down the session in FW at
first sign of cable pull. When dev_loss_tmo timer pops,
terminate_rport_io is called and upper layer is about to
free various resources. Terminate_rport_io trigger qla to do
the final cleanup, but the cleanup might not be fast enough where it
leave qla still holding on to the same resource.
Wait for IO's to return to upper layer before resources are freed.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230428075339.32551-7-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0367076b0817d5c75dfb83001ce7ce5c64d803a9 upstream.
While adding and removing the controller, the following call trace was
observed:
WARNING: CPU: 3 PID: 623596 at kernel/dma/mapping.c:532 dma_free_attrs+0x33/0x50
CPU: 3 PID: 623596 Comm: sh Kdump: loaded Not tainted 5.14.0-96.el9.x86_64 #1
RIP: 0010:dma_free_attrs+0x33/0x50
Call Trace:
qla2x00_async_sns_sp_done+0x107/0x1b0 [qla2xxx]
qla2x00_abort_srb+0x8e/0x250 [qla2xxx]
? ql_dbg+0x70/0x100 [qla2xxx]
__qla2x00_abort_all_cmds+0x108/0x190 [qla2xxx]
qla2x00_abort_all_cmds+0x24/0x70 [qla2xxx]
qla2x00_abort_isp_cleanup+0x305/0x3e0 [qla2xxx]
qla2x00_remove_one+0x364/0x400 [qla2xxx]
pci_device_remove+0x36/0xa0
__device_release_driver+0x17a/0x230
device_release_driver+0x24/0x30
pci_stop_bus_device+0x68/0x90
pci_stop_and_remove_bus_device_locked+0x16/0x30
remove_store+0x75/0x90
kernfs_fop_write_iter+0x11c/0x1b0
new_sync_write+0x11f/0x1b0
vfs_write+0x1eb/0x280
ksys_write+0x5f/0xe0
do_syscall_64+0x5c/0x80
? do_user_addr_fault+0x1d8/0x680
? do_syscall_64+0x69/0x80
? exc_page_fault+0x62/0x140
? asm_exc_page_fault+0x8/0x30
entry_SYSCALL_64_after_hwframe+0x44/0xae
The command was completed in the abort path during driver unload with a
lock held, causing the warning in abort path. Hence complete the command
without any lock held.
Reported-by: Lin Li <lilin@redhat.com>
Tested-by: Lin Li <lilin@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Link: https://lore.kernel.org/r/20230313043711.13500-2-njavali@marvell.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3fbc74feb642deb688cc97f76d40b7287ddd4cb1 upstream.
If after an adapter reset the appearance of link is not recovered, the
devices are not rediscovered. This is result of a race condition between
adapter reset (abort_isp) and the topology scan. During adapter reset, the
ABORT_ISP_ACTIVE flag is set. Topology scan usually occurred after adapter
reset. In this case, the topology scan came earlier than usual where it
ran into problem due to ABORT_ISP_ACTIVE flag was still set.
kernel: qla2xxx [0000:13:00.0]-1005:1: Cmd 0x6a aborted with timeout since ISP Abort is pending
kernel: qla2xxx [0000:13:00.0]-28a0:1: MBX_GET_PORT_NAME failed, No FL Port.
kernel: qla2xxx [0000:13:00.0]-286b:1: qla2x00_configure_loop: exiting normally. local port wwpn 51402ec0123d9a80 id 012300)
kernel: qla2xxx [0000:13:00.0]-8017:1: ADAPTER RESET SUCCEEDED nexus=1:0:15.
Allow adapter reset to complete before any scan can start.
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c75e6aef5039830cce5d4cf764dd204522f89e6b upstream.
The following message and call trace was seen with debug kernels:
DMA-API: qla2xxx 0000:41:00.0: device driver failed to check map
error [device address=0x00000002a3ff38d8] [size=1024 bytes] [mapped as
single]
WARNING: CPU: 0 PID: 2930 at kernel/dma/debug.c:1017
check_unmap+0xf42/0x1990
Call Trace:
debug_dma_unmap_page+0xc9/0x100
qla_nvme_ls_unmap+0x141/0x210 [qla2xxx]
Remove DMA mapping from the driver altogether, as it is already done by FC
layer. This prevents the warning.
Fixes: c85ab7d9e27a ("scsi: qla2xxx: Fix missed DMA unmap for NVMe ls requests")
Cc: stable@vger.kernel.org
Signed-off-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b1ae65c082f74536ec292b15766f2846f0238373 upstream.
User experienced symptoms of adapter failure in NPIV environment. NPIV
hosts were allowed to trigger chip reset back to back due to NPIV link
state being slow to come online.
Fix link failure in NPIV environment by removing NPIV host from directly
being able to perform chip reset.
kernel: qla2xxx [0000:04:00.1]-6009:261: Loop down - aborting ISP.
kernel: qla2xxx [0000:04:00.1]-6009:262: Loop down - aborting ISP.
kernel: qla2xxx [0000:04:00.1]-6009:281: Loop down - aborting ISP.
kernel: qla2xxx [0000:04:00.1]-6009:285: Loop down - aborting ISP
Fixes: 0d6e61bc6a ("[SCSI] qla2xxx: Correct various NPIV issues.")
Cc: stable@vger.kernel.org
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 53661ded2460b414644532de6b99bd87f71987e9 ]
This partially reverts commit d2b292c3f6 ("scsi: qla2xxx: Enable ATIO
interrupt handshake for ISP27XX")
For some workloads where the host sends a batch of commands and then
pauses, ATIO interrupt coalesce can cause some incoming ATIO entries to be
ignored for extended periods of time, resulting in slow performance,
timeouts, and aborted commands.
Disable interrupt coalesce and re-enable the dedicated ATIO MSI-X
interrupt.
Link: https://lore.kernel.org/r/97dcf365-89ff-014d-a3e5-1404c6af511c@cybernetics.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 26f9ce53817a8fd84b69a73473a7de852a24c897 ]
Aborting commands that have already been sent to the firmware can
cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped)
For instance:
- Command passes rdx_to_xfer state, maps sgl, sends to the firmware
- Reset occurs, qla2xxx performs ISP error recovery, aborts the command
- Target stack calls qlt_abort_cmd() and then qlt_free_cmd()
- BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not
unmapped
Thus, unmap sgl in qlt_abort_cmd() for commands with the aborted flag set.
Link: https://lore.kernel.org/r/AS8PR10MB4952D545F84B6B1DFD39EC1E9DEE9@AS8PR10MB4952.EURPRD10.PROD.OUTLOOK.COM
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Gleb Chesnokov <Chesnokov.G@raidix.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c02aada06d19a215c8291bd968a99a270e96f734 upstream.
User experienced device lost. The log shows Get port data base command was
queued up, failed, and requeued again. Every time it is requeued, it set
the FCF_ASYNC_ACTIVE. This prevents any recovery code from occurring
because driver thinks a recovery is in progress for this session. In
essence, this session is hung. The reason it gets into this place is the
session deletion got in front of this call due to link perturbation.
Break the requeue cycle and exit. The session deletion code will trigger a
session relogin.
Link: https://lore.kernel.org/r/20220310092604.22950-8-njavali@marvell.com
Fixes: 726b854870 ("qla2xxx: Add framework for async fabric discovery")
Cc: stable@vger.kernel.org
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6a45c8e137d4e2c72eecf1ac7cf64f2fdfcead99 upstream.
User experienced some of the LUN failed to get rediscovered after long
cable pull test. The issue is triggered by a race condition between driver
setting session online state vs starting the LUN scan process at the same
time. Current code set the online state after notifying the session is
available. In this case, trigger to start the LUN scan process happened
before driver could set the session in online state. LUN scan ends up with
failure due to the session online check was failing.
Set the online state before reporting of the availability of the session.
Link: https://lore.kernel.org/r/20220310092604.22950-3-njavali@marvell.com
Fixes: aecf043443 ("scsi: qla2xxx: Fix Remote port registration")
Cc: stable@vger.kernel.org
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>