Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
- various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
exec, kdump, rapidio, panic, kcov, kgdb, ipc).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...
Pull iommu updates from Joerg Roedel:
- Remove of the dev->archdata.iommu (or similar) pointers from most
architectures. Only Sparc is left, but this is private to Sparc as
their drivers don't use the IOMMU-API.
- ARM-SMMU updates from Will Deacon:
- Support for SMMU-500 implementation in Marvell Armada-AP806 SoC
- Support for SMMU-500 implementation in NVIDIA Tegra194 SoC
- DT compatible string updates
- Remove unused IOMMU_SYS_CACHE_ONLY flag
- Move ARM-SMMU drivers into their own subdirectory
- Intel VT-d updates from Lu Baolu:
- Misc tweaks and fixes for vSVA
- Report/response page request events
- Cleanups
- Move the Kconfig and Makefile bits for the AMD and Intel drivers into
their respective subdirectory.
- MT6779 IOMMU Support
- Support for new chipsets in the Renesas IOMMU driver
- Other misc cleanups and fixes (e.g. to improve compile test coverage)
* tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (77 commits)
iommu/amd: Move Kconfig and Makefile bits down into amd directory
iommu/vt-d: Move Kconfig and Makefile bits down into intel directory
iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory
iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu
iommu: Add gfp parameter to io_pgtable_ops->map()
iommu: Mark __iommu_map_sg() as static
iommu/vt-d: Rename intel-pasid.h to pasid.h
iommu/vt-d: Add page response ops support
iommu/vt-d: Report page request faults for guest SVA
iommu/vt-d: Add a helper to get svm and sdev for pasid
iommu/vt-d: Refactor device_to_iommu() helper
iommu/vt-d: Disable multiple GPASID-dev bind
iommu/vt-d: Warn on out-of-range invalidation address
iommu/vt-d: Fix devTLB flush for vSVA
iommu/vt-d: Handle non-page aligned address
iommu/vt-d: Fix PASID devTLB invalidation
iommu/vt-d: Remove global page support in devTLB flush
iommu/vt-d: Enforce PASID devTLB field mask
iommu: Make some functions static
iommu/amd: Remove double zero check
...
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
The APIC headers are relatively complex and bring in additional
header dependencies - while smp.h is a relatively simple header
included from high level headers.
Remove the dependency and add in the missing #include's in .c
files where they gained it indirectly before.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A pasid might be bound to a page table from a VM guest via the iommu
ops.sva_bind_gpasid. In this case, when a DMA page fault is detected
on the physical IOMMU, we need to inject the page fault request into
the guest. After the guest completes handling the page fault, a page
response need to be sent back via the iommu ops.page_response().
This adds support to report a page request fault. Any external module
which is interested in handling this fault should regiester a notifier
with iommu_register_device_fault_handler().
Co-developed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Co-developed-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20200724014925.15523-11-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
For the unlikely use case where multiple aux domains from the same pdev
are attached to a single guest and then bound to a single process
(thus same PASID) within that guest, we cannot easily support this case
by refcounting the number of users. As there is only one SL page table
per PASID while we have multiple aux domains thus multiple SL page tables
for the same PASID.
Extra unbinding guest PASID can happen due to race between normal and
exception cases. Termination of one aux domain may affect others unless
we actively track and switch aux domains to ensure the validity of SL
page tables and TLB states in the shared PASID entry.
Support for sharing second level PGDs across domains can reduce the
complexity but this is not available due to the limitations on VFIO
container architecture. We can revisit this decision once sharing PGDs
are available.
Overall, the complexity and potential glitch do not warrant this unlikely
use case thereby removed by this patch.
Fixes: 56722a4398 ("iommu/vt-d: Add bind guest PASID support")
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20200724014925.15523-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
For guest requested IOTLB invalidation, address and mask are provided as
part of the invalidation data. VT-d HW silently ignores any address bits
below the mask. SW shall also allow such case but give warning if
address does not align with the mask. This patch relax the fault
handling from error to warning and proceed with invalidation request
with the given mask.
Fixes: 6ee1b77ba3 ("iommu/vt-d: Add svm/sva invalidate function")
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20200724014925.15523-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
For guest SVA usage, in order to optimize for less VMEXIT, guest request
of IOTLB flush also includes device TLB.
On the host side, IOMMU driver performs IOTLB and implicit devTLB
invalidation. When PASID-selective granularity is requested by the guest
we need to derive the equivalent address range for devTLB instead of
using the address information in the UAPI data. The reason for that is,
unlike IOTLB flush, devTLB flush does not support PASID-selective
granularity. This is to say, we need to set the following in the PASID
based devTLB invalidation descriptor:
- entire 64 bit range in address ~(0x1 << 63)
- S bit = 1 (VT-d CH 6.5.2.6).
Without this fix, device TLB flush range is not set properly for PASID
selective granularity. This patch also merged devTLB flush code for both
implicit and explicit cases.
Fixes: 6ee1b77ba3 ("iommu/vt-d: Add svm/sva invalidate function")
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20200724014925.15523-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Address information for device TLB invalidation comes from userspace
when device is directly assigned to a guest with vIOMMU support.
VT-d requires page aligned address. This patch checks and enforce
address to be page aligned, otherwise reserved bits can be set in the
invalidation descriptor. Unrecoverable fault will be reported due to
non-zero value in the reserved bits.
Fixes: 61a06a16e3 ("iommu/vt-d: Support flushing more translation cache types")
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20200724014925.15523-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
DevTLB flush can be used for both DMA request with and without PASIDs.
The former uses PASID#0 (RID2PASID), latter uses non-zero PASID for SVA
usage.
This patch adds a check for PASID value such that devTLB flush with
PASID is used for SVA case. This is more efficient in that multiple
PASIDs can be used by a single device, when tearing down a PASID entry
we shall flush only the devTLB specific to a PASID.
Fixes: 6f7db75e1c ("iommu/vt-d: Add second level page table")
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20200724014925.15523-4-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit 711419e504 ("irqdomain: Add the missing assignment of
domain->fwnode for named fwnode") unintentionally caused a dangling pointer
page fault issue on firmware nodes that were freed after IRQ domain
allocation. Commit e3beca48a4 fixed that dangling pointer issue by only
freeing the firmware node after an IRQ domain allocation failure. That fix
no longer frees the firmware node immediately, but leaves the firmware node
allocated after the domain is removed.
The firmware node must be kept around through irq_domain_remove, but should be
freed it afterwards.
Add the missing free operations after domain removal where where appropriate.
Fixes: e3beca48a4 ("irqdomain/treewide: Keep firmware node unconditionally allocated")
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1595363169-7157-1-git-send-email-jonathan.derrick@intel.com
Quite some non OF/ACPI users of irqdomains allocate firmware nodes of type
IRQCHIP_FWNODE_NAMED or IRQCHIP_FWNODE_NAMED_ID and free them right after
creating the irqdomain. The only purpose of these FW nodes is to convey
name information. When this was introduced the core code did not store the
pointer to the node in the irqdomain. A recent change stored the firmware
node pointer in irqdomain for other reasons and missed to notice that the
usage sites which do the alloc_fwnode/create_domain/free_fwnode sequence
are broken by this. Storing a dangling pointer is dangerous itself, but in
case that the domain is destroyed later on this leads to a double free.
Remove the freeing of the firmware node after creating the irqdomain from
all affected call sites to cure this.
Fixes: 711419e504 ("irqdomain: Add the missing assignment of domain->fwnode for named fwnode")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/873661qakd.fsf@nanos.tec.linutronix.de
The Scalable-mode Page-walk Coherency (SMPWC) field in the VT-d extended
capability register indicates the hardware coherency behavior on paging
structures accessed through the pasid table entry. This is ignored in
current code and using ECAP.C instead which is only valid in legacy mode.
Fix this so that paging structure updates could be manually flushed from
the cache line if hardware page walking is not snooped.
Fixes: 765b6a98c1 ("iommu/vt-d: Enumerate the scalable mode capability")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lore.kernel.org/r/20200622231345.29722-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
PCI ACS is disabled if Intel IOMMU is off by default or intel_iommu=off
is used in command line. Unfortunately, Intel IOMMU will be forced on if
there're devices sitting on an external facing PCI port that is marked
as untrusted (for example, thunderbolt peripherals). That means, PCI ACS
is disabled while Intel IOMMU is forced on to isolate those devices. As
the result, the devices of an MFD will be grouped by a single group even
the ACS is supported on device.
[ 0.691263] pci 0000:00:07.1: Adding to iommu group 3
[ 0.691277] pci 0000:00:07.2: Adding to iommu group 3
[ 0.691292] pci 0000:00:07.3: Adding to iommu group 3
Fix it by requesting PCI ACS when Intel IOMMU is detected with platform
opt in hint.
Fixes: 89a6079df7 ("iommu/vt-d: Force IOMMU on for platform opt in hint")
Co-developed-by: Lalithambika Krishnakumar <lalithambika.krishnakumar@intel.com>
Signed-off-by: Lalithambika Krishnakumar <lalithambika.krishnakumar@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20200622231345.29722-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When using first-level translation for IOVA, currently the U/S bit in the
page table is cleared which implies DMA requests with user privilege are
blocked. As the result, following error messages might be observed when
passing through a device to user level:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [41:00.0] PASID 1 fault addr 7ecdcd000
[fault reason 129] SM: U/S set 0 for first-level translation
with user privilege
This fixes it by setting U/S bit in the first level page table and makes
IOVA over first level compatible with previous second-level translation.
Fixes: b802d070a5 ("iommu/vt-d: Use iova over first level")
Reported-by: Xin Zeng <xin.zeng@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20200622231345.29722-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Pull iommu driver directory structure cleanup from Joerg Roedel:
"Move the Intel and AMD IOMMU drivers into their own subdirectory.
Both drivers consist of several files by now and giving them their own
directory unclutters the IOMMU top-level directory a bit"
* tag 'iommu-drivers-move-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Move Intel IOMMU driver into subdirectory
iommu/amd: Move AMD IOMMU driver into subdirectory