Pull powerpc updates from Michael Ellerman:
- Add support for (optionally) using queued spinlocks & rwlocks.
- Support for a new faster system call ABI using the scv instruction on
Power9 or later.
- Drop support for the PROT_SAO mmap/mprotect flag as it will be
unsupported on Power10 and future processors, leaving us with no way
to implement the functionality it requests. This risks breaking
userspace, though we believe it is unused in practice.
- A bug fix for, and then the removal of, our custom stack expansion
checking. We now allow stack expansion up to the rlimit, like other
architectures.
- Remove the remnants of our (previously disabled) topology update
code, which tried to react to NUMA layout changes on virtualised
systems, but was prone to crashes and other problems.
- Add PMU support for Power10 CPUs.
- A change to our signal trampoline so that we don't unbalance the link
stack (branch return predictor) in the signal delivery path.
- Lots of other cleanups, refactorings, smaller features and so on as
usual.
Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
Wei Yongjun, Wen Xiong, YueHaibing.
* tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
selftests/powerpc: Fix pkey syscall redefinitions
powerpc: Fix circular dependency between percpu.h and mmu.h
powerpc/powernv/sriov: Fix use of uninitialised variable
selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
powerpc/40x: Fix assembler warning about r0
powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
cpuidle: pseries: Fixup exit latency for CEDE(0)
cpuidle: pseries: Add function to parse extended CEDE records
cpuidle: pseries: Set the latency-hint before entering CEDE
selftests/powerpc: Fix online CPU selection
powerpc/perf: Consolidate perf_callchain_user_[64|32]()
powerpc/pseries/hotplug-cpu: Remove double free in error path
powerpc/pseries/mobility: Add pr_debug() for device tree changes
powerpc/pseries/mobility: Set pr_fmt()
powerpc/cacheinfo: Warn if cache object chain becomes unordered
powerpc/cacheinfo: Improve diagnostics about malformed cache lists
powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
powerpc/cacheinfo: Set pr_fmt()
powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
...
Pull KVM updates from Paolo Bonzini:
"s390:
- implement diag318
x86:
- Report last CPU for debugging
- Emulate smaller MAXPHYADDR in the guest than in the host
- .noinstr and tracing fixes from Thomas
- nested SVM page table switching optimization and fixes
Generic:
- Unify shadow MMU cache data structures across architectures"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
KVM: SVM: Fix sev_pin_memory() error handling
KVM: LAPIC: Set the TDCR settable bits
KVM: x86: Specify max TDP level via kvm_configure_mmu()
KVM: x86/mmu: Rename max_page_level to max_huge_page_level
KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
KVM: VMX: Make vmx_load_mmu_pgd() static
KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
KVM: VMX: Drop a duplicate declaration of construct_eptp()
KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
KVM: Using macros instead of magic values
MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
KVM: VMX: Add guest physical address check in EPT violation and misconfig
KVM: VMX: introduce vmx_need_pf_intercept
KVM: x86: update exception bitmap on CPUID changes
KVM: x86: rename update_bp_intercept to update_exception_bitmap
...
Pull dma-mapping updates from Christoph Hellwig:
- make support for dma_ops optional
- move more code out of line
- add generic support for a dma_ops bypass mode
- misc cleanups
* tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping:
dma-contiguous: cleanup dma_alloc_contiguous
dma-debug: use named initializers for dir2name
powerpc: use the generic dma_ops_bypass mode
dma-mapping: add a dma_ops_bypass flag to struct device
dma-mapping: make support for dma ops optional
dma-mapping: inline the fast path dma-direct calls
dma-mapping: move the remaining DMA API calls out of line
Recently random.h started including percpu.h (see commit
f227e3ec3b ("random32: update the net random state on interrupt and
activity")), which broke corenet64_smp_defconfig:
In file included from /linux/arch/powerpc/include/asm/paca.h:18,
from /linux/arch/powerpc/include/asm/percpu.h:13,
from /linux/include/linux/random.h:14,
from /linux/lib/uuid.c:14:
/linux/arch/powerpc/include/asm/mmu.h:139:22: error: unknown type name 'next_tlbcam_idx'
139 | DECLARE_PER_CPU(int, next_tlbcam_idx);
This is due to a circular header dependency:
asm/mmu.h includes asm/percpu.h, which includes asm/paca.h, which
includes asm/mmu.h
Which means DECLARE_PER_CPU() isn't defined when mmu.h needs it.
We can fix it by moving the include of paca.h below the include of
asm-generic/percpu.h.
This moves the include of paca.h out of the #ifdef __powerpc64__, but
that is OK because paca.h is almost entirely inside #ifdef
CONFIG_PPC64 anyway.
It also moves the include of paca.h out of the #ifdef CONFIG_SMP,
which could possibly break something, but seems to have no ill
effects.
Fixes: f227e3ec3b ("random32: update the net random state on interrupt and activity")
Cc: stable@vger.kernel.org # v5.8
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200804130558.292328-1-mpe@ellerman.id.au
Though kdump kernel boots from loaded address, the first 64KB of it is
copied down to real 0. So, setup a backup region and let purgatory
copy the first 64KB of crashed kernel into this backup region before
booting into kdump kernel. Update reserve map with backup region and
crashed kernel's memory to avoid kdump kernel from accidentially using
that memory.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602294718.575379.16216507537038008623.stgit@hbathini
Currently, numa & prom are the only users of drmem LMB walk code.
Loading kdump with kexec_file also needs to walk the drmem LMBs to
setup the usable memory ranges for kdump kernel. But there are couple
of issues in using the code as is. One, walk_drmem_lmb() code is built
into the .init section currently, while kexec_file needs it later.
Two, there is no scope to pass data to the callback function for
processing and/or erroring out on certain conditions.
Fix that by, moving drmem LMB walk code out of .init section, adding
scope to pass data to the callback function and bailing out when an
error is encountered in the callback function.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602282727.575379.3979857013827701828.stgit@hbathini
crashkernel region could have an overlap with special memory regions
like OPAL, RTAS, TCE table & such. These regions are referred to as
excluded memory ranges. Setup these ranges during image probe in order
to avoid them while finding the buffer for different kdump segments.
Override arch_kexec_locate_mem_hole() to locate a memory hole taking
these ranges into account.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602281047.575379.6636807148335160795.stgit@hbathini
In kexec case, the kernel to be loaded uses the same memory layout as
the running kernel. So, passing on the DT of the running kernel would
be good enough.
But in case of kdump, different memory ranges are needed to manage
loading the kdump kernel, booting into it and exporting the elfcore of
the crashing kernel. The ranges are exclude memory ranges, usable
memory ranges, reserved memory ranges and crash memory ranges.
Exclude memory ranges specify the list of memory ranges to avoid while
loading kdump segments. Usable memory ranges list the memory ranges
that could be used for booting kdump kernel. Reserved memory ranges
list the memory regions for the loading kernel's reserve map. Crash
memory ranges list the memory ranges to be exported as the crashing
kernel's elfcore.
Add helper functions for setting up the above mentioned memory ranges.
This helpers facilitate in understanding the subsequent changes better
and make it easy to setup the different memory ranges listed above, as
and when appropriate.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602279194.575379.8526552316948643550.stgit@hbathini
This adds a kernel command line option that can be used to disable GTSE support.
Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.
This was done so that we can do VM migration between configs that enable/disable
GTSE support via hypervisor. To migrate a VM from a system that supports
GTSE to a system that doesn't, we can boot the guest with
radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
invalidates.
The check for hcall availability is done in pSeries_setup_arch so that
the panic message appears on the console. This should only happen on
a hypervisor that doesn't force the guest to hash translation even
though it can't handle the radix GTSE=0 request via CAS. With
radix_hcall_invalidate=on if the hypervisor doesn't support hcall_rpt_invalidate
hcall it should force the LPAR to hash translation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727085908.420806-1-aneesh.kumar@linux.ibm.com
There's a comment in time.h referring to CONFIG_POWER, which doesn't
exist. That confuses scripts/checkkconfigsymbols.py.
Presumably the comment was referring to a CONFIG_POWER vs CONFIG_PPC,
in which case for CONFIG_POWER we would #define __USE_RTC to 1. But
instead we have CONFIG_PPC_BOOK3S_601, and these days we have
IS_ENABLED().
So the comment is no longer relevant, drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-9-mpe@ellerman.id.au
KVM guests have certain restrictions and performance quirks when using
doorbells. This patch moves the EPAPR KVM guest test so it can be shared
with PSERIES, and uses that in doorbell setup code to apply the KVM
guest quirks and improves IPI performance for two cases:
- PowerVM guests may now use doorbells even if they are secure.
- KVM guests no longer use doorbells if XIVE is available.
There is a valid complaint that "KVM guest" is not a very reasonable
thing to test for, it's preferable for the hypervisor to advertise
particular behaviours to the guest so they could change if the
hypervisor implementation or configuration changes. However in this case
we were already assuming a KVM guest worst case, so this patch is about
containing those quirks. If KVM later advertises fast doorbells, we
should test for that and override the quirks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726035155.1424103-4-npiggin@gmail.com
Commit 9908c826d5 ("powerpc/perf: Add Power10 PMU feature to DT CPU
features") defines MMCRA_BHRB_DISABLE as `0x2000000000UL`. Binutils
version less than 2.28 doesn't support UL suffix.
arch/powerpc/kernel/cpu_setup_power.S: Assembler messages:
arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of line, first unrecognized character is `L'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of line, first unrecognized character is `L'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', expected: ')'
arch/powerpc/kernel/cpu_setup_power.S:250: Error: operand out of range (0x0000002000000000 is not between 0xffffffffffff8000 and 0x000000000000ffff)
Fix this by wrapping it with the `_UL` macro.
Fixes: 9908c826d5 ("Add Power10 PMU feature to DT CPU features")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1595996214-5833-1-git-send-email-atrajeev@linux.vnet.ibm.com
When a memory slot is hot plugged to a SVM, PFNs associated with the
GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs.
Call kvmppc_uv_migrate_mem_slot() to accomplish this.
Disable page-merge for all pages in the memory slot.
Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[rearranged the code, and modified the commit log]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This implements the generic paravirt qspinlocks using H_PROD and
H_CONFER to kick and wait.
This uses an un-directed yield to any CPU rather than the directed
yield to a pre-empted lock holder that paravirtualised simple
spinlocks use, that requires no kick hcall. This is something that
could be investigated and improved in future.
Performance results can be found in the commit which added queued
spinlocks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-5-npiggin@gmail.com
These have shown significantly improved performance and fairness when
spinlock contention is moderate to high on very large systems.
With this series including subsequent patches, on a 16 socket 1536
thread POWER9, a stress test such as same-file open/close from all
CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs
384158op/s (33x faster), where the difference in throughput between
the fastest and slowest thread goes from 7x to 1.4x.
Thanks to the fast path being identical in terms of atomics and
barriers (after a subsequent optimisation patch), single threaded
performance is not changed (no measurable difference).
On smaller systems, performance and fairness seems to be generally
improved. Using dbench on tmpfs as a test (that starts to run into
kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was
tested with bare metal and KVM guest configurations. Results can be
found here:
https://github.com/linuxppc/issues/issues/305#issuecomment-663487453
Observations are:
- Queued spinlocks are equal when contention is insignificant, as
expected and as measured with microbenchmarks.
- When there is contention, on bare metal queued spinlocks have better
throughput and max latency at all points.
- When virtualised, queued spinlocks are slightly worse approaching
peak throughput, but significantly better throughput and max latency
at all points beyond peak, until queued spinlock maximum latency
rises when clients are 2x vCPUs.
The regressions haven't been analysed very well yet, there are a lot
of things that can be tuned, particularly the paravirtualised locking,
but the numbers already look like a good net win even on relatively
small systems.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-4-npiggin@gmail.com
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in
there. The SR-IOV support can be extracted easily enough and is mostly
standalone, so move it into a separate file.
This patch also moves the PowerNV SR-IOV specific fields from pci_dn and
moves them into a platform specific structure. I'm not sure how they ended
up in there in the first place, but leaking platform specifics into common
code has proven to be a terrible idea so far so lets stop doing that.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-5-oohall@gmail.com
The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree
follows the PCI bus structures because a reset asserted on an upstream
bridge will be propagated to the downstream bridges. On pseries there's a
1-1 correspondence between what the guest sees are a PHB and a PE so the
"tree" is really just a single node.
Current the EEH core is reponsible for setting up this PE tree which it
does by traversing the pci_dn tree. The structure of the pci_dn tree
matches the bus tree on PowerNV which leads to the PE tree being "correct"
this setup method doesn't make a whole lot of sense and it's actively
confusing for the pseries case where it doesn't really do anything.
We want to remove the dependence on pci_dn anyway so this patch move
choosing where to insert a new PE into the platform code rather than
being part of the generic EEH code. For PowerNV this simplifies the
tree building logic and removes the use of pci_dn. For pseries we
keep the existing logic. I'm not really convinced it does anything
due to the 1-1 PE-to-PHB correspondence so every device under that
PHB should be in the same PE, but I'd rather not remove it entirely
until we've had a chance to look at it more deeply.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-14-oohall@gmail.com
The naming of eeh_{add_to|remove_from}_parent_pe() doesn't really reflect
what they actually do. If the PE referred to be edev->pe_config_addr
already exists under that PHB then the edev is added to that PE. However,
if the PE doesn't exist the a new one is created for the edev.
The bulk of the implementation of eeh_add_to_parent_pe() covers that
second case. Similarly, most of eeh_remove_from_parent_pe() is
determining when it's safe to delete a PE.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-12-oohall@gmail.com
The edev->class_code field is never referenced anywhere except for the
platform specific probe functions. The same information is available in
the pci_dev for PowerNV and in the pci_dn on pseries so we can remove
the field.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-11-oohall@gmail.com