When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits. This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format. It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the code for handling hypervisor instruction page faults
passes 0 for the flags indicating the type of fault, which is OK in
the usual case that the page is not mapped in the partition-scoped
page tables. However, there are other causes for hypervisor
instruction page faults, such as not being to update a reference
(R) or change (C) bit. The cause is indicated in bits in HSRR1,
including a bit which indicates that the fault is due to not being
able to write to a page (for example to update an R or C bit).
Not handling these other kinds of faults correctly can lead to a
loop of continual faults without forward progress in the guest.
In order to handle these faults better, this patch constructs a
"DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
have in common, and passes it to kvmppc_book3s_hv_page_fault() so
that it knows what caused the fault.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y. In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.
The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered. On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE. At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.
The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix CONFIG_ALTIVEC=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently kvmppc_handle_exit_hv() is called with the vcore lock held
because it is called within a for_each_runnable_thread loop.
However, we already unlock the vcore within kvmppc_handle_exit_hv()
under certain circumstances, and this is safe because (a) any vcpus
that become runnable and are added to the runnable set by
kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
run in the guest (because the vcore state is VCORE_EXITING), and
(b) for_each_runnable_thread is safe against addition or removal
of vcpus from the runnable set.
Therefore, in order to simplify things for following patches, let's
drop the vcore lock in the for_each_runnable_thread loop, so
kvmppc_handle_exit_hv() gets called without the vcore lock held.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions. This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved. The
condition register and DSCR are now always preserved.
With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero. So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.
Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9. This enables the code to be simplified
quite a bit.
_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest. With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.
Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path. The calling
convention is made compatible with C.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is based on a patch by Suraj Jitindar Singh.
This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c. This is to
make future maintenance and modification easier. The algorithm
expressed in the C code is almost identical to the previous
algorithm.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This removes code that clears the external interrupt pending bit in
the pending_exceptions bitmap. This is left over from an earlier
iteration of the code where this bit was set when an escalation
interrupt arrived in order to wake the vcpu from cede. Currently
we set the vcpu->arch.irq_pending flag instead for this purpose.
Therefore there is no need to do anything with the pending_exceptions
bitmap.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller). The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.
In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller. That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit. This
patch does that.
With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt. In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When doing nested virtualization, it is only necessary to do the
transactional memory hypervisor assist at level 0, that is, when
we are in hypervisor mode. Nested hypervisors can just use the TM
facilities as architected. Therefore we should clear the
CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode,
along with the CPU_FTR_HVMODE bit.
Doing this will not change anything at this stage because the only
code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently
can only be used when when CPU_FTR_HVMODE is set.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The kvmppc_gpa_to_ua() helper itself takes care of the permission
bits in the TCE and yet every single caller removes them.
This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
(which are GPAs + TCE permission bits) to make the callers simpler.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update
the hardware tables, we print a warning once, clear the entry and
continue. This is so as at the time the assumption was that if
a VFIO device is hotplugged into the guest, and the userspace replays
virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails,
then there is nothing useful we can do about it.
However the assumption is not valid as these handlers are not called for
TCE replay (VFIO ioctl interface is used for that) and these handlers
are for new TCEs.
This returns an error to the guest if there is a request which cannot be
processed. By now the only possible failure must be H_TOO_HARD.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The userspace can request an arbitrary supported page size for a DMA
window and this works fine as long as the mapped memory is backed with
the pages of the same or bigger size; if this is not the case,
mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
dangerously incorrect TCEs.
However since it is quite easy to misconfigure the KVM and we do not do
reverts to all changes made to TCE tables if an error happens in a middle,
we better do the acceptable page size validation before we even touch
the tables.
This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
against the preregistered memory page sizes.
Since the new check uses real/virtual mode helpers, this renames
kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
case and mirrors it for the virtual mode under the old name. The real
mode handler is not used for the virtual mode as:
1. it uses _lockless() list traversing primitives instead of RCU;
2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
virtual mode does not have to use and since on POWER9+radix only virtual
mode handlers actually work, we do not want to slow down that path even
a bit.
This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
are static now.
From now on the attempts on mapping IOMMU pages bigger than allowed
will result in KVM exit.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix KVM_HV=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that the 68k Mac port has adopted the via-pmu driver, the same RTC
code can be shared between m68k and powerpc. Replace duplicated code in
arch/powerpc and arch/m68k with common RTC accessors for Cuda and PMU.
Drop the problematic WARN_ON which was introduced in commit 22db552b50
("powerpc/powermac: Fix rtc read/write functions").
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael writes:
"powerpc fixes for 4.19 #4
Four regression fixes.
A fix for a change to lib/xz which broke our zImage loader when
building with XZ compression. OK'ed by Herbert who merged the
original patch.
The recent fix we did to avoid patching __init text broke some 32-bit
machines, fix that.
Our show_user_instructions() could be tricked into printing kernel
memory, add a check to avoid that.
And a fix for a change to our NUMA initialisation logic, which causes
crashes in some kdump configurations.
Thanks to:
Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis
Roos, Murilo Opsfelder Araujo, Srikar Dronamraju."
* tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/numa: Skip onlining a offline node in kdump path
powerpc: Don't print kernel instructions in show_user_instructions()
powerpc/lib: fix book3s/32 boot failure due to code patching
lib/xz: Put CRC32_POLY_LE in xz_private.h
Paolo writes:
"KVM changes for 4.19-rc7
x86 and PPC bugfixes, mostly introduced in 4.19-rc1."
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: nVMX: fix entry with pending interrupt if APICv is enabled
KVM: VMX: hide flexpriority from guest when disabled at the module level
KVM: VMX: check for existence of secondary exec controls before accessing
KVM: PPC: Book3S HV: Avoid crash from THP collapse during radix page fault
KVM: x86: fix L1TF's MMIO GFN calculation
tools/kvm_stat: cut down decimal places in update interval dialog
KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS
KVM: x86: Do not use kvm_x86_ops->mpx_supported() directly
KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled
KVM: x86: never trap MSR_KERNEL_GS_BASE
Recently we implemented show_user_instructions() which dumps the code
around the NIP when a user space process dies with an unhandled
signal. This was modelled on the x86 code, and we even went so far as
to implement the exact same bug, namely that if the user process
crashed with its NIP pointing into the kernel we will dump kernel text
to dmesg. eg:
bad-bctr[2996]: segfault (11) at c000000000010000 nip c000000000010000 lr 12d0b0894 code 1
bad-bctr[2996]: code: fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 fb810050 7f8802a6
bad-bctr[2996]: code: 3860001c f8010080 48242371 60000000 <7c7b1b79> 4082002c e8010080 eb610048
This was discovered on x86 by Jann Horn and fixed in commit
342db04ae7 ("x86/dumpstack: Don't dump kernel memory based on usermode RIP").
Fix it by checking the adjusted NIP value (pc) and number of
instructions against USER_DS, and bail if we fail the check, eg:
bad-bctr[2969]: segfault (11) at c000000000010000 nip c000000000010000 lr 107930894 code 1
bad-bctr[2969]: Bad NIP, not dumping instructions.
Fixes: 88b0fe1757 ("powerpc: Add show_user_instructions()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a mode where the vcore scheduling logic in HV KVM limits itself
to scheduling only virtual cores from the same VM on any given physical
core. This is enabled via a new module parameter on the kvm-hv module
called "one_vm_per_core". For this to work on POWER9, it is necessary to
set indep_threads_mode=N. (On POWER8, hardware limitations mean that KVM
is never in independent threads mode, regardless of the indep_threads_mode
setting.)
Thus the settings needed for this to work are:
1. The host is in SMT1 mode.
2. On POWER8, the host is not in 2-way or 4-way static split-core mode.
3. On POWER9, the indep_threads_mode parameter is N.
4. The one_vm_per_core parameter is Y.
With these settings, KVM can run up to 4 vcpus on a core at the same
time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest
threading mode), and will ensure that all of the vcpus belong to the
same VM.
This is intended for use in security-conscious settings where users are
concerned about possible side-channel attacks between threads which could
perhaps enable one VM to attack another VM on the same core, or the host.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.
To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.
Signed-off-by: Cameron Kaiser <spectre@floodgap.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Local radix TLB flush operations that operate on congruence classes
have explicit ERAT flushes for POWER9. The process scoped LPID flush
did not have a flush, so add it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.
Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If CONFIG_PPC_WATCHDOG is enabled we always cap the decrementer to
0x7fffffff:
if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
set_dec(0x7fffffff);
else
set_dec(decrementer_max);
If there are no future events, we don't reprogram the decrementer
after this and we end up with 0x7fffffff even on a large decrementer
capable system.
As suggested by Nick, add a set_state_oneshot_stopped callback
so we program the decrementer with decrementer_max if there are
no future events.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.
The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.
ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.
With this change all issued ATSDs now perform a flush, so the flush
parameter has been removed from all the helpers.
To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.
One NPU (1 GPU):
mprotect rate (GB/s)
Stride Before After Speedup
64K 5.3 5.6 5%
1M 39.3 57.4 46%
2M 49.7 82.6 66%
4M 286.6 285.7 0%
Two NPUs (6 GPUs):
mprotect rate (GB/s)
Stride Before After Speedup
64K 6.5 7.4 13%
1M 33.4 67.9 103%
2M 38.7 93.1 141%
4M 356.7 354.6 -1%
Anything over 2M is roughly the same as before since both cases issue a
single ATSD.
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:
for each NPU
- Write the target address to the XTS_ATSD_AVA register
- EIEIO
- Write the launch value to issue the ATSD
First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.
Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:
for each NPU
- Write the target address to the XTS_ATSD_AVA register
EIEIO
for each NPU
- Write the launch value to issue the ATSD
Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.
With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).
With two NPUs active (on a 6-GPU system) the effect is more noticeable:
mprotect rate (GB/s)
Stride Before After Speedup
64K 5.9 6.5 10%
1M 31.2 33.4 7%
2M 36.3 38.7 7%
4M 322.6 356.7 11%
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When enumerating page size definitions to check hardware support,
we construct a constant which is (1U << (def->shift - 10)).
However, the array of page size definitions is only initalised for
various MMU_PAGE_* constants, so it contains a number of 0-initialised
elements with def->shift == 0. This means we end up shifting by a
very large number, which gives the following UBSan splat:
================================================================================
UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
Call Trace:
[c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
[c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
[c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
[c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
[c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
[c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
================================================================================
Fix this by first checking if the element exists (shift != 0) before
constructing the constant.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 71d29f43b6 ("KVM: PPC: Book3S HV: Don't use compound_order to
determine host mapping size", 2018-09-11) added a call to
__find_linux_pte() and a dereference of the returned PTE pointer to the
radix page fault path in the common case where the page is normal
system memory. Previously, __find_linux_pte() was only called for
mappings to physical addresses which don't have a page struct (e.g.
memory-mapped I/O) or where the page struct is marked as reserved
memory.
This exposes us to the possibility that the returned PTE pointer
could be NULL, for example in the case of a concurrent THP collapse
operation. Dereferencing the returned NULL pointer causes a host
crash.
To fix this, we check for NULL, and if it is NULL, we retry the
operation by returning to the guest, with the expectation that it
will generate the same page fault again (unless of course it has
been fixed up by another CPU in the meantime).
Fixes: 71d29f43b6 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members. The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When a process allocates a hugepage, the following leak is
reported by kmemleak. This is a false positive which is
due to the pointer to the table being stored in the PGD
as physical memory address and not virtual memory pointer.
unreferenced object 0xc30f8200 (size 512):
comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<e32b68da>] huge_pte_alloc+0xdc/0x1f8
[<9e0df1e1>] hugetlb_fault+0x560/0x8f8
[<7938ec6c>] follow_hugetlb_page+0x14c/0x44c
[<afbdb405>] __get_user_pages+0x1c4/0x3dc
[<b8fd7cd9>] __mm_populate+0xac/0x140
[<3215421e>] vm_mmap_pgoff+0xb4/0xb8
[<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc
[<4fcd760f>] ret_from_syscall+0x0/0x38
See commit a984506c54 ("powerpc/mm: Don't report PUDs as
memory leaks when using kmemleak") for detailed explanation.
To fix that, this patch tells kmemleak to ignore the allocated
hugepage table.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The comments in this file don't conform to the coding style so take
them to "Comment Formatting Re-Education Camp".
Suggested-by: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Reflow some comments and add full stops, fix spelling of Sergeant.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
for 64bit configs which use for CONFIG_LOG_BUF_SHIFT the same
or higher value than the default (currently 17).
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The code in machine_check_exception excludes 64s hvmode when
incrementing the MCE counter only to call opal_machine_check to
increment it specifically for this case.
Remove the exclusion and special case.
Fixes: a43c159042 ("powerpc/pseries: Flush SLB contents on SLB MCE
errors.")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.
This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
This function is not necessary, since MSR_TM_ACTIVE() just do the same and
could be used, removing the dualism and simplifying the code.
This patchset remove every instance of msr_tm_active() and replaced it
by MSR_TM_ACTIVE().
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a mismatch between function pnv_platform_error_reboot() definition
and declaration regarding function modifiers. In the declaration part, it
contains the function attribute __noreturn, while function definition
itself lacks it.
This was reported by sparse tool as an error:
arch/powerpc/platforms/powernv/opal.c:538:6: error: symbol 'pnv_platform_error_reboot' redeclared with different type (originally declared at arch/powerpc/platforms/powernv/powernv.h:11) - different modifiers
I checked and the function is already being considered as being 'noreturn'
by the compiler, thus, I understand this patch does not change any code
being generated.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a patch that adds support for PTRACE_SYSEMU ptrace request in
PowerPC architecture.
When ptrace(PTRACE_SYSEMU, ...) request is called, it will be handled by
the arch independent function ptrace_resume(), which will tag the task with
the TIF_SYSCALL_EMU flag. This flag needs to be handled from a platform
dependent point of view, which is what this patch does.
This patch adds this task's flag as part of the _TIF_SYSCALL_DOTRACE, which
is the MACRO that is used to trace syscalls at entrance/exit.
Since TIF_SYSCALL_EMU is now part of _TIF_SYSCALL_DOTRACE, if the task has
_TIF_SYSCALL_DOTRACE set, it will hit do_syscall_trace_enter() at syscall
entrance and do_syscall_trace_leave() at syscall leave.
do_syscall_trace_enter() needs to handle the TIF_SYSCALL_EMU flag properly,
which will interrupt the syscall executing if TIF_SYSCALL_EMU is set. The
output values should not be changed, i.e. the return value (r3) should
contain the original syscall argument on exit.
With this flag set, the syscall is not executed fundamentally, because
do_syscall_trace_enter() is returning -1 which is bigger than NR_syscall,
thus, skipping the syscall execution and exiting userspace.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Moving TIF_32BIT to use bit 20 instead of 4 in the task flag field.
This change is making room for an upcoming new task macro
(_TIF_SYSCALL_EMU) which is preferred to set a bit in the lower 16-bits
part of the word.
This upcoming flag macro will take part in a composed macro
(_TIF_SYSCALL_DOTRACE) which will contain other flags as well, and it is
preferred that the whole _TIF_SYSCALL_DOTRACE macro only sets the lower 16
bits of a word, so, it could be handled using immediate operations (as load
immediate, add immediate, ...) where the immediate operand (SI) is limited
to 16-bits.
Another possible solution would be using the LOAD_REG_IMMEDIATE() macro
to load a full 64-bits word immediate, but it takes 5 operations instead of
one.
Having TIF_32BITS being redefined to use an upper bit is not a problem
since there is only one place in the assembly code where TIF_32BIT is being
used, and it could be replaced with an operation with right shift (addis),
since it is used alone, i.e. not being part of a composed macro, which has
different bits set, and would require LOAD_REG_IMMEDIATE().
Tested on a 64 bits Big Endian machine running a 32 bits task.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PPC64, as register r13 points to the paca_struct at all time,
this patch adds a copy of the canary there, which is copied at
task_switch.
That new canary is then used by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r13
-mstack-protector-guard-offset=offsetof(struct paca_struct, canary))
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This functionality was tentatively added in the past
(commit 6533b7c16e ("powerpc: Initial stack protector
(-fstack-protector) support")) but had to be reverted
(commit f2574030b0 ("powerpc: Revert the initial stack
protector support") because of GCC implementing it differently
whether it had been built with libc support or not.
Now, GCC offers the possibility to manually set the
stack-protector mode (global or tls) regardless of libc support.
This time, the patch selects HAVE_STACKPROTECTOR only if
-mstack-protector-guard=tls is supported by GCC.
On PPC32, as register r2 points to current task_struct at
all time, the stack_canary located inside task_struct can be
used directly by using the following GCC options:
-mstack-protector-guard=tls
-mstack-protector-guard-reg=r2
-mstack-protector-guard-offset=offsetof(struct task_struct, stack_canary))
The protector is disabled for prom_init and bootx_init as
it is too early to handle it properly.
$ echo CORRUPT_STACK > /sys/kernel/debug/provoke-crash/DIRECT
[ 134.943666] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: lkdtm_CORRUPT_STACK+0x64/0x64
[ 134.943666]
[ 134.955414] CPU: 0 PID: 283 Comm: sh Not tainted 4.18.0-s3k-dev-12143-ga3272be41209 #835
[ 134.963380] Call Trace:
[ 134.965860] [c6615d60] [c001f76c] panic+0x118/0x260 (unreliable)
[ 134.971775] [c6615dc0] [c001f654] panic+0x0/0x260
[ 134.976435] [c6615dd0] [c032c368] lkdtm_CORRUPT_STACK_STRONG+0x0/0x64
[ 134.982769] [c6615e00] [ffffffff] 0xffffffff
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.
Fixes: 7c105b63bd ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>