We are still troubled by occasional random segmentation faults and
memory memory corruption on SMP machines. The causes quite a few
package builds to fail on the Debian buildd machines for parisc. When
gcc-6 failed to build three times in a row, I looked again at the TLB
related code. I found a couple of issues. This is the first.
In general, we need to ensure page table updates and corresponding TLB
purges are atomic. The attached patch fixes an instance in pci-dma.c
where the page table update was not guarded by the TLB lock.
Tested on rp3440 and c8000. So far, no further random segmentation
faults have been observed.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: <stable@vger.kernel.org> # v3.16+
Signed-off-by: Helge Deller <deller@gmx.de>
Drop the open-coded sched_clock() function and replace it by the provided
GENERIC_SCHED_CLOCK implementation. We have seen quite some hung tasks in the
past, which seem to be fixed by this patch.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Helge Deller <deller@gmx.de>
This patch fixes the following DTC warning with W=1:
"Node /memory has a reg or ranges property, but no unit name"
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
This patch fixes the following DTC warning with W=1:
"Node /memory has a reg or ranges property, but no unit name"
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
This patch fixes the following DTC warning with W=1:
"Node /soc has a reg or ranges property, but no unit name"
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
The LAST_BREAK macro in entry.S uses a different instruction sequence
for CONFIG_MARCH_Z900 builds. The branch target offset to skip the
store of the last breaking event address needs to take the different
length of the code block into account.
Fixes: f8fc82b471 ("s390: move sys_call_table and last_break from thread_info to thread_struct")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch fixes the following DTC warnings with W=1:
Warning (unit_address_vs_reg): Node /regulators/regulator@0 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@1 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@2 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@3 has a unit
name, but no reg property
Warning (unit_address_vs_reg): Node /regulators/regulator@4 has a unit
name, but no reg property
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
ISA 3 allows for prevention of instruction fetch and execution
of user mode pages. If such an error occurs, SRR1 bit 35 reports the
error. We catch and report the error in do_page_fault().
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Setup AMOR (Authority Mask Override Register) in HV mode so that the
host and guest kernel can in turn setup IAMR.
This allows us to enable key 0 in a following patch.
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ensure that PSSCR is set to a safe value corresponding to no
state-loss each time a POWER9 CPU comes online.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is a nice interface for asking ftrace to dump all its tracing
buffers. The only down side for use in xmon is that it uses printk.
Depending on circumstances printk may not work when in xmon, but it also
may, so add a 'dt' command which dumps the ftrace buffers, and add a
note to the help to mentiont that it uses printk.
Calling this routine also disables tracing, which is problematic if you
return from xmon and expect the system to keep operating normally. So
after we do the dump turn tracing back on.
Both functions already have nop versions defined for when ftrace is not
enabled, so we don't need any extra #ifdefs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With commit e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO") we
started using the ppp value 0b110 to map kernel readonly. But that
facility was only added as part of ISA 2.04. For earlier ISA version
only supported ppp bit value for readonly mapping is 0b011. (This
implies both user and kernel get mapped using the same ppp bit value for
readonly mapping.).
Update the code such that for earlier architecture version we use ppp
value 0b011 for readonly mapping. We don't differentiate between power5+
and power5 here and apply the new ppp bits only from power6 (ISA 2.05).
This keep the changes minimal.
This fixes issue with PS3 spu usage reported at
https://lkml.kernel.org/r/rep.1421449714.geoff@infradead.org
Fixes: e58e87adc8 ("powerpc/mm: Update _PAGE_KERNEL_RO")
Cc: stable@vger.kernel.org # v4.7+
Tested-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit d0563a1297 ("powerpc: Implement {cmp}xchg for u8 and u16")
we removed the volatile from __cmpxchg().
This is leading to warnings such as:
drivers/gpu/drm/drm_lock.c: In function ‘drm_lock_take’:
arch/powerpc/include/asm/cmpxchg.h:484:37: warning: passing argument 1
of ‘__cmpxchg’ discards ‘volatile’ qualifier from pointer target
(__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_, \
There doesn't seem to be consensus across architectures whether the
argument is volatile or not, so at least for now put the volatile back.
Fixes: d0563a1297 ("powerpc: Implement {cmp}xchg for u8 and u16")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum
turbo frequencies of some cores in a CPU package may be higher than for
the other cores in the same package. In that case, better performance
(and possibly lower energy consumption as well) can be achieved by
making the scheduler prefer to run tasks on the CPUs with higher max
turbo frequencies.
To that end, set up a core priority metric to abstract the core
preferences based on the maximum turbo frequency. In that metric,
the cores with higher maximum turbo frequencies are higher-priority
than the other cores in the same package and that causes the scheduler
to favor them when making load-balancing decisions using the asymmertic
packing approach. At the same time, the priority of SMT threads with a
higher CPU number is reduced so as to avoid scheduling tasks on all of
the threads that belong to a favored core before all of the other cores
have been given a task to run.
The priority metric will be initialized by the P-state driver with the
help of the sched_set_itmt_core_prio() function. The P-state driver
will also determine whether or not ITMT is supported by the platform
and will call sched_set_itmt_support() to indicate that.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
KVM was using arrays of size KVM_MAX_VCPUS with vcpu_id, but ID can be
bigger that the maximal number of VCPUs, resulting in out-of-bounds
access.
Found by syzkaller:
BUG: KASAN: slab-out-of-bounds in __apic_accept_irq+0xb33/0xb50 at addr [...]
Write of size 1 by task a.out/27101
CPU: 1 PID: 27101 Comm: a.out Not tainted 4.9.0-rc5+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[...]
Call Trace:
[...] __apic_accept_irq+0xb33/0xb50 arch/x86/kvm/lapic.c:905
[...] kvm_apic_set_irq+0x10e/0x180 arch/x86/kvm/lapic.c:495
[...] kvm_irq_delivery_to_apic+0x732/0xc10 arch/x86/kvm/irq_comm.c:86
[...] ioapic_service+0x41d/0x760 arch/x86/kvm/ioapic.c:360
[...] ioapic_set_irq+0x275/0x6c0 arch/x86/kvm/ioapic.c:222
[...] kvm_ioapic_inject_all arch/x86/kvm/ioapic.c:235
[...] kvm_set_ioapic+0x223/0x310 arch/x86/kvm/ioapic.c:670
[...] kvm_vm_ioctl_set_irqchip arch/x86/kvm/x86.c:3668
[...] kvm_arch_vm_ioctl+0x1a08/0x23c0 arch/x86/kvm/x86.c:3999
[...] kvm_vm_ioctl+0x1fa/0x1a70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3099
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: af1bae5497 ("KVM: x86: bump KVM_MAX_VCPU_ID to 1023")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.
Found by syzkaller:
WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
Kernel panic - not syncing: panic_on_warn set ...
CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[...]
Call Trace:
[...] __dump_stack lib/dump_stack.c:15
[...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
[...] panic+0x1b7/0x3a3 kernel/panic.c:179
[...] __warn+0x1c4/0x1e0 kernel/panic.c:542
[...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
[...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
[...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
[...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
[...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
[...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
[...] complete_emulated_io arch/x86/kvm/x86.c:6870
[...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
[...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
[...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
[...] vfs_ioctl fs/ioctl.c:43
[...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
[...] SYSC_ioctl fs/ioctl.c:694
[...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
[...] entry_SYSCALL_64_fastpath+0x1f/0xc2
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: d1442d85cc ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
AMD hardware adds two additional bits to aid in nested page fault handling.
Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables
The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.
Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Since MIPSr6 the Wired register is split into 2 fields, with the upper
16 bits of the register indicating a limit on the value that the wired
entry count in the bottom 16 bits of the register can take. This means
that simply reading the wired register doesn't get us a valid TLB entry
index any longer, and we instead need to retrieve only the lower 16 bits
of the register. Introduce a new num_wired_entries() function which does
this on MIPSr6 or higher and simply returns the value of the wired
register on older architecture revisions, and make use of it when
reading the number of wired entries.
Since commit e710d66683 ("MIPS: tlb-r4k: If there are wired entries,
don't use TLBINVF") we have been using a non-zero number of wired
entries to determine whether we should avoid use of the tlbinvf
instruction (which would invalidate wired entries) and instead loop over
TLB entries in local_flush_tlb_all(). This loop begins with the number
of wired entries, or before this patch some large bogus TLB index on
MIPSr6 systems. Thus since the aforementioned commit some MIPSr6 systems
with FTLBs have been prone to leaving stale address translations in the
FTLB & crashing in various weird & wonderful ways when we later observe
the wrong memory.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/14557/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
When configured with CONFIG_PPC_EARLY_DEBUG_OPAL=y the kernel expects
the OPAL entry and base addresses to be passed in r8 and r9
respectively. Currently the wrapper does not attempt to restore these
values before entering the decompressed kernel which causes the kernel
to branch into whatever happens to be in r9 when doing a write to the
OPAL console in early boot.
This patch adds a platform_ops hook that can be used to branch into the
new kernel. The OPAL console driver patches this at runtime so that if
the console is used it will be restored just prior to entering the
kernel.
Fixes: 656ad58ef1 ("powerpc/boot: Add OPAL console to epapr wrappers")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit:
90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
changed the coredumping code to construct the elf coredump file according
to register set size - and that's good: if binary crashes with 32-bit code
selector, generate 32-bit ELF core, otherwise - 64-bit core.
That was made for restoring 32-bit applications on x86_64: we want
32-bit application after restore to generate 32-bit ELF dump on crash.
All was quite good and recently I started reworking 32-bit applications
dumping part of CRIU: now it has two parasites (32 and 64) for seizing
compat/native tasks, after rework it'll have one parasite, working in
64-bit mode, to which 32-bit prologue long-jumps during infection.
And while it has worked for my work machine, in VM with
!CONFIG_X86_X32_ABI during reworking I faced that segfault in 32-bit
binary, that has long-jumped to 64-bit mode results in dereference
of garbage:
32-victim[19266]: segfault at f775ef65 ip 00000000f775ef65 sp 00000000f776aa50 error 14
BUG: unable to handle kernel paging request at ffffffffffffffff
IP: [<ffffffff81332ce0>] strlen+0x0/0x20
[...]
Call Trace:
[] elf_core_dump+0x11a9/0x1480
[] do_coredump+0xa6b/0xe60
[] get_signal+0x1a8/0x5c0
[] do_signal+0x23/0x660
[] exit_to_usermode_loop+0x34/0x65
[] prepare_exit_to_usermode+0x2f/0x40
[] retint_user+0x8/0x10
That's because we have 64-bit registers set (with according total size)
and we're writing it to elf_thread_core_info which has smaller size
on !CONFIG_X86_X32_ABI. That lead to overwriting ELF notes part.
Tested on 32-, 64-bit ELF crashes and on 32-bit binaries that have
jumped with 64-bit code selector - all is readable with gdb.
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At present KVM on powerpc always reports KVM_CAP_PPC_ALLOC_HTAB as enabled.
However, the ioctl() it advertises (KVM_PPC_ALLOCATE_HTAB) only actually
works on KVM HV. On KVM PR it will fail with ENOTTY.
QEMU already has a workaround for this, so it's not breaking things in
practice, but it would be better to advertise this correctly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the "again" parameter to the dummy version of
kvmppc_check_passthru(), so that it matches the real version.
This fixes compilation with CONFIG_BOOK3S_64_HV set but
CONFIG_KVM_XICS=n.
This includes asm/smp.h in book3s_hv_builtin.c to fix compilation
with CONFIG_SMP=n. The explicit inclusion is necessary to provide
definitions of hard_smp_processor_id() and get_hard_smp_processor_id()
in UP configs.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The function kvmppc_set_arch_compat() is used to determine the value of the
processor compatibility register (PCR) for a guest running in a given
compatibility mode. There is currently no support for v3.00 of the ISA.
Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
to the PCR.
We also add a check to ensure the processor we are running on is capable of
emulating the chosen processor (for example a POWER7 cannot emulate a
POWER8, similarly with a POWER8 and a POWER9).
Based on work by: Paul Mackerras <paulus@ozlabs.org>
[paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
guest_pcr_bit when arch_compat == 0, added comment.]
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With POWER9, each CPU thread has its own MMU context and can be
in the host or a guest independently of the other threads; there is
still however a restriction that all threads must use the same type
of address translation, either radix tree or hashed page table (HPT).
Since we only support HPT guests on a HPT host at this point, we
can treat the threads as being independent, and avoid all of the
work of coordinating the CPU threads. To make this simpler, we
introduce a new threads_per_vcore() function that returns 1 on
POWER9 and threads_per_subcore on POWER7/8, and use that instead
of threads_per_subcore or threads_per_core in various places.
This also changes the value of the KVM_CAP_PPC_SMT capability on
POWER9 systems from 4 to 1, so that userspace will not try to
create VMs with multiple vcpus per vcore. (If userspace did create
a VM that thought it was in an SMT mode, the VM might try to use
the msgsndp instruction, which will not work as expected. In
future it may be possible to trap and emulate msgsndp in order to
allow VMs to think they are in an SMT mode, if only for the purpose
of allowing migration from POWER8 systems.)
With all this, we can now run guests on POWER9 as long as the host
is running with HPT translation. Since userspace currently has no
way to request radix tree translation for the guest, the guest has
no choice but to use HPT translation.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The new XIVE interrupt controller on POWER9 can direct external
interrupts to the hypervisor or the guest. The interrupts directed to
the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
come in as a "hypervisor virtualization interrupt". This sets the
LPCR bit so that hypervisor virtualization interrupts can occur while
we are in the guest. We then also need to cope with exiting the guest
because of a hypervisor virtualization interrupt.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 replaces the various power-saving mode instructions on POWER8
(doze, nap, sleep and rvwinkle) with a single "stop" instruction, plus
a register, PSSCR, which controls the depth of the power-saving mode.
This replaces the use of the nap instruction when threads are idle
during guest execution with the stop instruction, and adds code to
set PSSCR to a value which will allow an SMT mode switch while the
thread is idle (given that the core as a whole won't be idle in these
cases).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 includes a new interrupt controller, called XIVE, which is
quite different from the XICS interrupt controller on POWER7 and
POWER8 machines. KVM-HV accesses the XICS directly in several places
in order to send and clear IPIs and handle interrupts from PCI
devices being passed through to the guest.
In order to make the transition to XIVE easier, OPAL firmware will
include an emulation of XICS on top of XIVE. Access to the emulated
XICS is via OPAL calls. The one complication is that the EOI
(end-of-interrupt) function can now return a value indicating that
another interrupt is pending; in this case, the XIVE will not signal
an interrupt in hardware to the CPU, and software is supposed to
acknowledge the new interrupt without waiting for another interrupt
to be delivered in hardware.
This adapts KVM-HV to use the OPAL calls on machines where there is
no XICS hardware. When there is no XICS, we look for a device-tree
node with "ibm,opal-intc" in its compatible property, which is how
OPAL indicates that it provides XICS emulation.
In order to handle the EOI return value, kvmppc_read_intr() has
become kvmppc_read_one_intr(), with a boolean variable passed by
reference which can be set by the EOI functions to indicate that
another interrupt is pending. The new kvmppc_read_intr() keeps
calling kvmppc_read_one_intr() until there are no more interrupts
to process. The return value from kvmppc_read_intr() is the
largest non-zero value of the returns from kvmppc_read_one_intr().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On POWER9, the msgsnd instruction is able to send interrupts to
other cores, as well as other threads on the local core. Since
msgsnd is generally simpler and faster than sending an IPI via the
XICS, we use msgsnd for all IPIs sent by KVM on POWER9.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
and tlbiel (local tlbie) instructions. Both instructions get a
set of new parameters (RIC, PRS and R) which appear as bits in the
instruction word. The tlbiel instruction now has a second register
operand, which contains a PID and/or LPID value if needed, and
should otherwise contain 0.
This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
as well as older processors. Since we only handle HPT guests so
far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
word as on previous processors, so we don't need to conditionally
execute different instructions depending on the processor.
The local flush on first entry to a guest in book3s_hv_rmhandlers.S
is a loop which depends on the number of TLB sets. Rather than
using feature sections to set the number of iterations based on
which CPU we're on, we now work out this number at VM creation time
and store it in the kvm_arch struct. That will make it possible to
get the number from the device tree in future, which will help with
compatibility with future processors.
Since mmu_partition_table_set_entry() does a global flush of the
whole LPID, we don't need to do the TLB flush on first entry to the
guest on each processor. Therefore we don't set all bits in the
tlb_need_flush bitmap on VM startup on POWER9.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to handle two new guest-accessible special-purpose
registers on POWER9: TIDR (thread ID register) and PSSCR (processor
stop status and control register). They are context-switched
between host and guest, and the guest values can be read and set
via the one_reg interface.
The PSSCR contains some fields which are guest-accessible and some
which are only accessible in hypervisor mode. We only allow the
guest-accessible fields to be read or set by userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some special-purpose registers that were present and accessible
by guests on POWER8 no longer exist on POWER9, so this adds
feature sections to ensure that we don't try to context-switch
them when going into or out of a guest on POWER9. These are
all relatively obscure, rarely-used registers, but we had to
context-switch them on POWER8 to avoid creating a covert channel.
They are: SPMC1, SPMC2, MMCRS, CSIGR, TACR, TCSCR, and ACOP.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On POWER9, the SDR1 register (hashed page table base address) is no
longer used, and instead the hardware reads the HPT base address
and size from the partition table. The partition table entry also
contains the bits that specify the page size for the VRMA mapping,
which were previously in the LPCR. The VPM0 bit of the LPCR is
now reserved; the processor now always uses the VRMA (virtual
real-mode area) mechanism for guest real-mode accesses in HPT mode,
and the RMO (real-mode offset) mechanism has been dropped.
When entering or exiting the guest, we now only have to set the
LPIDR (logical partition ID register), not the SDR1 register.
There is also no requirement now to transition via a reserved
LPID value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adapts the KVM-HV hashed page table (HPT) code to read and write
HPT entries in the new format defined in Power ISA v3.00 on POWER9
machines. The new format moves the B (segment size) field from the
first doubleword to the second, and trims some bits from the AVA
(abbreviated virtual address) and ARPN (abbreviated real page number)
fields. As far as possible, the conversion is done when reading or
writing the HPT entries, and the rest of the code continues to use
the old format.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the ppc-kvm topic branch to get changes to
arch/powerpc code that are necessary for adding POWER9 KVM support.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits. For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.
Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values. So,
just use mult_frac() directly in tile's sched_clock().
Commit 4cecf6d401 ("sched, x86: Avoid unnecessary overflow
in sched_clock") by Salman Qazi results in essentially the same
generated code for x86 as this change does for tile. In fact,
a follow-on change by Salman introduced mult_frac() and switched
to using it, so the C code was largely identical at that point too.
Peter Zijlstra then added mul_u64_u32_shr() and switched x86
to use it. This is, in principle, better; by optimizing the
64x64->64 multiplies to be 32x32->64 multiplies we can potentially
save some time. However, the compiler piplines the 64x64->64
multiplies pretty well, and the conditional branch in the generic
mul_u64_u32_shr() causes some bubbles in execution, with the
result that it's pretty much a wash. If tilegx provided its own
implementation of mul_u64_u32_shr() without the conditional branch,
we could potentially save 3 cycles, but that seems like small gain
for a fair amount of additional build scaffolding; no other platform
currently provides a mul_u64_u32_shr() override, and tile doesn't
currently have an <asm/div64.h> header to put the override in.
Additionally, gcc currently has an optimization bug that prevents
it from recognizing the opportunity to use a 32x32->64 multiply,
and so the result would be no better than the existing mult_frac()
until such time as the compiler is fixed.
For now, just using mult_frac() seems like the right answer.
Cc: stable@kernel.org [v3.4+]
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>