On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.
This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode). Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent. This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.
To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address). SLOF uses these for IO.
However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself. The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled. The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.
This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus. Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.
Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[agraf: fix compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>
kvm_no_guest() calls power7_wakeup_loss() to put the thread into the
deepest supported idle state. power7_wakeup_loss() is defined in
arch/powerpc/kernel/idle_power7.S, which is compiled only when
PPC_P7_NAP=y.
And PPC_P7_NAP is selected when PPC_POWERNV=y.
Hence in cases where PPC_POWERNV=n and KVM_BOOK3S_64_HV=y we see the
following error:
arch/powerpc/kvm/built-in.o: In function `kvm_no_guest':
arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x42c): undefined reference to `power7_wakeup_loss'
Fix this by adding PPC_POWERNV as a dependency for KVM_BOOK3S_64_HV.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For THP that is marked trans splitting, we return the pte.
This require the callers to handle the pmd_trans_splitting scenario,
if they care. All the current callers are either looking at pfn or
write_ok, hence we don't need to update them.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We can disable a THP split or a hugepage collapse by disabling irq.
We do send IPI to all the cpus in the early part of split/collapse,
and disabling local irq ensure we don't make progress with
split/collapse. If the THP is getting split we return NULL from
find_linux_pte_or_hugepte(). For all the current callers it should be ok.
We need to be careful if we want to use returned pte_t pointer outside
the irq disabled region. W.r.t to THP split, the pfn remains the same,
but then a hugepage collapse will result in a pfn change. There are
few steps we can take to avoid a hugepage collapse.One way is to take page
reference inside the irq disable region. Other option is to take
mmap_sem so that a parallel collapse will not happen. We can also
disable collapse by taking pmd_lock. Another method used by kvm
subsystem is to check whether we had a mmu_notifer update in between
using mmu_notifier_retry().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch remove helpers which we had used only once in the code.
Limiting page table walk variants help in ensuring that we won't
end up with code walking page table with wrong assumptions.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pte can get updated from other CPUs as part of multiple activities
like THP split, huge page collapse, unmap. We need to make sure we
don't reload the pte value again and again for different checks.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull powerpc updates from Michael Ellerman:
- Numerous minor fixes, cleanups etc.
- More EEH work from Gavin to remove its dependency on device_nodes.
- Memory hotplug implemented entirely in the kernel from Nathan
Fontenot.
- Removal of redundant CONFIG_PPC_OF by Kevin Hao.
- Rewrite of VPHN parsing logic & tests from Greg Kurz.
- A fix from Nish Aravamudan to reduce memory usage by clamping
nodes_possible_map.
- Support for pstore on powernv from Hari Bathini.
- Removal of old powerpc specific byte swap routines by David Gibson.
- Fix from Vasant Hegde to prevent the flash driver telling you it was
flashing your firmware when it wasn't.
- Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
- Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan
Stancek.
- Some fixes for migration from Tyrel Datwyler.
- A new syscall to switch the cpu endian by Michael Ellerman.
- Large series from Wei Yang to implement SRIOV, reviewed and acked by
Bjorn.
- A fix for the OPAL sensor driver from Cédric Le Goater.
- Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
- Large series from Daniel Axtens to make our PCI hooks per PHB rather
than per machine.
- Small patch from Sam Bobroff to explicitly abort non-suspended
transactions on syscalls, plus a test to exercise it.
- Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
- Small patch to enable the hard lockup detector from Anton Blanchard.
- Fix from Dave Olson for missing L2 cache information on some CPUs.
- Some fixes from Michael Ellerman to get Cell machines booting again.
- Freescale updates from Scott: Highlights include BMan device tree
nodes, an MSI erratum workaround, a couple minor performance
improvements, config updates, and misc fixes/cleanup.
* tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (196 commits)
powerpc/powermac: Fix build error seen with powermac smp builds
powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE
powerpc: Remove PPC32 code from pseries specific find_and_init_phbs()
powerpc/cell: Fix iommu breakage caused by controller_ops change
powerpc/eeh: Fix crash in eeh_add_device_early() on Cell
powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
powerpc/perf/hv-24x7: Fail 24x7 initcall if create_events_from_catalog() fails
powerpc/pseries: Correct memory hotplug locking
powerpc: Fix missing L2 cache size in /sys/devices/system/cpu
powerpc: Add ppc64 hard lockup detector support
oprofile: Disable oprofile NMI timer on ppc64
powerpc/perf/hv-24x7: Add missing put_cpu_var()
powerpc/perf/hv-24x7: Break up single_24x7_request
powerpc/perf/hv-24x7: Define update_event_count()
powerpc/perf/hv-24x7: Whitespace cleanup
powerpc/perf/hv-24x7: Define add_event_to_24x7_request()
powerpc/perf/hv-24x7: Rename hv_24x7_event_update
powerpc/perf/hv-24x7: Move debug prints to separate function
powerpc/perf/hv-24x7: Drop event_24x7_request()
powerpc/perf/hv-24x7: Use pr_devel() to log message
...
Conflicts:
tools/testing/selftests/powerpc/Makefile
tools/testing/selftests/powerpc/tm/Makefile
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.1
The most interesting bit here is irqfd/ioeventfd support for ARM and
ARM64.
Summary:
ARM/ARM64:
fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390:
interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS:
FPU and MIPS SIMD Architecture (MSA) support. Includes some
patches from Ralf Baechle's MIPS tree.
x86:
bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
KVM: use slowpath for cross page cached accesses
kvm: mmu: lazy collapse small sptes into large sptes
KVM: x86: Clear CR2 on VCPU reset
KVM: x86: DR0-DR3 are not clear on reset
KVM: x86: BSP in MSR_IA32_APICBASE is writable
KVM: x86: simplify kvm_apic_map
KVM: x86: avoid logical_map when it is invalid
KVM: x86: fix mixed APIC mode broadcast
KVM: x86: use MDA for interrupt matching
kvm/ppc/mpic: drop unused IRQ_testbit
KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
KVM: vmx: pass error code with internal error #2
x86: vdso: fix pvclock races with task migration
KVM: remove kvm_read_hva and kvm_read_hva_atomic
KVM: x86: optimize delivery of TSC deadline timer interrupt
KVM: x86: extract blocking logic from __vcpu_run
kvm: x86: fix x86 eflags fixed bit
KVM: s390: migrate vcpu interrupt state
...
iodev.h contains definitions for the kvm_io_bus framework. This is
needed both by the generic KVM code in virt/kvm as well as by
architecture specific code under arch/. Putting the header file in
virt/kvm and using local includes in the architecture part seems at
least dodgy to me, so let's move the file into include/kvm, so that a
more natural "#include <kvm/iodev.h>" can be used by all of the code.
This also solves a problem later when using struct kvm_io_device
in arm_vgic.h.
Fixing up the FSF address in the GPL header and a wrong include path
on the way.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Sometimes the KVM code on powerpc needs to emulate load or store
instructions from the guest, which can include both normal and byte
reversed forms.
We currently (AFAICT) handle this correctly, but some variable names are
very misleading. In particular we use "is_bigendian" in several places to
actually mean "is the IO the same endian as the host", but we now support
little-endian powerpc hosts. This also ties into the misleadingly named
ld_le*() and st_le*() functions, which in fact always byteswap, even on
an LE host.
This patch cleans this up by renaming to more accurate "host_swabbed", and
uses the generic swab*() functions instead of the powerpc specific and
misleadingly named ld_le*() and st_le*() functions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit 4a157d61b4 ("KVM: PPC: Book3S HV: Fix endianness of
instruction obtained from HEIR register") had the side effect that
we no longer reset vcpu->arch.last_inst to -1 on guest exit in
the cases where the instruction is not fetched from the guest.
This means that if instruction emulation turns out to be required
in those cases, the host will emulate the wrong instruction, since
vcpu->arch.last_inst will contain the last instruction that was
emulated.
This fixes it by making sure that vcpu->arch.last_inst is reset
to -1 in those cases.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The VPA (virtual processor area) is defined by PAPR and is therefore
big-endian, so we need a be32_to_cpu when reading it in
kvmppc_get_yield_count(). Without this, H_CONFER always fails on a
little-endian host, causing SMP guests to waste time spinning on
spinlocks.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The flush_tlb hook in cpu_spec was introduced as a generic function hook
to invalidate TLBs. But the current implementation of flush_tlb hook
takes IS (invalidation selector) as an argument which is architecture
dependent. Hence, It is not right to have a generic routine where caller
has to pass non-generic argument.
This patch fixes this and makes flush_tlb hook as high level API.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
"Tighten rules for ACCESS_ONCE
This series tightens the rules for ACCESS_ONCE to only work on scalar
types. It also contains the necessary fixups as indicated by build
bots of linux-next. Now everything is in place to prevent new
non-scalar users of ACCESS_ONCE and we can continue to convert code to
READ_ONCE/WRITE_ONCE"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
kernel: Fix sparse warning for ACCESS_ONCE
next: sh: Fix compile error
kernel: tighten rules for ACCESS ONCE
mm/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
Pull powerpc updates from Michael Ellerman:
- Update of all defconfigs
- Addition of a bunch of config options to modernise our defconfigs
- Some PS3 updates from Geoff
- Optimised memcmp for 64 bit from Anton
- Fix for kprobes that allows 'perf probe' to work from Naveen
- Several cxl updates from Ian & Ryan
- Expanded support for the '24x7' PMU from Cody & Sukadev
- Freescale updates from Scott:
"Highlights include 8xx optimizations, some more work on datapath
device tree content, e300 machine check support, t1040 corenet
error reporting, and various cleanups and fixes"
* tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
cxl: Add missing return statement after handling AFU errror
cxl: Fail AFU initialisation if an invalid configuration record is found
cxl: Export optional AFU configuration record in sysfs
powerpc/mm: Warn on flushing tlb page in kernel context
powerpc/powernv: Add OPAL soft-poweroff routine
powerpc/perf/hv-24x7: Document sysfs event description entries
powerpc/perf/hv-gpci: add the remaining gpci requests
powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
perf: add PMU_EVENT_ATTR_STRING() helper
perf: provide sysfs_show for struct perf_pmu_events_attr
powerpc/kernel: Avoid initializing device-tree pointer twice
powerpc: Remove old compile time disabled syscall tracing code
powerpc/kernel: Make syscall_exit a local label
cxl: Fix device_node reference counting
powerpc/mm: bail out early when flushing TLB page
powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
perf/powerpc: reset event hw state when adding it to the PMU
powerpc/qe: Use strlcpy()
...
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The return value of kvm_arch_vcpu_postcreate is not checked in its
caller. This is okay, because only x86 provides vcpu_postcreate right
now and it could only fail if vcpu_load failed. But that is not
possible during KVM_CREATE_VCPU (kvm_arch_vcpu_load is void, too), so
just get rid of the unchecked return value.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.
The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.
If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.o
Size of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after
so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.
Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Alexander Graf <agraf@suse.de>
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
Commit 69111bac42 ("powerpc: Replace __get_cpu_var uses") introduced
compile breakage to the e500 target by introducing invalid automatically
created C syntax.
Fix up the breakage and make the code compile again.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The in-kernel XICS emulation is faster than doing it all in QEMU
and it has got a lot of testing, so enable it by default.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently the H_CONFER hcall is implemented in kernel virtual mode,
meaning that whenever a guest thread does an H_CONFER, all the threads
in that virtual core have to exit the guest. This is bad for
performance because it interrupts the other threads even if they
are doing useful work.
The H_CONFER hcall is called by a guest VCPU when it is spinning on a
spinlock and it detects that the spinlock is held by a guest VCPU that
is currently not running on a physical CPU. The idea is to give this
VCPU's time slice to the holder VCPU so that it can make progress
towards releasing the lock.
To avoid having the other threads exit the guest unnecessarily,
we add a real-mode implementation of H_CONFER that checks whether
the other threads are doing anything. If all the other threads
are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
it returns H_TOO_HARD which causes a guest exit and allows the
H_CONFER to be handled in virtual mode.
Otherwise it spins for a short time (up to 10 microseconds) to give
other threads the chance to observe that this thread is trying to
confer. The spin loop also terminates when any thread exits the guest
or when all other threads are idle or trying to confer. If the
timeout is reached, the H_CONFER returns H_SUCCESS. In this case the
guest VCPU will recheck the spinlock word and most likely call
H_CONFER again.
This also improves the implementation of the H_CONFER virtual mode
handler. If the VCPU is part of a virtual core (vcore) which is
runnable, there will be a 'runner' VCPU which has taken responsibility
for running the vcore. In this case we yield to the runner VCPU
rather than the target VCPU.
We also introduce a check on the target VCPU's yield count: if it
differs from the yield count passed to H_CONFER, the target VCPU
has run since H_CONFER was called and may have already released
the lock. This check is required by PAPR.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
There are two ways in which a guest instruction can be obtained from
the guest in the guest exit code in book3s_hv_rmhandlers.S. If the
exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
instruction), the offending instruction is in the HEIR register
(Hypervisor Emulation Instruction Register). If the exit was caused
by a load or store to an emulated MMIO device, we load the instruction
from the guest by turning data relocation on and loading the instruction
with an lwz instruction.
Unfortunately, in the case where the guest has opposite endianness to
the host, these two methods give results of different endianness, but
both get put into vcpu->arch.last_inst. The HEIR value has been loaded
using guest endianness, whereas the lwz will load the instruction using
host endianness. The rest of the code that uses vcpu->arch.last_inst
assumes it was loaded using host endianness.
To fix this, we define a new vcpu field to store the HEIR value. Then,
in kvmppc_handle_exit_hv(), we transfer the value from this new field to
vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
differ.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This removes the code that was added to enable HV KVM to work
on PPC970 processors. The PPC970 is an old CPU that doesn't
support virtualizing guest memory. Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.
Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This patch adds trace points in the guest entry and exit code and also
for exceptions handled by the host in kernel mode - hypercalls and page
faults. The new events are added to /sys/kernel/debug/tracing/events
under a new subsystem called kvm_hv.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently the calculations of stolen time for PPC Book3S HV guests
uses fields in both the vcpu struct and the kvmppc_vcore struct. The
fields in the kvmppc_vcore struct are protected by the
vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
running the virtual core. This works correctly but confuses lockdep,
because it sees that the code takes the tbacct_lock for a vcpu in
kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
vcore_stolen_time(), and it thinks there is a possibility of deadlock,
causing it to print reports like this:
=============================================
[ INFO: possible recursive locking detected ]
3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
---------------------------------------------
qemu-system-ppc/6188 is trying to acquire lock:
(&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
but task is already holding lock:
(&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&vcpu->arch.tbacct_lock)->rlock);
lock(&(&vcpu->arch.tbacct_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by qemu-system-ppc/6188:
#0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
#1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
#2: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
stack backtrace:
CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
Call Trace:
[c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
[c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
[c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
[c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
[c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
[c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
[c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
[c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
[c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
[c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
[c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
[c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
[c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
In order to make the locking easier to analyse, we change the code to
use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
preempt_tb fields. This lock needs to be an irq-safe lock since it is
used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
functions, which are called with the scheduler rq lock held, which is
an irq-safe lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Remove the function inst_set_field() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
Remove the function get_fpr_index() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
Removes some functions that are not used anywhere:
kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate()
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
Remove the function sr_nx() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
The kvmppc_vcore_blocked() code does not check for the wait condition
after putting the process on the wait queue. This means that it is
possible for an external interrupt to become pending, but the vcpu to
remain asleep until the next decrementer interrupt. The fix is to
make one last check for pending exceptions and ceded state before
calling schedule().
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This fixes some inaccuracies in the state machine for the virtualized
ICP when implementing the H_IPI hcall (Set_MFFR and related states):
1. The old code wipes out any pending interrupts when the new MFRR is
more favored than the CPPR but less favored than a pending
interrupt (by always modifying xisr and the pending_pri). This can
cause us to lose a pending external interrupt.
The correct code here is to only modify the pending_pri and xisr in
the ICP if the MFRR is equal to or more favored than the current
pending pri (since in this case, it is guaranteed that that there
cannot be a pending external interrupt). The code changes are
required in both kvmppc_rm_h_ipi and kvmppc_h_ipi.
2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check
for whether MFRR is being made less favored AND further if new MFFR
is also less favored than the current CPPR, we check for any
resends pending in the ICP. These checks look like they are
designed to cover the case where if the MFRR is being made less
favored, we opportunistically trigger a resend of any interrupts
that had been previously rejected. Although, this is not a state
described by PAPR, this is an action we actually need to do
especially if the CPPR is already at 0xFF. Because in this case,
the resend bit will stay on until another ICP state change which
may be a long time coming and the interrupt stays pending until
then. The current code which checks for MFRR < CPPR is broken when
CPPR is 0xFF since it will not get triggered in that case.
Ideally, we would want to do a resend only if
prio(pending_interrupt) < mfrr && prio(pending_interrupt) < cppr
where pending interrupt is the one that was rejected. But we don't
have the priority of the pending interrupt state saved, so we
simply trigger a resend whenever the MFRR is made less favored.
3. In kvmppc_rm_h_ipi, where we save state to pass resends to the
virtual mode, we also need to save the ICP whose need_resend we
reset since this does not need to be my ICP (vcpu->arch.icp) as is
incorrectly assumed by the current code. A new field rm_resend_icp
is added to the kvmppc_icp structure for this purpose.
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Testing with KSM active in the host showed occasional corruption of
guest memory. Typically a page that should have contained zeroes
would contain values that look like the contents of a user process
stack (values such as 0x0000_3fff_xxxx_xxx).
Code inspection in kvmppc_h_protect revealed that there was a race
condition with the possibility of granting write access to a page
which is read-only in the host page tables. The code attempts to keep
the host mapping read-only if the host userspace PTE is read-only, but
if that PTE had been temporarily made invalid for any reason, the
read-only check would not trigger and the host HPTE could end up
read-write. Examination of the guest HPT in the failure situation
revealed that there were indeed shared pages which should have been
read-only that were mapped read-write.
To close this race, we don't let a page go from being read-only to
being read-write, as far as the real HPTE mapping the page is
concerned (the guest view can go to read-write, but the actual mapping
stays read-only). When the guest tries to write to the page, we take
an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a
writable HPTE for the page.
This eliminates the occasional corruption of shared pages
that was previously seen with KSM active.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
In kvm_test_clear_dirty(), if we find an invalid HPTE we move on to the
next HPTE without unlocking the invalid one. In fact we should never
find an invalid and unlocked HPTE in the rmap chain, but for robustness
we should unlock it. This adds the missing unlock.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When injecting an IRQ, we only document which IRQ priority (which translates
to IRQ type) gets injected. However, when reading traces you don't necessarily
have all the numbers in your head to know which IRQ really is meant.
This patch converts the IRQ number field to a symbolic name that is in sync
with the respective define. That way it's a lot easier for readers to figure
out what interrupt gets injected.
Signed-off-by: Alexander Graf <agraf@suse.de>
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code. This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return. The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.
In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest. To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt. We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.
Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch fix spelling typo in printk and Kconfig within
various part of kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>