Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity: "Highlights include - full big real mode emulation on pre-Westmere Intel hosts (can be disabled with emulate_invalid_guest_state=0) - relatively small ppc and s390 updates - PCID/INVPCID support in guests - EOI avoidance; 3.6 guests should perform better on 3.6 hosts on interrupt intensive workloads) - Lockless write faults during live migration - EPT accessed/dirty bits support for new Intel processors" Fix up conflicts in: - Documentation/virtual/kvm/api.txt: Stupid subchapter numbering, added next to each other. - arch/powerpc/kvm/booke_interrupts.S: PPC asm changes clashing with the KVM fixes - arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c: Duplicated commits through the kvm tree and the s390 tree, with subsequent edits in the KVM tree. * tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) KVM: fix race with level interrupts x86, hyper: fix build with !CONFIG_KVM_GUEST Revert "apic: fix kvm build on UP without IOAPIC" KVM guest: switch to apic_set_eoi_write, apic_write apic: add apic_set_eoi_write for PV use KVM: VMX: Implement PCID/INVPCID for guests with EPT KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check KVM: PPC: Critical interrupt emulation support KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests KVM: PPC64: booke: Set interrupt computation mode for 64-bit host KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt KVM: PPC: bookehv64: Add support for std/ld emulation. booke: Added crit/mc exception handler for e500v2 booke/bookehv: Add host crit-watchdog exception support KVM: MMU: document mmu-lock and fast page fault KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint KVM: MMU: trace fast page fault KVM: MMU: fast path of handling guest page fault KVM: MMU: introduce SPTE_MMU_WRITEABLE bit KVM: MMU: fold tlb flush judgement into mmu_spte_update ...
This commit is contained in:
@@ -1946,6 +1946,40 @@ the guest using the specified gsi pin. The irqfd is removed using
|
||||
the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
|
||||
and kvm_irqfd.gsi.
|
||||
|
||||
4.76 KVM_PPC_ALLOCATE_HTAB
|
||||
|
||||
Capability: KVM_CAP_PPC_ALLOC_HTAB
|
||||
Architectures: powerpc
|
||||
Type: vm ioctl
|
||||
Parameters: Pointer to u32 containing hash table order (in/out)
|
||||
Returns: 0 on success, -1 on error
|
||||
|
||||
This requests the host kernel to allocate an MMU hash table for a
|
||||
guest using the PAPR paravirtualization interface. This only does
|
||||
anything if the kernel is configured to use the Book 3S HV style of
|
||||
virtualization. Otherwise the capability doesn't exist and the ioctl
|
||||
returns an ENOTTY error. The rest of this description assumes Book 3S
|
||||
HV.
|
||||
|
||||
There must be no vcpus running when this ioctl is called; if there
|
||||
are, it will do nothing and return an EBUSY error.
|
||||
|
||||
The parameter is a pointer to a 32-bit unsigned integer variable
|
||||
containing the order (log base 2) of the desired size of the hash
|
||||
table, which must be between 18 and 46. On successful return from the
|
||||
ioctl, it will have been updated with the order of the hash table that
|
||||
was allocated.
|
||||
|
||||
If no hash table has been allocated when any vcpu is asked to run
|
||||
(with the KVM_RUN ioctl), the host kernel will allocate a
|
||||
default-sized hash table (16 MB).
|
||||
|
||||
If this ioctl is called when a hash table has already been allocated,
|
||||
the kernel will clear out the existing hash table (zero all HPTEs) and
|
||||
return the hash table order in the parameter. (If the guest is using
|
||||
the virtualized real-mode area (VRMA) facility, the kernel will
|
||||
re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
|
||||
|
||||
|
||||
5. The kvm_run structure
|
||||
------------------------
|
||||
|
@@ -6,7 +6,129 @@ KVM Lock Overview
|
||||
|
||||
(to be written)
|
||||
|
||||
2. Reference
|
||||
2: Exception
|
||||
------------
|
||||
|
||||
Fast page fault:
|
||||
|
||||
Fast page fault is the fast path which fixes the guest page fault out of
|
||||
the mmu-lock on x86. Currently, the page fault can be fast only if the
|
||||
shadow page table is present and it is caused by write-protect, that means
|
||||
we just need change the W bit of the spte.
|
||||
|
||||
What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
|
||||
SPTE_MMU_WRITEABLE bit on the spte:
|
||||
- SPTE_HOST_WRITEABLE means the gfn is writable on host.
|
||||
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
|
||||
the gfn is writable on guest mmu and it is not write-protected by shadow
|
||||
page write-protection.
|
||||
|
||||
On fast page fault path, we will use cmpxchg to atomically set the spte W
|
||||
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
|
||||
is safe because whenever changing these bits can be detected by cmpxchg.
|
||||
|
||||
But we need carefully check these cases:
|
||||
1): The mapping from gfn to pfn
|
||||
The mapping from gfn to pfn may be changed since we can only ensure the pfn
|
||||
is not changed during cmpxchg. This is a ABA problem, for example, below case
|
||||
will happen:
|
||||
|
||||
At the beginning:
|
||||
gpte = gfn1
|
||||
gfn1 is mapped to pfn1 on host
|
||||
spte is the shadow page table entry corresponding with gpte and
|
||||
spte = pfn1
|
||||
|
||||
VCPU 0 VCPU0
|
||||
on fast page fault path:
|
||||
|
||||
old_spte = *spte;
|
||||
pfn1 is swapped out:
|
||||
spte = 0;
|
||||
|
||||
pfn1 is re-alloced for gfn2.
|
||||
|
||||
gpte is changed to point to
|
||||
gfn2 by the guest:
|
||||
spte = pfn1;
|
||||
|
||||
if (cmpxchg(spte, old_spte, old_spte+W)
|
||||
mark_page_dirty(vcpu->kvm, gfn1)
|
||||
OOPS!!!
|
||||
|
||||
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
|
||||
|
||||
For direct sp, we can easily avoid it since the spte of direct sp is fixed
|
||||
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
|
||||
to pin gfn to pfn, because after gfn_to_pfn_atomic():
|
||||
- We have held the refcount of pfn that means the pfn can not be freed and
|
||||
be reused for another gfn.
|
||||
- The pfn is writable that means it can not be shared between different gfns
|
||||
by KSM.
|
||||
|
||||
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
|
||||
|
||||
Currently, to simplify the whole things, we disable fast page fault for
|
||||
indirect shadow page.
|
||||
|
||||
2): Dirty bit tracking
|
||||
In the origin code, the spte can be fast updated (non-atomically) if the
|
||||
spte is read-only and the Accessed bit has already been set since the
|
||||
Accessed bit and Dirty bit can not be lost.
|
||||
|
||||
But it is not true after fast page fault since the spte can be marked
|
||||
writable between reading spte and updating spte. Like below case:
|
||||
|
||||
At the beginning:
|
||||
spte.W = 0
|
||||
spte.Accessed = 1
|
||||
|
||||
VCPU 0 VCPU0
|
||||
In mmu_spte_clear_track_bits():
|
||||
|
||||
old_spte = *spte;
|
||||
|
||||
/* 'if' condition is satisfied. */
|
||||
if (old_spte.Accssed == 1 &&
|
||||
old_spte.W == 0)
|
||||
spte = 0ull;
|
||||
on fast page fault path:
|
||||
spte.W = 1
|
||||
memory write on the spte:
|
||||
spte.Dirty = 1
|
||||
|
||||
|
||||
else
|
||||
old_spte = xchg(spte, 0ull)
|
||||
|
||||
|
||||
if (old_spte.Accssed == 1)
|
||||
kvm_set_pfn_accessed(spte.pfn);
|
||||
if (old_spte.Dirty == 1)
|
||||
kvm_set_pfn_dirty(spte.pfn);
|
||||
OOPS!!!
|
||||
|
||||
The Dirty bit is lost in this case.
|
||||
|
||||
In order to avoid this kind of issue, we always treat the spte as "volatile"
|
||||
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
|
||||
the spte is always atomicly updated in this case.
|
||||
|
||||
3): flush tlbs due to spte updated
|
||||
If the spte is updated from writable to readonly, we should flush all TLBs,
|
||||
otherwise rmap_write_protect will find a read-only spte, even though the
|
||||
writable spte might be cached on a CPU's TLB.
|
||||
|
||||
As mentioned before, the spte can be updated to writable out of mmu-lock on
|
||||
fast page fault path, in order to easily audit the path, we see if TLBs need
|
||||
be flushed caused by this reason in mmu_spte_update() since this is a common
|
||||
function to update spte (present -> present).
|
||||
|
||||
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
|
||||
atomicly update the spte, the race caused by fast page fault can be avoided,
|
||||
See the comments in spte_has_volatile_bits() and mmu_spte_update().
|
||||
|
||||
3. Reference
|
||||
------------
|
||||
|
||||
Name: kvm_lock
|
||||
@@ -23,3 +145,9 @@ Arch: x86
|
||||
Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
||||
- tsc offset in vmcb
|
||||
Comment: 'raw' because updating the tsc offsets must not be preempted.
|
||||
|
||||
Name: kvm->mmu_lock
|
||||
Type: spinlock_t
|
||||
Arch: any
|
||||
Protects: -shadow page/shadow tlb entry
|
||||
Comment: it is a spinlock since it is used in mmu notifier.
|
||||
|
@@ -223,3 +223,36 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
|
||||
steal: the amount of time in which this vCPU did not run, in
|
||||
nanoseconds. Time during which the vcpu is idle, will not be
|
||||
reported as steal time.
|
||||
|
||||
MSR_KVM_EOI_EN: 0x4b564d04
|
||||
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
|
||||
when disabled. Bit 1 is reserved and must be zero. When PV end of
|
||||
interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
|
||||
physical address of a 4 byte memory area which must be in guest RAM and
|
||||
must be zeroed.
|
||||
|
||||
The first, least significant bit of 4 byte memory location will be
|
||||
written to by the hypervisor, typically at the time of interrupt
|
||||
injection. Value of 1 means that guest can skip writing EOI to the apic
|
||||
(using MSR or MMIO write); instead, it is sufficient to signal
|
||||
EOI by clearing the bit in guest memory - this location will
|
||||
later be polled by the hypervisor.
|
||||
Value of 0 means that the EOI write is required.
|
||||
|
||||
It is always safe for the guest to ignore the optimization and perform
|
||||
the APIC EOI write anyway.
|
||||
|
||||
Hypervisor is guaranteed to only modify this least
|
||||
significant bit while in the current VCPU context, this means that
|
||||
guest does not need to use either lock prefix or memory ordering
|
||||
primitives to synchronise with the hypervisor.
|
||||
|
||||
However, hypervisor can set and clear this memory bit at any time:
|
||||
therefore to make sure hypervisor does not interrupt the
|
||||
guest and clear the least significant bit in the memory area
|
||||
in the window between guest testing it to detect
|
||||
whether it can skip EOI apic write and between guest
|
||||
clearing it to signal EOI to the hypervisor,
|
||||
guest must both read the least significant bit in the memory area and
|
||||
clear it using a single CPU instruction, such as test and clear, or
|
||||
compare and exchange.
|
||||
|
@@ -109,8 +109,6 @@ The following bits are safe to be set inside the guest:
|
||||
|
||||
MSR_EE
|
||||
MSR_RI
|
||||
MSR_CR
|
||||
MSR_ME
|
||||
|
||||
If any other bit changes in the MSR, please still use mtmsr(d).
|
||||
|
||||
|
Reference in New Issue
Block a user