Commit Graph

5421 Commits

Author SHA1 Message Date
Stanislav Lanci
806793f5f7 KVM: x86: AMD Processor Topology Information
This patch allow to enable x86 feature TOPOEXT. This is needed to provide
information about SMT on AMD Zen CPUs to the guest.

Signed-off-by: Stanislav Lanci <pixo@polepetko.eu>
Tested-by: Nick Sarnie <commendsarnex@gmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-31 18:25:34 +01:00
Vitaly Kuznetsov
d391f12070 x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
I was investigating an issue with seabios >= 1.10 which stopped working
for nested KVM on Hyper-V. The problem appears to be in
handle_ept_violation() function: when we do fast mmio we need to skip
the instruction so we do kvm_skip_emulated_instruction(). This, however,
depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
However, this is not the case.

Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
EPT MISCONFIG occurs. While on real hardware it was observed to be set,
some hypervisors follow the spec and don't set it; we end up advancing
IP with some random value.

I checked with Microsoft and they confirmed they don't fill
VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.

Fix the issue by doing instruction skip through emulator when running
nested.

Fixes: 68c3b4d167
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-31 18:25:34 +01:00
Thomas Gleixner
5fa4ec9cb2 x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
The reenlightment support for hyperv slapped a direct reference to
x86_hyper_type into the kvm code which results in the following build
failure when CONFIG_HYPERVISOR_GUEST=n:

arch/x86/kvm/x86.c:6259:6: error: ‘x86_hyper_type’ undeclared (first use in this function)
arch/x86/kvm/x86.c:6259:6: note: each undeclared identifier is reported only once for each function it appears in

Use the proper helper function to cure that.

The 32bit compile fails because of:

arch/x86/kvm/x86.c:5936:13: warning: ‘kvm_hyperv_tsc_notifier’ defined but not used [-Wunused-function]

which is a real trainwreck engineering artwork. The callsite is wrapped
into #ifdef CONFIG_X86_64, but the function itself has the #ifdef inside
the function body. Make the function itself wrapped into the ifdef to cure
that.

Qualiteee....

Fixes: 0092e4346f ("x86/kvm: Support Hyper-V reenlightenment")
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
2018-01-31 10:29:40 +01:00
Vitaly Kuznetsov
0092e4346f x86/kvm: Support Hyper-V reenlightenment
When running nested KVM on Hyper-V guests its required to update
masterclocks for all guests when L1 migrates to a host with different TSC
frequency.

Implement the procedure in the following way:
  - Pause all guests.
  - Tell the host (Hyper-V) to stop emulating TSC accesses.
  - Update the gtod copy, recompute clocks.
  - Unpause all guests.

This is somewhat similar to cpufreq but there are two important differences:
 - TSC emulation can only be disabled globally (on all CPUs)
 - The new TSC frequency is not known until emulation is turned off so
   there is no way to 'prepare' for the event upfront.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
Link: https://lkml.kernel.org/r/20180124132337.30138-8-vkuznets@redhat.com
2018-01-30 23:55:34 +01:00
Vitaly Kuznetsov
b0c39dc68e x86/kvm: Pass stable clocksource to guests when running nested on Hyper-V
Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource which is used on the
host is TSC.

When running nested on Hyper-V the guest normally uses a different one: TSC
page which is resistant to TSC frequency changes on events like L1
migration. Add support for it in KVM.

The only non-trivial change is in vgettsc(): when updating the gtod copy
both the clock readout and tsc value have to be updated now.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
Link: https://lkml.kernel.org/r/20180124132337.30138-7-vkuznets@redhat.com
2018-01-30 23:55:34 +01:00
Ingo Molnar
7e86548e2c Merge tag 'v4.15' into x86/pti, to be able to merge dependent changes
Time has come to switch PTI development over to a v4.15 base - we'll still
try to make sure that all PTI fixes backport cleanly to v4.14 and earlier.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-30 15:08:27 +01:00
Linus Torvalds
6304672b7f Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "Another set of melted spectrum related changes:

   - Code simplifications and cleanups for RSB and retpolines.

   - Make the indirect calls in KVM speculation safe.

   - Whitelist CPUs which are known not to speculate from Meltdown and
     prepare for the new CPUID flag which tells the kernel that a CPU is
     not affected.

   - A less rigorous variant of the module retpoline check which merily
     warns when a non-retpoline protected module is loaded and reflects
     that fact in the sysfs file.

   - Prepare for Indirect Branch Prediction Barrier support.

   - Prepare for exposure of the Speculation Control MSRs to guests, so
     guest OSes which depend on those "features" can use them. Includes
     a blacklist of the broken microcodes. The actual exposure of the
     MSRs through KVM is still being worked on"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation: Simplify indirect_branch_prediction_barrier()
  x86/retpoline: Simplify vmexit_fill_RSB()
  x86/cpufeatures: Clean up Spectre v2 related CPUID flags
  x86/cpu/bugs: Make retpoline module warning conditional
  x86/bugs: Drop one "mitigation" from dmesg
  x86/nospec: Fix header guards names
  x86/alternative: Print unadorned pointers
  x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
  x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
  x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
  x86/msr: Add definitions for new speculation control MSRs
  x86/cpufeatures: Add AMD feature bits for Speculation Control
  x86/cpufeatures: Add Intel feature bits for Speculation Control
  x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
  module/retpoline: Warn about missing retpoline in module
  KVM: VMX: Make indirect call speculation safe
  KVM: x86: Make indirect calls in emulator speculation safe
2018-01-29 19:08:02 -08:00
Paolo Bonzini
f21f165ef9 KVM: VMX: introduce alloc_loaded_vmcs
Group together the calls to alloc_vmcs and loaded_vmcs_init.  Soon we'll also
allocate an MSR bitmap there.

Cc: stable@vger.kernel.org       # prereq for Spectre mitigation
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-27 09:43:12 +01:00
Jim Mattson
de3a0021a6 KVM: nVMX: Eliminate vmcs02 pool
The potential performance advantages of a vmcs02 pool have never been
realized. To simplify the code, eliminate the pool. Instead, a single
vmcs02 is allocated per VCPU when the VCPU enters VMX operation.

Cc: stable@vger.kernel.org       # prereq for Spectre mitigation
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Ameya More <ameya.more@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-27 09:43:03 +01:00
Peter Zijlstra
c940a3fb1e KVM: VMX: Make indirect call speculation safe
Replace indirect call with CALL_NOSPEC.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rga@amazon.de
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180125095843.645776917@infradead.org
2018-01-25 14:14:42 +01:00
Peter Zijlstra
1a29b5b7f3 KVM: x86: Make indirect calls in emulator speculation safe
Replace the indirect calls with CALL_NOSPEC.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rga@amazon.de
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180125095843.595615683@infradead.org
2018-01-25 11:30:07 +01:00
Tianyu Lan
37b95951c5 KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()
kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
to fix it.

Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-17 15:01:11 +01:00
Paolo Bonzini
d7231e75f7 KVM: VMX: introduce X2APIC_MSR macro
Remove duplicate expression in nested_vmx_prepare_msr_bitmap, and make
the register names clearer in hardware_setup.

Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Resolved rebase conflict after removing Intel PT. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:52:52 +01:00
Paolo Bonzini
c992384bde KVM: vmx: speed up MSR bitmap merge
The bulk of the MSR bitmap is either immutable, or can be copied from
the L1 bitmap.  By initializing it at VMXON time, and copying the mutable
parts one long at a time on vmentry (rather than one bit), about 4000
clock cycles (30%) can be saved on a nested VMLAUNCH/VMRESUME.

The resulting for loop only has four iterations, so it is cheap enough
to reinitialize the MSR write bitmaps on every iteration, and it makes
the code simpler.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:52:52 +01:00
Paolo Bonzini
1f6e5b2564 KVM: vmx: simplify MSR bitmap setup
The APICv-enabled MSR bitmap is a superset of the APICv-disabled bitmap.
Make that obvious in vmx_disable_intercept_msr_x2apic.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Resolved rebase conflict after removing Intel PT. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:52:48 +01:00
Paolo Bonzini
07f36616cd KVM: nVMX: remove unnecessary vmwrite from L2->L1 vmexit
The POSTED_INTR_NV field is constant (though it differs between the vmcs01 and
vmcs02), there is no need to reload it on vmexit to L1.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:23 +01:00
Paolo Bonzini
25a2e4fe8e KVM: nVMX: initialize more non-shadowed fields in prepare_vmcs02_full
These fields are also simple copies of the data in the vmcs12 struct.
For some of them, prepare_vmcs02 was skipping the copy when the field
was unused.  In prepare_vmcs02_full, we copy them always as long as the
field exists on the host, because the corresponding execution control
might be one of the shadowed fields.

Optimization opportunities remain for MSRs that, depending on the
entry/exit controls, have to be copied from either the vmcs01 or
the vmcs12: EFER (whose value is partly stored in the entry controls
too), PAT, DEBUGCTL (and also DR7).  Before moving these three and
the entry/exit controls to prepare_vmcs02_full, KVM would have to set
dirty_vmcs12 on writes to the L1 MSRs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:20 +01:00
Paolo Bonzini
8665c3f973 KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full
This part is separate for ease of review, because git prefers to move
prepare_vmcs02 below the initial long sequence of vmcs_write* operations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:17 +01:00
Paolo Bonzini
74a497fae7 KVM: nVMX: track dirty state of non-shadowed VMCS fields
VMCS12 fields that are not handled through shadow VMCS are rarely
written, and thus they are also almost constant in the vmcs02.  We can
thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
fields in the common case.

This patch introduces the (pretty simple) tracking infrastructure; the
next patches will move work to prepare_vmcs02_full and save a few hundred
clock cycles per VMRESUME on a Haswell Xeon E5 system:

	                                before  after
	cpuid                           14159   13869
	vmcall                          15290   14951
	inl_from_kernel                 17703   17447
	outl_to_kernel                  16011   14692
	self_ipi_sti_nop                16763   15825
	self_ipi_tpr_sti_nop            17341   15935
	wr_tsc_adjust_msr               14510   14264
	rd_tsc_adjust_msr               15018   14311
	mmio-wildcard-eventfd:pci-mem   16381   14947
	mmio-datamatch-eventfd:pci-mem  18620   17858
	portio-wildcard-eventfd:pci-io  15121   14769
	portio-datamatch-eventfd:pci-io 15761   14831

(average savings 748, stdev 460).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:13 +01:00
Paolo Bonzini
c9e9deae76 KVM: VMX: split list of shadowed VMCS field to a separate file
Prepare for multiple inclusions of the list.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:05 +01:00
Jim Mattson
58e9ffae5e kvm: vmx: Reduce size of vmcs_field_to_offset_table
The vmcs_field_to_offset_table was a rather sparse table of short
integers with a maximum index of 0x6c16, amounting to 55342 bytes. Now
that we are considering support for multiple VMCS12 formats, it would
be unfortunate to replicate that large, sparse table. Rotating the
field encoding (as a 16-bit integer) left by 6 reduces that table to
5926 bytes.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:03 +01:00
Jim Mattson
d37f4267a7 kvm: vmx: Change vmcs_field_type to vmcs_field_width
Per the SDM, "[VMCS] Fields are grouped by width (16-bit, 32-bit,
etc.) and type (guest-state, host-state, etc.)." Previously, the width
was indicated by vmcs_field_type. To avoid confusion when we start
dealing with both field width and field type, change vmcs_field_type
to vmcs_field_width.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:50:01 +01:00
Jim Mattson
5b15706dbf kvm: vmx: Introduce VMCS12_MAX_FIELD_INDEX
This is the highest index value used in any supported VMCS12 field
encoding. It is used to populate the IA32_VMX_VMCS_ENUM MSR.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:49:58 +01:00
Paolo Bonzini
44900ba65e KVM: VMX: optimize shadow VMCS copying
Because all fields can be read/written with a single vmread/vmwrite on
64-bit kernels, the switch statements in copy_vmcs12_to_shadow and
copy_shadow_to_vmcs12 are unnecessary.

What I did in this patch is to copy the two parts of 64-bit fields
separately on 32-bit kernels, to keep all complicated #ifdef-ery
in init_vmcs_shadow_fields.  The disadvantage is that 64-bit fields
have to be listed separately in shadow_read_only/read_write_fields,
but those are few and we can validate the arrays when building the
VMREAD and VMWRITE bitmaps.  This saves a few hundred clock cycles
per nested vmexit.

However there is still a "switch" in vmcs_read_any and vmcs_write_any.
So, while at it, this patch reorders the fields by type, hoping that
the branch predictor appreciates it.

Cc: Jim Mattson <jmattson@google.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:49:56 +01:00
Paolo Bonzini
c5d167b27e KVM: vmx: shadow more fields that are read/written on every vmexits
Compared to when VMCS shadowing was added to KVM, we are reading/writing
a few more fields: the PML index, the interrupt status and the preemption
timer value.  The first two are because we are exposing more features
to nested guests, the preemption timer is simply because we have grown
a new optimization.  Adding them to the shadow VMCS field lists reduces
the cost of a vmexit by about 1000 clock cycles for each field that exists
on bare metal.

On the other hand, the guest BNDCFGS and TSC offset are not written on
fast paths, so remove them.

Suggested-by: Jim Mattson <jmattson@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:49:44 +01:00
Liran Alon
6b6977117f KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2
Consider the following scenario:
1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI
to CPU B via virtual posted-interrupt mechanism.
2. CPU B is currently executing L2 guest.
3. vmx_deliver_nested_posted_interrupt() calls
kvm_vcpu_trigger_posted_interrupt() which will note that
vcpu->mode == IN_GUEST_MODE.
4. Assume that before CPU A sends the physical POSTED_INTR_NESTED_VECTOR
IPI, CPU B exits from L2 to L0 during event-delivery
(valid IDT-vectoring-info).
5. CPU A now sends the physical IPI. The IPI is received in host and
it's handler (smp_kvm_posted_intr_nested_ipi()) does nothing.
6. Assume that before CPU A sets pi_pending=true and KVM_REQ_EVENT,
CPU B continues to run in L0 and reach vcpu_enter_guest(). As
KVM_REQ_EVENT is not set yet, vcpu_enter_guest() will continue and resume
L2 guest.
7. At this point, CPU A sets pi_pending=true and KVM_REQ_EVENT but
it's too late! CPU B already entered L2 and KVM_REQ_EVENT will only be
consumed at next L2 entry!

Another scenario to consider:
1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI
to CPU B via virtual posted-interrupt mechanism.
2. Assume that before CPU A calls kvm_vcpu_trigger_posted_interrupt(),
CPU B is at L0 and is about to resume into L2. Further assume that it is
in vcpu_enter_guest() after check for KVM_REQ_EVENT.
3. At this point, CPU A calls kvm_vcpu_trigger_posted_interrupt() which
will note that vcpu->mode != IN_GUEST_MODE. Therefore, do nothing and
return false. Then, will set pi_pending=true and KVM_REQ_EVENT.
4. Now CPU B continue and resumes into L2 guest without processing
the posted-interrupt until next L2 entry!

To fix both issues, we just need to change
vmx_deliver_nested_posted_interrupt() to set pi_pending=true and
KVM_REQ_EVENT before calling kvm_vcpu_trigger_posted_interrupt().

It will fix the first scenario by chaging step (6) to note that
KVM_REQ_EVENT and pi_pending=true and therefore process
nested posted-interrupt.

It will fix the second scenario by two possible ways:
1. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B has changed
vcpu->mode to IN_GUEST_MODE, physical IPI will be sent and will be received
when CPU resumes into L2.
2. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B hasn't yet
changed vcpu->mode to IN_GUEST_MODE, then after CPU B will change
vcpu->mode it will call kvm_request_pending() which will return true and
therefore force another round of vcpu_enter_guest() which will note that
KVM_REQ_EVENT and pi_pending=true and therefore process nested
posted-interrupt.

Cc: stable@vger.kernel.org
Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
[Add kvm_vcpu_kick to also handle the case where L1 doesn't intercept L2 HLT
 and L2 executes HLT instruction. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Liran Alon
851c1a18c5 KVM: nVMX: Fix injection to L2 when L1 don't intercept external-interrupts
Before each vmentry to guest, vcpu_enter_guest() calls sync_pir_to_irr()
which calls vmx_hwapic_irr_update() to update RVI.
Currently, vmx_hwapic_irr_update() contains a tweak in case it is called
when CPU is running L2 and L1 don't intercept external-interrupts.
In that case, code injects interrupt directly into L2 instead of
updating RVI.

Besides being hacky (wouldn't expect function updating RVI to also
inject interrupt), it also doesn't handle this case correctly.
The code contains several issues:
1. When code calls kvm_queue_interrupt() it just passes it max_irr which
represents the highest IRR currently pending in L1 LAPIC.
This is problematic as interrupt was injected to guest but it's bit is
still set in LAPIC IRR instead of being cleared from IRR and set in ISR.
2. Code doesn't check if LAPIC PPR is set to accept an interrupt of
max_irr priority. It just checks if interrupts are enabled in guest with
vmx_interrupt_allowed().

To fix the above issues:
1. Simplify vmx_hwapic_irr_update() to just update RVI.
Note that this shouldn't happen when CPU is running L2
(See comment in code).
2. Since now vmx_hwapic_irr_update() only does logic for L1
virtual-interrupt-delivery, inject_pending_event() should be the
one responsible for injecting the interrupt directly into L2.
Therefore, change kvm_cpu_has_injectable_intr() to check L1
LAPIC when CPU is running L2.
3. Change vmx_sync_pir_to_irr() to set KVM_REQ_EVENT when L1
has a pending injectable interrupt.

Fixes: 963fee1656 ("KVM: nVMX: Fix virtual interrupt delivery
injection")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Liran Alon
f27a85c498 KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
In case posted-interrupt was delivered to CPU while it is in host
(outside guest), then posted-interrupt delivery will be done by
calling sync_pir_to_irr() at vmentry after interrupts are disabled.

sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
ensure virtual-interrupt-delivery will dispatch interrupt to guest.

However, it is possible that L1 will receive a posted-interrupt while
CPU runs at host and is about to enter L2. In this case, the call to
sync_pir_to_irr() will indeed update the L1's APIC IRR but
vcpu_enter_guest() will then just resume into L2 guest without
re-evaluating if it should exit from L2 to L1 as a result of this
new pending L1 event.

To address this case, if sync_pir_to_irr() has a new L1 injectable
interrupt and CPU is running L2, we force exit GUEST_MODE which will
result in another iteration of vcpu_run() run loop which will call
kvm_vcpu_running() which will call check_nested_events() which will
handle the pending L1 event properly.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Liran Alon
e7387b0e27 KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated
This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Liran Alon
fa59cc0038 KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr()
sync_pir_to_irr() is only called if vcpu->arch.apicv_active()==true.
In case it is false, VMX code make sure to set sync_pir_to_irr
to NULL.

Therefore, having SVM stubs allows to remove check for if
sync_pir_to_irr != NULL from all calling sites.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
[Return highest IRR in the SVM case. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Liran Alon
5c7d4f9ad3 KVM: nVMX: Fix bug of injecting L2 exception into L1
kvm_clear_exception_queue() should clear pending exception.
This also includes exceptions which were only marked pending but not
yet injected. This is because exception.pending is used for both L1
and L2 to determine if an exception should be raised to guest.
Note that an exception which is pending but not yet injected will
be raised again once the guest will be resumed.

Consider the following scenario:
1) L0 KVM with ignore_msrs=false.
2) L1 prepare vmcs12 with the following:
    a) No intercepts on MSR (MSR_BITMAP exist and is filled with 0).
    b) No intercept for #GP.
    c) vmx-preemption-timer is configured.
3) L1 enters into L2.
4) L2 reads an unhandled MSR that exists in MSR_BITMAP
(such as 0x1fff).

L2 RDMSR could be handled as described below:
1) L2 exits to L0 on RDMSR and calls handle_rdmsr().
2) handle_rdmsr() calls kvm_inject_gp() which sets
KVM_REQ_EVENT, exception.pending=true and exception.injected=false.
3) vcpu_enter_guest() consumes KVM_REQ_EVENT and calls
inject_pending_event() which calls vmx_check_nested_events()
which sees that exception.pending=true but
nested_vmx_check_exception() returns 0 and therefore does nothing at
this point. However let's assume it later sees vmx-preemption-timer
expired and therefore exits from L2 to L1 by calling
nested_vmx_vmexit().
4) nested_vmx_vmexit() calls prepare_vmcs12()
which calls vmcs12_save_pending_event() but it does nothing as
exception.injected is false. Also prepare_vmcs12() calls
kvm_clear_exception_queue() which does nothing as
exception.injected is already false.
5) We now return from vmx_check_nested_events() with 0 while still
having exception.pending=true!
6) Therefore inject_pending_event() continues
and we inject L2 exception to L1!...

This commit will fix above issue by changing step (4) to
clear exception.pending in kvm_clear_exception_queue().

Fixes: 664f8e26b0 ("KVM: X86: Fix loss of exception which has not yet been injected")
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Borislav Petkov
a6cb099a43 kvm/vmx: Use local vmx variable in vmx_get_msr()
... just like in vmx_set_msr().

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Haozhong Zhang
aa2e063aea KVM: MMU: consider host cache mode in MMIO page check
Some reserved pages, such as those from NVDIMM DAX devices, are not
for MMIO, and can be mapped with cached memory type for better
performance. However, the above check misconceives those pages as
MMIO.  Because KVM maps MMIO pages with UC memory type, the
performance of guest accesses to those pages would be harmed.
Therefore, we check the host memory type in addition and only treat
UC/UC-/WC pages as MMIO.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escareno@intel.com>
Reported-by: Kumar, Karthik <karthik.kumar@intel.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Paolo Bonzini
05992edc27 Merge branch 'kvm-insert-lfence'
Topic branch for CVE-2017-5753, avoiding conflicts in the next merge window.
2018-01-16 16:39:30 +01:00
Paolo Bonzini
505c9e94d8 KVM: x86: prefer "depends on" to "select" for SEV
Avoid reverse dependencies.  Instead, SEV will only be enabled if
the PSP driver is available.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:38:32 +01:00
Paolo Bonzini
65e38583c3 Merge branch 'sev-v9-p2' of https://github.com/codomania/kvm
This part of Secure Encrypted Virtualization (SEV) patch series focuses on KVM
changes required to create and manage SEV guests.

SEV is an extension to the AMD-V architecture which supports running encrypted
virtual machine (VMs) under the control of a hypervisor. Encrypted VMs have their
pages (code and data) secured such that only the guest itself has access to
unencrypted version. Each encrypted VM is associated with a unique encryption key;
if its data is accessed to a different entity using a different key the encrypted
guest's data will be incorrectly decrypted, leading to unintelligible data.
This security model ensures that hypervisor will no longer able to inspect or
alter any guest code or data.

The key management of this feature is handled by a separate processor known as
the AMD Secure Processor (AMD-SP) which is present on AMD SOCs. The SEV Key
Management Specification (see below) provides a set of commands which can be
used by hypervisor to load virtual machine keys through the AMD-SP driver.

The patch series adds a new ioctl in KVM driver (KVM_MEMORY_ENCRYPT_OP). The
ioctl will be used by qemu to issue SEV guest-specific commands defined in Key
Management Specification.

The following links provide additional details:

AMD Memory Encryption white paper:
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf

AMD64 Architecture Programmer's Manual:
    http://support.amd.com/TechDocs/24593.pdf
    SME is section 7.10
    SEV is section 15.34

SEV Key Management:
http://support.amd.com/TechDocs/55766_SEV-KM API_Specification.pdf

KVM Forum Presentation:
http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf

SEV Guest BIOS support:
  SEV support has been add to EDKII/OVMF BIOS
  https://github.com/tianocore/edk2

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-16 16:35:32 +01:00
Paolo Bonzini
476b7adaa3 KVM: x86: avoid unnecessary XSETBV on guest entry
xsetbv can be expensive when running on nested virtualization, try to
avoid it.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li
efdab99281 KVM: x86: fix escape of guest dr6 to the host
syzkaller reported:

   WARNING: CPU: 0 PID: 12927 at arch/x86/kernel/traps.c:780 do_debug+0x222/0x250
   CPU: 0 PID: 12927 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #16
   RIP: 0010:do_debug+0x222/0x250
   Call Trace:
    <#DB>
    debug+0x3e/0x70
   RIP: 0010:copy_user_enhanced_fast_string+0x10/0x20
    </#DB>
    _copy_from_user+0x5b/0x90
    SyS_timer_create+0x33/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

The testcase sets a watchpoint (with perf_event_open) on a buffer that is
passed to timer_create() as the struct sigevent argument.  In timer_create(),
copy_from_user()'s rep movsb triggers the BP.  The testcase also sets
the debug registers for the guest.

However, KVM only restores host debug registers when the host has active
watchpoints, which triggers a race condition when running the testcase with
multiple threads.  The guest's DR6.BS bit can escape to the host before
another thread invokes timer_create(), and do_debug() complains.

The fix is to respect do_debug()'s dr6 invariant when leaving KVM.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li
f38a7b7526 KVM: X86: support paravirtualized help for TLB shootdowns
When running on a virtual machine, IPIs are expensive when the target
CPU is sleeping.  Thus, it is nice to be able to avoid them for TLB
shootdowns.  KVM can just do the flush via INVVPID on the guest's behalf
the next time the CPU is scheduled.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Use "&" to test the bit instead of "==". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li
c2ba05ccfd KVM: X86: introduce invalidate_gpa argument to tlb flush
Introduce a new bool invalidate_gpa argument to kvm_x86_ops->tlb_flush,
it will be used by later patches to just flush guest tlb.

For VMX, this will use INVVPID instead of INVEPT, which will invalidate
combined mappings while keeping guest-physical mappings.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li
fa55eedd63 KVM: X86: Add KVM_VCPU_PREEMPTED
The next patch will add another bit to the preempted field in
kvm_steal_time.  Define a constant for bit 0 (the only one that is
currently used).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Paolo Bonzini
51776043af kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
This ioctl is obsolete (it was used by Xenner as far as I know) but
still let's not break it gratuitously...  Its handler is copying
directly into struct kvm.  Go through a bounce buffer instead, with
the added benefit that we can actually do something useful with the
flags argument---the previous code was exiting with -EINVAL but still
doing the copy.

This technically is a userspace ABI breakage, but since no one should be
using the ioctl, it's a good occasion to see if someone actually
complains.

Cc: kernel-hardening@lists.openwall.com
Cc: Kees Cook <keescook@chromium.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15 12:08:07 -08:00
Linus Torvalds
40548c6b6c Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner:
 "This contains:

   - a PTI bugfix to avoid setting reserved CR3 bits when PCID is
     disabled. This seems to cause issues on a virtual machine at least
     and is incorrect according to the AMD manual.

   - a PTI bugfix which disables the perf BTS facility if PTI is
     enabled. The BTS AUX buffer is not globally visible and causes the
     CPU to fault when the mapping disappears on switching CR3 to user
     space. A full fix which restores BTS on PTI is non trivial and will
     be worked on.

   - PTI bugfixes for EFI and trusted boot which make sure that the user
     space visible page table entries have the NX bit cleared

   - removal of dead code in the PTI pagetable setup functions

   - add PTI documentation

   - add a selftest for vsyscall to verify that the kernel actually
     implements what it advertises.

   - a sysfs interface to expose vulnerability and mitigation
     information so there is a coherent way for users to retrieve the
     status.

   - the initial spectre_v2 mitigations, aka retpoline:

      + The necessary ASM thunk and compiler support

      + The ASM variants of retpoline and the conversion of affected ASM
        code

      + Make LFENCE serializing on AMD so it can be used as speculation
        trap

      + The RSB fill after vmexit

   - initial objtool support for retpoline

  As I said in the status mail this is the most of the set of patches
  which should go into 4.15 except two straight forward patches still on
  hold:

   - the retpoline add on of LFENCE which waits for ACKs

   - the RSB fill after context switch

  Both should be ready to go early next week and with that we'll have
  covered the major holes of spectre_v2 and go back to normality"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  x86,perf: Disable intel_bts when PTI
  security/Kconfig: Correct the Documentation reference for PTI
  x86/pti: Fix !PCID and sanitize defines
  selftests/x86: Add test_vsyscall
  x86/retpoline: Fill return stack buffer on vmexit
  x86/retpoline/irq32: Convert assembler indirect jumps
  x86/retpoline/checksum32: Convert assembler indirect jumps
  x86/retpoline/xen: Convert Xen hypercall indirect jumps
  x86/retpoline/hyperv: Convert assembler indirect jumps
  x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
  x86/retpoline/entry: Convert entry assembler indirect jumps
  x86/retpoline/crypto: Convert crypto assembler indirect jumps
  x86/spectre: Add boot time option to select Spectre v2 mitigation
  x86/retpoline: Add initial retpoline support
  objtool: Allow alternatives to be ignored
  objtool: Detect jumps to retpoline thunks
  x86/pti: Make unpoison of pgd for trusted boot work for real
  x86/alternatives: Fix optimize_nops() checking
  sysfs/cpu: Fix typos in vulnerability documentation
  x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
  ...
2018-01-14 09:51:25 -08:00
David Woodhouse
117cc7a908 x86/retpoline: Fill return stack buffer on vmexit
In accordance with the Intel and AMD documentation, we need to overwrite
all entries in the RSB on exiting a guest, to prevent malicious branch
target predictions from affecting the host kernel. This is needed both
for retpoline and for IBRS.

[ak: numbers again for the RSB stuffing labels]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: thomas.lendacky@amd.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/1515755487-8524-1-git-send-email-dwmw@amazon.co.uk
2018-01-12 12:33:37 +01:00
Paolo Bonzini
2aad9b3e07 Merge branch 'kvm-insert-lfence' into kvm-master
Topic branch for CVE-2017-5753, avoiding conflicts in the next merge window.
2018-01-11 18:20:48 +01:00
Andrew Honig
75f139aaf8 KVM: x86: Add memory barrier on vmcs field lookup
This adds a memory barrier when performing a lookup into
the vmcs_field_to_offset_table.  This is related to
CVE-2017-5753.

Signed-off-by: Andrew Honig <ahonig@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-11 18:20:31 +01:00
Paolo Bonzini
bd89525a82 KVM: x86: emulate #UD while in guest mode
This reverts commits ae1f576707
and ac9b305caa.

If the hardware doesn't support MOVBE, but L0 sets CPUID.01H:ECX.MOVBE
in L1's emulated CPUID information, then L1 is likely to pass that
CPUID bit through to L2. L2 will expect MOVBE to work, but if L1
doesn't intercept #UD, then any MOVBE instruction executed in L2 will
raise #UD, and the exception will be delivered in L2.

Commit ac9b305caa is a better and more
complete version of ae1f576707 ("KVM: nVMX: Do not emulate #UD while
in guest mode"); however, neither considers the above case.

Suggested-by: Jim Mattson <jmattson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-11 16:55:24 +01:00
Arnd Bergmann
ab271bd4df x86: kvm: propagate register_shrinker return code
Patch "mm,vmscan: mark register_shrinker() as __must_check" is
queued for 4.16 in linux-mm and adds a warning about the unchecked
call to register_shrinker:

arch/x86/kvm/mmu.c:5485:2: warning: ignoring return value of 'register_shrinker', declared with attribute warn_unused_result [-Wunused-result]

This changes the kvm_mmu_module_init() function to fail itself
when the call to register_shrinker fails.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-11 16:53:13 +01:00
Haozhong Zhang
2a266f2355 KVM MMU: check pending exception before injecting APF
For example, when two APF's for page ready happen after one exit and
the first one becomes pending, the second one will result in #DF.
Instead, just handle the second page fault synchronously.

Reported-by: Ross Zwisler <zwisler@gmail.com>
Message-ID: <CAOxpaSUBf8QoOZQ1p4KfUp0jq76OKfGY4Uxs-Gg8ngReD99xww@mail.gmail.com>
Reported-by: Alec Blayne <ab@tevsa.net>
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-11 14:05:19 +01:00
Linus Torvalds
5b6c02f383 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
 "s390:
   - Two fixes for potential bitmap overruns in the cmma migration code

  x86:
   - Clear guest provided GPRs to defeat the Project Zero PoC for CVE
     2017-5715"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: vmx: Scrub hardware GPRs at VM-exit
  KVM: s390: prevent buffer overrun on memory hotplug during migration
  KVM: s390: fix cmma migration for multiple memory slots
2018-01-06 17:05:05 -08:00