android_kernel_xiaomi_sm8450

xiaomi-sm8450/android_kernel_xiaomi_sm8450

Author	SHA1	Message	Date
Jim Mattson	1eaafe91a0	kvm: x86: IA32_ARCH_CAPABILITIES is always supported If there is a possibility that a VM may migrate to a Skylake host, then the hypervisor should report IA32_ARCH_CAPABILITIES.RSBA[bit 2] as being set (future work, of course). This implies that CPUID.(EAX=7,ECX=0):EDX.ARCH_CAPABILITIES[bit 29] should be set. Therefore, kvm should report this CPUID bit as being supported whether or not the host supports it. Userspace is still free to clear the bit if it chooses. For more information on RSBA, see Intel's white paper, "Retpoline: A Branch Target Injection Mitigation" (Document Number 337131-001), currently available at https://bugzilla.kernel.org/show_bug.cgi?id=199511. Since the IA32_ARCH_CAPABILITIES MSR is emulated in kvm, there is no dependency on hardware support for this feature. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Fixes: `28c1c9fabf` ("KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES") Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-24 18:38:34 +02:00
Wei Huang	c4d2188206	KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0) allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is supposed to update these CPUID bits when CR4 is updated. Current KVM code doesn't handle some special cases when updates come from emulator. Here is one example: Step 1: guest boots Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1 Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1 Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv Step 4 above will cause an #UD and guest crash because guest OS hasn't turned on OSXAVE yet. This patch solves the problem by comparing the the old_cr4 with cr4. If the related bits have been changed, kvm_update_cpuid() needs to be called. Signed-off-by: Wei Huang <wei@redhat.com> Reviewed-by: Bandan Das <bsd@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-24 17:57:18 +02:00
David Vrabel	d8f2f498d9	x86/kvm: fix LAPIC timer drift when guest uses periodic mode Since 4.10, commit `8003c9ae20` (KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support), guests using periodic LAPIC timers (such as FreeBSD 8.4) would see their timers drift significantly over time. Differences in the underlying clocks and numerical errors means the periods of the two timers (hv and sw) are not the same. This difference will accumulate with every expiry resulting in a large error between the hv and sw timer. This means the sw timer may be running slow when compared to the hv timer. When the timer is switched from hv to sw, the now active sw timer will expire late. The guest VCPU is reentered and it switches to using the hv timer. This timer catches up, injecting multiple IRQs into the guest (of which the guest only sees one as it does not get to run until the hv timer has caught up) and thus the guest's timer rate is low (and becomes increasing slower over time as the sw timer lags further and further behind). I believe a similar problem would occur if the hv timer is the slower one, but I have not observed this. Fix this by synchronizing the deadlines for both timers to the same time source on every tick. This prevents the errors from accumulating. Fixes: `8003c9ae20` Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: David Vrabel <david.vrabel@nutanix.com> Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-24 16:48:55 +02:00
Jim Mattson	21ebf53b2c	KVM: nVMX: Ensure that VMCS12 field offsets do not change Enforce the invariant that existing VMCS12 field offsets must not change. Experience has shown that without strict enforcement, this invariant will not be maintained. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [Changed the code to use BUILD_BUG_ON_MSG instead of better, but GCC 4.6 requiring _Static_assert. - Radim.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-23 17:48:42 +02:00
Jim Mattson	b348e7933c	KVM: nVMX: Restore the VMCS12 offsets for v4.0 fields Changing the VMCS12 layout will break save/restore compatibility with older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are accepted upstream. Google has already been using these ioctls for some time, and we implore the community not to disturb the existing layout. Move the four most recently added fields to preserve the offsets of the previously defined fields and reserve locations for the vmread and vmwrite bitmaps, which will be used in the virtualization of VMCS shadowing (to improve the performance of double-nesting). Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [Kept the SDM order in vmcs_field_to_offset_table. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-23 16:33:48 +02:00
Arnd Bergmann	899a31f509	KVM: x86: use timespec64 for KVM_HC_CLOCK_PAIRING The hypercall was added using a struct timespec based implementation, but we should not use timespec in new code. This changes it to timespec64. There is no functional change here since the implementation is only used in 64-bit kernels that use the same definition for timespec and timespec64. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-23 15:22:02 +02:00
Jim Mattson	6514dc380d	kvm: nVMX: Use nested_run_pending rather than from_vmentry When saving a vCPU's nested state, the vmcs02 is discarded. Only the shadow vmcs12 is saved. The shadow vmcs12 contains all of the information needed to reconstruct an equivalent vmcs02 on restore, but we have to be able to deal with two contexts: 1. The nested state was saved immediately after an emulated VM-entry, before the vmcs02 was ever launched. 2. The nested state was saved some time after the first successful launch of the vmcs02. Though it's an implementation detail rather than an architected bit, vmx->nested_run_pending serves to distinguish between these two cases. Hence, we save it as part of the vCPU's nested state. (Yes, this is ugly.) Even when restoring from a checkpoint, it may be necessary to build the vmcs02 as if prepare_vmcs02 was called from nested_vmx_run. So, the 'from_vmentry' argument should be dropped, and vmx->nested_run_pending should be consulted instead. The nested state restoration code then has to set vmx->nested_run_pending prior to calling prepare_vmcs02. It's important that the restoration code set vmx->nested_run_pending anyway, since the flag impacts things like interrupt delivery as well. Fixes: `cf8b84f48a` ("kvm: nVMX: Prepare for checkpointing L2 state") Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-23 15:22:02 +02:00
Konrad Rzeszutek Wilk	0aa48468d0	KVM/VMX: Expose SSBD properly to guests The X86_FEATURE_SSBD is an synthetic CPU feature - that is it bit location has no relevance to the real CPUID 0x7.EBX[31] bit position. For that we need the new CPU feature name. Fixes: `52817587e7` ("x86/cpufeatures: Disentangle SSBD enumeration") Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lkml.kernel.org/r/20180521215449.26423-2-konrad.wilk@oracle.com	2018-05-23 10:55:52 +02:00
Linus Torvalds	3b78ce4a34	Merge branch 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge speculative store buffer bypass fixes from Thomas Gleixner: - rework of the SPEC_CTRL MSR management to accomodate the new fancy SSBD (Speculative Store Bypass Disable) bit handling. - the CPU bug and sysfs infrastructure for the exciting new Speculative Store Bypass 'feature'. - support for disabling SSB via LS_CFG MSR on AMD CPUs including Hyperthread synchronization on ZEN. - PRCTL support for dynamic runtime control of SSB - SECCOMP integration to automatically disable SSB for sandboxed processes with a filter flag for opt-out. - KVM integration to allow guests fiddling with SSBD including the new software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on AMD. - BPF protection against SSB .. this is just the core and x86 side, other architecture support will come separately. * 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) bpf: Prevent memory disambiguation attack x86/bugs: Rename SSBD_NO to SSB_NO KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG x86/bugs: Rework spec_ctrl base and mask logic x86/bugs: Remove x86_spec_ctrl_set() x86/bugs: Expose x86_spec_ctrl_base directly x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host} x86/speculation: Rework speculative_store_bypass_update() x86/speculation: Add virtualized speculative store bypass disable support x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL x86/speculation: Handle HT correctly on AMD x86/cpufeatures: Add FEATURE_ZEN x86/cpufeatures: Disentangle SSBD enumeration x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP KVM: SVM: Move spec control call after restore of GS x86/cpu: Make alternative_msr_write work for 32-bit code x86/bugs: Fix the parameters alignment and missing void x86/bugs: Make cpu_show_common() static ...	2018-05-21 11:23:26 -07:00
Tom Lendacky	bc226f07dc	KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2018-05-17 17:09:21 +02:00
Thomas Gleixner	ccbcd26744	x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care about the bit position of the SSBD bit and thus facilitate migration. Also, the sibling coordination on Family 17H CPUs can only be done on the host. Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an extra argument for the VIRT_SPEC_CTRL MSR. Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU data structure which is going to be used in later patches for the actual implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2018-05-17 17:09:18 +02:00
Borislav Petkov	e7c587da12	x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP Intel and AMD have different CPUID bits hence for those use synthetic bits which get set on the respective vendor's in init_speculation_control(). So that debacles like what the commit message of `c65732e4f7` ("x86/cpu: Restore CPUID_8000_0008_EBX reload") talks about don't happen anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Jörg Otte <jrg.otte@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/20180504161815.GG9257@pd.tnic	2018-05-17 17:09:16 +02:00
Thomas Gleixner	15e6c22fd8	KVM: SVM: Move spec control call after restore of GS svm_vcpu_run() invokes x86_spec_ctrl_restore_host() after VMEXIT, but before the host GS is restored. x86_spec_ctrl_restore_host() uses 'current' to determine the host SSBD state of the thread. 'current' is GS based, but host GS is not yet restored and the access causes a triple fault. Move the call after the host GS restore. Fixes: `885f82bfbc` x86/process: Allow runtime control of Speculative Store Bypass Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-17 17:09:16 +02:00
Wanpeng Li	4c27625b7a	KVM: X86: Lower the default timer frequency limit to 200us Anthoine reported: The period used by Windows change over time but it can be 1 milliseconds or less. I saw the limit_periodic_timer_frequency print so 500 microseconds is sometimes reached. As suggested by Paolo, lower the default timer frequency limit to a smaller interval of 200 us (5000 Hz) to leave some headroom. This is required due to Windows 10 changing the scheduler tick limit from 1024 Hz to 2048 Hz. Reported-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Cc: Darren Kenny <darren.kenny@oracle.com> Cc: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-15 11:56:21 +02:00
Jim Mattson	3a2936dedd	kvm: mmu: Don't expose private memslots to L2 These private pages have special purposes in the virtualization of L1, but not in the virtualization of L2. In particular, L1's APIC access page should never be entered into L2's page tables, because this causes a great deal of confusion when the APIC virtualization hardware is being used to accelerate L2's accesses to its own APIC. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:24:26 +02:00
Jim Mattson	1313cc2bd8	kvm: mmu: Add guest_mode to kvm_mmu_page_role L1 and L2 need to have disjoint mappings, so that L1's APIC access page (under VMX) can be omitted from L2's mappings. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:24:25 +02:00
Jim Mattson	ab5df31cee	kvm: nVMX: Eliminate APIC access page sharing between L1 and L2 It is only possible to share the APIC access page between L1 and L2 if they also share the virtual-APIC page. If L2 has its own virtual-APIC page, then MMIO accesses to L1's TPR from L2 will access L2's TPR instead. Moreover, L1's local APIC has to be in xAPIC mode, which is another condition that hasn't been checked. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:24:24 +02:00
Jim Mattson	8d860bbeed	kvm: vmx: Basic APIC virtualization controls have three settings Previously, we toggled between SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE and SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES, depending on whether or not the EXTD bit was set in MSR_IA32_APICBASE. However, if the local APIC is disabled, we should not set either of these APIC virtualization control bits. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:24:24 +02:00
Jim Mattson	5887164942	kvm: vmx: Introduce lapic_mode enumeration The local APIC can be in one of three modes: disabled, xAPIC or x2APIC. (A fourth mode, "invalid," is included for completeness.) Using the new enumeration can make some of the APIC mode logic easier to read. In kvm_set_apic_base, for instance, it is clear that one cannot transition directly from x2APIC mode to xAPIC mode or directly from APIC disabled to x2APIC mode. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> [Check invalid bits even if msr_info->host_initiated. Reported by Wanpeng Li. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:14:25 +02:00
Vitaly Kuznetsov	ceef7d10df	KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support Enlightened MSR-Bitmap is a natural extension of Enlightened VMCS: Hyper-V Top Level Functional Specification states: "The L1 hypervisor may collaborate with the L0 hypervisor to make MSR accesses more efficient. It can enable enlightened MSR bitmaps by setting the corresponding field in the enlightened VMCS to 1. When enabled, the L0 hypervisor does not monitor the MSR bitmaps for changes. Instead, the L1 hypervisor must invalidate the corresponding clean field after making changes to one of the MSR bitmaps." I reached out to Hyper-V team for additional details and I got the following information: "Current Hyper-V implementation works as following: If the enlightened MSR bitmap is not enabled: - All MSR accesses of L2 guests cause physical VM-Exits If the enlightened MSR bitmap is enabled: - Physical VM-Exits for L2 accesses to certain MSRs (currently FS_BASE, GS_BASE and KERNEL_GS_BASE) are avoided, thus making these MSR accesses faster." I tested my series with a tight rdmsrl loop in L2, for KERNEL_GS_BASE the results are: Without Enlightened MSR-Bitmap: 1300 cycles/read With Enlightened MSR-Bitmap: 120 cycles/read Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:14:24 +02:00
Junaid Shahid	74b566e6cf	kvm: x86: Refactor mmu_free_roots() Extract the logic to free a root page in a separate function to avoid code duplication in mmu_free_roots(). Also, change it to an exported function i.e. kvm_mmu_free_roots(). Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:14:23 +02:00
Wanpeng Li	a780a3ea62	KVM: X86: Fix reserved bits check for MOV to CR3 MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4. It should be checked when PCIDE bit is not set, however commit 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on its physical address width")' removes the bit 63 checking unconditionally. This patch fixes it by checking bit 63 of CR3 when PCIDE bit is not set in CR4. Fixes: `d1cd3ce900` (KVM: MMU: check guest CR3 reserved bits based on its physical address width) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-14 18:14:16 +02:00
Sean Christopherson	64f7a11586	KVM: vmx: update sec exec controls for UMIP iff emulating UMIP Update SECONDARY_EXEC_DESC for UMIP emulation if and only UMIP is actually being emulated. Skipping the VMCS update eliminates unnecessary VMREAD/VMWRITE when UMIP is supported in hardware, and on platforms that don't have SECONDARY_VM_EXEC_CONTROL. The latter case resolves a bug where KVM would fill the kernel log with warnings due to failed VMWRITEs on older platforms. Fixes: `0367f205a3` ("KVM: vmx: add support for emulating UMIP") Cc: stable@vger.kernel.org #4.16 Reported-by: Paolo Zeppegno <pzeppegno@gmail.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Radim KrÄmÃ¡Å™ <rkrcmar@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-11 11:21:13 +02:00
Junaid Shahid	c19986fea8	kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled If the PCIDE bit is not set in CR4, then the MSb of CR3 is a reserved bit. If the guest tries to set it, that should cause a #GP fault. So mask out the bit only when the PCIDE bit is set. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-11 11:21:12 +02:00
Paolo Bonzini	452a68d0ef	KVM: hyperv: idr_find needs RCU protection Even though the eventfd is released after the KVM SRCU grace period elapses, the conn_to_evt data structure itself is not; it uses RCU internally, instead. Fix the read-side critical section to happen under rcu_read_lock/unlock; the result is still protected by vcpu->kvm->srcu. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-11 11:21:11 +02:00
Marian Rotariu	6356ee0c96	x86: Delay skip of emulated hypercall instruction The IP increment should be done after the hypercall emulation, after calling the various handlers. In this way, these handlers can accurately identify the the IP of the VMCALL if they need it. This patch keeps the same functionality for the Hyper-V handler which does not use the return code of the standard kvm_skip_emulated_instruction() call. Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com> [Hyper-V hypercalls also need kvm_skip_emulated_instruction() - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-05-11 11:21:10 +02:00
Konrad Rzeszutek Wilk	9f65fb2937	x86/bugs: Rename _RDS to _SSBD Intel collateral will reference the SSB mitigation bit in IA32_SPEC_CTL[2] as SSBD (Speculative Store Bypass Disable). Hence changing it. It is unclear yet what the MSR_IA32_ARCH_CAPABILITIES (0x10a) Bit(4) name is going to be. Following the rename it would be SSBD_NO but that rolls out to Speculative Store Bypass Disable No. Also fixed the missing space in X86_FEATURE_AMD_SSBD. [ tglx: Fixup x86_amd_rds_enable() and rds_tif_to_amd_ls_cfg() as well ] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2018-05-09 21:41:38 +02:00
Anthoine Bourgeois	ecf08dad72	KVM: x86: remove APIC Timer periodic/oneshot spikes Since the commit "8003c9ae204e: add APIC Timer periodic/oneshot mode VMX preemption timer support", a Windows 10 guest has some erratic timer spikes. Here the results on a 150000 times 1ms timer without any load: Before `8003c9ae20` \| After `8003c9ae20` Max 1834us \| 86000us Mean 1100us \| 1021us Deviation 59us \| 149us Here the results on a 150000 times 1ms timer with a cpu-z stress test: Before `8003c9ae20` \| After `8003c9ae20` Max 32000us \| 140000us Mean 1006us \| 1997us Deviation 140us \| 11095us The root cause of the problem is starting hrtimer with an expiry time already in the past can take more than 20 milliseconds to trigger the timer function. It can be solved by forward such past timers immediately, rather than submitting them to hrtimer_start(). In case the timer is periodic, update the target expiration and call hrtimer_start with it. v2: Check if the tsc deadline is already expired. Thank you Mika. v3: Execute the past timers immediately rather than submitting them to hrtimer_start(). v4: Rearm the periodic timer with advance_periodic_target_expiration() a simpler version of set_target_expiration(). Thank you Paolo. Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> `8003c9ae20` ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-05-05 23:09:39 +02:00
Thomas Gleixner	28a2775217	x86/speculation: Create spec-ctrl.h to avoid include hell Having everything in nospec-branch.h creates a hell of dependencies when adding the prctl based switching mechanism. Move everything which is not required in nospec-branch.h to spec-ctrl.h and fix up the includes in the relevant files. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org>	2018-05-03 13:55:50 +02:00
Konrad Rzeszutek Wilk	da39556f66	x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guest Expose the CPUID.7.EDX[31] bit to the guest, and also guard against various combinations of SPEC_CTRL MSR values. The handling of the MSR (to take into account the host value of SPEC_CTRL Bit(2)) is taken care of in patch: KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org>	2018-05-03 13:55:49 +02:00
Konrad Rzeszutek Wilk	5cf6875487	x86/bugs, KVM: Support the combination of guest and host IBRS A guest may modify the SPEC_CTRL MSR from the value used by the kernel. Since the kernel doesn't use IBRS, this means a value of zero is what is needed in the host. But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to the other bits as reserved so the kernel should respect the boot time SPEC_CTRL value and use that. This allows to deal with future extensions to the SPEC_CTRL interface if any at all. Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any difference as paravirt will over-write the callq *0xfff.. with the wrmsrl assembler code. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org>	2018-05-03 13:55:47 +02:00
KarimAllah Ahmed	5e62493f1a	x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of capabilities. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-04-27 18:37:17 +02:00
Junaid Shahid	a468f2dbf9	kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in use Currently, KVM flushes the TLB after a change to the APIC access page address or the APIC mode when EPT mode is enabled. However, even in shadow paging mode, a TLB flush is needed if VPIDs are being used, as specified in the Intel SDM Section 29.4.5. So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will flush if either EPT or VPIDs are in use. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-04-27 17:44:00 +02:00
Eric W. Biederman	3eb0f5193b	signal: Ensure every siginfo we send has all bits initialized Call clear_siginfo to ensure every stack allocated siginfo is properly initialized before being passed to the signal sending functions. Note: It is not safe to depend on C initializers to initialize struct siginfo on the stack because C is allowed to skip holes when initializing a structure. The initialization of struct siginfo in tracehook_report_syscall_exit was moved from the helper user_single_step_siginfo into tracehook_report_syscall_exit itself, to make it clear that the local variable siginfo gets fully initialized. In a few cases the scope of struct siginfo has been reduced to make it clear that siginfo siginfo is not used on other paths in the function in which it is declared. Instances of using memset to initialize siginfo have been replaced with calls clear_siginfo for clarity. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-04-25 10:40:51 -05:00
Linus Torvalds	e6d9bfdeb4	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Bug fixes, plus a new test case and the associated infrastructure for writing nested virtualization tests" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: selftests: add vmx_tsc_adjust_test kvm: x86: move MSR_IA32_TSC handling to x86.c X86/KVM: Properly update 'tsc_offset' to represent the running guest kvm: selftests: add -std=gnu99 cflags x86: Add check for APIC access address for vmentry of L2 guests KVM: X86: fix incorrect reference of trace_kvm_pi_irte_update X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available kvm: selftests: fix spelling mistake: "divisable" and "divisible" X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted	2018-04-16 11:24:28 -07:00
Paolo Bonzini	dd259935e4	kvm: x86: move MSR_IA32_TSC handling to x86.c This is not specific to Intel/AMD anymore. The TSC offset is available in vcpu->arch.tsc_offset. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-16 17:50:22 +02:00
KarimAllah Ahmed	e79f245dde	X86/KVM: Properly update 'tsc_offset' to represent the running guest Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always captures the TSC_OFFSET of the running guest whether it is the L1 or L2 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-16 17:50:11 +02:00
Krish Sadhukhan	f0f4cf5b30	x86: Add check for APIC access address for vmentry of L2 guests According to the sub-section titled 'VM-Execution Control Fields' in the section titled 'Basic VM-Entry Checks' in Intel SDM vol. 3C, the following vmentry check must be enforced: If the 'virtualize APIC-accesses' VM-execution control is 1, the APIC-access address must satisfy the following checks: - Bits 11:0 of the address must be 0. - The address should not set any bits beyond the processor's physical-address width. This patch adds the necessary check to conform to this rule. If the check fails, we cause the L2 VMENTRY to fail which is what the associated unit test (following patch) expects. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-12 18:36:28 +02:00
hu huajun	2698d82e51	KVM: X86: fix incorrect reference of trace_kvm_pi_irte_update In arch/x86/kvm/trace.h, this function is declared as host_irq the first input, and vcpu_id the second, instead of otherwise. Signed-off-by: hu huajun <huhuajun@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-11 13:34:48 +02:00
KarimAllah Ahmed	8e9b29b618	X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available If the processor does not have an "Always Running APIC Timer" (aka ARAT), we should not give guests direct access to MWAIT. The LAPIC timer would stop ticking in deep C-states, so any host deadlines would not wakeup the host kernel. The host kernel intel_idle driver handles this by switching to broadcast mode when ARAT is not available and MWAIT is issued with a deep C-state that would stop the LAPIC timer. When MWAIT is passed through, we can not tell when MWAIT is issued. So just disable this capability when LAPIC ARAT is not available. I am not even sure if there are any CPUs with VMX support but no LAPIC ARAT or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Reported-by: Wanpeng Li <kernellwp@gmail.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-11 11:34:16 +02:00
KarimAllah Ahmed	386c6ddbda	X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted The VMX-preemption timer is used by KVM as a way to set deadlines for the guest (i.e. timer emulation). That was safe till very recently when capability KVM_X86_DISABLE_EXITS_MWAIT to disable intercepting MWAIT was introduced. According to Intel SDM 25.5.1: """ The VMX-preemption timer operates in the C-states C0, C1, and C2; it also operates in the shutdown and wait-for-SIPI states. If the timer counts down to zero in any state other than the wait-for SIPI state, the logical processor transitions to the C0 C-state and causes a VM exit; the timer does not cause a VM exit if it counts down to zero in the wait-for-SIPI state. The timer is not decremented in C-states deeper than C2. """ Now once the guest issues the MWAIT with a c-state deeper than C2 the preemption timer will never wake it up again since it stopped ticking! Usually this is compensated by other activities in the system that would wake the core from the deep C-state (and cause a VMExit). For example, if the host itself is ticking or it received interrupts, etc! So disable the VMX-preemption timer if MWAIT is exposed to the guest! Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Fixes: `4d5422cea3` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-10 17:19:44 +02:00
Linus Torvalds	d8312a3f61	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - VHE optimizations - EL2 address space randomization - speculative execution mitigations ("variant 3a", aka execution past invalid privilege register access) - bugfixes and cleanups PPC: - improvements for the radix page fault handler for HV KVM on POWER9 s390: - more kvm stat counters - virtio gpu plumbing - documentation - facilities improvements x86: - support for VMware magic I/O port and pseudo-PMCs - AMD pause loop exiting - support for AMD core performance extensions - support for synchronous register access - expose nVMX capabilities to userspace - support for Hyper-V signaling via eventfd - use Enlightened VMCS when running on Hyper-V - allow userspace to disable MWAIT/HLT/PAUSE vmexits - usual roundup of optimizations and nested virtualization bugfixes Generic: - API selftest infrastructure (though the only tests are for x86 as of now)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits) kvm: x86: fix a prototype warning kvm: selftests: add sync_regs_test kvm: selftests: add API testing infrastructure kvm: x86: fix a compile warning KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" KVM: X86: Introduce handle_ud() KVM: vmx: unify adjacent #ifdefs x86: kvm: hide the unused 'cpu' variable KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" kvm: Add emulation for movups/movupd KVM: VMX: raise internal error for exception during invalid protected mode state KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending KVM: x86: Fix misleading comments on handling pending exceptions KVM: x86: Rename interrupt.pending to interrupt.injected KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt x86/kvm: use Enlightened VMCS when running on Hyper-V x86/hyper-v: detect nested features x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits ...	2018-04-09 11:42:31 -07:00
Peng Hao	e01bca2fc6	kvm: x86: fix a prototype warning Make the function static to avoid a warning: no previous prototype for ‘vmx_enable_tdp’ Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-06 18:20:31 +02:00
Peng Hao	3140c156e9	kvm: x86: fix a compile warning fix a "warning: no previous prototype". Cc: stable@vger.kernel.org Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 19:10:29 +02:00
Wanpeng Li	6c86eedc20	KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" There is no easy way to force KVM to run an instruction through the emulator (by design as that will expose the x86 emulator as a significant attack-surface). However, we do wish to expose the x86 emulator in case we are testing it (e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix" that is designed to raise #UD which KVM will trap and it's #UD exit-handler will match "force emulation prefix" to run instruction after prefix by the x86 emulator. To not expose the x86 emulator by default, we add a module parameter that should be off by default. A simple testcase here: #include <stdio.h> #include <string.h> #define HYPERVISOR_INFO 0x40000000 #define CPUID(idx, eax, ebx, ecx, edx) \ asm volatile (\ "ud2a; .ascii \"kvm\"; cpuid" \ :"=b" (ebx), "=a" (eax), "=c" (ecx), "=d" (edx) \ :"0"(idx) ); void main() { unsigned int eax, ebx, ecx, edx; char string[13]; CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx); (unsigned int )(string + 0) = ebx; (unsigned int )(string + 4) = ecx; (unsigned int )(string + 8) = edx; string[12] = 0; if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0) printf("kvm guest\n"); else printf("bare hardware\n"); } Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Correctly handle usermode exits. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 19:09:40 +02:00
Wanpeng Li	082d06edab	KVM: X86: Introduce handle_ud() Introduce handle_ud() to handle invalid opcode, this function will be used by later patches. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmÃ¡Å™ <rkrcmar@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 19:03:58 +02:00
Paolo Bonzini	4fde8d57cf	KVM: vmx: unify adjacent #ifdefs vmx_save_host_state has multiple ifdefs for CONFIG_X86_64 that have no other code between them. Simplify by reducing them to a single conditional. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 18:58:59 +02:00
Arnd Bergmann	51e8a8cc2f	x86: kvm: hide the unused 'cpu' variable The local variable was newly introduced but is only accessed in one place on x86_64, but not on 32-bit: arch/x86/kvm/vmx.c: In function 'vmx_save_host_state': arch/x86/kvm/vmx.c:2175:6: error: unused variable 'cpu' [-Werror=unused-variable] This puts it into another #ifdef. Fixes: `35060ed6a1` ("x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASE") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 18:57:40 +02:00
Sean Christopherson	c75d0edc8e	KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig Remove the WARN_ON in handle_ept_misconfig() as it is unnecessary and causes false positives. Return the unmodified result of kvm_mmu_page_fault() instead of converting a system error code to KVM_EXIT_UNKNOWN so that userspace sees the error code of the actual failure, not a generic "we don't know what went wrong". * kvm_mmu_page_fault() will WARN if reserved bits are set in the SPTEs, i.e. it covers the case where an EPT misconfig occurred because of a KVM bug. * The WARN_ON will fire on any system error code that is hit while handling the fault, e.g. -ENOMEM from mmu_topup_memory_caches() while handling a legitmate MMIO EPT misconfig or -EFAULT from kvm_handle_bad_page() if the corresponding HVA is invalid. In either case, userspace should receive the original error code and firing a warning is incorrect behavior as KVM is operating as designed. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 18:00:40 +02:00
Sean Christopherson	2c151b2544	Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" The bug that led to commit `95e057e258` was a benign warning (no adverse affects other than the warning itself) that was detected by syzkaller. Further inspection shows that the WARN_ON in question, in handle_ept_misconfig(), is unnecessary and flawed (this was also briefly discussed in the original patch: https://patchwork.kernel.org/patch/10204649). * The WARN_ON is unnecessary as kvm_mmu_page_fault() will WARN if reserved bits are set in the SPTEs, i.e. it covers the case where an EPT misconfig occurred because of a KVM bug. * The WARN_ON is flawed because it will fire on any system error code that is hit while handling the fault, e.g. -ENOMEM can be returned by mmu_topup_memory_caches() while handling a legitmate MMIO EPT misconfig. The original behavior of returning -EFAULT when userspace munmaps an HVA without first removing the memslot is correct and desirable, i.e. KVM is letting userspace know it has generated a bad address. Returning RET_PF_EMULATE masks the WARN_ON in the EPT misconfig path, but does not fix the underlying bug, i.e. the WARN_ON is bogus. Furthermore, returning RET_PF_EMULATE has the unwanted side effect of causing KVM to attempt to emulate an instruction on any page fault with an invalid HVA translation, e.g. a not-present EPT violation on a VM_PFNMAP VMA whose fault handler failed to insert a PFN. * There is no guarantee that the fault is directly related to the instruction, i.e. the fault could have been triggered by a side effect memory access in the guest, e.g. while vectoring a #DB or writing a tracing record. This could cause KVM to effectively mask the fault if KVM doesn't model the behavior leading to the fault, i.e. emulation could succeed and resume the guest. * If emulation does fail, KVM will return EMULATION_FAILED instead of -EFAULT, which is a red herring as the user will either debug a bogus emulation attempt or scratch their head wondering why we were attempting emulation in the first place. TL;DR: revert to returning -EFAULT and remove the bogus WARN_ON in handle_ept_misconfig in a future patch. This reverts commit `95e057e258`. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-04-04 18:00:36 +02:00

... 14 15 16 17 18 ...

5421 Commits