Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:
   - GICv4.1 support

   - 32bit host removal

  PPC:
   - secure (encrypted) using under the Protected Execution Framework
     ultravisor

  s390:
   - allow disabling GISA (hardware interrupt injection) and protected
     VMs/ultravisor support.

  x86:
   - New dirty bitmap flag that sets all bits in the bitmap when dirty
     page logging is enabled; this is faster because it doesn't require
     bulk modification of the page tables.

   - Initial work on making nested SVM event injection more similar to
     VMX, and less buggy.

   - Various cleanups to MMU code (though the big ones and related
     optimizations were delayed to 5.8). Instead of using cr3 in
     function names which occasionally means eptp, KVM too has
     standardized on "pgd".

   - A large refactoring of CPUID features, which now use an array that
     parallels the core x86_features.

   - Some removal of pointer chasing from kvm_x86_ops, which will also
     be switched to static calls as soon as they are available.

   - New Tigerlake CPUID features.

   - More bugfixes, optimizations and cleanups.

  Generic:
   - selftests: cleanups, new MMU notifier stress test, steal-time test

   - CSV output for kvm_stat"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits)
  x86/kvm: fix a missing-prototypes "vmread_error"
  KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y
  KVM: VMX: Add a trampoline to fix VMREAD error handling
  KVM: SVM: Annotate svm_x86_ops as __initdata
  KVM: VMX: Annotate vmx_x86_ops as __initdata
  KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup()
  KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection
  KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes
  KVM: VMX: Configure runtime hooks using vmx_x86_ops
  KVM: VMX: Move hardware_setup() definition below vmx_x86_ops
  KVM: x86: Move init-only kvm_x86_ops to separate struct
  KVM: Pass kvm_init()'s opaque param to additional arch funcs
  s390/gmap: return proper error code on ksm unsharing
  KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move()
  KVM: Fix out of range accesses to memslots
  KVM: X86: Micro-optimize IPI fastpath delay
  KVM: X86: Delay read msr data iff writes ICR MSR
  KVM: PPC: Book3S HV: Add a capability for enabling secure guests
  KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs
  KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs
  ...
This commit is contained in:
Linus Torvalds
2020-04-02 15:13:15 -07:00
206 changed files with 7874 additions and 9705 deletions

View File

@@ -1574,8 +1574,8 @@ This ioctl would set vcpu's xcr to the value userspace specified.
};
#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0)
#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1)
#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2)
#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) /* deprecated */
#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) /* deprecated */
struct kvm_cpuid_entry2 {
__u32 function;
@@ -1626,13 +1626,6 @@ emulate them efficiently. The fields in each entry are defined as follows:
KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
if the index field is valid
KVM_CPUID_FLAG_STATEFUL_FUNC:
if cpuid for this function returns different values for successive
invocations; there will be several entries with the same function,
all with this flag set
KVM_CPUID_FLAG_STATE_READ_NEXT:
for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
the first entry to be read by a cpu
eax, ebx, ecx, edx:
the values returned by the cpuid instruction for
@@ -2117,7 +2110,8 @@ Errors:
====== ============================================================
 ENOENT   no such register
 EINVAL   invalid register ID, or no such register
 EINVAL   invalid register ID, or no such register or used with VMs in
protected virtualization mode on s390
 EPERM    (arm64) register access not allowed before vcpu finalization
====== ============================================================
@@ -2552,7 +2546,8 @@ Errors include:
======== ============================================================
 ENOENT   no such register
 EINVAL   invalid register ID, or no such register
 EINVAL   invalid register ID, or no such register or used with VMs in
protected virtualization mode on s390
 EPERM    (arm64) register access not allowed before vcpu finalization
======== ============================================================
@@ -3347,8 +3342,8 @@ The member 'flags' is used for passing flags from userspace.
::
#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0)
#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1)
#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2)
#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) /* deprecated */
#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) /* deprecated */
struct kvm_cpuid_entry2 {
__u32 function;
@@ -3394,13 +3389,6 @@ The fields in each entry are defined as follows:
KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
if the index field is valid
KVM_CPUID_FLAG_STATEFUL_FUNC:
if cpuid for this function returns different values for successive
invocations; there will be several entries with the same function,
all with this flag set
KVM_CPUID_FLAG_STATE_READ_NEXT:
for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
the first entry to be read by a cpu
eax, ebx, ecx, edx:
@@ -4649,6 +4637,60 @@ the clear cpu reset definition in the POP. However, the cpu is not put
into ESA mode. This reset is a superset of the initial reset.
4.125 KVM_S390_PV_COMMAND
-------------------------
:Capability: KVM_CAP_S390_PROTECTED
:Architectures: s390
:Type: vm ioctl
:Parameters: struct kvm_pv_cmd
:Returns: 0 on success, < 0 on error
::
struct kvm_pv_cmd {
__u32 cmd; /* Command to be executed */
__u16 rc; /* Ultravisor return code */
__u16 rrc; /* Ultravisor return reason code */
__u64 data; /* Data or address */
__u32 flags; /* flags for future extensions. Must be 0 for now */
__u32 reserved[3];
};
cmd values:
KVM_PV_ENABLE
Allocate memory and register the VM with the Ultravisor, thereby
donating memory to the Ultravisor that will become inaccessible to
KVM. All existing CPUs are converted to protected ones. After this
command has succeeded, any CPU added via hotplug will become
protected during its creation as well.
Errors:
===== =============================
EINTR an unmasked signal is pending
===== =============================
KVM_PV_DISABLE
Deregister the VM from the Ultravisor and reclaim the memory that
had been donated to the Ultravisor, making it usable by the kernel
again. All registered VCPUs are converted back to non-protected
ones.
KVM_PV_VM_SET_SEC_PARMS
Pass the image header from VM memory to the Ultravisor in
preparation of image unpacking and verification.
KVM_PV_VM_UNPACK
Unpack (protect and decrypt) a page of the encrypted boot image.
KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.
5. The kvm_run structure
========================
@@ -5707,8 +5749,13 @@ and injected exceptions.
:Architectures: x86, arm, arm64, mips
:Parameters: args[0] whether feature should be enabled or not
With this capability enabled, KVM_GET_DIRTY_LOG will not automatically
clear and write-protect all pages that are returned as dirty.
Valid flags are::
#define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0)
#define KVM_DIRTY_LOG_INITIALLY_SET (1 << 1)
With KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE is set, KVM_GET_DIRTY_LOG will not
automatically clear and write-protect all pages that are returned as dirty.
Rather, userspace will have to do this operation separately using
KVM_CLEAR_DIRTY_LOG.
@@ -5719,18 +5766,42 @@ than requiring to sync a full memslot; this ensures that KVM does not
take spinlocks for an extended period of time. Second, in some cases a
large amount of time can pass between a call to KVM_GET_DIRTY_LOG and
userspace actually using the data in the page. Pages can be modified
during this time, which is inefficint for both the guest and userspace:
during this time, which is inefficient for both the guest and userspace:
the guest will incur a higher penalty due to write protection faults,
while userspace can see false reports of dirty pages. Manual reprotection
helps reducing this time, improving guest performance and reducing the
number of dirty log false positives.
With KVM_DIRTY_LOG_INITIALLY_SET set, all the bits of the dirty bitmap
will be initialized to 1 when created. This also improves performance because
dirty logging can be enabled gradually in small chunks on the first call
to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on
x86 for now).
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
it hard or impossible to use it correctly. The availability of
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
7.19 KVM_CAP_PPC_SECURE_GUEST
------------------------------
:Architectures: ppc
This capability indicates that KVM is running on a host that has
ultravisor firmware and thus can support a secure guest. On such a
system, a guest can ask the ultravisor to make it a secure guest,
one whose memory is inaccessible to the host except for pages which
are explicitly requested to be shared with the host. The ultravisor
notifies KVM when a guest requests to become a secure guest, and KVM
has the opportunity to veto the transition.
If present, this capability can be enabled for a VM, meaning that KVM
will allow the transition to secure guest mode. Otherwise KVM will
veto the transition.
8. Other capabilities.
======================
@@ -6027,3 +6098,14 @@ Architectures: s390
This capability indicates that the KVM_S390_NORMAL_RESET and
KVM_S390_CLEAR_RESET ioctls are available.
8.23 KVM_CAP_S390_PROTECTED
Architecture: s390
This capability indicates that the Ultravisor has been initialized and
KVM can therefore start protected VMs.
This capability governs the KVM_S390_PV_COMMAND ioctl and the
KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
guests when the state change is invalid.

View File

@@ -11,6 +11,11 @@ hypervisor when running as a guest (under Xen, KVM or any other
hypervisor), or any hypervisor-specific interaction when the kernel is
used as a host.
Note: KVM/arm has been removed from the kernel. The API described
here is still valid though, as it allows the kernel to kexec when
booted at HYP. It can also be used by a hypervisor other than KVM
if necessary.
On arm and arm64 (without VHE), the kernel doesn't run in hypervisor
mode, but still needs to interact with it, allowing a built-in
hypervisor to be either installed or torn down.

View File

@@ -108,16 +108,9 @@ Groups:
mask or unmask the adapter, as specified in mask
KVM_S390_IO_ADAPTER_MAP
perform a gmap translation for the guest address provided in addr,
pin a userspace page for the translated address and add it to the
list of mappings
.. note:: A new mapping will be created unconditionally; therefore,
the calling code should avoid making duplicate mappings.
This is now a no-op. The mapping is purely done by the irq route.
KVM_S390_IO_ADAPTER_UNMAP
release a userspace page for the translated address specified in addr
from the list of mappings
This is now a no-op. The mapping is purely done by the irq route.
KVM_DEV_FLIC_AISM
modify the adapter-interruption-suppression mode for a given isc if the

View File

@@ -18,6 +18,8 @@ KVM
nested-vmx
ppc-pv
s390-diag
s390-pv
s390-pv-boot
timekeeping
vcpu-requests

View File

@@ -96,19 +96,18 @@ will happen:
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
For direct sp, we can easily avoid it since the spte of direct sp is fixed
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
to pin gfn to pfn, because after gfn_to_pfn_atomic():
to gfn. For indirect sp, we disabled fast page fault for simplicity.
A solution for indirect sp could be to pin the gfn, for example via
kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
- We have held the refcount of pfn that means the pfn can not be freed and
be reused for another gfn.
- The pfn is writable that means it can not be shared between different gfns
- The pfn is writable and therefore it cannot be shared between different gfns
by KSM.
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
Currently, to simplify the whole things, we disable fast page fault for
indirect shadow page.
2) Dirty bit tracking
In the origin code, the spte can be fast updated (non-atomically) if the

View File

@@ -0,0 +1,84 @@
.. SPDX-License-Identifier: GPL-2.0
======================================
s390 (IBM Z) Boot/IPL of Protected VMs
======================================
Summary
-------
The memory of Protected Virtual Machines (PVMs) is not accessible to
I/O or the hypervisor. In those cases where the hypervisor needs to
access the memory of a PVM, that memory must be made accessible.
Memory made accessible to the hypervisor will be encrypted. See
:doc:`s390-pv` for details."
On IPL (boot) a small plaintext bootloader is started, which provides
information about the encrypted components and necessary metadata to
KVM to decrypt the protected virtual machine.
Based on this data, KVM will make the protected virtual machine known
to the Ultravisor (UV) and instruct it to secure the memory of the
PVM, decrypt the components and verify the data and address list
hashes, to ensure integrity. Afterwards KVM can run the PVM via the
SIE instruction which the UV will intercept and execute on KVM's
behalf.
As the guest image is just like an opaque kernel image that does the
switch into PV mode itself, the user can load encrypted guest
executables and data via every available method (network, dasd, scsi,
direct kernel, ...) without the need to change the boot process.
Diag308
-------
This diagnose instruction is the basic mechanism to handle IPL and
related operations for virtual machines. The VM can set and retrieve
IPL information blocks, that specify the IPL method/devices and
request VM memory and subsystem resets, as well as IPLs.
For PVMs this concept has been extended with new subcodes:
Subcode 8: Set an IPL Information Block of type 5 (information block
for PVMs)
Subcode 9: Store the saved block in guest memory
Subcode 10: Move into Protected Virtualization mode
The new PV load-device-specific-parameters field specifies all data
that is necessary to move into PV mode.
* PV Header origin
* PV Header length
* List of Components composed of
* AES-XTS Tweak prefix
* Origin
* Size
The PV header contains the keys and hashes, which the UV will use to
decrypt and verify the PV, as well as control flags and a start PSW.
The components are for instance an encrypted kernel, kernel parameters
and initrd. The components are decrypted by the UV.
After the initial import of the encrypted data, all defined pages will
contain the guest content. All non-specified pages will start out as
zero pages on first access.
When running in protected virtualization mode, some subcodes will result in
exceptions or return error codes.
Subcodes 4 and 7, which specify operations that do not clear the guest
memory, will result in specification exceptions. This is because the
UV will clear all memory when a secure VM is removed, and therefore
non-clearing IPL subcodes are not allowed.
Subcodes 8, 9, 10 will result in specification exceptions.
Re-IPL into a protected mode is only possible via a detour into non
protected mode.
Keys
----
Every CEC will have a unique public key to enable tooling to build
encrypted images.
See `s390-tools <https://github.com/ibm-s390-tools/s390-tools/>`_
for the tooling.

View File

@@ -0,0 +1,116 @@
.. SPDX-License-Identifier: GPL-2.0
=========================================
s390 (IBM Z) Ultravisor and Protected VMs
=========================================
Summary
-------
Protected virtual machines (PVM) are KVM VMs that do not allow KVM to
access VM state like guest memory or guest registers. Instead, the
PVMs are mostly managed by a new entity called Ultravisor (UV). The UV
provides an API that can be used by PVMs and KVM to request management
actions.
Each guest starts in non-protected mode and then may make a request to
transition into protected mode. On transition, KVM registers the guest
and its VCPUs with the Ultravisor and prepares everything for running
it.
The Ultravisor will secure and decrypt the guest's boot memory
(i.e. kernel/initrd). It will safeguard state changes like VCPU
starts/stops and injected interrupts while the guest is running.
As access to the guest's state, such as the SIE state description, is
normally needed to be able to run a VM, some changes have been made in
the behavior of the SIE instruction. A new format 4 state description
has been introduced, where some fields have different meanings for a
PVM. SIE exits are minimized as much as possible to improve speed and
reduce exposed guest state.
Interrupt injection
-------------------
Interrupt injection is safeguarded by the Ultravisor. As KVM doesn't
have access to the VCPUs' lowcores, injection is handled via the
format 4 state description.
Machine check, external, IO and restart interruptions each can be
injected on SIE entry via a bit in the interrupt injection control
field (offset 0x54). If the guest cpu is not enabled for the interrupt
at the time of injection, a validity interception is recognized. The
format 4 state description contains fields in the interception data
block where data associated with the interrupt can be transported.
Program and Service Call exceptions have another layer of
safeguarding; they can only be injected for instructions that have
been intercepted into KVM. The exceptions need to be a valid outcome
of an instruction emulation by KVM, e.g. we can never inject a
addressing exception as they are reported by SIE since KVM has no
access to the guest memory.
Mask notification interceptions
-------------------------------
KVM cannot intercept lctl(g) and lpsw(e) anymore in order to be
notified when a PVM enables a certain class of interrupt. As a
replacement, two new interception codes have been introduced: One
indicating that the contents of CRs 0, 6, or 14 have been changed,
indicating different interruption subclasses; and one indicating that
PSW bit 13 has been changed, indicating that a machine check
intervention was requested and those are now enabled.
Instruction emulation
---------------------
With the format 4 state description for PVMs, the SIE instruction already
interprets more instructions than it does with format 2. It is not able
to interpret every instruction, but needs to hand some tasks to KVM;
therefore, the SIE and the ultravisor safeguard emulation inputs and outputs.
The control structures associated with SIE provide the Secure
Instruction Data Area (SIDA), the Interception Parameters (IP) and the
Secure Interception General Register Save Area. Guest GRs and most of
the instruction data, such as I/O data structures, are filtered.
Instruction data is copied to and from the SIDA when needed. Guest
GRs are put into / retrieved from the Secure Interception General
Register Save Area.
Only GR values needed to emulate an instruction will be copied into this
save area and the real register numbers will be hidden.
The Interception Parameters state description field still contains the
the bytes of the instruction text, but with pre-set register values
instead of the actual ones. I.e. each instruction always uses the same
instruction text, in order not to leak guest instruction text.
This also implies that the register content that a guest had in r<n>
may be in r<m> from the hypervisor's point of view.
The Secure Instruction Data Area contains instruction storage
data. Instruction data, i.e. data being referenced by an instruction
like the SCCB for sclp, is moved via the SIDA. When an instruction is
intercepted, the SIE will only allow data and program interrupts for
this instruction to be moved to the guest via the two data areas
discussed before. Other data is either ignored or results in validity
interceptions.
Instruction emulation interceptions
-----------------------------------
There are two types of SIE secure instruction intercepts: the normal
and the notification type. Normal secure instruction intercepts will
make the guest pending for instruction completion of the intercepted
instruction type, i.e. on SIE entry it is attempted to complete
emulation of the instruction with the data provided by KVM. That might
be a program exception or instruction completion.
The notification type intercepts inform KVM about guest environment
changes due to guest instruction interpretation. Such an interception
is recognized, for example, for the store prefix instruction to provide
the new lowcore location. On SIE reentry, any KVM data in the data areas
is ignored and execution continues as if the guest instruction had
completed. For that reason KVM is not allowed to inject a program
interrupt.
Links
-----
`KVM Forum 2019 presentation <https://static.sched.com/hosted_files/kvmforum2019/3b/ibm_protected_vms_s390x.pdf>`_