running-nested-guests.rst 9.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ==============================
  3. Running nested guests with KVM
  4. ==============================
  5. A nested guest is the ability to run a guest inside another guest (it
  6. can be KVM-based or a different hypervisor). The straightforward
  7. example is a KVM guest that in turn runs on a KVM guest (the rest of
  8. this document is built on this example)::
  9. .----------------. .----------------.
  10. | | | |
  11. | L2 | | L2 |
  12. | (Nested Guest) | | (Nested Guest) |
  13. | | | |
  14. |----------------'--'----------------|
  15. | |
  16. | L1 (Guest Hypervisor) |
  17. | KVM (/dev/kvm) |
  18. | |
  19. .------------------------------------------------------.
  20. | L0 (Host Hypervisor) |
  21. | KVM (/dev/kvm) |
  22. |------------------------------------------------------|
  23. | Hardware (with virtualization extensions) |
  24. '------------------------------------------------------'
  25. Terminology:
  26. - L0 – level-0; the bare metal host, running KVM
  27. - L1 – level-1 guest; a VM running on L0; also called the "guest
  28. hypervisor", as it itself is capable of running KVM.
  29. - L2 – level-2 guest; a VM running on L1, this is the "nested guest"
  30. .. note:: The above diagram is modelled after the x86 architecture;
  31. s390x, ppc64 and other architectures are likely to have
  32. a different design for nesting.
  33. For example, s390x always has an LPAR (LogicalPARtition)
  34. hypervisor running on bare metal, adding another layer and
  35. resulting in at least four levels in a nested setup — L0 (bare
  36. metal, running the LPAR hypervisor), L1 (host hypervisor), L2
  37. (guest hypervisor), L3 (nested guest).
  38. This document will stick with the three-level terminology (L0,
  39. L1, and L2) for all architectures; and will largely focus on
  40. x86.
  41. Use Cases
  42. ---------
  43. There are several scenarios where nested KVM can be useful, to name a
  44. few:
  45. - As a developer, you want to test your software on different operating
  46. systems (OSes). Instead of renting multiple VMs from a Cloud
  47. Provider, using nested KVM lets you rent a large enough "guest
  48. hypervisor" (level-1 guest). This in turn allows you to create
  49. multiple nested guests (level-2 guests), running different OSes, on
  50. which you can develop and test your software.
  51. - Live migration of "guest hypervisors" and their nested guests, for
  52. load balancing, disaster recovery, etc.
  53. - VM image creation tools (e.g. ``virt-install``, etc) often run
  54. their own VM, and users expect these to work inside a VM.
  55. - Some OSes use virtualization internally for security (e.g. to let
  56. applications run safely in isolation).
  57. Enabling "nested" (x86)
  58. -----------------------
  59. From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled
  60. by default for Intel and AMD. (Though your Linux distribution might
  61. override this default.)
  62. In case you are running a Linux kernel older than v4.19, to enable
  63. nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
  64. persist this setting across reboots, you can add it in a config file, as
  65. shown below:
  66. 1. On the bare metal host (L0), list the kernel modules and ensure that
  67. the KVM modules::
  68. $ lsmod | grep -i kvm
  69. kvm_intel 133627 0
  70. kvm 435079 1 kvm_intel
  71. 2. Show information for ``kvm_intel`` module::
  72. $ modinfo kvm_intel | grep -i nested
  73. parm: nested:bool
  74. 3. For the nested KVM configuration to persist across reboots, place the
  75. below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
  76. doesn't exist)::
  77. $ cat /etc/modprobe.d/kvm_intel.conf
  78. options kvm-intel nested=y
  79. 4. Unload and re-load the KVM Intel module::
  80. $ sudo rmmod kvm-intel
  81. $ sudo modprobe kvm-intel
  82. 5. Verify if the ``nested`` parameter for KVM is enabled::
  83. $ cat /sys/module/kvm_intel/parameters/nested
  84. Y
  85. For AMD hosts, the process is the same as above, except that the module
  86. name is ``kvm-amd``.
  87. Additional nested-related kernel parameters (x86)
  88. -------------------------------------------------
  89. If your hardware is sufficiently advanced (Intel Haswell processor or
  90. higher, which has newer hardware virt extensions), the following
  91. additional features will also be enabled by default: "Shadow VMCS
  92. (Virtual Machine Control Structure)", APIC Virtualization on your bare
  93. metal host (L0). Parameters for Intel hosts::
  94. $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
  95. Y
  96. $ cat /sys/module/kvm_intel/parameters/enable_apicv
  97. Y
  98. $ cat /sys/module/kvm_intel/parameters/ept
  99. Y
  100. .. note:: If you suspect your L2 (i.e. nested guest) is running slower,
  101. ensure the above are enabled (particularly
  102. ``enable_shadow_vmcs`` and ``ept``).
  103. Starting a nested guest (x86)
  104. -----------------------------
  105. Once your bare metal host (L0) is configured for nesting, you should be
  106. able to start an L1 guest with::
  107. $ qemu-kvm -cpu host [...]
  108. The above will pass through the host CPU's capabilities as-is to the
  109. gues); or for better live migration compatibility, use a named CPU
  110. model supported by QEMU. e.g.::
  111. $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
  112. then the guest hypervisor will subsequently be capable of running a
  113. nested guest with accelerated KVM.
  114. Enabling "nested" (s390x)
  115. -------------------------
  116. 1. On the host hypervisor (L0), enable the ``nested`` parameter on
  117. s390x::
  118. $ rmmod kvm
  119. $ modprobe kvm nested=1
  120. .. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
  121. with the ``nested`` paramter — i.e. to be able to enable
  122. ``nested``, the ``hpage`` parameter *must* be disabled.
  123. 2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
  124. feature — with QEMU, this can be done by using "host passthrough"
  125. (via the command-line ``-cpu host``).
  126. 3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
  127. $ modprobe kvm
  128. Live migration with nested KVM
  129. ------------------------------
  130. Migrating an L1 guest, with a *live* nested guest in it, to another
  131. bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
  132. Intel x86 systems, and even on older versions for s390x.
  133. On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
  134. should no longer be migrated or saved (refer to QEMU documentation on
  135. "savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
  136. or save-and-load an L1 guest while an L2 guest is running will result in
  137. undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
  138. kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
  139. guest can no longer be considered stable or secure, and must be restarted.
  140. Migrating an L1 guest merely configured to support nesting, while not
  141. actually running L2 guests, is expected to function normally even on AMD
  142. systems but may fail once guests are started.
  143. Migrating an L2 guest is always expected to succeed, so all the following
  144. scenarios should work even on AMD systems:
  145. - Migrating a nested guest (L2) to another L1 guest on the *same* bare
  146. metal host.
  147. - Migrating a nested guest (L2) to another L1 guest on a *different*
  148. bare metal host.
  149. - Migrating a nested guest (L2) to a bare metal host.
  150. Reporting bugs from nested setups
  151. -----------------------------------
  152. Debugging "nested" problems can involve sifting through log files across
  153. L0, L1 and L2; this can result in tedious back-n-forth between the bug
  154. reporter and the bug fixer.
  155. - Mention that you are in a "nested" setup. If you are running any kind
  156. of "nesting" at all, say so. Unfortunately, this needs to be called
  157. out because when reporting bugs, people tend to forget to even
  158. *mention* that they're using nested virtualization.
  159. - Ensure you are actually running KVM on KVM. Sometimes people do not
  160. have KVM enabled for their guest hypervisor (L1), which results in
  161. them running with pure emulation or what QEMU calls it as "TCG", but
  162. they think they're running nested KVM. Thus confusing "nested Virt"
  163. (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
  164. Information to collect (generic)
  165. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  166. The following is not an exhaustive list, but a very good starting point:
  167. - Kernel, libvirt, and QEMU version from L0
  168. - Kernel, libvirt and QEMU version from L1
  169. - QEMU command-line of L1 -- when using libvirt, you'll find it here:
  170. ``/var/log/libvirt/qemu/instance.log``
  171. - QEMU command-line of L2 -- as above, when using libvirt, get the
  172. complete libvirt-generated QEMU command-line
  173. - ``cat /sys/cpuinfo`` from L0
  174. - ``cat /sys/cpuinfo`` from L1
  175. - ``lscpu`` from L0
  176. - ``lscpu`` from L1
  177. - Full ``dmesg`` output from L0
  178. - Full ``dmesg`` output from L1
  179. x86-specific info to collect
  180. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  181. Both the below commands, ``x86info`` and ``dmidecode``, should be
  182. available on most Linux distributions with the same name:
  183. - Output of: ``x86info -a`` from L0
  184. - Output of: ``x86info -a`` from L1
  185. - Output of: ``dmidecode`` from L0
  186. - Output of: ``dmidecode`` from L1
  187. s390x-specific info to collect
  188. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  189. Along with the earlier mentioned generic details, the below is
  190. also recommended:
  191. - ``/proc/sysinfo`` from L1; this will also include the info from L0