tdx.rst 8.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =====================================
  3. Intel Trust Domain Extensions (TDX)
  4. =====================================
  5. Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
  6. the host and physical attacks by isolating the guest register state and by
  7. encrypting the guest memory. In TDX, a special module running in a special
  8. mode sits between the host and the guest and manages the guest/host
  9. separation.
  10. Since the host cannot directly access guest registers or memory, much
  11. normal functionality of a hypervisor must be moved into the guest. This is
  12. implemented using a Virtualization Exception (#VE) that is handled by the
  13. guest kernel. A #VE is handled entirely inside the guest kernel, but some
  14. require the hypervisor to be consulted.
  15. TDX includes new hypercall-like mechanisms for communicating from the
  16. guest to the hypervisor or the TDX module.
  17. New TDX Exceptions
  18. ==================
  19. TDX guests behave differently from bare-metal and traditional VMX guests.
  20. In TDX guests, otherwise normal instructions or memory accesses can cause
  21. #VE or #GP exceptions.
  22. Instructions marked with an '*' conditionally cause exceptions. The
  23. details for these instructions are discussed below.
  24. Instruction-based #VE
  25. ---------------------
  26. - Port I/O (INS, OUTS, IN, OUT)
  27. - HLT
  28. - MONITOR, MWAIT
  29. - WBINVD, INVD
  30. - VMCALL
  31. - RDMSR*,WRMSR*
  32. - CPUID*
  33. Instruction-based #GP
  34. ---------------------
  35. - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
  36. VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
  37. - ENCLS, ENCLU
  38. - GETSEC
  39. - RSM
  40. - ENQCMD
  41. - RDMSR*,WRMSR*
  42. RDMSR/WRMSR Behavior
  43. --------------------
  44. MSR access behavior falls into three categories:
  45. - #GP generated
  46. - #VE generated
  47. - "Just works"
  48. In general, the #GP MSRs should not be used in guests. Their use likely
  49. indicates a bug in the guest. The guest may try to handle the #GP with a
  50. hypercall but it is unlikely to succeed.
  51. The #VE MSRs are typically able to be handled by the hypervisor. Guests
  52. can make a hypercall to the hypervisor to handle the #VE.
  53. The "just works" MSRs do not need any special guest handling. They might
  54. be implemented by directly passing through the MSR to the hardware or by
  55. trapping and handling in the TDX module. Other than possibly being slow,
  56. these MSRs appear to function just as they would on bare metal.
  57. CPUID Behavior
  58. --------------
  59. For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
  60. return values (in guest EAX/EBX/ECX/EDX) are configurable by the
  61. hypervisor. For such cases, the Intel TDX module architecture defines two
  62. virtualization types:
  63. - Bit fields for which the hypervisor controls the value seen by the guest
  64. TD.
  65. - Bit fields for which the hypervisor configures the value such that the
  66. guest TD either sees their native value or a value of 0. For these bit
  67. fields, the hypervisor can mask off the native values, but it can not
  68. turn *on* values.
  69. A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
  70. not know how to handle. The guest kernel may ask the hypervisor for the
  71. value with a hypercall.
  72. #VE on Memory Accesses
  73. ======================
  74. There are essentially two classes of TDX memory: private and shared.
  75. Private memory receives full TDX protections. Its content is protected
  76. against access from the hypervisor. Shared memory is expected to be
  77. shared between guest and hypervisor and does not receive full TDX
  78. protections.
  79. A TD guest is in control of whether its memory accesses are treated as
  80. private or shared. It selects the behavior with a bit in its page table
  81. entries. This helps ensure that a guest does not place sensitive
  82. information in shared memory, exposing it to the untrusted hypervisor.
  83. #VE on Shared Memory
  84. --------------------
  85. Access to shared mappings can cause a #VE. The hypervisor ultimately
  86. controls whether a shared memory access causes a #VE, so the guest must be
  87. careful to only reference shared pages it can safely handle a #VE. For
  88. instance, the guest should be careful not to access shared memory in the
  89. #VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
  90. Shared mapping content is entirely controlled by the hypervisor. The guest
  91. should only use shared mappings for communicating with the hypervisor.
  92. Shared mappings must never be used for sensitive memory content like kernel
  93. stacks. A good rule of thumb is that hypervisor-shared memory should be
  94. treated the same as memory mapped to userspace. Both the hypervisor and
  95. userspace are completely untrusted.
  96. MMIO for virtual devices is implemented as shared memory. The guest must
  97. be careful not to access device MMIO regions unless it is also prepared to
  98. handle a #VE.
  99. #VE on Private Pages
  100. --------------------
  101. An access to private mappings can also cause a #VE. Since all kernel
  102. memory is also private memory, the kernel might theoretically need to
  103. handle a #VE on arbitrary kernel memory accesses. This is not feasible, so
  104. TDX guests ensure that all guest memory has been "accepted" before memory
  105. is used by the kernel.
  106. A modest amount of memory (typically 512M) is pre-accepted by the firmware
  107. before the kernel runs to ensure that the kernel can start up without
  108. being subjected to a #VE.
  109. The hypervisor is permitted to unilaterally move accepted pages to a
  110. "blocked" state. However, if it does this, page access will not generate a
  111. #VE. It will, instead, cause a "TD Exit" where the hypervisor is required
  112. to handle the exception.
  113. Linux #VE handler
  114. =================
  115. Just like page faults or #GP's, #VE exceptions can be either handled or be
  116. fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
  117. An unhandled kernel #VE results in an oops.
  118. Handling nested exceptions on x86 is typically nasty business. A #VE
  119. could be interrupted by an NMI which triggers another #VE and hilarity
  120. ensues. The TDX #VE architecture anticipated this scenario and includes a
  121. feature to make it slightly less nasty.
  122. During #VE handling, the TDX module ensures that all interrupts (including
  123. NMIs) are blocked. The block remains in place until the guest makes a
  124. TDG.VP.VEINFO.GET TDCALL. This allows the guest to control when interrupts
  125. or a new #VE can be delivered.
  126. However, the guest kernel must still be careful to avoid potential
  127. #VE-triggering actions (discussed above) while this block is in place.
  128. While the block is in place, any #VE is elevated to a double fault (#DF)
  129. which is not recoverable.
  130. MMIO handling
  131. =============
  132. In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
  133. mapping which will cause a VMEXIT on access, and then the hypervisor
  134. emulates the access. That is not possible in TDX guests because VMEXIT
  135. will expose the register state to the host. TDX guests don't trust the host
  136. and can't have their state exposed to the host.
  137. In TDX, MMIO regions typically trigger a #VE exception in the guest. The
  138. guest #VE handler then emulates the MMIO instruction inside the guest and
  139. converts it into a controlled TDCALL to the host, rather than exposing
  140. guest state to the host.
  141. MMIO addresses on x86 are just special physical addresses. They can
  142. theoretically be accessed with any instruction that accesses memory.
  143. However, the kernel instruction decoding method is limited. It is only
  144. designed to decode instructions like those generated by io.h macros.
  145. MMIO access via other means (like structure overlays) may result in an
  146. oops.
  147. Shared Memory Conversions
  148. =========================
  149. All TDX guest memory starts out as private at boot. This memory can not
  150. be accessed by the hypervisor. However, some kernel users like device
  151. drivers might have a need to share data with the hypervisor. To do this,
  152. memory must be converted between shared and private. This can be
  153. accomplished using some existing memory encryption helpers:
  154. * set_memory_decrypted() converts a range of pages to shared.
  155. * set_memory_encrypted() converts memory back to private.
  156. Device drivers are the primary user of shared memory, but there's no need
  157. to touch every driver. DMA buffers and ioremap() do the conversions
  158. automatically.
  159. TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
  160. converted to shared on boot.
  161. For coherent DMA allocation, the DMA buffer gets converted on the
  162. allocation. Check force_dma_unencrypted() for details.
  163. References
  164. ==========
  165. TDX reference material is collected here:
  166. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html