sva.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ===========================================
  3. Shared Virtual Addressing (SVA) with ENQCMD
  4. ===========================================
  5. Background
  6. ==========
  7. Shared Virtual Addressing (SVA) allows the processor and device to use the
  8. same virtual addresses avoiding the need for software to translate virtual
  9. addresses to physical addresses. SVA is what PCIe calls Shared Virtual
  10. Memory (SVM).
  11. In addition to the convenience of using application virtual addresses
  12. by the device, it also doesn't require pinning pages for DMA.
  13. PCIe Address Translation Services (ATS) along with Page Request Interface
  14. (PRI) allow devices to function much the same way as the CPU handling
  15. application page-faults. For more information please refer to the PCIe
  16. specification Chapter 10: ATS Specification.
  17. Use of SVA requires IOMMU support in the platform. IOMMU is also
  18. required to support the PCIe features ATS and PRI. ATS allows devices
  19. to cache translations for virtual addresses. The IOMMU driver uses the
  20. mmu_notifier() support to keep the device TLB cache and the CPU cache in
  21. sync. When an ATS lookup fails for a virtual address, the device should
  22. use the PRI in order to request the virtual address to be paged into the
  23. CPU page tables. The device must use ATS again in order the fetch the
  24. translation before use.
  25. Shared Hardware Workqueues
  26. ==========================
  27. Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
  28. the use of Shared Work Queues (SWQ) by both applications and Virtual
  29. Machines (VM's). This allows better hardware utilization vs. hard
  30. partitioning resources that could result in under utilization. In order to
  31. allow the hardware to distinguish the context for which work is being
  32. executed in the hardware by SWQ interface, SIOV uses Process Address Space
  33. ID (PASID), which is a 20-bit number defined by the PCIe SIG.
  34. PASID value is encoded in all transactions from the device. This allows the
  35. IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
  36. Resource Identifier (RID) which is the Bus/Device/Function.
  37. ENQCMD
  38. ======
  39. ENQCMD is a new instruction on Intel platforms that atomically submits a
  40. work descriptor to a device. The descriptor includes the operation to be
  41. performed, virtual addresses of all parameters, virtual address of a completion
  42. record, and the PASID (process address space ID) of the current process.
  43. ENQCMD works with non-posted semantics and carries a status back if the
  44. command was accepted by hardware. This allows the submitter to know if the
  45. submission needs to be retried or other device specific mechanisms to
  46. implement fairness or ensure forward progress should be provided.
  47. ENQCMD is the glue that ensures applications can directly submit commands
  48. to the hardware and also permits hardware to be aware of application context
  49. to perform I/O operations via use of PASID.
  50. Process Address Space Tagging
  51. =============================
  52. A new thread-scoped MSR (IA32_PASID) provides the connection between
  53. user processes and the rest of the hardware. When an application first
  54. accesses an SVA-capable device, this MSR is initialized with a newly
  55. allocated PASID. The driver for the device calls an IOMMU-specific API
  56. that sets up the routing for DMA and page-requests.
  57. For example, the Intel Data Streaming Accelerator (DSA) uses
  58. iommu_sva_bind_device(), which will do the following:
  59. - Allocate the PASID, and program the process page-table (%cr3 register) in the
  60. PASID context entries.
  61. - Register for mmu_notifier() to track any page-table invalidations to keep
  62. the device TLB in sync. For example, when a page-table entry is invalidated,
  63. the IOMMU propagates the invalidation to the device TLB. This will force any
  64. future access by the device to this virtual address to participate in
  65. ATS. If the IOMMU responds with proper response that a page is not
  66. present, the device would request the page to be paged in via the PCIe PRI
  67. protocol before performing I/O.
  68. This MSR is managed with the XSAVE feature set as "supervisor state" to
  69. ensure the MSR is updated during context switch.
  70. PASID Management
  71. ================
  72. The kernel must allocate a PASID on behalf of each process which will use
  73. ENQCMD and program it into the new MSR to communicate the process identity to
  74. platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests
  75. from this process. When a user submits a work descriptor to a device using the
  76. ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
  77. value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
  78. with the same PASID. The platform IOMMU uses the PASID in the transaction to
  79. perform address translation. The IOMMU APIs setup the corresponding PASID
  80. entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
  81. x86).
  82. The MSR must be configured on each logical CPU before any application
  83. thread can interact with a device. Threads that belong to the same
  84. process share the same page tables, thus the same MSR value.
  85. PASID Life Cycle Management
  86. ===========================
  87. PASID is initialized as INVALID_IOASID (-1) when a process is created.
  88. Only processes that access SVA-capable devices need to have a PASID
  89. allocated. This allocation happens when a process opens/binds an SVA-capable
  90. device but finds no PASID for this process. Subsequent binds of the same, or
  91. other devices will share the same PASID.
  92. Although the PASID is allocated to the process by opening a device,
  93. it is not active in any of the threads of that process. It's loaded to the
  94. IA32_PASID MSR lazily when a thread tries to submit a work descriptor
  95. to a device using the ENQCMD.
  96. That first access will trigger a #GP fault because the IA32_PASID MSR
  97. has not been initialized with the PASID value assigned to the process
  98. when the device was opened. The Linux #GP handler notes that a PASID has
  99. been allocated for the process, and so initializes the IA32_PASID MSR
  100. and returns so that the ENQCMD instruction is re-executed.
  101. On fork(2) or exec(2) the PASID is removed from the process as it no
  102. longer has the same address space that it had when the device was opened.
  103. On clone(2) the new task shares the same address space, so will be
  104. able to use the PASID allocated to the process. The IA32_PASID is not
  105. preemptively initialized as the PASID value might not be allocated yet or
  106. the kernel does not know whether this thread is going to access the device
  107. and the cleared IA32_PASID MSR reduces context switch overhead by xstate
  108. init optimization. Since #GP faults have to be handled on any threads that
  109. were created before the PASID was assigned to the mm of the process, newly
  110. created threads might as well be treated in a consistent way.
  111. Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
  112. all threads in unbind, free the PASID lazily only on mm exit.
  113. If a process does a close(2) of the device file descriptor and munmap(2)
  114. of the device MMIO portal, then the driver will unbind the device. The
  115. PASID is still marked VALID in the PASID_MSR for any threads in the
  116. process that accessed the device. But this is harmless as without the
  117. MMIO portal they cannot submit new work to the device.
  118. Relationships
  119. =============
  120. * Each process has many threads, but only one PASID.
  121. * Devices have a limited number (~10's to 1000's) of hardware workqueues.
  122. The device driver manages allocating hardware workqueues.
  123. * A single mmap() maps a single hardware workqueue as a "portal" and
  124. each portal maps down to a single workqueue.
  125. * For each device with which a process interacts, there must be
  126. one or more mmap()'d portals.
  127. * Many threads within a process can share a single portal to access
  128. a single device.
  129. * Multiple processes can separately mmap() the same portal, in
  130. which case they still share one device hardware workqueue.
  131. * The single process-wide PASID is used by all threads to interact
  132. with all devices. There is not, for instance, a PASID for each
  133. thread or each thread<->device pair.
  134. FAQ
  135. ===
  136. * What is SVA/SVM?
  137. Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
  138. work in the same address space, i.e., to share it. Some call it Shared
  139. Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
  140. POSIX Shared Memory and Secure Virtual Machines which were terms already in
  141. circulation.
  142. * What is a PASID?
  143. A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
  144. (TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
  145. PASID is included in all transactions between the platform and the device.
  146. * How are shared workqueues different?
  147. Traditionally, in order for userspace applications to interact with hardware,
  148. there is a separate hardware instance required per process. For example,
  149. consider doorbells as a mechanism of informing hardware about work to process.
  150. Each doorbell is required to be spaced 4k (or page-size) apart for process
  151. isolation. This requires hardware to provision that space and reserve it in
  152. MMIO. This doesn't scale as the number of threads becomes quite large. The
  153. hardware also manages the queue depth for Shared Work Queues (SWQ), and
  154. consumers don't need to track queue depth. If there is no space to accept
  155. a command, the device will return an error indicating retry.
  156. A user should check Deferrable Memory Write (DMWr) capability on the device
  157. and only submits ENQCMD when the device supports it. In the new DMWr PCIe
  158. terminology, devices need to support DMWr completer capability. In addition,
  159. it requires all switch ports to support DMWr routing and must be enabled by
  160. the PCIe subsystem, much like how PCIe atomic operations are managed for
  161. instance.
  162. SWQ allows hardware to provision just a single address in the device. When
  163. used with ENQCMD to submit work, the device can distinguish the process
  164. submitting the work since it will include the PASID assigned to that
  165. process. This helps the device scale to a large number of processes.
  166. * Is this the same as a user space device driver?
  167. Communicating with the device via the shared workqueue is much simpler
  168. than a full blown user space driver. The kernel driver does all the
  169. initialization of the hardware. User space only needs to worry about
  170. submitting work and processing completions.
  171. * Is this the same as SR-IOV?
  172. Single Root I/O Virtualization (SR-IOV) focuses on providing independent
  173. hardware interfaces for virtualizing hardware. Hence, it's required to be
  174. almost fully functional interface to software supporting the traditional
  175. BARs, space for interrupts via MSI-X, its own register layout.
  176. Virtual Functions (VFs) are assisted by the Physical Function (PF)
  177. driver.
  178. Scalable I/O Virtualization builds on the PASID concept to create device
  179. instances for virtualization. SIOV requires host software to assist in
  180. creating virtual devices; each virtual device is represented by a PASID
  181. along with the bus/device/function of the device. This allows device
  182. hardware to optimize device resource creation and can grow dynamically on
  183. demand. SR-IOV creation and management is very static in nature. Consult
  184. references below for more details.
  185. * Why not just create a virtual function for each app?
  186. Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
  187. duplicated hardware for PCI config space and interrupts such as MSI-X.
  188. Resources such as interrupts have to be hard partitioned between VFs at
  189. creation time, and cannot scale dynamically on demand. The VFs are not
  190. completely independent from the Physical Function (PF). Most VFs require
  191. some communication and assistance from the PF driver. SIOV, in contrast,
  192. creates a software-defined device where all the configuration and control
  193. aspects are mediated via the slow path. The work submission and completion
  194. happen without any mediation.
  195. * Does this support virtualization?
  196. ENQCMD can be used from within a guest VM. In these cases, the VMM helps
  197. with setting up a translation table to translate from Guest PASID to Host
  198. PASID. Please consult the ENQCMD instruction set reference for more
  199. details.
  200. * Does memory need to be pinned?
  201. When devices support SVA along with platform hardware such as IOMMU
  202. supporting such devices, there is no need to pin memory for DMA purposes.
  203. Devices that support SVA also support other PCIe features that remove the
  204. pinning requirement for memory.
  205. Device TLB support - Device requests the IOMMU to lookup an address before
  206. use via Address Translation Service (ATS) requests. If the mapping exists
  207. but there is no page allocated by the OS, IOMMU hardware returns that no
  208. mapping exists.
  209. Device requests the virtual address to be mapped via Page Request
  210. Interface (PRI). Once the OS has successfully completed the mapping, it
  211. returns the response back to the device. The device requests again for
  212. a translation and continues.
  213. IOMMU works with the OS in managing consistency of page-tables with the
  214. device. When removing pages, it interacts with the device to remove any
  215. device TLB entry that might have been cached before removing the mappings from
  216. the OS.
  217. References
  218. ==========
  219. VT-D:
  220. https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
  221. SIOV:
  222. https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
  223. ENQCMD in ISE:
  224. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
  225. DSA spec:
  226. https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf