123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302 |
- .. SPDX-License-Identifier: GPL-2.0
- ===============================
- Software Guard eXtensions (SGX)
- ===============================
- Overview
- ========
- Software Guard eXtensions (SGX) hardware enables for user space applications
- to set aside private memory regions of code and data:
- * Privileged (ring-0) ENCLS functions orchestrate the construction of the
- regions.
- * Unprivileged (ring-3) ENCLU functions allow an application to enter and
- execute inside the regions.
- These memory regions are called enclaves. An enclave can be only entered at a
- fixed set of entry points. Each entry point can hold a single hardware thread
- at a time. While the enclave is loaded from a regular binary file by using
- ENCLS functions, only the threads inside the enclave can access its memory. The
- region is denied from outside access by the CPU, and encrypted before it leaves
- from LLC.
- The support can be determined by
- ``grep sgx /proc/cpuinfo``
- SGX must both be supported in the processor and enabled by the BIOS. If SGX
- appears to be unsupported on a system which has hardware support, ensure
- support is enabled in the BIOS. If a BIOS presents a choice between "Enabled"
- and "Software Enabled" modes for SGX, choose "Enabled".
- Enclave Page Cache
- ==================
- SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated
- with an enclave. It is contained in a BIOS-reserved region of physical memory.
- Unlike pages used for regular memory, pages can only be accessed from outside of
- the enclave during enclave construction with special, limited SGX instructions.
- Only a CPU executing inside an enclave can directly access enclave memory.
- However, a CPU executing inside an enclave may access normal memory outside the
- enclave.
- The kernel manages enclave memory similar to how it treats device memory.
- Enclave Page Types
- ------------------
- **SGX Enclave Control Structure (SECS)**
- Enclave's address range, attributes and other global data are defined
- by this structure.
- **Regular (REG)**
- Regular EPC pages contain the code and data of an enclave.
- **Thread Control Structure (TCS)**
- Thread Control Structure pages define the entry points to an enclave and
- track the execution state of an enclave thread.
- **Version Array (VA)**
- Version Array pages contain 512 slots, each of which can contain a version
- number for a page evicted from the EPC.
- Enclave Page Cache Map
- ----------------------
- The processor tracks EPC pages in a hardware metadata structure called the
- *Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page
- which describes the owning enclave, access rights and page type among the other
- things.
- EPCM permissions are separate from the normal page tables. This prevents the
- kernel from, for instance, allowing writes to data which an enclave wishes to
- remain read-only. EPCM permissions may only impose additional restrictions on
- top of normal x86 page permissions.
- For all intents and purposes, the SGX architecture allows the processor to
- invalidate all EPCM entries at will. This requires that software be prepared to
- handle an EPCM fault at any time. In practice, this can happen on events like
- power transitions when the ephemeral key that encrypts enclave memory is lost.
- Application interface
- =====================
- Enclave build functions
- -----------------------
- In addition to the traditional compiler and linker build process, SGX has a
- separate enclave “build” process. Enclaves must be built before they can be
- executed (entered). The first step in building an enclave is opening the
- **/dev/sgx_enclave** device. Since enclave memory is protected from direct
- access, special privileged instructions are then used to copy data into enclave
- pages and establish enclave page permissions.
- .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
- :functions: sgx_ioc_enclave_create
- sgx_ioc_enclave_add_pages
- sgx_ioc_enclave_init
- sgx_ioc_enclave_provision
- Enclave runtime management
- --------------------------
- Systems supporting SGX2 additionally support changes to initialized
- enclaves: modifying enclave page permissions and type, and dynamically
- adding and removing of enclave pages. When an enclave accesses an address
- within its address range that does not have a backing page then a new
- regular page will be dynamically added to the enclave. The enclave is
- still required to run EACCEPT on the new page before it can be used.
- .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
- :functions: sgx_ioc_enclave_restrict_permissions
- sgx_ioc_enclave_modify_types
- sgx_ioc_enclave_remove_pages
- Enclave vDSO
- ------------
- Entering an enclave can only be done through SGX-specific EENTER and ERESUME
- functions, and is a non-trivial process. Because of the complexity of
- transitioning to and from an enclave, enclaves typically utilize a library to
- handle the actual transitions. This is roughly analogous to how glibc
- implementations are used by most applications to wrap system calls.
- Another crucial characteristic of enclaves is that they can generate exceptions
- as part of their normal operation that need to be handled in the enclave or are
- unique to SGX.
- Instead of the traditional signal mechanism to handle these exceptions, SGX
- can leverage special exception fixup provided by the vDSO. The kernel-provided
- vDSO function wraps low-level transitions to/from the enclave like EENTER and
- ERESUME. The vDSO function intercepts exceptions that would otherwise generate
- a signal and return the fault information directly to its caller. This avoids
- the need to juggle signal handlers.
- .. kernel-doc:: arch/x86/include/uapi/asm/sgx.h
- :functions: vdso_sgx_enter_enclave_t
- ksgxd
- =====
- SGX support includes a kernel thread called *ksgxd*.
- EPC sanitization
- ----------------
- ksgxd is started when SGX initializes. Enclave memory is typically ready
- for use when the processor powers on or resets. However, if SGX has been in
- use since the reset, enclave pages may be in an inconsistent state. This might
- occur after a crash and kexec() cycle, for instance. At boot, ksgxd
- reinitializes all enclave pages so that they can be allocated and re-used.
- The sanitization is done by going through EPC address space and applying the
- EREMOVE function to each physical page. Some enclave pages like SECS pages have
- hardware dependencies on other pages which prevents EREMOVE from functioning.
- Executing two EREMOVE passes removes the dependencies.
- Page reclaimer
- --------------
- Similar to the core kswapd, ksgxd, is responsible for managing the
- overcommitment of enclave memory. If the system runs out of enclave memory,
- *ksgxd* “swaps” enclave memory to normal memory.
- Launch Control
- ==============
- SGX provides a launch control mechanism. After all enclave pages have been
- copied, kernel executes EINIT function, which initializes the enclave. Only after
- this the CPU can execute inside the enclave.
- EINIT function takes an RSA-3072 signature of the enclave measurement. The function
- checks that the measurement is correct and signature is signed with the key
- hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the
- SHA256 of a public key.
- Those MSRs can be configured by the BIOS to be either readable or writable.
- Linux supports only writable configuration in order to give full control to the
- kernel on launch control policy. Before calling EINIT function, the driver sets
- the MSRs to match the enclave's signing key.
- Encryption engines
- ==================
- In order to conceal the enclave data while it is out of the CPU package, the
- memory controller has an encryption engine to transparently encrypt and decrypt
- enclave memory.
- In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to
- encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in
- SRAM to maintain integrity of the encrypted data. This provides integrity and
- anti-replay protection but does not scale to large memory sizes because the time
- required to update the Merkle tree grows logarithmically in relation to the
- memory size.
- CPUs starting from Icelake use Total Memory Encryption (TME) in the place of
- MEE. TME-based SGX implementations do not have an integrity Merkle tree, which
- means integrity and replay-attacks are not mitigated. B, it includes
- additional changes to prevent cipher text from being returned and SW memory
- aliases from being created.
- DMA to enclave memory is blocked by range registers on both MEE and TME systems
- (SDM section 41.10).
- Usage Models
- ============
- Shared Library
- --------------
- Sensitive data and the code that acts on it is partitioned from the application
- into a separate library. The library is then linked as a DSO which can be loaded
- into an enclave. The application can then make individual function calls into
- the enclave through special SGX instructions. A run-time within the enclave is
- configured to marshal function parameters into and out of the enclave and to
- call the correct library function.
- Application Container
- ---------------------
- An application may be loaded into a container enclave which is specially
- configured with a library OS and run-time which permits the application to run.
- The enclave run-time and library OS work together to execute the application
- when a thread enters the enclave.
- Impact of Potential Kernel SGX Bugs
- ===================================
- EPC leaks
- ---------
- When EPC page leaks happen, a WARNING like this is shown in dmesg:
- "EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..."
- This is effectively a kernel use-after-free of an EPC page, and due
- to the way SGX works, the bug is detected at freeing. Rather than
- adding the page back to the pool of available EPC pages, the kernel
- intentionally leaks the page to avoid additional errors in the future.
- When this happens, the kernel will likely soon leak more EPC pages, and
- SGX will likely become unusable because the memory available to SGX is
- limited. However, while this may be fatal to SGX, the rest of the kernel
- is unlikely to be impacted and should continue to work.
- As a result, when this happpens, user should stop running any new
- SGX workloads, (or just any new workloads), and migrate all valuable
- workloads. Although a machine reboot can recover all EPC memory, the bug
- should be reported to Linux developers.
- Virtual EPC
- ===========
- The implementation has also a virtual EPC driver to support SGX enclaves
- in guests. Unlike the SGX driver, an EPC page allocated by the virtual
- EPC driver doesn't have a specific enclave associated with it. This is
- because KVM doesn't track how a guest uses EPC pages.
- As a result, the SGX core page reclaimer doesn't support reclaiming EPC
- pages allocated to KVM guests through the virtual EPC driver. If the
- user wants to deploy SGX applications both on the host and in guests
- on the same machine, the user should reserve enough EPC (by taking out
- total virtual EPC size of all SGX VMs from the physical EPC size) for
- host SGX applications so they can run with acceptable performance.
- Architectural behavior is to restore all EPC pages to an uninitialized
- state also after a guest reboot. Because this state can be reached only
- through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc``
- provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction
- on all pages in the virtual EPC.
- ``EREMOVE`` can fail for three reasons. Userspace must pay attention
- to expected failures and handle them as follows:
- 1. Page removal will always fail when any thread is running in the
- enclave to which the page belongs. In this case the ioctl will
- return ``EBUSY`` independent of whether it has successfully removed
- some pages; userspace can avoid these failures by preventing execution
- of any vcpu which maps the virtual EPC.
- 2. Page removal will cause a general protection fault if two calls to
- ``EREMOVE`` happen concurrently for pages that refer to the same
- "SECS" metadata pages. This can happen if there are concurrent
- invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc``
- file descriptor in the guest is closed at the same time as
- ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``.
- This can be avoided in userspace by serializing calls to the ioctl()
- and to close(), but in general it should not be a problem.
- 3. Finally, page removal will fail for SECS metadata pages which still
- have child pages. Child pages can be removed by executing
- ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors
- mapped into the guest. This means that the ioctl() must be called
- twice: an initial set of calls to remove child pages and a subsequent
- set of calls to remove SECS pages. The second set of calls is only
- required for those mappings that returned a nonzero value from the
- first call. It indicates a bug in the kernel or the userspace client
- if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
- a return code other than 0.
|