self-protection.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316
  1. ======================
  2. Kernel Self-Protection
  3. ======================
  4. Kernel self-protection is the design and implementation of systems and
  5. structures within the Linux kernel to protect against security flaws in
  6. the kernel itself. This covers a wide range of issues, including removing
  7. entire classes of bugs, blocking security flaw exploitation methods,
  8. and actively detecting attack attempts. Not all topics are explored in
  9. this document, but it should serve as a reasonable starting point and
  10. answer any frequently asked questions. (Patches welcome, of course!)
  11. In the worst-case scenario, we assume an unprivileged local attacker
  12. has arbitrary read and write access to the kernel's memory. In many
  13. cases, bugs being exploited will not provide this level of access,
  14. but with systems in place that defend against the worst case we'll
  15. cover the more limited cases as well. A higher bar, and one that should
  16. still be kept in mind, is protecting the kernel against a _privileged_
  17. local attacker, since the root user has access to a vastly increased
  18. attack surface. (Especially when they have the ability to load arbitrary
  19. kernel modules.)
  20. The goals for successful self-protection systems would be that they
  21. are effective, on by default, require no opt-in by developers, have no
  22. performance impact, do not impede kernel debugging, and have tests. It
  23. is uncommon that all these goals can be met, but it is worth explicitly
  24. mentioning them, since these aspects need to be explored, dealt with,
  25. and/or accepted.
  26. Attack Surface Reduction
  27. ========================
  28. The most fundamental defense against security exploits is to reduce the
  29. areas of the kernel that can be used to redirect execution. This ranges
  30. from limiting the exposed APIs available to userspace, making in-kernel
  31. APIs hard to use incorrectly, minimizing the areas of writable kernel
  32. memory, etc.
  33. Strict kernel memory permissions
  34. --------------------------------
  35. When all of kernel memory is writable, it becomes trivial for attacks
  36. to redirect execution flow. To reduce the availability of these targets
  37. the kernel needs to protect its memory with a tight set of permissions.
  38. Executable code and read-only data must not be writable
  39. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  40. Any areas of the kernel with executable memory must not be writable.
  41. While this obviously includes the kernel text itself, we must consider
  42. all additional places too: kernel modules, JIT memory, etc. (There are
  43. temporary exceptions to this rule to support things like instruction
  44. alternatives, breakpoints, kprobes, etc. If these must exist in a
  45. kernel, they are implemented in a way where the memory is temporarily
  46. made writable during the update, and then returned to the original
  47. permissions.)
  48. In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
  49. ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
  50. writable, data is not executable, and read-only data is neither writable
  51. nor executable.
  52. Most architectures have these options on by default and not user selectable.
  53. For some architectures like arm that wish to have these be selectable,
  54. the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
  55. a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
  56. the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
  57. Function pointers and sensitive variables must not be writable
  58. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  59. Vast areas of kernel memory contain function pointers that are looked
  60. up by the kernel and used to continue execution (e.g. descriptor/vector
  61. tables, file/network/etc operation structures, etc). The number of these
  62. variables must be reduced to an absolute minimum.
  63. Many such variables can be made read-only by setting them "const"
  64. so that they live in the .rodata section instead of the .data section
  65. of the kernel, gaining the protection of the kernel's strict memory
  66. permissions as described above.
  67. For variables that are initialized once at ``__init`` time, these can
  68. be marked with the ``__ro_after_init`` attribute.
  69. What remains are variables that are updated rarely (e.g. GDT). These
  70. will need another infrastructure (similar to the temporary exceptions
  71. made to kernel code mentioned above) that allow them to spend the rest
  72. of their lifetime read-only. (For example, when being updated, only the
  73. CPU thread performing the update would be given uninterruptible write
  74. access to the memory.)
  75. Segregation of kernel memory from userspace memory
  76. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  77. The kernel must never execute userspace memory. The kernel must also never
  78. access userspace memory without explicit expectation to do so. These
  79. rules can be enforced either by support of hardware-based restrictions
  80. (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
  81. By blocking userspace memory in this way, execution and data parsing
  82. cannot be passed to trivially-controlled userspace memory, forcing
  83. attacks to operate entirely in kernel memory.
  84. Reduced access to syscalls
  85. --------------------------
  86. One trivial way to eliminate many syscalls for 64-bit systems is building
  87. without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
  88. The "seccomp" system provides an opt-in feature made available to
  89. userspace, which provides a way to reduce the number of kernel entry
  90. points available to a running process. This limits the breadth of kernel
  91. code that can be reached, possibly reducing the availability of a given
  92. bug to an attack.
  93. An area of improvement would be creating viable ways to keep access to
  94. things like compat, user namespaces, BPF creation, and perf limited only
  95. to trusted processes. This would keep the scope of kernel entry points
  96. restricted to the more regular set of normally available to unprivileged
  97. userspace.
  98. Restricting access to kernel modules
  99. ------------------------------------
  100. The kernel should never allow an unprivileged user the ability to
  101. load specific kernel modules, since that would provide a facility to
  102. unexpectedly extend the available attack surface. (The on-demand loading
  103. of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
  104. considered "expected" here, though additional consideration should be
  105. given even to these.) For example, loading a filesystem module via an
  106. unprivileged socket API is nonsense: only the root or physically local
  107. user should trigger filesystem module loading. (And even this can be up
  108. for debate in some scenarios.)
  109. To protect against even privileged users, systems may need to either
  110. disable module loading entirely (e.g. monolithic kernel builds or
  111. modules_disabled sysctl), or provide signed modules (e.g.
  112. ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
  113. root load arbitrary kernel code via the module loader interface.
  114. Memory integrity
  115. ================
  116. There are many memory structures in the kernel that are regularly abused
  117. to gain execution control during an attack, By far the most commonly
  118. understood is that of the stack buffer overflow in which the return
  119. address stored on the stack is overwritten. Many other examples of this
  120. kind of attack exist, and protections exist to defend against them.
  121. Stack buffer overflow
  122. ---------------------
  123. The classic stack buffer overflow involves writing past the expected end
  124. of a variable stored on the stack, ultimately writing a controlled value
  125. to the stack frame's stored return address. The most widely used defense
  126. is the presence of a stack canary between the stack variables and the
  127. return address (``CONFIG_STACKPROTECTOR``), which is verified just before
  128. the function returns. Other defenses include things like shadow stacks.
  129. Stack depth overflow
  130. --------------------
  131. A less well understood attack is using a bug that triggers the
  132. kernel to consume stack memory with deep function calls or large stack
  133. allocations. With this attack it is possible to write beyond the end of
  134. the kernel's preallocated stack space and into sensitive structures. Two
  135. important changes need to be made for better protections: moving the
  136. sensitive thread_info structure elsewhere, and adding a faulting memory
  137. hole at the bottom of the stack to catch these overflows.
  138. Heap memory integrity
  139. ---------------------
  140. The structures used to track heap free lists can be sanity-checked during
  141. allocation and freeing to make sure they aren't being used to manipulate
  142. other memory areas.
  143. Counter integrity
  144. -----------------
  145. Many places in the kernel use atomic counters to track object references
  146. or perform similar lifetime management. When these counters can be made
  147. to wrap (over or under) this traditionally exposes a use-after-free
  148. flaw. By trapping atomic wrapping, this class of bug vanishes.
  149. Size calculation overflow detection
  150. -----------------------------------
  151. Similar to counter overflow, integer overflows (usually size calculations)
  152. need to be detected at runtime to kill this class of bug, which
  153. traditionally leads to being able to write past the end of kernel buffers.
  154. Probabilistic defenses
  155. ======================
  156. While many protections can be considered deterministic (e.g. read-only
  157. memory cannot be written to), some protections provide only statistical
  158. defense, in that an attack must gather enough information about a
  159. running system to overcome the defense. While not perfect, these do
  160. provide meaningful defenses.
  161. Canaries, blinding, and other secrets
  162. -------------------------------------
  163. It should be noted that things like the stack canary discussed earlier
  164. are technically statistical defenses, since they rely on a secret value,
  165. and such values may become discoverable through an information exposure
  166. flaw.
  167. Blinding literal values for things like JITs, where the executable
  168. contents may be partially under the control of userspace, need a similar
  169. secret value.
  170. It is critical that the secret values used must be separate (e.g.
  171. different canary per stack) and high entropy (e.g. is the RNG actually
  172. working?) in order to maximize their success.
  173. Kernel Address Space Layout Randomization (KASLR)
  174. -------------------------------------------------
  175. Since the location of kernel memory is almost always instrumental in
  176. mounting a successful attack, making the location non-deterministic
  177. raises the difficulty of an exploit. (Note that this in turn makes
  178. the value of information exposures higher, since they may be used to
  179. discover desired memory locations.)
  180. Text and module base
  181. ~~~~~~~~~~~~~~~~~~~~
  182. By relocating the physical and virtual base address of the kernel at
  183. boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
  184. frustrated. Additionally, offsetting the module loading base address
  185. means that even systems that load the same set of modules in the same
  186. order every boot will not share a common base address with the rest of
  187. the kernel text.
  188. Stack base
  189. ~~~~~~~~~~
  190. If the base address of the kernel stack is not the same between processes,
  191. or even not the same between syscalls, targets on or beyond the stack
  192. become more difficult to locate.
  193. Dynamic memory base
  194. ~~~~~~~~~~~~~~~~~~~
  195. Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
  196. being relatively deterministic in layout due to the order of early-boot
  197. initializations. If the base address of these areas is not the same
  198. between boots, targeting them is frustrated, requiring an information
  199. exposure specific to the region.
  200. Structure layout
  201. ~~~~~~~~~~~~~~~~
  202. By performing a per-build randomization of the layout of sensitive
  203. structures, attacks must either be tuned to known kernel builds or expose
  204. enough kernel memory to determine structure layouts before manipulating
  205. them.
  206. Preventing Information Exposures
  207. ================================
  208. Since the locations of sensitive structures are the primary target for
  209. attacks, it is important to defend against exposure of both kernel memory
  210. addresses and kernel memory contents (since they may contain kernel
  211. addresses or other sensitive things like canary values).
  212. Kernel addresses
  213. ----------------
  214. Printing kernel addresses to userspace leaks sensitive information about
  215. the kernel memory layout. Care should be exercised when using any printk
  216. specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
  217. in certain circumstances [*]). Any file written to using one of these
  218. specifiers should be readable only by privileged processes.
  219. Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
  220. addresses printed with the specifier %p are hashed before printing.
  221. [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
  222. printed. If KALLSYMS is not enabled the raw address is printed.
  223. Unique identifiers
  224. ------------------
  225. Kernel memory addresses must never be used as identifiers exposed to
  226. userspace. Instead, use an atomic counter, an idr, or similar unique
  227. identifier.
  228. Memory initialization
  229. ---------------------
  230. Memory copied to userspace must always be fully initialized. If not
  231. explicitly memset(), this will require changes to the compiler to make
  232. sure structure holes are cleared.
  233. Memory poisoning
  234. ----------------
  235. When releasing memory, it is best to poison the contents, to avoid reuse
  236. attacks that rely on the old contents of memory. E.g., clear stack on a
  237. syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
  238. free. This frustrates many uninitialized variable attacks, stack content
  239. exposures, heap content exposures, and use-after-free attacks.
  240. Destination tracking
  241. --------------------
  242. To help kill classes of bugs that result in kernel addresses being
  243. written to userspace, the destination of writes needs to be tracked. If
  244. the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
  245. it should automatically censor sensitive values.