vmalloced-kernel-stacks.rst 5.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =====================================
  3. Virtually Mapped Kernel Stack Support
  4. =====================================
  5. :Author: Shuah Khan <[email protected]>
  6. .. contents:: :local:
  7. Overview
  8. --------
  9. This is a compilation of information from the code and original patch
  10. series that introduced the `Virtually Mapped Kernel Stacks feature
  11. <https://lwn.net/Articles/694348/>`
  12. Introduction
  13. ------------
  14. Kernel stack overflows are often hard to debug and make the kernel
  15. susceptible to exploits. Problems could show up at a later time making
  16. it difficult to isolate and root-cause.
  17. Virtually-mapped kernel stacks with guard pages causes kernel stack
  18. overflows to be caught immediately rather than causing difficult to
  19. diagnose corruptions.
  20. HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable
  21. support for virtually mapped stacks with guard pages. This feature
  22. causes reliable faults when the stack overflows. The usability of
  23. the stack trace after overflow and response to the overflow itself
  24. is architecture dependent.
  25. .. note::
  26. As of this writing, arm64, powerpc, riscv, s390, um, and x86 have
  27. support for VMAP_STACK.
  28. HAVE_ARCH_VMAP_STACK
  29. --------------------
  30. Architectures that can support Virtually Mapped Kernel Stacks should
  31. enable this bool configuration option. The requirements are:
  32. - vmalloc space must be large enough to hold many kernel stacks. This
  33. may rule out many 32-bit architectures.
  34. - Stacks in vmalloc space need to work reliably. For example, if
  35. vmap page tables are created on demand, either this mechanism
  36. needs to work while the stack points to a virtual address with
  37. unpopulated page tables or arch code (switch_to() and switch_mm(),
  38. most likely) needs to ensure that the stack's page table entries
  39. are populated before running on a possibly unpopulated stack.
  40. - If the stack overflows into a guard page, something reasonable
  41. should happen. The definition of "reasonable" is flexible, but
  42. instantly rebooting without logging anything would be unfriendly.
  43. VMAP_STACK
  44. ----------
  45. VMAP_STACK bool configuration option when enabled allocates virtually
  46. mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK.
  47. - Enable this if you want the use virtually-mapped kernel stacks
  48. with guard pages. This causes kernel stack overflows to be caught
  49. immediately rather than causing difficult-to-diagnose corruption.
  50. .. note::
  51. Using this feature with KASAN requires architecture support
  52. for backing virtual mappings with real shadow memory, and
  53. KASAN_VMALLOC must be enabled.
  54. .. note::
  55. VMAP_STACK is enabled, it is not possible to run DMA on stack
  56. allocated data.
  57. Kernel configuration options and dependencies keep changing. Refer to
  58. the latest code base:
  59. `Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>`
  60. Allocation
  61. -----------
  62. When a new kernel thread is created, thread stack is allocated from
  63. virtually contiguous memory pages from the page level allocator. These
  64. pages are mapped into contiguous kernel virtual space with PAGE_KERNEL
  65. protections.
  66. alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack
  67. with PAGE_KERNEL protections.
  68. - Allocated stacks are cached and later reused by new threads, so memcg
  69. accounting is performed manually on assigning/releasing stacks to tasks.
  70. Hence, __vmalloc_node_range is called without __GFP_ACCOUNT.
  71. - vm_struct is cached to be able to find when thread free is initiated
  72. in interrupt context. free_thread_stack() can be called in interrupt
  73. context.
  74. - On arm64, all VMAP's stacks need to have the same alignment to ensure
  75. that VMAP'd stack overflow detection works correctly. Arch specific
  76. vmap stack allocator takes care of this detail.
  77. - This does not address interrupt stacks - according to the original patch
  78. Thread stack allocation is initiated from clone(), fork(), vfork(),
  79. kernel_thread() via kernel_clone(). Leaving a few hints for searching
  80. the code base to understand when and how thread stack is allocated.
  81. Bulk of the code is in:
  82. `kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`.
  83. stack_vm_area pointer in task_struct keeps track of the virtually allocated
  84. stack and a non-null stack_vm_area pointer serves as a indication that the
  85. virtually mapped kernel stacks are enabled.
  86. ::
  87. struct vm_struct *stack_vm_area;
  88. Stack overflow handling
  89. -----------------------
  90. Leading and trailing guard pages help detect stack overflows. When stack
  91. overflows into the guard pages, handlers have to be careful not overflow
  92. the stack again. When handlers are called, it is likely that very little
  93. stack space is left.
  94. On x86, this is done by handling the page fault indicating the kernel
  95. stack overflow on the double-fault stack.
  96. Testing VMAP allocation with guard pages
  97. ----------------------------------------
  98. How do we ensure that VMAP_STACK is actually allocating with a leading
  99. and trailing guard page? The following lkdtm tests can help detect any
  100. regressions.
  101. ::
  102. void lkdtm_STACK_GUARD_PAGE_LEADING()
  103. void lkdtm_STACK_GUARD_PAGE_TRAILING()
  104. Conclusions
  105. -----------
  106. - A percpu cache of vmalloced stacks appears to be a bit faster than a
  107. high-order stack allocation, at least when the cache hits.
  108. - THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and
  109. simply embed the thread_info (containing only flags) and 'int cpu' into
  110. task_struct.
  111. - The thread stack can be free'ed as soon as the task is dead (without
  112. waiting for RCU) and then, if vmapped stacks are in use, cache the
  113. entire stack for reuse on the same cpu.