fsys.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303
  1. ===================================
  2. Light-weight System Calls for IA-64
  3. ===================================
  4. Started: 13-Jan-2003
  5. Last update: 27-Sep-2003
  6. David Mosberger-Tang
  7. <[email protected]>
  8. Using the "epc" instruction effectively introduces a new mode of
  9. execution to the ia64 linux kernel. We call this mode the
  10. "fsys-mode". To recap, the normal states of execution are:
  11. - kernel mode:
  12. Both the register stack and the memory stack have been
  13. switched over to kernel memory. The user-level state is saved
  14. in a pt-regs structure at the top of the kernel memory stack.
  15. - user mode:
  16. Both the register stack and the kernel stack are in
  17. user memory. The user-level state is contained in the
  18. CPU registers.
  19. - bank 0 interruption-handling mode:
  20. This is the non-interruptible state which all
  21. interruption-handlers start execution in. The user-level
  22. state remains in the CPU registers and some kernel state may
  23. be stored in bank 0 of registers r16-r31.
  24. In contrast, fsys-mode has the following special properties:
  25. - execution is at privilege level 0 (most-privileged)
  26. - CPU registers may contain a mixture of user-level and kernel-level
  27. state (it is the responsibility of the kernel to ensure that no
  28. security-sensitive kernel-level state is leaked back to
  29. user-level)
  30. - execution is interruptible and preemptible (an fsys-mode handler
  31. can disable interrupts and avoid all other interruption-sources
  32. to avoid preemption)
  33. - neither the memory-stack nor the register-stack can be trusted while
  34. in fsys-mode (they point to the user-level stacks, which may
  35. be invalid, or completely bogus addresses)
  36. In summary, fsys-mode is much more similar to running in user-mode
  37. than it is to running in kernel-mode. Of course, given that the
  38. privilege level is at level 0, this means that fsys-mode requires some
  39. care (see below).
  40. How to tell fsys-mode
  41. =====================
  42. Linux operates in fsys-mode when (a) the privilege level is 0 (most
  43. privileged) and (b) the stacks have NOT been switched to kernel memory
  44. yet. For convenience, the header file <asm-ia64/ptrace.h> provides
  45. three macros::
  46. user_mode(regs)
  47. user_stack(task,regs)
  48. fsys_mode(task,regs)
  49. The "regs" argument is a pointer to a pt_regs structure. The "task"
  50. argument is a pointer to the task structure to which the "regs"
  51. pointer belongs to. user_mode() returns TRUE if the CPU state pointed
  52. to by "regs" was executing in user mode (privilege level 3).
  53. user_stack() returns TRUE if the state pointed to by "regs" was
  54. executing on the user-level stack(s). Finally, fsys_mode() returns
  55. TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
  56. The fsys_mode() macro is equivalent to the expression::
  57. !user_mode(regs) && user_stack(task,regs)
  58. How to write an fsyscall handler
  59. ================================
  60. The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
  61. (fsyscall_table). This table contains one entry for each system call.
  62. By default, a system call is handled by fsys_fallback_syscall(). This
  63. routine takes care of entering (full) kernel mode and calling the
  64. normal Linux system call handler. For performance-critical system
  65. calls, it is possible to write a hand-tuned fsyscall_handler. For
  66. example, fsys.S contains fsys_getpid(), which is a hand-tuned version
  67. of the getpid() system call.
  68. The entry and exit-state of an fsyscall handler is as follows:
  69. Machine state on entry to fsyscall handler
  70. ------------------------------------------
  71. ========= ===============================================================
  72. r10 0
  73. r11 saved ar.pfs (a user-level value)
  74. r15 system call number
  75. r16 "current" task pointer (in normal kernel-mode, this is in r13)
  76. r32-r39 system call arguments
  77. b6 return address (a user-level value)
  78. ar.pfs previous frame-state (a user-level value)
  79. PSR.be cleared to zero (i.e., little-endian byte order is in effect)
  80. - all other registers may contain values passed in from user-mode
  81. ========= ===============================================================
  82. Required machine state on exit to fsyscall handler
  83. --------------------------------------------------
  84. ========= ===========================================================
  85. r11 saved ar.pfs (as passed into the fsyscall handler)
  86. r15 system call number (as passed into the fsyscall handler)
  87. r32-r39 system call arguments (as passed into the fsyscall handler)
  88. b6 return address (as passed into the fsyscall handler)
  89. ar.pfs previous frame-state (as passed into the fsyscall handler)
  90. ========= ===========================================================
  91. Fsyscall handlers can execute with very little overhead, but with that
  92. speed comes a set of restrictions:
  93. * Fsyscall-handlers MUST check for any pending work in the flags
  94. member of the thread-info structure and if any of the
  95. TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
  96. doing a full system call (by calling fsys_fallback_syscall).
  97. * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
  98. r15, b6, and ar.pfs) because they will be needed in case of a
  99. system call restart. Of course, all "preserved" registers also
  100. must be preserved, in accordance to the normal calling conventions.
  101. * Fsyscall-handlers MUST check argument registers for containing a
  102. NaT value before using them in any way that could trigger a
  103. NaT-consumption fault. If a system call argument is found to
  104. contain a NaT value, an fsyscall-handler may return immediately
  105. with r8=EINVAL, r10=-1.
  106. * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
  107. any other operation that would trigger mandatory RSE
  108. (register-stack engine) traffic.
  109. * Fsyscall-handlers MUST NOT write to any stacked registers because
  110. it is not safe to assume that user-level called a handler with the
  111. proper number of arguments.
  112. * Fsyscall-handlers need to be careful when accessing per-CPU variables:
  113. unless proper safe-guards are taken (e.g., interruptions are avoided),
  114. execution may be pre-empted and resumed on another CPU at any given
  115. time.
  116. * Fsyscall-handlers must be careful not to leak sensitive kernel'
  117. information back to user-level. In particular, before returning to
  118. user-level, care needs to be taken to clear any scratch registers
  119. that could contain sensitive information (note that the current
  120. task pointer is not considered sensitive: it's already exposed
  121. through ar.k6).
  122. * Fsyscall-handlers MUST NOT access user-memory without first
  123. validating access-permission (this can be done typically via
  124. probe.r.fault and/or probe.w.fault) and without guarding against
  125. memory access exceptions (this can be done with the EX() macros
  126. defined by asmmacro.h).
  127. The above restrictions may seem draconian, but remember that it's
  128. possible to trade off some of the restrictions by paying a slightly
  129. higher overhead. For example, if an fsyscall-handler could benefit
  130. from the shadow register bank, it could temporarily disable PSR.i and
  131. PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
  132. needed. In other words, following the above rules yields extremely
  133. fast system call execution (while fully preserving system call
  134. semantics), but there is also a lot of flexibility in handling more
  135. complicated cases.
  136. Signal handling
  137. ===============
  138. The delivery of (asynchronous) signals must be delayed until fsys-mode
  139. is exited. This is accomplished with the help of the lower-privilege
  140. transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
  141. checks whether the interrupted task was in fsys-mode and, if so, sets
  142. PSR.lp and returns immediately. When fsys-mode is exited via the
  143. "br.ret" instruction that lowers the privilege level, a trap will
  144. occur. The trap handler clears PSR.lp again and returns immediately.
  145. The kernel exit path then checks for and delivers any pending signals.
  146. PSR Handling
  147. ============
  148. The "epc" instruction doesn't change the contents of PSR at all. This
  149. is in contrast to a regular interruption, which clears almost all
  150. bits. Because of that, some care needs to be taken to ensure things
  151. work as expected. The following discussion describes how each PSR bit
  152. is handled.
  153. ======= =======================================================================
  154. PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
  155. to ensure the CPU is in little-endian mode before the first
  156. load/store instruction is executed. PSR.be is normally NOT
  157. restored upon return from an fsys-mode handler. In other
  158. words, user-level code must not rely on PSR.be being preserved
  159. across a system call.
  160. PSR.up Unchanged.
  161. PSR.ac Unchanged.
  162. PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
  163. PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
  164. PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
  165. PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
  166. PSR.pk Unchanged.
  167. PSR.dt Unchanged.
  168. PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
  169. PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
  170. PSR.sp Unchanged.
  171. PSR.pp Unchanged.
  172. PSR.di Unchanged.
  173. PSR.si Unchanged.
  174. PSR.db Unchanged. The kernel prevents user-level from setting a hardware
  175. breakpoint that triggers at any privilege level other than
  176. 3 (user-mode).
  177. PSR.lp Unchanged.
  178. PSR.tb Lazy redirect. If a taken-branch trap occurs while in
  179. fsys-mode, the trap-handler modifies the saved machine state
  180. such that execution resumes in the gate page at
  181. syscall_via_break(), with privilege level 3. Note: the
  182. taken branch would occur on the branch invoking the
  183. fsyscall-handler, at which point, by definition, a syscall
  184. restart is still safe. If the system call number is invalid,
  185. the fsys-mode handler will return directly to user-level. This
  186. return will trigger a taken-branch trap, but since the trap is
  187. taken _after_ restoring the privilege level, the CPU has already
  188. left fsys-mode, so no special treatment is needed.
  189. PSR.rt Unchanged.
  190. PSR.cpl Cleared to 0.
  191. PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
  192. PSR.mc Unchanged.
  193. PSR.it Unchanged (guaranteed to be 1).
  194. PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
  195. PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
  196. PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
  197. PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
  198. be taken. The trap handler then modifies the saved machine
  199. state such that execution resumes in the gate page at
  200. syscall_via_break(), with privilege level 3.
  201. PSR.ri Unchanged.
  202. PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
  203. handler performed a speculative load that gets NaTted. If so, this
  204. would be the normal & expected behavior, so no special treatment is
  205. needed.
  206. PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
  207. Doing so requires clearing PSR.i and PSR.ic as well.
  208. PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
  209. ======= =======================================================================
  210. Using fast system calls
  211. =======================
  212. To use fast system calls, userspace applications need simply call
  213. __kernel_syscall_via_epc(). For example
  214. -- example fgettimeofday() call --
  215. -- fgettimeofday.S --
  216. ::
  217. #include <asm/asmmacro.h>
  218. GLOBAL_ENTRY(fgettimeofday)
  219. .prologue
  220. .save ar.pfs, r11
  221. mov r11 = ar.pfs
  222. .body
  223. mov r2 = 0xa000000000020660;; // gate address
  224. // found by inspection of System.map for the
  225. // __kernel_syscall_via_epc() function. See
  226. // below for how to do this for real.
  227. mov b7 = r2
  228. mov r15 = 1087 // gettimeofday syscall
  229. ;;
  230. br.call.sptk.many b6 = b7
  231. ;;
  232. .restore sp
  233. mov ar.pfs = r11
  234. br.ret.sptk.many rp;; // return to caller
  235. END(fgettimeofday)
  236. -- end fgettimeofday.S --
  237. In reality, getting the gate address is accomplished by two extra
  238. values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
  239. * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
  240. * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
  241. The ELF DSO is a pre-linked library that is mapped in by the kernel at
  242. the gate page. It is a proper ELF shared object so, with a dynamic
  243. loader that recognises the library, you should be able to make calls to
  244. the exported functions within it as with any other shared library.
  245. AT_SYSINFO points into the kernel DSO at the
  246. __kernel_syscall_via_epc() function for historical reasons (it was
  247. used before the kernel DSO) and as a convenience.