classic_vs_extended.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376
  1. ===================
  2. Classic BPF vs eBPF
  3. ===================
  4. eBPF is designed to be JITed with one to one mapping, which can also open up
  5. the possibility for GCC/LLVM compilers to generate optimized eBPF code through
  6. an eBPF backend that performs almost as fast as natively compiled code.
  7. Some core changes of the eBPF format from classic BPF:
  8. - Number of registers increase from 2 to 10:
  9. The old format had two registers A and X, and a hidden frame pointer. The
  10. new layout extends this to be 10 internal registers and a read-only frame
  11. pointer. Since 64-bit CPUs are passing arguments to functions via registers
  12. the number of args from eBPF program to in-kernel function is restricted
  13. to 5 and one register is used to accept return value from an in-kernel
  14. function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
  15. sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
  16. registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
  17. Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
  18. etc, and eBPF calling convention maps directly to ABIs used by the kernel on
  19. 64-bit architectures.
  20. On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
  21. and may let more complex programs to be interpreted.
  22. R0 - R5 are scratch registers and eBPF program needs spill/fill them if
  23. necessary across calls. Note that there is only one eBPF program (== one
  24. eBPF main routine) and it cannot call other eBPF functions, it can only
  25. call predefined in-kernel functions, though.
  26. - Register width increases from 32-bit to 64-bit:
  27. Still, the semantics of the original 32-bit ALU operations are preserved
  28. via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
  29. subregisters that zero-extend into 64-bit if they are being written to.
  30. That behavior maps directly to x86_64 and arm64 subregister definition, but
  31. makes other JITs more difficult.
  32. 32-bit architectures run 64-bit eBPF programs via interpreter.
  33. Their JITs may convert BPF programs that only use 32-bit subregisters into
  34. native instruction set and let the rest being interpreted.
  35. Operation is 64-bit, because on 64-bit architectures, pointers are also
  36. 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
  37. so 32-bit eBPF registers would otherwise require to define register-pair
  38. ABI, thus, there won't be able to use a direct eBPF register to HW register
  39. mapping and JIT would need to do combine/split/move operations for every
  40. register in and out of the function, which is complex, bug prone and slow.
  41. Another reason is the use of atomic 64-bit counters.
  42. - Conditional jt/jf targets replaced with jt/fall-through:
  43. While the original design has constructs such as ``if (cond) jump_true;
  44. else jump_false;``, they are being replaced into alternative constructs like
  45. ``if (cond) jump_true; /* else fall-through */``.
  46. - Introduces bpf_call insn and register passing convention for zero overhead
  47. calls from/to other kernel functions:
  48. Before an in-kernel function call, the eBPF program needs to
  49. place function arguments into R1 to R5 registers to satisfy calling
  50. convention, then the interpreter will take them from registers and pass
  51. to in-kernel function. If R1 - R5 registers are mapped to CPU registers
  52. that are used for argument passing on given architecture, the JIT compiler
  53. doesn't need to emit extra moves. Function arguments will be in the correct
  54. registers and BPF_CALL instruction will be JITed as single 'call' HW
  55. instruction. This calling convention was picked to cover common call
  56. situations without performance penalty.
  57. After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
  58. a return value of the function. Since R6 - R9 are callee saved, their state
  59. is preserved across the call.
  60. For example, consider three C functions::
  61. u64 f1() { return (*_f2)(1); }
  62. u64 f2(u64 a) { return f3(a + 1, a); }
  63. u64 f3(u64 a, u64 b) { return a - b; }
  64. GCC can compile f1, f3 into x86_64::
  65. f1:
  66. movl $1, %edi
  67. movq _f2(%rip), %rax
  68. jmp *%rax
  69. f3:
  70. movq %rdi, %rax
  71. subq %rsi, %rax
  72. ret
  73. Function f2 in eBPF may look like::
  74. f2:
  75. bpf_mov R2, R1
  76. bpf_add R1, 1
  77. bpf_call f3
  78. bpf_exit
  79. If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
  80. returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
  81. be used to call into f2.
  82. For practical reasons all eBPF programs have only one argument 'ctx' which is
  83. already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
  84. can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
  85. are currently not supported, but these restrictions can be lifted if necessary
  86. in the future.
  87. On 64-bit architectures all register map to HW registers one to one. For
  88. example, x86_64 JIT compiler can map them as ...
  89. ::
  90. R0 - rax
  91. R1 - rdi
  92. R2 - rsi
  93. R3 - rdx
  94. R4 - rcx
  95. R5 - r8
  96. R6 - rbx
  97. R7 - r13
  98. R8 - r14
  99. R9 - r15
  100. R10 - rbp
  101. ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
  102. and rbx, r12 - r15 are callee saved.
  103. Then the following eBPF pseudo-program::
  104. bpf_mov R6, R1 /* save ctx */
  105. bpf_mov R2, 2
  106. bpf_mov R3, 3
  107. bpf_mov R4, 4
  108. bpf_mov R5, 5
  109. bpf_call foo
  110. bpf_mov R7, R0 /* save foo() return value */
  111. bpf_mov R1, R6 /* restore ctx for next call */
  112. bpf_mov R2, 6
  113. bpf_mov R3, 7
  114. bpf_mov R4, 8
  115. bpf_mov R5, 9
  116. bpf_call bar
  117. bpf_add R0, R7
  118. bpf_exit
  119. After JIT to x86_64 may look like::
  120. push %rbp
  121. mov %rsp,%rbp
  122. sub $0x228,%rsp
  123. mov %rbx,-0x228(%rbp)
  124. mov %r13,-0x220(%rbp)
  125. mov %rdi,%rbx
  126. mov $0x2,%esi
  127. mov $0x3,%edx
  128. mov $0x4,%ecx
  129. mov $0x5,%r8d
  130. callq foo
  131. mov %rax,%r13
  132. mov %rbx,%rdi
  133. mov $0x6,%esi
  134. mov $0x7,%edx
  135. mov $0x8,%ecx
  136. mov $0x9,%r8d
  137. callq bar
  138. add %r13,%rax
  139. mov -0x228(%rbp),%rbx
  140. mov -0x220(%rbp),%r13
  141. leaveq
  142. retq
  143. Which is in this example equivalent in C to::
  144. u64 bpf_filter(u64 ctx)
  145. {
  146. return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
  147. }
  148. In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
  149. arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
  150. registers and place their return value into ``%rax`` which is R0 in eBPF.
  151. Prologue and epilogue are emitted by JIT and are implicit in the
  152. interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
  153. them across the calls as defined by calling convention.
  154. For example the following program is invalid::
  155. bpf_mov R1, 1
  156. bpf_call foo
  157. bpf_mov R0, R1
  158. bpf_exit
  159. After the call the registers R1-R5 contain junk values and cannot be read.
  160. An in-kernel verifier.rst is used to validate eBPF programs.
  161. Also in the new design, eBPF is limited to 4096 insns, which means that any
  162. program will terminate quickly and will only call a fixed number of kernel
  163. functions. Original BPF and eBPF are two operand instructions,
  164. which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
  165. The input context pointer for invoking the interpreter function is generic,
  166. its content is defined by a specific use case. For seccomp register R1 points
  167. to seccomp_data, for converted BPF filters R1 points to a skb.
  168. A program, that is translated internally consists of the following elements::
  169. op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
  170. So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
  171. has room for new instructions. Some of them may use 16/24/32 byte encoding. New
  172. instructions must be multiple of 8 bytes to preserve backward compatibility.
  173. eBPF is a general purpose RISC instruction set. Not every register and
  174. every instruction are used during translation from original BPF to eBPF.
  175. For example, socket filters are not using ``exclusive add`` instruction, but
  176. tracing filters may do to maintain counters of events, for example. Register R9
  177. is not used by socket filters either, but more complex filters may be running
  178. out of registers and would have to resort to spill/fill to stack.
  179. eBPF can be used as a generic assembler for last step performance
  180. optimizations, socket filters and seccomp are using it as assembler. Tracing
  181. filters may use it as assembler to generate code from kernel. In kernel usage
  182. may not be bounded by security considerations, since generated eBPF code
  183. may be optimizing internal code path and not being exposed to the user space.
  184. Safety of eBPF can come from the verifier.rst. In such use cases as
  185. described, it may be used as safe instruction set.
  186. Just like the original BPF, eBPF runs within a controlled environment,
  187. is deterministic and the kernel can easily prove that. The safety of the program
  188. can be determined in two steps: first step does depth-first-search to disallow
  189. loops and other CFG validation; second step starts from the first insn and
  190. descends all possible paths. It simulates execution of every insn and observes
  191. the state change of registers and stack.
  192. opcode encoding
  193. ===============
  194. eBPF is reusing most of the opcode encoding from classic to simplify conversion
  195. of classic BPF to eBPF.
  196. For arithmetic and jump instructions the 8-bit 'code' field is divided into three
  197. parts::
  198. +----------------+--------+--------------------+
  199. | 4 bits | 1 bit | 3 bits |
  200. | operation code | source | instruction class |
  201. +----------------+--------+--------------------+
  202. (MSB) (LSB)
  203. Three LSB bits store instruction class which is one of:
  204. =================== ===============
  205. Classic BPF classes eBPF classes
  206. =================== ===============
  207. BPF_LD 0x00 BPF_LD 0x00
  208. BPF_LDX 0x01 BPF_LDX 0x01
  209. BPF_ST 0x02 BPF_ST 0x02
  210. BPF_STX 0x03 BPF_STX 0x03
  211. BPF_ALU 0x04 BPF_ALU 0x04
  212. BPF_JMP 0x05 BPF_JMP 0x05
  213. BPF_RET 0x06 BPF_JMP32 0x06
  214. BPF_MISC 0x07 BPF_ALU64 0x07
  215. =================== ===============
  216. The 4th bit encodes the source operand ...
  217. ::
  218. BPF_K 0x00
  219. BPF_X 0x08
  220. * in classic BPF, this means::
  221. BPF_SRC(code) == BPF_X - use register X as source operand
  222. BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
  223. * in eBPF, this means::
  224. BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
  225. BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
  226. ... and four MSB bits store operation code.
  227. If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
  228. BPF_ADD 0x00
  229. BPF_SUB 0x10
  230. BPF_MUL 0x20
  231. BPF_DIV 0x30
  232. BPF_OR 0x40
  233. BPF_AND 0x50
  234. BPF_LSH 0x60
  235. BPF_RSH 0x70
  236. BPF_NEG 0x80
  237. BPF_MOD 0x90
  238. BPF_XOR 0xa0
  239. BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
  240. BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
  241. BPF_END 0xd0 /* eBPF only: endianness conversion */
  242. If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
  243. BPF_JA 0x00 /* BPF_JMP only */
  244. BPF_JEQ 0x10
  245. BPF_JGT 0x20
  246. BPF_JGE 0x30
  247. BPF_JSET 0x40
  248. BPF_JNE 0x50 /* eBPF only: jump != */
  249. BPF_JSGT 0x60 /* eBPF only: signed '>' */
  250. BPF_JSGE 0x70 /* eBPF only: signed '>=' */
  251. BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
  252. BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
  253. BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
  254. BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
  255. BPF_JSLT 0xc0 /* eBPF only: signed '<' */
  256. BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
  257. So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
  258. and eBPF. There are only two registers in classic BPF, so it means A += X.
  259. In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
  260. BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
  261. src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
  262. Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
  263. eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
  264. BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
  265. exactly the same operations as BPF_ALU, but with 64-bit wide operands
  266. instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
  267. dst_reg = dst_reg + src_reg
  268. Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
  269. operation. Classic BPF_RET | BPF_K means copy imm32 into return register
  270. and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
  271. in eBPF means function exit only. The eBPF program needs to store return
  272. value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
  273. BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
  274. operands for the comparisons instead.
  275. For load and store instructions the 8-bit 'code' field is divided as::
  276. +--------+--------+-------------------+
  277. | 3 bits | 2 bits | 3 bits |
  278. | mode | size | instruction class |
  279. +--------+--------+-------------------+
  280. (MSB) (LSB)
  281. Size modifier is one of ...
  282. ::
  283. BPF_W 0x00 /* word */
  284. BPF_H 0x08 /* half word */
  285. BPF_B 0x10 /* byte */
  286. BPF_DW 0x18 /* eBPF only, double word */
  287. ... which encodes size of load/store operation::
  288. B - 1 byte
  289. H - 2 byte
  290. W - 4 byte
  291. DW - 8 byte (eBPF only)
  292. Mode modifier is one of::
  293. BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
  294. BPF_ABS 0x20
  295. BPF_IND 0x40
  296. BPF_MEM 0x60
  297. BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
  298. BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
  299. BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */