README 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213
  1. ****************************
  2. RDMA Transport (RTRS)
  3. ****************************
  4. RTRS (RDMA Transport) is a reliable high speed transport library
  5. which provides support to establish optimal number of connections
  6. between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
  7. transport. It is optimized to transfer (read/write) IO blocks.
  8. In its core interface it follows the BIO semantics of providing the
  9. possibility to either write data from an sg list to the remote side
  10. or to request ("read") data transfer from the remote side into a given
  11. sg list.
  12. RTRS provides I/O fail-over and load-balancing capabilities by using
  13. multipath I/O (see "add_path" and "mp_policy" configuration entries in
  14. Documentation/ABI/testing/sysfs-class-rtrs-client).
  15. RTRS is used by the RNBD (RDMA Network Block Device) modules.
  16. ==================
  17. Transport protocol
  18. ==================
  19. Overview
  20. --------
  21. An established connection between a client and a server is called rtrs
  22. session. A session is associated with a set of memory chunks reserved on the
  23. server side for a given client for rdma transfer. A session
  24. consists of multiple paths, each representing a separate physical link
  25. between client and server. Those are used for load balancing and failover.
  26. Each path consists of as many connections (QPs) as there are cpus on
  27. the client.
  28. When processing an incoming write or read request, rtrs client uses memory
  29. chunks reserved for him on the server side. Their number, size and addresses
  30. need to be exchanged between client and server during the connection
  31. establishment phase. Apart from the memory related information client needs to
  32. inform the server about the session name and identify each path and connection
  33. individually.
  34. On an established session client sends to server write or read messages.
  35. Server uses immediate field to tell the client which request is being
  36. acknowledged and for errno. Client uses immediate field to tell the server
  37. which of the memory chunks has been accessed and at which offset the message
  38. can be found.
  39. Module parameter always_invalidate is introduced for the security problem
  40. discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
  41. invalidate each rdma buffer before we hand it over to RNBD server and
  42. then pass it to the block layer. A new rkey is generated and registered for the
  43. buffer after it returns back from the block layer and RNBD server.
  44. The new rkey is sent back to the client along with the IO result.
  45. The procedure is the default behaviour of the driver. This invalidation and
  46. registration on each IO causes performance drop of up to 20%. A user of the
  47. driver may choose to load the modules with this mechanism switched off
  48. (always_invalidate=N), if he understands and can take the risk of a malicious
  49. client being able to corrupt memory of a server it is connected to. This might
  50. be a reasonable option in a scenario where all the clients and all the servers
  51. are located within a secure datacenter.
  52. Connection establishment
  53. ------------------------
  54. 1. Client starts establishing connections belonging to a path of a session one
  55. by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
  56. Those include uuid of the session and uuid of the path to be
  57. established. They are used by the server to find a persisting session/path or
  58. to create a new one when necessary. The message also contains the protocol
  59. version and magic for compatibility, total number of connections per session
  60. (as many as cpus on the client), the id of the current connection and
  61. the reconnect counter, which is used to resolve the situations where
  62. client is trying to reconnect a path, while server is still destroying the old
  63. one.
  64. 2. Server accepts the connection requests one by one and attaches
  65. RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
  66. protocol version, the messages include error code, queue depth supported by
  67. the server (number of memory chunks which are going to be allocated for that
  68. session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
  69. when always_invalidate=Y.
  70. 3. After all connections of a path are established client sends to server the
  71. RTRS_MSG_INFO_REQ message, containing the name of the session. This message
  72. requests the address information from the server.
  73. 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
  74. which contains the addresses and keys of the RDMA buffers allocated for that
  75. session.
  76. 5. Session becomes connected after all paths to be established are connected
  77. (i.e. steps 1-4 finished for all paths requested for a session)
  78. 6. Server and client exchange periodically heartbeat messages (empty rdma
  79. messages with an immediate field) which are used to detect a crash on remote
  80. side or network outage in an absence of IO.
  81. 7. On any RDMA related error or in the case of a heartbeat timeout, the
  82. corresponding path is disconnected, all the inflight IO are failed over to a
  83. healthy path, if any, and the reconnect mechanism is triggered.
  84. CLT SRV
  85. *for each connection belonging to a path and for each path:
  86. RTRS_MSG_CON_REQ ------------------->
  87. <------------------- RTRS_MSG_CON_RSP
  88. ...
  89. *after all connections are established:
  90. RTRS_MSG_INFO_REQ ------------------->
  91. <------------------- RTRS_MSG_INFO_RSP
  92. *heartbeat is started from both sides:
  93. -------------------> [RTRS_HB_MSG_IMM]
  94. [RTRS_HB_MSG_ACK] <-------------------
  95. [RTRS_HB_MSG_IMM] <-------------------
  96. -------------------> [RTRS_HB_MSG_ACK]
  97. IO path
  98. -------
  99. * Write (always_invalidate=N) *
  100. 1. When processing a write request client selects one of the memory chunks
  101. on the server side and rdma writes there the user data, user header and the
  102. RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
  103. contains size of the user header. The client tells the server which chunk has
  104. been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
  105. using the IMM field.
  106. 2. When confirming a write request server sends an "empty" rdma message with
  107. an immediate field. The 32 bit field is used to specify the outstanding
  108. inflight IO and for the error code.
  109. CLT SRV
  110. usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
  111. [RTRS_IO_RSP_IMM] <----------------- (id + errno)
  112. * Write (always_invalidate=Y) *
  113. 1. When processing a write request client selects one of the memory chunks
  114. on the server side and rdma writes there the user data, user header and the
  115. RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
  116. contains size of the user header. The client tells the server which chunk has
  117. been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
  118. using the IMM field, Server invalidate rkey associated to the memory chunks
  119. first, when it finishes, pass the IO to RNBD server module.
  120. 2. When confirming a write request server sends an "empty" rdma message with
  121. an immediate field. The 32 bit field is used to specify the outstanding
  122. inflight IO and for the error code. The new rkey is sent back using
  123. SEND_WITH_IMM WR, client When it recived new rkey message, it validates
  124. the message and finished IO after update rkey for the rbuffer, then post
  125. back the recv buffer for later use.
  126. CLT SRV
  127. usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
  128. [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
  129. [RTRS_IO_RSP_IMM] <----------------- (id + errno)
  130. * Read (always_invalidate=N)*
  131. 1. When processing a read request client selects one of the memory chunks
  132. on the server side and rdma writes there the user header and the
  133. RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
  134. the user header, flags (specifying if memory invalidation is necessary) and the
  135. list of addresses along with keys for the data to be read into.
  136. 2. When confirming a read request server transfers the requested data first,
  137. attaches an invalidation message if requested and finally an "empty" rdma
  138. message with an immediate field. The 32 bit field is used to specify the
  139. outstanding inflight IO and the error code.
  140. CLT SRV
  141. usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
  142. [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
  143. or in case client requested invalidation:
  144. [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
  145. * Read (always_invalidate=Y)*
  146. 1. When processing a read request client selects one of the memory chunks
  147. on the server side and rdma writes there the user header and the
  148. RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
  149. the user header, flags (specifying if memory invalidation is necessary) and the
  150. list of addresses along with keys for the data to be read into.
  151. Server invalidate rkey associated to the memory chunks first, when it finishes,
  152. passes the IO to RNBD server module.
  153. 2. When confirming a read request server transfers the requested data first,
  154. attaches an invalidation message if requested and finally an "empty" rdma
  155. message with an immediate field. The 32 bit field is used to specify the
  156. outstanding inflight IO and the error code. The new rkey is sent back using
  157. SEND_WITH_IMM WR, client When it recived new rkey message, it validates
  158. the message and finished IO after update rkey for the rbuffer, then post
  159. back the recv buffer for later use.
  160. CLT SRV
  161. usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
  162. [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
  163. [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
  164. or in case client requested invalidation:
  165. [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
  166. =========================================
  167. Contributors List(in alphabetical order)
  168. =========================================
  169. Danil Kipnis <[email protected]>
  170. Fabian Holler <[email protected]>
  171. Guoqing Jiang <[email protected]>
  172. Jack Wang <[email protected]>
  173. Kleber Souza <[email protected]>
  174. Lutz Pogrell <[email protected]>
  175. Milind Dumbare <[email protected]>
  176. Roman Penyaev <[email protected]>