123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213 |
- ****************************
- RDMA Transport (RTRS)
- ****************************
- RTRS (RDMA Transport) is a reliable high speed transport library
- which provides support to establish optimal number of connections
- between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
- transport. It is optimized to transfer (read/write) IO blocks.
- In its core interface it follows the BIO semantics of providing the
- possibility to either write data from an sg list to the remote side
- or to request ("read") data transfer from the remote side into a given
- sg list.
- RTRS provides I/O fail-over and load-balancing capabilities by using
- multipath I/O (see "add_path" and "mp_policy" configuration entries in
- Documentation/ABI/testing/sysfs-class-rtrs-client).
- RTRS is used by the RNBD (RDMA Network Block Device) modules.
- ==================
- Transport protocol
- ==================
- Overview
- --------
- An established connection between a client and a server is called rtrs
- session. A session is associated with a set of memory chunks reserved on the
- server side for a given client for rdma transfer. A session
- consists of multiple paths, each representing a separate physical link
- between client and server. Those are used for load balancing and failover.
- Each path consists of as many connections (QPs) as there are cpus on
- the client.
- When processing an incoming write or read request, rtrs client uses memory
- chunks reserved for him on the server side. Their number, size and addresses
- need to be exchanged between client and server during the connection
- establishment phase. Apart from the memory related information client needs to
- inform the server about the session name and identify each path and connection
- individually.
- On an established session client sends to server write or read messages.
- Server uses immediate field to tell the client which request is being
- acknowledged and for errno. Client uses immediate field to tell the server
- which of the memory chunks has been accessed and at which offset the message
- can be found.
- Module parameter always_invalidate is introduced for the security problem
- discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
- invalidate each rdma buffer before we hand it over to RNBD server and
- then pass it to the block layer. A new rkey is generated and registered for the
- buffer after it returns back from the block layer and RNBD server.
- The new rkey is sent back to the client along with the IO result.
- The procedure is the default behaviour of the driver. This invalidation and
- registration on each IO causes performance drop of up to 20%. A user of the
- driver may choose to load the modules with this mechanism switched off
- (always_invalidate=N), if he understands and can take the risk of a malicious
- client being able to corrupt memory of a server it is connected to. This might
- be a reasonable option in a scenario where all the clients and all the servers
- are located within a secure datacenter.
- Connection establishment
- ------------------------
- 1. Client starts establishing connections belonging to a path of a session one
- by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
- Those include uuid of the session and uuid of the path to be
- established. They are used by the server to find a persisting session/path or
- to create a new one when necessary. The message also contains the protocol
- version and magic for compatibility, total number of connections per session
- (as many as cpus on the client), the id of the current connection and
- the reconnect counter, which is used to resolve the situations where
- client is trying to reconnect a path, while server is still destroying the old
- one.
- 2. Server accepts the connection requests one by one and attaches
- RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
- protocol version, the messages include error code, queue depth supported by
- the server (number of memory chunks which are going to be allocated for that
- session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
- when always_invalidate=Y.
- 3. After all connections of a path are established client sends to server the
- RTRS_MSG_INFO_REQ message, containing the name of the session. This message
- requests the address information from the server.
- 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
- which contains the addresses and keys of the RDMA buffers allocated for that
- session.
- 5. Session becomes connected after all paths to be established are connected
- (i.e. steps 1-4 finished for all paths requested for a session)
- 6. Server and client exchange periodically heartbeat messages (empty rdma
- messages with an immediate field) which are used to detect a crash on remote
- side or network outage in an absence of IO.
- 7. On any RDMA related error or in the case of a heartbeat timeout, the
- corresponding path is disconnected, all the inflight IO are failed over to a
- healthy path, if any, and the reconnect mechanism is triggered.
- CLT SRV
- *for each connection belonging to a path and for each path:
- RTRS_MSG_CON_REQ ------------------->
- <------------------- RTRS_MSG_CON_RSP
- ...
- *after all connections are established:
- RTRS_MSG_INFO_REQ ------------------->
- <------------------- RTRS_MSG_INFO_RSP
- *heartbeat is started from both sides:
- -------------------> [RTRS_HB_MSG_IMM]
- [RTRS_HB_MSG_ACK] <-------------------
- [RTRS_HB_MSG_IMM] <-------------------
- -------------------> [RTRS_HB_MSG_ACK]
- IO path
- -------
- * Write (always_invalidate=N) *
- 1. When processing a write request client selects one of the memory chunks
- on the server side and rdma writes there the user data, user header and the
- RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
- contains size of the user header. The client tells the server which chunk has
- been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
- using the IMM field.
- 2. When confirming a write request server sends an "empty" rdma message with
- an immediate field. The 32 bit field is used to specify the outstanding
- inflight IO and for the error code.
- CLT SRV
- usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
- [RTRS_IO_RSP_IMM] <----------------- (id + errno)
- * Write (always_invalidate=Y) *
- 1. When processing a write request client selects one of the memory chunks
- on the server side and rdma writes there the user data, user header and the
- RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
- contains size of the user header. The client tells the server which chunk has
- been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
- using the IMM field, Server invalidate rkey associated to the memory chunks
- first, when it finishes, pass the IO to RNBD server module.
- 2. When confirming a write request server sends an "empty" rdma message with
- an immediate field. The 32 bit field is used to specify the outstanding
- inflight IO and for the error code. The new rkey is sent back using
- SEND_WITH_IMM WR, client When it recived new rkey message, it validates
- the message and finished IO after update rkey for the rbuffer, then post
- back the recv buffer for later use.
- CLT SRV
- usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
- [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
- [RTRS_IO_RSP_IMM] <----------------- (id + errno)
- * Read (always_invalidate=N)*
- 1. When processing a read request client selects one of the memory chunks
- on the server side and rdma writes there the user header and the
- RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
- the user header, flags (specifying if memory invalidation is necessary) and the
- list of addresses along with keys for the data to be read into.
- 2. When confirming a read request server transfers the requested data first,
- attaches an invalidation message if requested and finally an "empty" rdma
- message with an immediate field. The 32 bit field is used to specify the
- outstanding inflight IO and the error code.
- CLT SRV
- usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
- [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
- or in case client requested invalidation:
- [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
- * Read (always_invalidate=Y)*
- 1. When processing a read request client selects one of the memory chunks
- on the server side and rdma writes there the user header and the
- RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
- the user header, flags (specifying if memory invalidation is necessary) and the
- list of addresses along with keys for the data to be read into.
- Server invalidate rkey associated to the memory chunks first, when it finishes,
- passes the IO to RNBD server module.
- 2. When confirming a read request server transfers the requested data first,
- attaches an invalidation message if requested and finally an "empty" rdma
- message with an immediate field. The 32 bit field is used to specify the
- outstanding inflight IO and the error code. The new rkey is sent back using
- SEND_WITH_IMM WR, client When it recived new rkey message, it validates
- the message and finished IO after update rkey for the rbuffer, then post
- back the recv buffer for later use.
- CLT SRV
- usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
- [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
- [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
- or in case client requested invalidation:
- [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
- =========================================
- Contributors List(in alphabetical order)
- =========================================
- Danil Kipnis <[email protected]>
- Fabian Holler <[email protected]>
- Guoqing Jiang <[email protected]>
- Jack Wang <[email protected]>
- Kleber Souza <[email protected]>
- Lutz Pogrell <[email protected]>
- Milind Dumbare <[email protected]>
- Roman Penyaev <[email protected]>
|