Merge tag 'docs-5.1' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet: "A fairly routine cycle for docs - lots of typo fixes, some new documents, and more translations. There's also some LICENSES adjustments from Thomas" * tag 'docs-5.1' of git://git.lwn.net/linux: (74 commits) docs: Bring some order to filesystem documentation Documentation/locking/lockdep: Drop last two chars of sample states doc: rcu: Suspicious RCU usage is a warning docs: driver-api: iio: fix errors in documentation Documentation/process/howto: Update for 4.x -> 5.x versioning docs: Explicitly state that the 'Fixes:' tag shouldn't split lines doc: security: Add kern-doc for lsm_hooks.h doc: sctp: Merge and clean up rst files Docs: Correct /proc/stat path scripts/spdxcheck.py: fix C++ comment style detection doc: fix typos in license-rules.rst Documentation: fix admin-guide/README.rst minimum gcc version requirement doc: process: complete removal of info about -git patches doc: translations: sync translations 'remove info about -git patches' perf-security: wrap paragraphs on 72 columns perf-security: elaborate on perf_events/Perf privileged users perf-security: document collected perf_events/Perf data categories perf-security: document perf_events/Perf resource control sysfs.txt: add note on available attribute macros docs: kernel-doc: typo "if ... if" -> "if ... is" ...
This commit is contained in:
143
Documentation/networking/checksum-offloads.rst
Normal file
143
Documentation/networking/checksum-offloads.rst
Normal file
@@ -0,0 +1,143 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================
|
||||
Checksum Offloads
|
||||
=================
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document describes a set of techniques in the Linux networking stack to
|
||||
take advantage of checksum offload capabilities of various NICs.
|
||||
|
||||
The following technologies are described:
|
||||
|
||||
* TX Checksum Offload
|
||||
* LCO: Local Checksum Offload
|
||||
* RCO: Remote Checksum Offload
|
||||
|
||||
Things that should be documented here but aren't yet:
|
||||
|
||||
* RX Checksum Offload
|
||||
* CHECKSUM_UNNECESSARY conversion
|
||||
|
||||
|
||||
TX Checksum Offload
|
||||
===================
|
||||
|
||||
The interface for offloading a transmit checksum to a device is explained in
|
||||
detail in comments near the top of include/linux/skbuff.h.
|
||||
|
||||
In brief, it allows to request the device fill in a single ones-complement
|
||||
checksum defined by the sk_buff fields skb->csum_start and skb->csum_offset.
|
||||
The device should compute the 16-bit ones-complement checksum (i.e. the
|
||||
'IP-style' checksum) from csum_start to the end of the packet, and fill in the
|
||||
result at (csum_start + csum_offset).
|
||||
|
||||
Because csum_offset cannot be negative, this ensures that the previous value of
|
||||
the checksum field is included in the checksum computation, thus it can be used
|
||||
to supply any needed corrections to the checksum (such as the sum of the
|
||||
pseudo-header for UDP or TCP).
|
||||
|
||||
This interface only allows a single checksum to be offloaded. Where
|
||||
encapsulation is used, the packet may have multiple checksum fields in
|
||||
different header layers, and the rest will have to be handled by another
|
||||
mechanism such as LCO or RCO.
|
||||
|
||||
CRC32c can also be offloaded using this interface, by means of filling
|
||||
skb->csum_start and skb->csum_offset as described above, and setting
|
||||
skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
|
||||
|
||||
No offloading of the IP header checksum is performed; it is always done in
|
||||
software. This is OK because when we build the IP header, we obviously have it
|
||||
in cache, so summing it isn't expensive. It's also rather short.
|
||||
|
||||
The requirements for GSO are more complicated, because when segmenting an
|
||||
encapsulated packet both the inner and outer checksums may need to be edited or
|
||||
recomputed for each resulting segment. See the skbuff.h comment (section 'E')
|
||||
for more details.
|
||||
|
||||
A driver declares its offload capabilities in netdev->hw_features; see
|
||||
Documentation/networking/netdev-features.txt for more. Note that a device
|
||||
which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
|
||||
csum_offset given in the SKB; if it tries to deduce these itself in hardware
|
||||
(as some NICs do) the driver should check that the values in the SKB match
|
||||
those which the hardware will deduce, and if not, fall back to checksumming in
|
||||
software instead (with skb_csum_hwoffload_help() or one of the
|
||||
skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
|
||||
include/linux/skbuff.h).
|
||||
|
||||
The stack should, for the most part, assume that checksum offload is supported
|
||||
by the underlying device. The only place that should check is
|
||||
validate_xmit_skb(), and the functions it calls directly or indirectly. That
|
||||
function compares the offload features requested by the SKB (which may include
|
||||
other offloads besides TX Checksum Offload) and, if they are not supported or
|
||||
enabled on the device (determined by netdev->features), performs the
|
||||
corresponding offload in software. In the case of TX Checksum Offload, that
|
||||
means calling skb_csum_hwoffload_help(skb, features).
|
||||
|
||||
|
||||
LCO: Local Checksum Offload
|
||||
===========================
|
||||
|
||||
LCO is a technique for efficiently computing the outer checksum of an
|
||||
encapsulated datagram when the inner checksum is due to be offloaded.
|
||||
|
||||
The ones-complement sum of a correctly checksummed TCP or UDP packet is equal
|
||||
to the complement of the sum of the pseudo header, because everything else gets
|
||||
'cancelled out' by the checksum field. This is because the sum was
|
||||
complemented before being written to the checksum field.
|
||||
|
||||
More generally, this holds in any case where the 'IP-style' ones complement
|
||||
checksum is used, and thus any checksum that TX Checksum Offload supports.
|
||||
|
||||
That is, if we have set up TX Checksum Offload with a start/offset pair, we
|
||||
know that after the device has filled in that checksum, the ones complement sum
|
||||
from csum_start to the end of the packet will be equal to the complement of
|
||||
whatever value we put in the checksum field beforehand. This allows us to
|
||||
compute the outer checksum without looking at the payload: we simply stop
|
||||
summing when we get to csum_start, then add the complement of the 16-bit word
|
||||
at (csum_start + csum_offset).
|
||||
|
||||
Then, when the true inner checksum is filled in (either by hardware or by
|
||||
skb_checksum_help()), the outer checksum will become correct by virtue of the
|
||||
arithmetic.
|
||||
|
||||
LCO is performed by the stack when constructing an outer UDP header for an
|
||||
encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for the
|
||||
IPv6 equivalents, in udp6_set_csum().
|
||||
|
||||
It is also performed when constructing an IPv4 GRE header, in
|
||||
net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
|
||||
constructing an IPv6 GRE header; the GRE checksum is computed over the whole
|
||||
packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be possible to use
|
||||
LCO here as IPv6 GRE still uses an IP-style checksum.
|
||||
|
||||
All of the LCO implementations use a helper function lco_csum(), in
|
||||
include/linux/skbuff.h.
|
||||
|
||||
LCO can safely be used for nested encapsulations; in this case, the outer
|
||||
encapsulation layer will sum over both its own header and the 'middle' header.
|
||||
This does mean that the 'middle' header will get summed multiple times, but
|
||||
there doesn't seem to be a way to avoid that without incurring bigger costs
|
||||
(e.g. in SKB bloat).
|
||||
|
||||
|
||||
RCO: Remote Checksum Offload
|
||||
============================
|
||||
|
||||
RCO is a technique for eliding the inner checksum of an encapsulated datagram,
|
||||
allowing the outer checksum to be offloaded. It does, however, involve a
|
||||
change to the encapsulation protocols, which the receiver must also support.
|
||||
For this reason, it is disabled by default.
|
||||
|
||||
RCO is detailed in the following Internet-Drafts:
|
||||
|
||||
* https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
|
||||
* https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
|
||||
|
||||
In Linux, RCO is implemented individually in each encapsulation protocol, and
|
||||
most tunnel types have flags controlling its use. For instance, VXLAN has the
|
||||
flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that RCO should be
|
||||
used when transmitting to a given remote destination.
|
@@ -1,122 +0,0 @@
|
||||
Checksum Offloads in the Linux Networking Stack
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document describes a set of techniques in the Linux networking stack
|
||||
to take advantage of checksum offload capabilities of various NICs.
|
||||
|
||||
The following technologies are described:
|
||||
* TX Checksum Offload
|
||||
* LCO: Local Checksum Offload
|
||||
* RCO: Remote Checksum Offload
|
||||
|
||||
Things that should be documented here but aren't yet:
|
||||
* RX Checksum Offload
|
||||
* CHECKSUM_UNNECESSARY conversion
|
||||
|
||||
|
||||
TX Checksum Offload
|
||||
===================
|
||||
|
||||
The interface for offloading a transmit checksum to a device is explained
|
||||
in detail in comments near the top of include/linux/skbuff.h.
|
||||
In brief, it allows to request the device fill in a single ones-complement
|
||||
checksum defined by the sk_buff fields skb->csum_start and
|
||||
skb->csum_offset. The device should compute the 16-bit ones-complement
|
||||
checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the
|
||||
packet, and fill in the result at (csum_start + csum_offset).
|
||||
Because csum_offset cannot be negative, this ensures that the previous
|
||||
value of the checksum field is included in the checksum computation, thus
|
||||
it can be used to supply any needed corrections to the checksum (such as
|
||||
the sum of the pseudo-header for UDP or TCP).
|
||||
This interface only allows a single checksum to be offloaded. Where
|
||||
encapsulation is used, the packet may have multiple checksum fields in
|
||||
different header layers, and the rest will have to be handled by another
|
||||
mechanism such as LCO or RCO.
|
||||
CRC32c can also be offloaded using this interface, by means of filling
|
||||
skb->csum_start and skb->csum_offset as described above, and setting
|
||||
skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
|
||||
No offloading of the IP header checksum is performed; it is always done in
|
||||
software. This is OK because when we build the IP header, we obviously
|
||||
have it in cache, so summing it isn't expensive. It's also rather short.
|
||||
The requirements for GSO are more complicated, because when segmenting an
|
||||
encapsulated packet both the inner and outer checksums may need to be
|
||||
edited or recomputed for each resulting segment. See the skbuff.h comment
|
||||
(section 'E') for more details.
|
||||
|
||||
A driver declares its offload capabilities in netdev->hw_features; see
|
||||
Documentation/networking/netdev-features.txt for more. Note that a device
|
||||
which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start
|
||||
and csum_offset given in the SKB; if it tries to deduce these itself in
|
||||
hardware (as some NICs do) the driver should check that the values in the
|
||||
SKB match those which the hardware will deduce, and if not, fall back to
|
||||
checksumming in software instead (with skb_csum_hwoffload_help() or one of
|
||||
the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
|
||||
include/linux/skbuff.h).
|
||||
|
||||
The stack should, for the most part, assume that checksum offload is
|
||||
supported by the underlying device. The only place that should check is
|
||||
validate_xmit_skb(), and the functions it calls directly or indirectly.
|
||||
That function compares the offload features requested by the SKB (which
|
||||
may include other offloads besides TX Checksum Offload) and, if they are
|
||||
not supported or enabled on the device (determined by netdev->features),
|
||||
performs the corresponding offload in software. In the case of TX
|
||||
Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features).
|
||||
|
||||
|
||||
LCO: Local Checksum Offload
|
||||
===========================
|
||||
|
||||
LCO is a technique for efficiently computing the outer checksum of an
|
||||
encapsulated datagram when the inner checksum is due to be offloaded.
|
||||
The ones-complement sum of a correctly checksummed TCP or UDP packet is
|
||||
equal to the complement of the sum of the pseudo header, because everything
|
||||
else gets 'cancelled out' by the checksum field. This is because the sum was
|
||||
complemented before being written to the checksum field.
|
||||
More generally, this holds in any case where the 'IP-style' ones complement
|
||||
checksum is used, and thus any checksum that TX Checksum Offload supports.
|
||||
That is, if we have set up TX Checksum Offload with a start/offset pair, we
|
||||
know that after the device has filled in that checksum, the ones
|
||||
complement sum from csum_start to the end of the packet will be equal to
|
||||
the complement of whatever value we put in the checksum field beforehand.
|
||||
This allows us to compute the outer checksum without looking at the payload:
|
||||
we simply stop summing when we get to csum_start, then add the complement of
|
||||
the 16-bit word at (csum_start + csum_offset).
|
||||
Then, when the true inner checksum is filled in (either by hardware or by
|
||||
skb_checksum_help()), the outer checksum will become correct by virtue of
|
||||
the arithmetic.
|
||||
|
||||
LCO is performed by the stack when constructing an outer UDP header for an
|
||||
encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for
|
||||
the IPv6 equivalents, in udp6_set_csum().
|
||||
It is also performed when constructing an IPv4 GRE header, in
|
||||
net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
|
||||
constructing an IPv6 GRE header; the GRE checksum is computed over the
|
||||
whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be
|
||||
possible to use LCO here as IPv6 GRE still uses an IP-style checksum.
|
||||
All of the LCO implementations use a helper function lco_csum(), in
|
||||
include/linux/skbuff.h.
|
||||
|
||||
LCO can safely be used for nested encapsulations; in this case, the outer
|
||||
encapsulation layer will sum over both its own header and the 'middle'
|
||||
header. This does mean that the 'middle' header will get summed multiple
|
||||
times, but there doesn't seem to be a way to avoid that without incurring
|
||||
bigger costs (e.g. in SKB bloat).
|
||||
|
||||
|
||||
RCO: Remote Checksum Offload
|
||||
============================
|
||||
|
||||
RCO is a technique for eliding the inner checksum of an encapsulated
|
||||
datagram, allowing the outer checksum to be offloaded. It does, however,
|
||||
involve a change to the encapsulation protocols, which the receiver must
|
||||
also support. For this reason, it is disabled by default.
|
||||
RCO is detailed in the following Internet-Drafts:
|
||||
https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
|
||||
https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
|
||||
In Linux, RCO is implemented individually in each encapsulation protocol,
|
||||
and most tunnel types have flags controlling its use. For instance, VXLAN
|
||||
has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that
|
||||
RCO should be used when transmitting to a given remote destination.
|
@@ -36,6 +36,9 @@ Contents:
|
||||
alias
|
||||
bridge
|
||||
snmp_counter
|
||||
checksum-offloads
|
||||
segmentation-offloads
|
||||
scaling
|
||||
|
||||
.. only:: subproject
|
||||
|
||||
|
@@ -1,4 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================================
|
||||
Scaling in the Linux Networking Stack
|
||||
=====================================
|
||||
|
||||
|
||||
Introduction
|
||||
@@ -10,11 +14,11 @@ multi-processor systems.
|
||||
|
||||
The following technologies are described:
|
||||
|
||||
RSS: Receive Side Scaling
|
||||
RPS: Receive Packet Steering
|
||||
RFS: Receive Flow Steering
|
||||
Accelerated Receive Flow Steering
|
||||
XPS: Transmit Packet Steering
|
||||
- RSS: Receive Side Scaling
|
||||
- RPS: Receive Packet Steering
|
||||
- RFS: Receive Flow Steering
|
||||
- Accelerated Receive Flow Steering
|
||||
- XPS: Transmit Packet Steering
|
||||
|
||||
|
||||
RSS: Receive Side Scaling
|
||||
@@ -45,7 +49,9 @@ programmable filters. For example, webserver bound TCP port 80 packets
|
||||
can be directed to their own receive queue. Such “n-tuple” filters can
|
||||
be configured from ethtool (--config-ntuple).
|
||||
|
||||
==== RSS Configuration
|
||||
|
||||
RSS Configuration
|
||||
-----------------
|
||||
|
||||
The driver for a multi-queue capable NIC typically provides a kernel
|
||||
module parameter for specifying the number of hardware queues to
|
||||
@@ -63,7 +69,9 @@ commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
|
||||
indirection table could be done to give different queues different
|
||||
relative weights.
|
||||
|
||||
== RSS IRQ Configuration
|
||||
|
||||
RSS IRQ Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Each receive queue has a separate IRQ associated with it. The NIC triggers
|
||||
this to notify a CPU when new packets arrive on the given queue. The
|
||||
@@ -77,7 +85,9 @@ affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems
|
||||
will be running irqbalance, a daemon that dynamically optimizes IRQ
|
||||
assignments and as a result may override any manual settings.
|
||||
|
||||
== Suggested Configuration
|
||||
|
||||
Suggested Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
RSS should be enabled when latency is a concern or whenever receive
|
||||
interrupt processing forms a bottleneck. Spreading load between CPUs
|
||||
@@ -105,10 +115,12 @@ Whereas RSS selects the queue and hence CPU that will run the hardware
|
||||
interrupt handler, RPS selects the CPU to perform protocol processing
|
||||
above the interrupt handler. This is accomplished by placing the packet
|
||||
on the desired CPU’s backlog queue and waking up the CPU for processing.
|
||||
RPS has some advantages over RSS: 1) it can be used with any NIC,
|
||||
2) software filters can easily be added to hash over new protocols,
|
||||
RPS has some advantages over RSS:
|
||||
|
||||
1) it can be used with any NIC
|
||||
2) software filters can easily be added to hash over new protocols
|
||||
3) it does not increase hardware device interrupt rate (although it does
|
||||
introduce inter-processor interrupts (IPIs)).
|
||||
introduce inter-processor interrupts (IPIs))
|
||||
|
||||
RPS is called during bottom half of the receive interrupt handler, when
|
||||
a driver sends a packet up the network stack with netif_rx() or
|
||||
@@ -135,21 +147,25 @@ packets have been queued to their backlog queue. The IPI wakes backlog
|
||||
processing on the remote CPU, and any queued packets are then processed
|
||||
up the networking stack.
|
||||
|
||||
==== RPS Configuration
|
||||
|
||||
RPS Configuration
|
||||
-----------------
|
||||
|
||||
RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
|
||||
by default for SMP). Even when compiled in, RPS remains disabled until
|
||||
explicitly configured. The list of CPUs to which RPS may forward traffic
|
||||
can be configured for each receive queue using a sysfs file entry:
|
||||
can be configured for each receive queue using a sysfs file entry::
|
||||
|
||||
/sys/class/net/<dev>/queues/rx-<n>/rps_cpus
|
||||
/sys/class/net/<dev>/queues/rx-<n>/rps_cpus
|
||||
|
||||
This file implements a bitmap of CPUs. RPS is disabled when it is zero
|
||||
(the default), in which case packets are processed on the interrupting
|
||||
CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
|
||||
the bitmap.
|
||||
|
||||
== Suggested Configuration
|
||||
|
||||
Suggested Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For a single queue device, a typical RPS configuration would be to set
|
||||
the rps_cpus to the CPUs in the same memory domain of the interrupting
|
||||
@@ -163,7 +179,9 @@ and unnecessary. If there are fewer hardware queues than CPUs, then
|
||||
RPS might be beneficial if the rps_cpus for each queue are the ones that
|
||||
share the same memory domain as the interrupting CPU for that queue.
|
||||
|
||||
==== RPS Flow Limit
|
||||
|
||||
RPS Flow Limit
|
||||
--------------
|
||||
|
||||
RPS scales kernel receive processing across CPUs without introducing
|
||||
reordering. The trade-off to sending all packets from the same flow
|
||||
@@ -187,29 +205,33 @@ No packets are dropped when the input packet queue length is below
|
||||
the threshold, so flow limit does not sever connections outright:
|
||||
even large flows maintain connectivity.
|
||||
|
||||
== Interface
|
||||
|
||||
Interface
|
||||
~~~~~~~~~
|
||||
|
||||
Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
|
||||
turned on. It is implemented for each CPU independently (to avoid lock
|
||||
and cache contention) and toggled per CPU by setting the relevant bit
|
||||
in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
|
||||
bitmap interface as rps_cpus (see above) when called from procfs:
|
||||
bitmap interface as rps_cpus (see above) when called from procfs::
|
||||
|
||||
/proc/sys/net/core/flow_limit_cpu_bitmap
|
||||
/proc/sys/net/core/flow_limit_cpu_bitmap
|
||||
|
||||
Per-flow rate is calculated by hashing each packet into a hashtable
|
||||
bucket and incrementing a per-bucket counter. The hash function is
|
||||
the same that selects a CPU in RPS, but as the number of buckets can
|
||||
be much larger than the number of CPUs, flow limit has finer-grained
|
||||
identification of large flows and fewer false positives. The default
|
||||
table has 4096 buckets. This value can be modified through sysctl
|
||||
table has 4096 buckets. This value can be modified through sysctl::
|
||||
|
||||
net.core.flow_limit_table_len
|
||||
net.core.flow_limit_table_len
|
||||
|
||||
The value is only consulted when a new table is allocated. Modifying
|
||||
it does not update active tables.
|
||||
|
||||
== Suggested Configuration
|
||||
|
||||
Suggested Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Flow limit is useful on systems with many concurrent connections,
|
||||
where a single connection taking up 50% of a CPU indicates a problem.
|
||||
@@ -280,10 +302,10 @@ table), the packet is enqueued onto that CPU’s backlog. If they differ,
|
||||
the current CPU is updated to match the desired CPU if one of the
|
||||
following is true:
|
||||
|
||||
- The current CPU's queue head counter >= the recorded tail counter
|
||||
value in rps_dev_flow[i]
|
||||
- The current CPU is unset (>= nr_cpu_ids)
|
||||
- The current CPU is offline
|
||||
- The current CPU's queue head counter >= the recorded tail counter
|
||||
value in rps_dev_flow[i]
|
||||
- The current CPU is unset (>= nr_cpu_ids)
|
||||
- The current CPU is offline
|
||||
|
||||
After this check, the packet is sent to the (possibly updated) current
|
||||
CPU. These rules aim to ensure that a flow only moves to a new CPU when
|
||||
@@ -291,19 +313,23 @@ there are no packets outstanding on the old CPU, as the outstanding
|
||||
packets could arrive later than those about to be processed on the new
|
||||
CPU.
|
||||
|
||||
==== RFS Configuration
|
||||
|
||||
RFS Configuration
|
||||
-----------------
|
||||
|
||||
RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
|
||||
by default for SMP). The functionality remains disabled until explicitly
|
||||
configured. The number of entries in the global flow table is set through:
|
||||
configured. The number of entries in the global flow table is set through::
|
||||
|
||||
/proc/sys/net/core/rps_sock_flow_entries
|
||||
/proc/sys/net/core/rps_sock_flow_entries
|
||||
|
||||
The number of entries in the per-queue flow table are set through:
|
||||
The number of entries in the per-queue flow table are set through::
|
||||
|
||||
/sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
|
||||
/sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
|
||||
|
||||
== Suggested Configuration
|
||||
|
||||
Suggested Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Both of these need to be set before RFS is enabled for a receive queue.
|
||||
Values for both are rounded up to the nearest power of two. The
|
||||
@@ -347,7 +373,9 @@ functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
|
||||
to populate the map. For each CPU, the corresponding queue in the map is
|
||||
set to be one whose processing CPU is closest in cache locality.
|
||||
|
||||
==== Accelerated RFS Configuration
|
||||
|
||||
Accelerated RFS Configuration
|
||||
-----------------------------
|
||||
|
||||
Accelerated RFS is only available if the kernel is compiled with
|
||||
CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
|
||||
@@ -356,11 +384,14 @@ of CPU to queues is automatically deduced from the IRQ affinities
|
||||
configured for each receive queue by the driver, so no additional
|
||||
configuration should be necessary.
|
||||
|
||||
== Suggested Configuration
|
||||
|
||||
Suggested Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This technique should be enabled whenever one wants to use RFS and the
|
||||
NIC supports hardware acceleration.
|
||||
|
||||
|
||||
XPS: Transmit Packet Steering
|
||||
=============================
|
||||
|
||||
@@ -430,20 +461,25 @@ transport layer is responsible for setting ooo_okay appropriately. TCP,
|
||||
for instance, sets the flag when all data for a connection has been
|
||||
acknowledged.
|
||||
|
||||
==== XPS Configuration
|
||||
XPS Configuration
|
||||
-----------------
|
||||
|
||||
XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
|
||||
default for SMP). The functionality remains disabled until explicitly
|
||||
configured. To enable XPS, the bitmap of CPUs/receive-queues that may
|
||||
use a transmit queue is configured using the sysfs file entry:
|
||||
|
||||
For selection based on CPUs map:
|
||||
/sys/class/net/<dev>/queues/tx-<n>/xps_cpus
|
||||
For selection based on CPUs map::
|
||||
|
||||
For selection based on receive-queues map:
|
||||
/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
|
||||
/sys/class/net/<dev>/queues/tx-<n>/xps_cpus
|
||||
|
||||
== Suggested Configuration
|
||||
For selection based on receive-queues map::
|
||||
|
||||
/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
|
||||
|
||||
|
||||
Suggested Configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For a network device with a single transmission queue, XPS configuration
|
||||
has no effect, since there is no choice in this case. In a multi-queue
|
||||
@@ -460,16 +496,18 @@ explicitly configured mapping receive-queue(s) to transmit queue(s). If the
|
||||
user configuration for receive-queue map does not apply, then the transmit
|
||||
queue is selected based on the CPUs map.
|
||||
|
||||
Per TX Queue rate limitation:
|
||||
=============================
|
||||
|
||||
Per TX Queue rate limitation
|
||||
============================
|
||||
|
||||
These are rate-limitation mechanisms implemented by HW, where currently
|
||||
a max-rate attribute is supported, by setting a Mbps value to
|
||||
a max-rate attribute is supported, by setting a Mbps value to::
|
||||
|
||||
/sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
|
||||
/sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
|
||||
|
||||
A value of zero means disabled, and this is the default.
|
||||
|
||||
|
||||
Further Information
|
||||
===================
|
||||
RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
|
||||
@@ -480,5 +518,6 @@ Accelerated RFS was introduced in 2.6.35. Original patches were
|
||||
submitted by Ben Hutchings (bwh@kernel.org)
|
||||
|
||||
Authors:
|
||||
Tom Herbert (therbert@google.com)
|
||||
Willem de Bruijn (willemb@google.com)
|
||||
|
||||
- Tom Herbert (therbert@google.com)
|
||||
- Willem de Bruijn (willemb@google.com)
|
@@ -1,4 +1,9 @@
|
||||
Segmentation Offloads in the Linux Networking Stack
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
Segmentation Offloads
|
||||
=====================
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
@@ -15,6 +20,7 @@ The following technologies are described:
|
||||
* Partial Generic Segmentation Offload - GSO_PARTIAL
|
||||
* SCTP accelleration with GSO - GSO_BY_FRAGS
|
||||
|
||||
|
||||
TCP Segmentation Offload
|
||||
========================
|
||||
|
||||
@@ -42,6 +48,7 @@ NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO
|
||||
and we will either increment the IP ID for all frames, or leave it at a
|
||||
static value based on driver preference.
|
||||
|
||||
|
||||
UDP Fragmentation Offload
|
||||
=========================
|
||||
|
||||
@@ -54,6 +61,7 @@ UFO is deprecated: modern kernels will no longer generate UFO skbs, but can
|
||||
still receive them from tuntap and similar devices. Offload of UDP-based
|
||||
tunnel protocols is still supported.
|
||||
|
||||
|
||||
IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads
|
||||
========================================================
|
||||
|
||||
@@ -71,17 +79,19 @@ refer to the tunnel headers as the outer headers, while the encapsulated
|
||||
data is normally referred to as the inner headers. Below is the list of
|
||||
calls to access the given headers:
|
||||
|
||||
IPIP/SIT Tunnel:
|
||||
Outer Inner
|
||||
MAC skb_mac_header
|
||||
Network skb_network_header skb_inner_network_header
|
||||
Transport skb_transport_header
|
||||
IPIP/SIT Tunnel::
|
||||
|
||||
UDP/GRE Tunnel:
|
||||
Outer Inner
|
||||
MAC skb_mac_header skb_inner_mac_header
|
||||
Network skb_network_header skb_inner_network_header
|
||||
Transport skb_transport_header skb_inner_transport_header
|
||||
Outer Inner
|
||||
MAC skb_mac_header
|
||||
Network skb_network_header skb_inner_network_header
|
||||
Transport skb_transport_header
|
||||
|
||||
UDP/GRE Tunnel::
|
||||
|
||||
Outer Inner
|
||||
MAC skb_mac_header skb_inner_mac_header
|
||||
Network skb_network_header skb_inner_network_header
|
||||
Transport skb_transport_header skb_inner_transport_header
|
||||
|
||||
In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and
|
||||
SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the
|
||||
@@ -93,6 +103,7 @@ header has requested a remote checksum offload. In this case the inner
|
||||
headers will be left with a partial checksum and only the outer header
|
||||
checksum will be computed.
|
||||
|
||||
|
||||
Generic Segmentation Offload
|
||||
============================
|
||||
|
||||
@@ -106,6 +117,7 @@ Before enabling any hardware segmentation offload a corresponding software
|
||||
offload is required in GSO. Otherwise it becomes possible for a frame to
|
||||
be re-routed between devices and end up being unable to be transmitted.
|
||||
|
||||
|
||||
Generic Receive Offload
|
||||
=======================
|
||||
|
||||
@@ -117,6 +129,7 @@ this is IPv4 ID in the case that the DF bit is set for a given IP header.
|
||||
If the value of the IPv4 ID is not sequentially incrementing it will be
|
||||
altered so that it is when a frame assembled via GRO is segmented via GSO.
|
||||
|
||||
|
||||
Partial Generic Segmentation Offload
|
||||
====================================
|
||||
|
||||
@@ -134,6 +147,7 @@ is the outer IPv4 ID field. It is up to the device drivers to guarantee
|
||||
that the IPv4 ID field is incremented in the case that a given header does
|
||||
not have the DF bit set.
|
||||
|
||||
|
||||
SCTP accelleration with GSO
|
||||
===========================
|
||||
|
||||
@@ -157,14 +171,14 @@ appropriately.
|
||||
|
||||
There are some helpers to make this easier:
|
||||
|
||||
- skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
|
||||
an skb is an SCTP GSO skb.
|
||||
- skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
|
||||
an skb is an SCTP GSO skb.
|
||||
|
||||
- For size checks, the skb_gso_validate_*_len family of helpers correctly
|
||||
considers GSO_BY_FRAGS.
|
||||
- For size checks, the skb_gso_validate_*_len family of helpers correctly
|
||||
considers GSO_BY_FRAGS.
|
||||
|
||||
- For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
|
||||
will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
|
||||
- For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
|
||||
will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
|
||||
|
||||
This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits
|
||||
set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE.
|
Fai riferimento in un nuovo problema
Block a user