Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
This commit is contained in:
Linus Torvalds
2017-05-02 16:40:27 -07:00
1690 changed files with 109196 additions and 46486 deletions

View File

@@ -595,10 +595,9 @@ got from bpf_prog_create(), and 'ctx' the given context (e.g.
skb pointer). All constraints and restrictions from bpf_check_classic() apply
before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most of the
architectures. x86-64, aarch64 and s390x perform JIT compilation from eBPF
instruction set, however, future work will migrate other JIT compilers as well,
so that they will profit from the very same benefits.
Currently, the classic BPF format is being used for JITing on most 32-bit
architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
compilation from eBPF instruction set.
Some core changes of the new internal format:

View File

@@ -63,6 +63,78 @@ Additional Configurations
The latest release of ethtool can be found from
https://www.kernel.org/pub/software/network/ethtool
Flow Director n-ntuple traffic filters (FDir)
---------------------------------------------
The driver utilizes the ethtool interface for configuring ntuple filters,
via "ethtool -N <device> <filter>".
The sctp4, ip4, udp4, and tcp4 flow types are supported with the standard
fields including src-ip, dst-ip, src-port and dst-port. The driver only
supports fully enabling or fully masking the fields, so use of the mask
fields for partial matches is not supported.
Additionally, the driver supports using the action to specify filters for a
Virtual Function. You can specify the action as a 64bit value, where the
lower 32 bits represents the queue number, while the next 8 bits represent
which VF. Note that 0 is the PF, so the VF identifier is offset by 1. For
example:
... action 0x800000002 ...
Would indicate to direct traffic for Virtual Function 7 (8 minus 1) on queue
2 of that VF.
The driver also supports using the user-defined field to specify 2 bytes of
arbitrary data to match within the packet payload in addition to the regular
fields. The data is specified in the lower 32bits of the user-def field in
the following way:
+----------------------------+---------------------------+
| 31 28 24 20 16 | 15 12 8 4 0|
+----------------------------+---------------------------+
| offset into packet payload | 2 bytes of flexible data |
+----------------------------+---------------------------+
As an example,
... user-def 0x4FFFF ....
means to match the value 0xFFFF 4 bytes into the packet payload. Note that
the offset is based on the beginning of the payload, and not the beginning
of the packet. Thus
flow-type tcp4 ... user-def 0x8BEAF ....
would match TCP/IPv4 packets which have the value 0xBEAF 8bytes into the
TCP/IPv4 payload.
For ICMP, the hardware parses the ICMP header as 4 bytes of header and 4
bytes of payload, so if you want to match an ICMP frames payload you may need
to add 4 to the offset in order to match the data.
Furthermore, the offset can only be up to a value of 64, as the hardware
will only read up to 64 bytes of data from the payload. It must also be even
as the flexible data is 2 bytes long and must be aligned to byte 0 of the
packet payload.
When programming filters, the hardware is limited to using a single input
set for each flow type. This means that it is an error to program two
different filters with the same type that don't match on the same fields.
Thus the second of the following two commands will fail:
ethtool -N <device> flow-type tcp4 src-ip 192.168.0.7 action 5
ethtool -N <device> flow-type tcp4 dst-ip 192.168.15.18 action 1
This is because the first filter will be accepted and reprogram the input
set for TCPv4 filters, but the second filter will be unable to reprogram the
input set until all the conflicting TCPv4 filters are first removed.
Note that the user-defined flexible offset is also considered part of the
input set and cannot be programmed separately for multiple filters of the
same type. However, the flexible data is not part of the input set and
multiple filters may use the same offset but match against different data.
Data Center Bridging (DCB)
--------------------------
DCB configuration is not currently supported.

View File

@@ -73,6 +73,14 @@ fib_multipath_use_neigh - BOOLEAN
0 - disabled
1 - enabled
fib_multipath_hash_policy - INTEGER
Controls which hash policy to use for multipath routes. Only valid
for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
Default: 0 (Layer 3)
Possible values:
0 - Layer 3
1 - Layer 4
route/max_size - INTEGER
Maximum number of routes allowed in the kernel. Increase
this when using large numbers of interfaces and/or routes.
@@ -594,6 +602,14 @@ tcp_fastopen - INTEGER
Note that that additional client or server features are only
effective if the basic support (0x1 and 0x2) are enabled respectively.
tcp_fastopen_blackhole_timeout_sec - INTEGER
Initial time period in second to disable Fastopen on active TCP sockets
when a TFO firewall blackhole issue happens.
This time period will grow exponentially when more blackhole issues
get detected right after Fastopen is re-enabled and will reset to
initial value when the blackhole issue goes away.
By default, it is set to 1hr.
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
@@ -640,11 +656,6 @@ tcp_tso_win_divisor - INTEGER
building larger TSO frames.
Default: 3
tcp_tw_recycle - BOOLEAN
Enable fast recycling TIME-WAIT sockets. Default value is 0.
It should not be changed without advice/request of technical
experts.
tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
@@ -853,12 +864,21 @@ ip_dynaddr - BOOLEAN
ip_early_demux - BOOLEAN
Optimize input packet processing down to one demux for
certain kinds of local sockets. Currently we only do this
for established TCP sockets.
for established TCP and connected UDP sockets.
It may add an additional cost for pure routing workloads that
reduces overall throughput, in such case you should disable it.
Default: 1
tcp_early_demux - BOOLEAN
Enable early demux for established TCP sockets.
Default: 1
udp_early_demux - BOOLEAN
Enable early demux for connected UDP sockets. Disable this if
your system could experience more unconnected load.
Default: 1
icmp_echo_ignore_all - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it.
@@ -1458,11 +1478,20 @@ accept_ra_pinfo - BOOLEAN
Functional default: enabled if accept_ra is enabled.
disabled if accept_ra is disabled.
accept_ra_rt_info_min_plen - INTEGER
Minimum prefix length of Route Information in RA.
Route Information w/ prefix smaller than this variable shall
be ignored.
Functional default: 0 if accept_ra_rtr_pref is enabled.
-1 if accept_ra_rtr_pref is disabled.
accept_ra_rt_info_max_plen - INTEGER
Maximum prefix length of Route Information in RA.
Route Information w/ prefix larger than or equal to this
variable shall be ignored.
Route Information w/ prefix larger than this variable shall
be ignored.
Functional default: 0 if accept_ra_rtr_pref is enabled.
-1 if accept_ra_rtr_pref is disabled.

View File

@@ -175,6 +175,14 @@ nat_icmp_send - BOOLEAN
for VS/NAT when the load balancer receives packets from real
servers but the connection entries don't exist.
pmtu_disc - BOOLEAN
0 - disabled
not 0 - enabled (default)
By default, reject with FRAG_NEEDED all DF packets that exceed
the PMTU, irrespective of the forwarding method. For TUN method
the flag can be disabled to fragment such packets.
secure_tcp - INTEGER
0 - disabled (default)
@@ -185,15 +193,59 @@ secure_tcp - INTEGER
The value definition is the same as that of drop_entry and
drop_packet.
sync_threshold - INTEGER
default 3
sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period
default 3 50
It sets synchronization threshold, which is the minimum number
of incoming packets that a connection needs to receive before
the connection will be synchronized. A connection will be
synchronized, every time the number of its incoming packets
modulus 50 equals the threshold. The range of the threshold is
from 0 to 49.
It sets synchronization threshold, which is the minimum number
of incoming packets that a connection needs to receive before
the connection will be synchronized. A connection will be
synchronized, every time the number of its incoming packets
modulus sync_period equals the threshold. The range of the
threshold is from 0 to sync_period.
When sync_period and sync_refresh_period are 0, send sync only
for state changes or only once when pkts matches sync_threshold
sync_refresh_period - UNSIGNED INTEGER
default 0
In seconds, difference in reported connection timer that triggers
new sync message. It can be used to avoid sync messages for the
specified period (or half of the connection timeout if it is lower)
if connection state is not changed since last sync.
This is useful for normal connections with high traffic to reduce
sync rate. Additionally, retry sync_retries times with period of
sync_refresh_period/8.
sync_retries - INTEGER
default 0
Defines sync retries with period of sync_refresh_period/8. Useful
to protect against loss of sync messages. The range of the
sync_retries is from 0 to 3.
sync_qlen_max - UNSIGNED LONG
Hard limit for queued sync messages that are not sent yet. It
defaults to 1/32 of the memory pages but actually represents
number of messages. It will protect us from allocating large
parts of memory when the sending rate is lower than the queuing
rate.
sync_sock_size - INTEGER
default 0
Configuration of SNDBUF (master) or RCVBUF (slave) socket limit.
Default value is 0 (preserve system defaults).
sync_ports - INTEGER
default 1
The number of threads that master and backup servers can use for
sync traffic. Every thread will use single UDP port, thread 0 will
use the default port 8848 while last thread will use port
8848+sync_ports-1.
snat_reroute - BOOLEAN
0 - disabled

View File

@@ -19,6 +19,25 @@ platform_labels - INTEGER
Possible values: 0 - 1048575
Default: 0
ip_ttl_propagate - BOOL
Control whether TTL is propagated from the IPv4/IPv6 header to
the MPLS header on imposing labels and propagated from the
MPLS header to the IPv4/IPv6 header on popping the last label.
If disabled, the MPLS transport network will appear as a
single hop to transit traffic.
0 - disabled / RFC 3443 [Short] Pipe Model
1 - enabled / RFC 3443 Uniform Model (default)
default_ttl - BOOL
Default TTL value to use for MPLS packets where it cannot be
propagated from an IP header, either because one isn't present
or ip_ttl_propagate has been disabled.
Possible values: 1 - 255
Default: 255
conf/<interface>/input - BOOL
Control whether packets can be input on this interface.

View File

@@ -13,43 +13,43 @@ an example setup using a data-center-class switch ASIC chip. Other setups
with SR-IOV or soft switches, such as OVS, are possible.
User-spacetools
userspace|
+-------------------------------------------------------------------+
kernel|Netlink
|
+--------------+-------------------------------+
|Networkstack|
|(Linux)|
||
+----------------------------------------------+
User-spacetools
userspace|
+-------------------------------------------------------------------+
kernel|Netlink
|
+--------------+-------------------------------+
|Networkstack|
|(Linux)|
||
+----------------------------------------------+
sw1p2 sw1p4 sw1p6
sw1p1 + sw1p3 + sw1p5 + eth1
+|+|+|+
|||||||
+--+----+----+----+-+--+----+---++-----+-----+
|Switchdriver||mgmt|
|(thisdocument)||driver|
||||
+--------------+----------------++-----------+
|
kernel|HWbus(egPCI)
+-------------------------------------------------------------------+
hardware|
+--------------+---+------------+
|Switchdevice (sw1)|
|+----++--------+
||voffloadeddatapath|mgmtport
||||
+--|----|----+----+----+----+---+
||||||
++++++
sw1p1 + sw1p3 + sw1p5 + eth1
+|+|+|+
|||||||
+--+----+----+----+-+--+----+---++-----+-----+
|Switchdriver||mgmt|
|(thisdocument)||driver|
||||
+--------------+----------------++-----------+
|
kernel|HWbus(egPCI)
+-------------------------------------------------------------------+
hardware|
+--------------+---+------------+
|Switchdevice (sw1)|
|+----++--------+
||voffloadeddatapath|mgmtport
||||
+--|----|----+----+----+----+---+
||||||
++++++
p1p2p3p4p5p6
front-panelports
front-panelports
Fig 1.