Commit Graph

53672 Commits

Author SHA1 Message Date
William Tu
5540fbf438 bpf: clear the ip_tunnel_info.
The percpu metadata_dst might carry the stale ip_tunnel_info
and cause incorrect behavior.  When mixing tests using ipv4/ipv6
bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
ipv4's src ip addr as its ipv6 src address, because the previous
tunnel info does not clean up.  The patch zeros the fields in
ip_tunnel_info.

Signed-off-by: William Tu <u9012063@gmail.com>
Reported-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-25 09:51:54 +02:00
David S. Miller
c749fa181b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-24 23:59:11 -04:00
Eyal Birger
12bed760a7 bpf: add helper for getting xfrm states
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.

Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)

skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions

The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.

struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.

Typical usage:

struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 22:26:58 +02:00
Eric Dumazet
091311debc net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()
rt6_remove_exception_rt() is called under rcu_read_lock() only.

We lock rt6_exception_lock a bit later, so we do not hold
rt6_exception_lock yet.

Fixes: 8a14e46f14 ("net/ipv6: Fix missing rcu dereferences on from")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 16:19:14 -04:00
Colin Ian King
95ad7544ad net/tls: remove redundant second null check on sgout
A duplicated null check on sgout is redundant as it is known to be
already true because of the identical earlier check. Remove it.
Detected by cppcheck:

net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always
true.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 16:02:10 -04:00
Chris Novakovic
c04d2cb200 ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers
Distributed filesystems are most effective when the server and client
clocks are synchronised. Embedded devices often use NFS for their
root filesystem but typically do not contain an RTC, so the clocks of
the NFS server and the embedded device will be out-of-sync when the root
filesystem is mounted (and may not be synchronised until late in the
boot process).

Extend ipconfig with the ability to export IP addresses of NTP servers
it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as
follows:

 - If ipconfig is configured manually via the "ip=" or "nfsaddrs="
   kernel command line parameters, one NTP server can be specified in
   the new "<ntp0-ip>" parameter.
 - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in
   the DHCPDISCOVER message, and record the IP addresses of up to three
   NTP servers sent by the responding DHCP server in the subsequent
   DHCPOFFER message.

ipconfig will only write the NTP server IP addresses it discovers to
/proc/net/ipconfig/ntp_servers, one per line (in the order received from
the DHCP server, if DHCP autoconfiguration is used); making use of these
NTP servers is the responsibility of a user space process (e.g. an
initrd/initram script that invokes an NTP client before mounting an NFS
root filesystem).

Signed-off-by: Chris Novakovic <chris@chrisn.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:40:42 -04:00
Chris Novakovic
4d019b3f80 ipconfig: Create /proc/net/ipconfig directory
To allow ipconfig to report IP configuration details to user space
processes without cluttering /proc/net, create a new subdirectory
/proc/net/ipconfig. All files containing IP configuration details should
be written to this directory.

Signed-off-by: Chris Novakovic <chris@chrisn.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:40:42 -04:00
Chris Novakovic
300eec7c0a ipconfig: Correctly initialise ic_nameservers
ic_nameservers, which stores the list of name servers discovered by
ipconfig, is initialised (i.e. has all of its elements set to NONE, or
0xffffffff) by ic_nameservers_predef() in the following scenarios:

 - before the "ip=" and "nfsaddrs=" kernel command line parameters are
   parsed (in ip_auto_config_setup());
 - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in
   order to clear any values that may have been set after parsing "ip="
   or "nfsaddrs=" and are no longer needed.

This means that ic_nameservers_predef() is not called when neither "ip="
nor "nfsaddrs=" is specified on the kernel command line. In this
scenario, every element in ic_nameservers remains set to 0x00000000,
which is indistinguishable from ANY and causes pnp_seq_show() to write
the following (bogus) information to /proc/net/pnp:

  #MANUAL
  nameserver 0.0.0.0
  nameserver 0.0.0.0
  nameserver 0.0.0.0

This is potentially problematic for systems that blindly link
/etc/resolv.conf to /proc/net/pnp.

Ensure that ic_nameservers is also initialised when neither "ip=" nor
"nfsaddrs=" are specified by calling ic_nameservers_predef() in
ip_auto_config(), but only when ip_auto_config_setup() was not called
earlier. This causes the following to be written to /proc/net/pnp, and
is consistent with what gets written when ipconfig is configured
manually but no name servers are specified on the kernel command line:

  #MANUAL

Signed-off-by: Chris Novakovic <chris@chrisn.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:40:41 -04:00
Chris Novakovic
de1fa15b66 ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers
When ipconfig is autoconfigured via BOOTP, the request packet
initialised by ic_bootp_init_ext() always allocates 8 bytes for the name
server option, limiting the BOOTP server to responding with at most 2
name servers even though ipconfig in fact supports an arbitrary number
of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently
3).

Only request name servers in the request packet if CONF_NAMESERVERS_MAX
is positive (to comply with [1, §3.8]), and allocate enough space in the
packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum
number we can accept in response.

[1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
    https://tools.ietf.org/rfc/rfc2132.txt

Signed-off-by: Chris Novakovic <chris@chrisn.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:40:41 -04:00
Chris Novakovic
4e1a8af28d ipconfig: BOOTP: Don't request IEN-116 name servers
When ipconfig is autoconfigured via BOOTP, the request packet
initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name
Server" [1, §3.7]), but tag 5 in the response isn't processed by
ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name
Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears
to have been the intended tag to request.

This won't cause any breakage for existing users, as tag 5 responses
provided by BOOTP servers weren't being processed anyway.

[1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
    https://tools.ietf.org/rfc/rfc2132.txt

Signed-off-by: Chris Novakovic <chris@chrisn.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:40:41 -04:00
Chris Novakovic
e18bdc83ae ipconfig: Tidy up reporting of name servers
Commit 5e953778a2 ("ipconfig: add
nameserver IPs to kernel-parameter ip=") adds the IP addresses of
discovered name servers to the summary printed by ipconfig when
configuration is complete. It appears the intention in ip_auto_config()
was to print the name servers on a new line (especially given the
spacing and lack of comma before "nameserver0="), but they're actually
printed on the same line as the NFS root filesystem configuration
summary:

  [    0.686186] IP-Config: Complete:
  [    0.686226]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
  [    0.686328]      host=test, domain=example.com, nis-domain=(none)
  [    0.686386]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=     nameserver0=10.0.0.1

This makes it harder to read and parse ipconfig's output. Instead, print
the name servers on a separate line:

  [    0.791250] IP-Config: Complete:
  [    0.791289]      device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1
  [    0.791407]      host=test, domain=example.com, nis-domain=(none)
  [    0.791475]      bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
  [    0.791476]      nameserver0=10.0.0.1

Signed-off-by: Chris Novakovic <chris@chrisn.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:40:41 -04:00
Eric Dumazet
8c2320e84c tcp: md5: only call tp->af_specific->md5_lookup() for md5 sockets
RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive,
given they have no result.
We can omit the calls for sockets that have no md5 keys.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:20:03 -04:00
Willem de Bruijn
a6361f0ca4 packet: fix bitfield update race
Updates to the bitfields in struct packet_sock are not atomic.
Serialize these read-modify-write cycles.

Move po->running into a separate variable. Its writes are protected by
po->bind_lock (except for one startup case at packet_create). Also
replace a textual precondition warning with lockdep annotation.

All others are set only in packet_setsockopt. Serialize these
updates by holding the socket lock. Analogous to other field updates,
also hold the lock when testing whether a ring is active (pg_vec).

Fixes: 8dc4194474 ("[PACKET]: Add optional checksum computation for recvmsg")
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Reported-by: Byoungyoung Lee <byoungyoung@purdue.edu>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:17:08 -04:00
Yafang Shao
a06ac0d67d Revert "net: init sk_cookie for inet socket"
This reverts commit <c6849a3ac17e> ("net: init sk_cookie for inet socket")

Per discussion with Eric, when update sock_net(sk)->cookie_gen, the
whole cache cache line will be invalidated, as this cache line is shared
with all cpus, that may cause great performace hit.

Bellow is the data form Eric.
"Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on
my host" when running synflood test.

Have to revert it to prevent from cache line false sharing.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 11:15:32 -04:00
Ilya Dryomov
7b4c443d13 libceph: reschedule a tick in finish_hunting()
If we go without an established session for a while, backoff delay will
climb to 30 seconds.  The keepalive timeout is also 30 seconds, so it's
pretty easily hit after a prolonged hunting for a monitor: we don't get
a chance to send out a keepalive in time, which means we never get back
a keepalive ack in time, cutting an established session and attempting
to connect to a different monitor every 30 seconds:

  [Sun Apr 1 23:37:05 2018] libceph: mon0 10.80.20.99:6789 session established
  [Sun Apr 1 23:37:36 2018] libceph: mon0 10.80.20.99:6789 session lost, hunting for new mon
  [Sun Apr 1 23:37:36 2018] libceph: mon2 10.80.20.103:6789 session established
  [Sun Apr 1 23:38:07 2018] libceph: mon2 10.80.20.103:6789 session lost, hunting for new mon
  [Sun Apr 1 23:38:07 2018] libceph: mon1 10.80.20.100:6789 session established
  [Sun Apr 1 23:38:37 2018] libceph: mon1 10.80.20.100:6789 session lost, hunting for new mon
  [Sun Apr 1 23:38:37 2018] libceph: mon2 10.80.20.103:6789 session established
  [Sun Apr 1 23:39:08 2018] libceph: mon2 10.80.20.103:6789 session lost, hunting for new mon

The regular keepalive interval is 10 seconds.  After ->hunting is
cleared in finish_hunting(), call __schedule_delayed() to ensure we
send out a keepalive after 10 seconds.

Cc: stable@vger.kernel.org # 4.7+
Link: http://tracker.ceph.com/issues/23537
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2018-04-24 10:40:21 +02:00
Ilya Dryomov
facb9f6eba libceph: un-backoff on tick when we have a authenticated session
This means that if we do some backoff, then authenticate, and are
healthy for an extended period of time, a subsequent failure won't
leave us starting our hunting sequence with a large backoff.

Mirrors ceph.git commit d466bc6e66abba9b464b0b69687cf45c9dccf383.

Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2018-04-24 10:39:52 +02:00
Florian Westphal
bd2bbdb497 netfilter: merge meta_bridge into nft_meta
It overcomplicates things for no reason.
nft_meta_bridge only offers retrieval of bridge port interface name.

Because of this being its own module, we had to export all nft_meta
functions, which we can then make static again (which even reduces
the size of nft_meta -- including bridge port retrieval...):

before:
   text    data     bss     dec     hex filename
   1838     832       0    2670     a6e net/bridge/netfilter/nft_meta_bridge.ko
   6147     936       1    7084    1bac net/netfilter/nft_meta.ko

after:
   5826     936       1    6763    1a6b net/netfilter/nft_meta.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:22 +02:00
Florian Westphal
99a0efbeeb netfilter: nf_tables: always use an upper set size for dynsets
nft rejects rules that lack a timeout and a size limit when they're used
to add elements from packet path.

Pick a sane upperlimit instead of rejecting outright.
The upperlimit is visible to userspace, just as if it would have been
given during set declaration.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:21 +02:00
Florian Westphal
8e1102d5a1 netfilter: nf_tables: support timeouts larger than 23 days
Marco De Benedetto says:
 I would like to use a timeout of 30 days for elements in a set but it
 seems there is a some kind of problem above 24d20h31m23s.

Fix this by using 'jiffies64' for timeout handling to get same behaviour
on 32 and 64bit systems.

nftables passes timeouts as u64 in milliseconds to the kernel,
but on kernel side we used a mixture of 'long' and jiffies conversions
rather than u64 and jiffies64.

Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1237
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:20 +02:00
Taehee Yoo
dc3c09d327 netfilter: xtables: use ipt_get_target_c instead of ipt_get_target
ipt_get_target is used to get struct xt_entry_target
and ipt_get_target_c is used to get const struct xt_entry_target.
However in the ipt_do_table, ipt_get_target is used to get
const struct xt_entry_target. it should be replaced by ipt_get_target_c.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:19 +02:00
Taehee Yoo
a1d768f1a0 netfilter: ebtables: add ebt_get_target and ebt_get_target_c
ebt_get_target similar to {ip/ip6/arp}t_get_target.
and ebt_get_target_c similar to {ip/ip6/arp}t_get_target_c.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:18 +02:00
Taehee Yoo
4351bef053 netfilter: x_tables: remove duplicate ip6t_get_target function call
In the check_target, ip6t_get_target is called twice.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:17 +02:00
Taehee Yoo
cd9a5a1580 netfilter: ebtables: remove EBT_MATCH and EBT_NOMATCH
EBT_MATCH and EBT_NOMATCH are used to change return value.
match functions(ebt_xxx.c) return false when received frame is not matched
and returns true when received frame is matched.
but, EBT_MATCH_ITERATE understands oppositely.
so, to change return value, EBT_MATCH and EBT_NOMATCH are used.
but, we can use operation '!' simply.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:16 +02:00
Taehee Yoo
e4de6ead16 netfilter: ebtables: add ebt_free_table_info function
A ebt_free_table_info frees all of chainstacks.
It similar to xt_free_table_info. this inline function
reduces code line.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:15 +02:00
Taehee Yoo
35341a6159 netfilter: add __exit mark to helper modules
There are no __exit mark in the helper modules.
because these exit functions used to be called by init function
but now that is not. so we can add __exit mark.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:14 +02:00
Thierry Du Tre
2eb0f624b7 netfilter: add NAT support for shifted portmap ranges
This is a patch proposal to support shifted ranges in portmaps.  (i.e. tcp/udp
incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100)

Currently DNAT only works for single port or identical port ranges.  (i.e.
ports 5000-5100 on WAN interface redirected to a LAN host while original
destination port is not altered) When different port ranges are configured,
either 'random' mode should be used, or else all incoming connections are
mapped onto the first port in the redirect range. (in described example
WAN:5000-5100 will all be mapped to 192.168.1.5:2000)

This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET
which uses a base port value to calculate an offset with the destination port
present in the incoming stream. That offset is then applied as index within the
redirect port range (index modulo rangewidth to handle range overflow).

In described example the base port would be 5000. An incoming stream with
destination port 5004 would result in an offset value 4 which means that the
NAT'ed stream will be using destination port 2004.

Other possibilities include deterministic mapping of larger or multiple ranges
to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port
51xx)

This patch does not change any current behavior. It just adds new NAT proto
range functionality which must be selected via the specific flag when intended
to use.

A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed
which makes this functionality immediately available.

Signed-off-by: Thierry Du Tre <thierry@dtsystems.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:12 +02:00
Phil Sutter
71cc0873e0 netfilter: nf_tables: Simplify set backend selection
Drop nft_set_type's ability to act as a container of multiple backend
implementations it chooses from. Instead consolidate the whole selection
logic in nft_select_set_ops() and the actual backend provided estimate()
callback.

This turns nf_tables_set_types into a list containing all available
backends which is traversed when selecting one matching userspace
requested criteria.

Also, this change allows to embed nft_set_ops structure into
nft_set_type and pull flags field into the latter as it's only used
during selection phase.

A crucial part of this change is to make sure the new layout respects
hash backend constraints formerly enforced by nft_hash_select_ops()
function: This is achieved by introduction of a specific estimate()
callback for nft_hash_fast_ops which returns false for key lengths != 4.
In turn, nft_hash_estimate() is changed to return false for key lengths
== 4 so it won't be chosen by accident. Also, both callbacks must return
false for unbounded sets as their size estimate depends on a known
maximum element count.

Note that this patch partially reverts commit 4f2921ca21 ("netfilter:
nf_tables: meter: pick a set backend that supports updates") by making
nft_set_ops_candidate() not explicitly look for an update callback but
make NFT_SET_EVAL a regular backend feature flag which is checked along
with the others. This way all feature requirements are checked in one
go.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:11 +02:00
Pablo Neira Ayuso
36dd1bcc07 netfilter: nf_tables: initial support for extended ACK reporting
Keep it simple to start with, just report attribute offsets that can be
useful to userspace when representating errors to users.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:10 +02:00
Pablo Neira Ayuso
cac20fcdf1 netfilter: nf_tables: simplify lookup functions
Replace the nf_tables_ prefix by nft_ and merge code into single lookup
function whenever possible. In many cases we go over the 80-chars
boundary function names, this save us ~50 LoC.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:09 +02:00
Felix Fietkau
df1e202531 netfilter: nf_flow_table: fix offloading connections with SNAT+DNAT
Pass all NAT types to the flow offload struct, otherwise parts of the
address/port pair do not get translated properly, causing connection
stalls

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:07 +02:00
Felix Fietkau
33894c367d netfilter: nf_flow_table: add missing condition for TCP state check
Avoid looking at unrelated fields in UDP packets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:04 +02:00
Felix Fietkau
b6f27d322a netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen
Allow the slow path to handle the shutdown of the connection with proper
timeouts. The packet containing RST/FIN is also sent to the slow path
and the TCP conntrack module will update its state.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:02 +02:00
Felix Fietkau
da5984e510 netfilter: nf_flow_table: add support for sending flows back to the slow path
Since conntrack hasn't seen any packets from the offloaded flow in a
while, and the timeout for offloaded flows is set to an extremely long
value, we need to fix up the state before we can send a flow back to the
slow path.

For TCP, reset td_maxwin in both directions, which makes it resync its
state on the next packets.

Use the regular timeout for TCP and UDP established connections.

This allows the slow path to take over again once the offload state has
been torn down

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:59 +02:00
Felix Fietkau
ba03137f4c netfilter: nf_flow_table: in flow_offload_lookup, skip entries being deleted
Preparation for sending flows back to the slow path

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:57 +02:00
Felix Fietkau
59c466dd68 netfilter: nf_flow_table: add a new flow state for tearing down offloading
On cleanup, this will be treated differently from FLOW_OFFLOAD_DYING:

If FLOW_OFFLOAD_DYING is set, the connection is going away, so both the
offload state and the connection tracking entry will be deleted.

If FLOW_OFFLOAD_TEARDOWN is set, the connection remains alive, but
the offload state is torn down. This is useful for cases that require
more complex state tracking / timeout handling on TCP, or if the
connection has been idle for too long.

Support for sending flows back to the slow path will be implemented in
a following patch

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:54 +02:00
Felix Fietkau
6bdc3c68d9 netfilter: nf_flow_table: make flow_offload_dead inline
It is too trivial to keep as a separate exported function

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:52 +02:00
Felix Fietkau
84453a9025 netfilter: nf_flow_table: track flow tables in nf_flow_table directly
Avoids having nf_flow_table depend on nftables (useful for future
iptables backport work)

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:50 +02:00
Felix Fietkau
17857d9299 netfilter: nf_flow_table: fix priv pointer for netdev hook
The offload ip hook expects a pointer to the flowtable, not to the
rhashtable. Since the rhashtable is the first member, this is safe for
the moment, but breaks as soon as the structure layout changes

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:47 +02:00
Felix Fietkau
a268de77fa netfilter: nf_flow_table: move init code to nf_flow_table_core.c
Reduces duplication of .gc and .params in flowtable type definitions and
makes the API clearer

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:45 +02:00
Felix Fietkau
1e80380b14 netfilter: nf_flow_table: relax mixed ipv4/ipv6 flowtable dependencies
Since the offload hook code was moved, this table no longer depends on
the IPv4 and IPv6 flowtable modules

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:39 +02:00
Felix Fietkau
a908fdec3d netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table
Useful as preparation for adding iptables support for offload.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:18 +02:00
Felix Fietkau
3aeb51d7e7 netfilter: nf_flow_table: move ip header check out of nf_flow_exceeds_mtu
Allows the function to be shared with the IPv6 hook code

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:28:15 +02:00
Felix Fietkau
7d20868717 netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table
Allows some minor code sharing with the ipv6 hook code and is also
useful as preparation for adding iptables support for offload

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:27:16 +02:00
Roopa Prabhu
9c20b9372f net: fib_rules: fix l3mdev netlink attr processing
Fixes: b16fb418b1 ("net: fib_rules: add extack support")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 23:20:12 -04:00
Guillaume Nault
eb1c28c058 l2tp: check sockaddr length in pppol2tp_connect()
Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that
it actually points to valid data.

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Reported-by: syzbot+a70ac890b23b1bf29f5c@syzkaller.appspotmail.com
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 21:10:43 -04:00
David S. Miller
77621f024d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) Fix SIP conntrack with phones sending session descriptions for different
   media types but same port numbers, from Florian Westphal.

2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian
   Anastasov.

3) Skip compat array allocation in ebtables if there is no entries, also
   from Florian.

4) Do not lose left/right bits when shifting marks from xt_connmark, from
   Jack Ma.

5) Silence false positive memleak in conntrack extensions, from Cong Wang.

6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann.

7) Cannot kfree rule that is already in list in nf_tables, switch order
   so this error handling is not required, from Florian Westphal.

8) Release set name in error path, from Florian.

9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell.

10) NAT chain and extensions depend on NF_TABLES.

11) Out of bound access when renaming chains, from Taehee Yoo.

12) Incorrect casting in xt_connmark leads to wrong bitshifting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:22:24 -04:00
David Ahern
8a14e46f14 net/ipv6: Fix missing rcu dereferences on from
kbuild test robot reported 2 uses of rt->from not properly accessed
using rcu_dereference:
1. add rcu_dereference_protected to rt6_remove_exception_rt and make
   sure it is always called with rcu lock held.

2. change rt6_do_redirect to take a reference on 'from' when accessed
   the first time so it can be used the sceond time outside of the lock

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:12:55 -04:00
David Ahern
c3c14da028 net/ipv6: add rcu locking to ip6_negative_advice
syzbot reported a suspicious rcu_dereference_check:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
  dst_negative_advice include/net/sock.h:1786 [inline]
  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
  SYSC_setsockopt net/socket.c:1914 [inline]
  SyS_setsockopt+0x34/0x50 net/socket.c:1911

Add rcu locking around call to rt6_check_expired in
ip6_negative_advice.

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:12:54 -04:00
Alexander Aring
f18fa5de5b net: ieee802154: 6lowpan: fix frag reassembly
This patch initialize stack variables which are used in
frag_lowpan_compare_key to zero. In my case there are padding bytes in the
structures ieee802154_addr as well in frag_lowpan_compare_key. Otherwise
the key variable contains random bytes. The result is that a compare of
two keys by memcmp works incorrect.

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reported-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
2018-04-23 20:56:24 +02:00
Eric Dumazet
aa8f877849 ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
KMSAN reported use of uninit-value that I tracked to lack
of proper size check on RTA_TABLE attribute.

I also believe RTA_PREFSRC lacks a similar check.

Fixes: 86872cb579 ("[IPv6] route: FIB6 configuration using struct fib6_config")
Fixes: c3968a857a ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 12:01:21 -04:00