Commit Graph

57846 Commits

Author SHA1 Message Date
David Ahern
c55c89884d ipv6: Handle all fib6_nh in a nexthop in __ip6_route_redirect
Add a hook in __ip6_route_redirect to handle a nexthop struct in a
fib6_info. Use nexthop_for_each_fib6_nh and fib6_nh_redirect_match
to call ip6_redirect_nh_match for each fib6_nh looking for a match.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
e659ba31d8 ipv6: Handle all fib6_nh in a nexthop in exception handling
Add a hook in rt6_flush_exceptions, rt6_remove_exception_rt,
rt6_update_exception_stamp_rt, and rt6_age_exceptions to handle
nexthop struct in a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
2c170e0753 ipv6: Handle all fib6_nh in a nexthop in fib6_info_uses_dev
Add a hook in fib6_info_uses_dev to handle nexthop struct in a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
a1b7a1f06c ipv6: Handle all fib6_nh in a nexthop in rt6_nlmsg_size
Add a hook in rt6_nlmsg_size to handle nexthop struct in a fib6_info.
rt6_nh_nlmsg_size is used to sum the space needed for all nexthops in
the fib entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
17a5984eee ipv6: Handle all fib6_nh in a nexthop in __find_rr_leaf
Add a hook in __find_rr_leaf to handle nexthop struct in a fib6_info.
nexthop_for_each_fib6_nh is used to walk each fib6_nh in a nexthop and
call find_match. On a match, use the fib6_nh saved in the callback arg
to setup fib6_result.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
962b680383 ipv6: Handle all fib6_nh in a nexthop in rt6_device_match
Add a hook in rt6_device_match to handle nexthop struct in a fib6_info.
The new rt6_nh_dev_match uses nexthop_for_each_fib6_nh to walk each
fib6_nh in a nexthop and call __rt6_device_match. On match,
rt6_nh_dev_match returns the fib6_nh and rt6_device_match uses it to
setup fib6_result.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
2ab75bfb17 ipv6: Handle all fib6_nh in a nexthop in fib6_drop_pcpu_from
Use nexthop_for_each_fib6_nh to walk all fib6_nh in a nexthop when
dropping 'from' reference in pcpu routes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
f88c9aa12f nexthops: Add ipv6 helper to walk all fib6_nh in a nexthop struct
IPv6 has traditionally had a single fib6_nh per fib6_info. With
nexthops we can have multiple fib6_nh associated with a fib6_info.
Add a nexthop helper to invoke a callback for each fib6_nh in a
'struct nexthop'. If the callback returns non-0, the loop is
stopped and the return value passed to the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
YueHaibing
948622f984 tcp: Make tcp_fastopen_alloc_ctx static
Fix sparse warning:

net/ipv4/tcp_fastopen.c:75:29: warning:
 symbol 'tcp_fastopen_alloc_ctx' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:38:50 -07:00
Jozsef Kadlecsik
fe03d47456 Update my email address
It's better to use my kadlec@netfilter.org email address in
the source code. I might not be able to use
kadlec@blackhole.kfki.hu in the future.

Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 13:00:24 +02:00
Stefano Brivio
11921796f4 ipset: Fix memory accounting for hash types on resize
If a fresh array block is allocated during resize, the current in-memory
set size should be increased by the size of the block, not replaced by it.

Before the fix, adding entries to a hash set type, leading to a table
resize, caused an inconsistent memory size to be reported. This becomes
more obvious when swapping sets with similar sizes:

  # cat hash_ip_size.sh
  #!/bin/sh
  FAIL_RETRIES=10

  tries=0
  while [ ${tries} -lt ${FAIL_RETRIES} ]; do
  	ipset create t1 hash:ip
  	for i in `seq 1 4345`; do
  		ipset add t1 1.2.$((i / 255)).$((i % 255))
  	done
  	t1_init="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')"

  	ipset create t2 hash:ip
  	for i in `seq 1 4360`; do
  		ipset add t2 1.2.$((i / 255)).$((i % 255))
  	done
  	t2_init="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')"

  	ipset swap t1 t2
  	t1_swap="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')"
  	t2_swap="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')"

  	ipset destroy t1
  	ipset destroy t2
  	tries=$((tries + 1))

  	if [ ${t1_init} -lt 10000 ] || [ ${t2_init} -lt 10000 ]; then
  		echo "FAIL after ${tries} tries:"
  		echo "T1 size ${t1_init}, after swap ${t1_swap}"
  		echo "T2 size ${t2_init}, after swap ${t2_swap}"
  		exit 1
  	fi
  done
  echo "PASS"
  # echo -n 'func hash_ip4_resize +p' > /sys/kernel/debug/dynamic_debug/control
  # ./hash_ip_size.sh
  [ 2035.018673] attempt to resize set t1 from 10 to 11, t 00000000fe6551fa
  [ 2035.078583] set t1 resized from 10 (00000000fe6551fa) to 11 (00000000172a0163)
  [ 2035.080353] Table destroy by resize 00000000fe6551fa
  FAIL after 4 tries:
  T1 size 9064, after swap 71128
  T2 size 71128, after swap 9064

Reported-by: NOYB <JunkYardMail1@Frontier.com>
Fixes: 9e41f26a50 ("netfilter: ipset: Count non-static extension memory for userspace")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:59:23 +02:00
Jozsef Kadlecsik
b1732e1638 netfilter: ipset: Fix error path in set_target_v3_checkentry()
Fix error path and release the references properly.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:58:26 +02:00
Jozsef Kadlecsik
13c6ba1f85 netfilter: ipset: Fix the last missing check of nla_parse_deprecated()
In dump_init() the outdated comment was incorrect and we had a missing
validation check of nla_parse_deprecated().

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:50:00 +02:00
Aditya Pakki
f4f5748bfe netfilter: ipset: fix a missing check of nla_parse
When nla_parse fails, we should not use the results (the first
argument). The fix checks if it fails, and if so, returns its error code
upstream.

Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:47:37 +02:00
Florent Fourcot
f0cb839084 netfilter: ipset: merge uadd and udel functions
Both functions are using exactly the same code, except the command value
passed to call_ad function.

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:42:56 +02:00
Florent Fourcot
24c509b2e2 netfilter: ipset: remove useless memset() calls
One of the memset call is buggy: it does not erase full array, but only pointer size.
Moreover, after a check, first step of nla_parse_nested/nla_parse is to
erase tb array as well. We can remove both calls safely.

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:29:31 +02:00
Eric Dumazet
c67b85558f ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
In case autoflowlabel is in action, skb_get_hash_flowi6()
derives a non zero skb->hash to the flowlabel.

If skb->hash is zero, a flow dissection is performed.

Since all TCP skbs sent from ESTABLISH state inherit their
skb->hash from sk->sk_txhash, we better keep a copy
of sk->sk_txhash into the TIME_WAIT socket.

After this patch, ACK or RST packets sent on behalf of
a TIME_WAIT socket have the flowlabel that was previously
used by the flow.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:10:19 -07:00
Yuchung Cheng
fcc2202a9d tcp: fix undo spurious SYNACK in passive Fast Open
Commit 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK
retransmit") may cause tcp_fastretrans_alert() to warn about pending
retransmission in Open state. This is triggered when the Fast Open
server both sends data and has spurious SYNACK retransmission during
the handshake, and the data packets were lost or reordered.

The root cause is a bit complicated:

(1) Upon receiving SYN-data: a full socket is created with
    snd_una = ISN + 1 by tcp_create_openreq_child()

(2) On SYNACK timeout the server/sender enters CA_Loss state.

(3) Upon receiving the final ACK to complete the handshake, sender
    does not mark FLAG_SND_UNA_ADVANCED since (1)

    Sender then calls tcp_process_loss since state is CA_loss by (2)

(4) tcp_process_loss() does not invoke undo operations but instead
    mark REXMIT_LOST to force retransmission

(5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
    changes state to CA_Open but has positive tp->retrans_out

(6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()

The step that goes wrong is (4) where the undo operation should
have been invoked because the ACK successfully acknowledged the
SYN sequence. This fixes that by specifically checking undo
when the SYN-ACK sequence is acknowledged. Then after
tcp_process_loss() the state would be further adjusted based
in tcp_fastretrans_alert() to avoid triggering the warning in (6).

Fixes: 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:04:11 -07:00
Matteo Croce
c1a9d65954 mpls: fix af_mpls dependencies
MPLS routing code relies on sysctl to work, so let it select PROC_SYSCTL.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:57:24 -07:00
Hariprasad Kelam
de47c5d8e1 af_key: make use of BUG_ON macro
fix below warnings reported by coccicheck

net/key/af_key.c:932:2-5: WARNING: Use BUG_ON instead of if condition
followed by BUG.
net/key/af_key.c:948:2-5: WARNING: Use BUG_ON instead of if condition
followed by BUG.

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:56:42 -07:00
David S. Miller
62f42a114b Merge tag 'linux-can-fixes-for-5.2-20190607' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:

====================
pull-request: can 2019-06-07

this is a pull reqeust of 9 patches for net/master.

The first patch is by Alexander Dahl and removes a duplicate menu entry from
the Kconfig. The next patch by Joakim Zhang fixes the timeout in the flexcan
driver when setting small bit rates. Anssi Hannula's patch for the xilinx_can
driver fixes the bittiming_const for CAN FD core. The two patches by Sean
Nyekjaer bring mcp25625 to the existing mcp251x driver. The patch by Eugen
Hristev implements an errata for the m_can driver. YueHaibing's patch fixes the
error handling ing can_init(). The patch by Fabio Estevam for the flexcan
driver removes an unneeded registration message during flexcan_probe(). And the
last patch is by Willem de Bruijn and adds the missing purging the  socket
error queue on sock destruct.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:44:01 -07:00
Eric Dumazet
392096736a ipv6: tcp: fix potential NULL deref in tcp_v6_send_reset()
syzbot found a crash in tcp_v6_send_reset() caused by my latest
change.

Problem is that if an skb has been queued to socket prequeue,
skb_dst(skb)->dev can not anymore point to the device.

Fortunately in this case the socket pointer is not NULL.

A similar issue has been fixed in commit 0f85feae6b ("tcp: fix
more NULL deref after prequeue changes"), I should have known better.

Fixes: 323a53c412 ("ipv6: tcp: enable flowlabel reflection in some RST packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:42:23 -07:00
Sebastian Andrzej Siewior
6dcdd884e2 net: hwbm: Make the hwbm_pool lock a mutex
Based on review, `lock' is only acquired in hwbm_pool_add() which is
invoked via ->probe(), ->resume() and ->ndo_change_mtu(). Based on this
the lock can become a mutex and there is no need to disable interrupts
during the procedure.
Now that the lock is a mutex, hwbm_pool_add() no longer invokes
hwbm_pool_refill() in an atomic context so we can pass GFP_KERNEL to
hwbm_pool_refill() and remove the `gfp' argument from hwbm_pool_add().

Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:10 -07:00
Sebastian Andrzej Siewior
92dcabd7a0 net: Don't disable interrupts in __netdev_alloc_skb()
__netdev_alloc_skb() can be used from any context and is used by NAPI
and non-NAPI drivers. Non-NAPI drivers use it in interrupt context and
NAPI drivers use it during initial allocation (->ndo_open() or
->ndo_change_mtu()). Some NAPI drivers share the same function for the
initial allocation and the allocation in their NAPI callback.

The interrupts are disabled in order to ensure locked access from every
context to `netdev_alloc_cache'.

Let __netdev_alloc_skb() check if interrupts are disabled. If they are, use
`netdev_alloc_cache'. Otherwise disable BH and use `napi_alloc_cache.page'.
The IRQ check is cheaper compared to disabling & enabling interrupts and
memory allocation with disabled interrupts does not work on -RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:09 -07:00
Sebastian Andrzej Siewior
7ba7aeabba net: Don't disable interrupts in napi_alloc_frag()
netdev_alloc_frag() can be used from any context and is used by NAPI
and non-NAPI drivers. Non-NAPI drivers use it in interrupt context
and NAPI drivers use it during initial allocation (->ndo_open() or
->ndo_change_mtu()). Some NAPI drivers share the same function for the
initial allocation and the allocation in their NAPI callback.

The interrupts are disabled in order to ensure locked access from every
context to `netdev_alloc_cache'.

Let netdev_alloc_frag() check if interrupts are disabled. If they are,
use `netdev_alloc_cache' otherwise disable BH and invoke
__napi_alloc_frag() for the allocation. The IRQ check is cheaper
compared to disabling & enabling interrupts and memory allocation with
disabled interrupts does not work on -RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:09 -07:00
George Wilkie
2f3f7d1fa0 mpls: fix warning with multi-label encap
If you configure a route with multiple labels, e.g.
  ip route add 10.10.3.0/24 encap mpls 16/100 via 10.10.2.2 dev ens4
A warning is logged:
  kernel: [  130.561819] netlink: 'ip': attribute type 1 has an invalid
  length.

This happens because mpls_iptunnel_policy has set the type of
MPLS_IPTUNNEL_DST to fixed size NLA_U32.
Change it to a minimum size.
nla_get_labels() does the remaining validation.

Fixes: e3e4712ec0 ("mpls: ip tunnel support")
Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:26:34 -07:00
Eric Dumazet
65a3c497c0 ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero
Before taking a refcount, make sure the object is not already
scheduled for deletion.

Same fix is needed in ipv6_flowlabel_opt()

Fixes: 18367681a1 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:07:14 -07:00
Enrico Weigelt
c3fee640bc net: ipv4: fib_semantics: fix uninitialized variable
fix an uninitialized variable:

  CC      net/ipv4/fib_semantics.o
net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw':
net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized]
   if (!tbl || err) {
            ^~

Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 12:47:30 -07:00
Vladimir Oltean
f3097be21b net: dsa: sja1105: Add a state machine for RX timestamping
Meta frame reception relies on the hardware keeping its promise that it
will send no other traffic towards the CPU port between a link-local
frame and a meta frame.  Otherwise there is no other way to associate
the meta frame with the link-local frame it's holding a timestamp of.
The receive function is made stateful, and buffers a timestampable frame
until its meta frame arrives, then merges the two, drops the meta and
releases the link-local frame up the stack.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
e53e18a6fe net: dsa: sja1105: Receive and decode meta frames
This adds support in the tagger for understanding the source port and
switch id of meta frames.  Their timestamp is also extracted but not
used yet - this needs to be done in a state machine that modifies the
previously received timestampable frame - will be added in a follow-up
patch.

Also take the opportunity to:
- Remove a comment in sja1105_filter made obsolete by e8d67fa569
  ("net: dsa: sja1105: Don't store frame type in skb->cb")
- Reorder the checks in sja1105_filter to optimize for the most likely
  scenario first: regular traffic.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
79fa706139 net: dsa: sja1105: Make sja1105_is_link_local not match meta frames
Although meta frames are configured to be sent at SJA1105_META_DMAC
(01-80-C2-00-00-0E) which is a multicast MAC address that would also be
trapped by the switch to the CPU, were it to receive it on a front-panel
port, meta frames are conceptually not link-local frames, they only
carry their RX timestamps.

The choice of sending meta frames at a multicast DMAC is a pragmatic
one, to avoid installing an extra entry to the DSA master port's
multicast MAC filter.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
d3f9b90bf1 net: dsa: sja1105: Build a minimal understanding of meta frames
Meta frames are sent on the CPU port by the switch if RX timestamping is
enabled. They contain a partial timestamp of the previous frame.

They are Ethernet frames with the Ethernet header constructed out of:

- SJA1105_META_DMAC
- SJA1105_META_SMAC
- ETH_P_SJA1105_META

The Ethernet payload will be decoded in a follow-up patch.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
42824463d3 net: dsa: sja1105: Limit use of incl_srcpt to bridge+vlan mode
The incl_srcpt setting makes the switch mangle the destination MACs of
multicast frames trapped to the CPU - a primitive tagging mechanism that
works even when we cannot use the 802.1Q software features.

The downside is that the two multicast MAC addresses that the switch
traps for L2 PTP (01-80-C2-00-00-0E and 01-1B-19-00-00-00) quickly turn
into a lot more, as the switch encodes the source port and switch id
into bytes 3 and 4 of the MAC. The resulting range of MAC addresses
would need to be installed manually into the DSA master port's multicast
MAC filter, and even then, most devices might not have a large enough
MAC filtering table.

As a result, only limit use of incl_srcpt to when it's strictly
necessary: when under a VLAN filtering bridge.  This fixes PTP in
non-bridged mode (standalone ports). Otherwise, PTP frames, as well as
metadata follow-up frames holding RX timestamps won't be received
because they will be blocked by the master port's MAC filter.
Linuxptp doesn't help, because it only requests the addition of the
unmodified PTP MACs to the multicast filter.
This issue is not seen in bridged mode because the master port is put in
promiscuous mode when the slave ports are enslaved to a bridge.
Therefore, there is no downside to having the incl_srcpt mechanism
active there.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
d461933638 net: dsa: tag_8021q: Create helper function for removing VLAN header
This removes the existing implementation from tag_sja1105, which was
partially incorrect (it was not changing the MAC header offset, thereby
leaving it to point 4 bytes earlier than it should have).

This overwrites the VLAN tag by moving the Ethernet source and
destination MACs 4 bytes to the right. Then skb->data (assumed to be
pointing immediately after the EtherType) is temporarily pushed to the
beginning of the new Ethernet header, the new Ethernet header offset and
length are recorded, then skb->data is moved back to where it was.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Vladimir Oltean
5e3f847a02 net: dsa: Add teardown callback for drivers
This is helpful for e.g. draining per-driver (not per-port) tagger
queues.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Vladimir Oltean
146d442c23 net: dsa: Keep a pointer to the skb clone for TX timestamping
For drivers that use deferred_xmit for PTP frames (such as sja1105),
there is no need to perform matching between PTP frames and their egress
timestamps, since the sending process can be serialized.

In that case, it makes sense to have the pointer to the skb clone that
DSA made directly in the skb->cb. It will be used for pushing the egress
timestamp back in the application socket's error queue.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Linus Torvalds
9331b6740f Merge tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull yet more SPDX updates from Greg KH:
 "Another round of SPDX header file fixes for 5.2-rc4

  These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
  added, based on the text in the files. We are slowly chipping away at
  the 700+ different ways people tried to write the license text. All of
  these were reviewed on the spdx mailing list by a number of different
  people.

  We now have over 60% of the kernel files covered with SPDX tags:
	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
	Files checked:            64533
	Files with SPDX:          40392
	Files with errors:            0

  I think the majority of the "easy" fixups are now done, it's now the
  start of the longer-tail of crazy variants to wade through"

* tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429
  ...
2019-06-08 12:52:42 -07:00
David S. Miller
38e406f600 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-06-07

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
   32-bits for alu32 ops, from Björn and Luke with selftests covering all
   relevant BPF alu ops from Björn and Jiong.

2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
   __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
   that skb->data is pointing to transport header, from Martin.

3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
   workqueue, and a missing restore of sk_write_space when psock gets dropped,
   from Jakub and John.

4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
   breaks standard applications like DNS if reverse NAT is not performed upon
   receive, from Daniel.

5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
   fails to verify that the length of the tuple is long enough, from Lorenz.

6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
   {un,}successful probe) as that is expected to be propagated as an fd to
   load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
   from Michal.

7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.

8) Minor misc fixes in docs, samples and selftests, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 14:46:47 -07:00
Willem de Bruijn
fd704bd5ee can: purge socket error queue on sock destruct
CAN supports software tx timestamps as of the below commit. Purge
any queued timestamp packets on socket destroy.

Fixes: 51f31cabe3 ("ip: support for TX timestamps on UDP and RAW sockets")
Reported-by: syzbot+a90604060cb40f5bdd16@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07 23:03:54 +02:00
YueHaibing
c5a3aed1cd can: af_can: Fix error path of can_init()
This patch add error path for can_init() to avoid possible crash if some
error occurs.

Fixes: 0d66548a10 ("[CAN]: Add PF_CAN core module")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07 23:03:54 +02:00
David S. Miller
a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
Linus Torvalds
1e1d926369 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

 2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

 3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

 4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

 5) Memory leak on unload in RDS, from Zhu Yanjun.

 6) sctp_process_init leak, from Neil HOrman.

 7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
  pktgen: do not sleep with the thread lock held.
  net: mvpp2: Use strscpy to handle stat strings
  net: rds: fix memory leak in rds_ib_flush_mr_pool
  ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
  ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
  Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
  net: aquantia: fix wol configuration not applied sometimes
  ethtool: fix potential userspace buffer overflow
  Fix memory leak in sctp_process_init
  net: rds: fix memory leak when unload rds_rdma
  ipv6: fix the check before getting the cookie in rt6_get_cookie
  ipv4: not do cache for local delivery if bc_forwarding is enabled
  s390/qeth: handle error when updating TX queue count
  s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
  s390/qeth: check dst entry before use
  s390/qeth: handle limited IPv4 broadcast in L3 TX path
  net: fix indirect calls helpers for ptype list hooks.
  net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
  udp: only choose unbound UDP socket for multicast when not in a VRF
  net/tls: replace the sleeping lock around RX resync with a bit lock
  ...
2019-06-07 09:29:14 -07:00
Guillaume Nault
8a3dca6325 netfilter: ipv6: nf_defrag: accept duplicate fragments again
When fixing the skb leak introduced by the conversion to rbtree, I
forgot about the special case of duplicate fragments. The condition
under the 'insert_error' label isn't effective anymore as
nf_ct_frg6_gather() doesn't override the returned value anymore. So
duplicate fragments now get NF_DROP verdict.

To accept duplicate fragments again, handle them specially as soon as
inet_frag_queue_insert() reports them. Return -EINPROGRESS which will
translate to NF_STOLEN verdict, like any accepted fragment. However,
such packets don't carry any new information and aren't queued, so we
just drop them immediately.

Fixes: a0d56cb911 ("netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-07 14:49:01 +02:00
Roman Gushchin
4ecabd55c9 bpf: allow CGROUP_SKB programs to use bpf_skb_cgroup_id() helper
Currently bpf_skb_cgroup_id() is not supported for CGROUP_SKB
programs. An attempt to load such a program generates an error
like this:

    libbpf:
    0: (b7) r6 = 0
    ...
    9: (85) call bpf_skb_cgroup_id#79
    unknown func bpf_skb_cgroup_id#79

There are no particular reasons for denying it, and we have some
use cases where it might be useful.

So let's add it to the list of allowed helpers.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-07 01:56:52 +02:00
Daniel Borkmann
983695fa67 bpf: fix unconnected udp hooks
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06 16:53:12 -07:00
Dirk van der Merwe
b9727d7f95 net/tls: export TLS per skb encryption
While offloading TLS connections, drivers need to handle the case where
out of order packets need to be transmitted.

Other drivers obtain the entire TLS record for the specific skb to
provide as context to hardware for encryption. However, other designs
may also want to keep the hardware state intact and perform the
out of order encryption entirely on the host.

To achieve this, export the already existing software encryption
fallback path so drivers could access this.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 14:13:40 -07:00
Linus Torvalds
459aa077a2 Merge tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
 "These are mostly stable bugfixes found during testing, many during the
  recent NFS bake-a-thon.

  Stable bugfixes:
   - SUNRPC: Fix regression in umount of a secure mount
   - SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
   - NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
   - NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled

  Other bugfixes:
   - xprtrdma: Use struct_size() in kzalloc()"

* tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
  NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
  SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
  SUNRPC fix regression in umount of a secure mount
  xprtrdma: Use struct_size() in kzalloc()
2019-06-06 12:19:37 -07:00
Paolo Abeni
720f1de402 pktgen: do not sleep with the thread lock held.
Currently, the process issuing a "start" command on the pktgen procfs
interface, acquires the pktgen thread lock and never release it, until
all pktgen threads are completed. The above can blocks indefinitely any
other pktgen command and any (even unrelated) netdevice removal - as
the pktgen netdev notifier acquires the same lock.

The issue is demonstrated by the following script, reported by Matteo:

ip -b - <<'EOF'
	link add type dummy
	link add type veth
	link set dummy0 up
EOF
modprobe pktgen
echo reset >/proc/net/pktgen/pgctrl
{
	echo rem_device_all
	echo add_device dummy0
} >/proc/net/pktgen/kpktgend_0
echo count 0 >/proc/net/pktgen/dummy0
echo start >/proc/net/pktgen/pgctrl &
sleep 1
rmmod veth

Fix the above releasing the thread lock around the sleep call.

Additionally we must prevent racing with forcefull rmmod - as the
thread lock no more protects from them. Instead, acquire a self-reference
before waiting for any thread. As a side effect, running

rmmod pktgen

while some thread is running now fails with "module in use" error,
before this patch such command hanged indefinitely.

Note: the issue predates the commit reported in the fixes tag, but
this fix can't be applied before the mentioned commit.

v1 -> v2:
 - no need to check for thread existence after flipping the lock,
   pktgen threads are freed only at net exit time
 -

Fixes: 6146e6a43b ("[PKTGEN]: Removes thread_{un,}lock() macros.")
Reported-and-tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 11:31:35 -07:00
Colin Ian King
4daa95af7f ipv6: fix spelling mistake: "wtih" -> "with"
There is a spelling mistake in a NL_SET_ERR_MSG message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 10:37:28 -07:00
Zhu Yanjun
85cb928787 net: rds: fix memory leak in rds_ib_flush_mr_pool
When the following tests last for several hours, the problem will occur.

Server:
    rds-stress -r 1.1.1.16 -D 1M
Client:
    rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30

The following will occur.

"
Starting up....
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu
%
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
"
>From vmcore, we can find that clean_list is NULL.

>From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
Then rds_ib_mr_pool_flush_worker calls
"
 rds_ib_flush_mr_pool(pool, 0, NULL);
"
Then in function
"
int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
                         int free_all, struct rds_ib_mr **ibmr_ret)
"
ibmr_ret is NULL.

In the source code,
"
...
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
if (ibmr_ret)
        *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);

/* more than one entry in llist nodes */
if (clean_nodes->next)
        llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
...
"
When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
instead of clean_nodes is added in clean_list.
So clean_nodes is discarded. It can not be used again.
The workqueue is executed periodically. So more and more clean_nodes are
discarded. Finally the clean_list is NULL.
Then this problem will occur.

Fixes: 1bc144b625 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 10:32:16 -07:00