Commit Graph

48346 Commits

Author SHA1 Message Date
Jason A. Donenfeld
fef0035c0f netlink: do not proceed if dump's start() errs
Drivers that use the start method for netlink dumping rely on dumpit not
being called if start fails. For example, ila_xlat.c allocates memory
and assigns it to cb->args[0] in its start() function. It might fail to
do that and return -ENOMEM instead. However, even when returning an
error, dumpit will be called, which, in the example above, quickly
dereferences the memory in cb->args[0], which will OOPS the kernel. This
is but one example of how this goes wrong.

Since start() has always been a function with an int return type, it
therefore makes sense to use it properly, rather than ignoring it. This
patch thus returns early and does not call dumpit() when start() fails.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-30 16:13:31 +01:00
Sven Eckelmann
706cc9f51d batman-adv: Add argument names for function ptr definitions
checkpatch started to report unnamed arguments in function pointer
definitions. Add the corresponding names to these definitions to avoid this
warning.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-09-30 09:31:34 +02:00
Maciej Żenczykowski
84e14fe353 net-ipv6: add support for sockopt(SOL_IPV6, IPV6_FREEBIND)
So far we've been relying on sockopt(SOL_IP, IP_FREEBIND) being usable
even on IPv6 sockets.

However, it turns out it is perfectly reasonable to want to set freebind
on an AF_INET6 SOCK_RAW socket - but there is no way to set any SOL_IP
socket option on such a socket (they're all blindly errored out).

One use case for this is to allow spoofing src ip on a raw socket
via sendmsg cmsg.

Tested:
  built, and booted
  # python
  >>> import socket
  >>> SOL_IP = socket.SOL_IP
  >>> SOL_IPV6 = socket.IPPROTO_IPV6
  >>> IP_FREEBIND = 15
  >>> IPV6_FREEBIND = 78
  >>> s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0)
  >>> s.getsockopt(SOL_IP, IP_FREEBIND)
  0
  >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
  0
  >>> s.setsockopt(SOL_IPV6, IPV6_FREEBIND, 1)
  >>> s.getsockopt(SOL_IP, IP_FREEBIND)
  1
  >>> s.getsockopt(SOL_IPV6, IPV6_FREEBIND)
  1

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-30 05:30:52 +01:00
Mike Manning
1f372c7bfb net: ipv6: send NS for DAD when link operationally up
The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.

Signed-off-by: Mike Manning <mmanning@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-30 05:24:45 +01:00
Artem Savkov
e6b72ee88a netfilter: ebtables: fix race condition in frame_filter_net_init()
It is possible for ebt_in_hook to be triggered before ebt_table is assigned
resulting in a NULL-pointer dereference. Make sure hooks are
registered as the last step.

Fixes: aee12a0a37 ("ebtables: remove nf_hook_register usage")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-29 13:36:06 +02:00
JingPiao Chen
0d18779be1 netfilter: nf_tables: fix update chain error
# nft add table filter
 # nft add chain filter c1
 # nft rename chain filter c1 c2

Error: Could not process rule: No such file or directory
rename chain filter c1 c2
^^^^^^^^^^^^^^^^^^^^^^^^^^

 # nft add chain filter c2
 # nft rename chain filter c1 c2
 # nft list table filter

table ip filter {
	chain c2 {
	}

	chain c2 {
	}
}

Fixes: 664b0f8cd8 ("netfilter: nf_tables: add generation mask to chains")
Signed-off-by: JingPiao Chen <chenjingpiao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-29 13:34:36 +02:00
Ross Lagerwall
e5173418ac netfilter: ipset: Fix race between dump and swap
Fix a race between ip_set_dump_start() and ip_set_swap().
The race is as follows:
* Without holding the ref lock, ip_set_swap() checks ref_netlink of the
  set and it is 0.
* ip_set_dump_start() takes a reference on the set.
* ip_set_swap() does the swap (even though it now has a non-zero
  reference count).
* ip_set_dump_start() gets the set from ip_set_list again which is now a
  different set since it has been swapped.
* ip_set_dump_start() calls __ip_set_put_netlink() and hits a BUG_ON due
  to the reference count being 0.

Fix this race by extending the critical region in which the ref lock is
held to include checking the ref counts.

The race can be reproduced with the following script:
  while :; do
    ipset destroy hash_ip1
    ipset destroy hash_ip2
    ipset create hash_ip1 hash:ip family inet hashsize 1024 \
        maxelem 500000
    ipset create hash_ip2 hash:ip family inet hashsize 300000 \
        maxelem 500000
    ipset create hash_ip3 hash:ip family inet hashsize 1024 \
        maxelem 500000
    ipset save &
    ipset swap hash_ip3 hash_ip2
    ipset destroy hash_ip3
    wait
  done

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-29 12:15:14 +02:00
David Ahern
fa8fefaa67 net: ipv4: remove fib_info arg to fib_check_nh
fib_check_nh does not use the fib_info arg; remove t.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-29 06:19:32 +01:00
David Ahern
c7c3e5913b net: ipv4: remove fib_weight
fib_weight in fib_info is set but not used. Remove it and the
helpers for setting it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-29 06:19:32 +01:00
Hoang Tran
cf5d74b85e tcp: fix under-evaluated ssthresh in TCP Vegas
With the commit 76174004a0 (tcp: do not slow start when cwnd equals
ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
under-evaluate the ssthresh.

Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-29 06:07:00 +01:00
Nikolay Aleksandrov
5af48b59f3 net: bridge: add per-port group_fwd_mask with less restrictions
We need to be able to transparently forward most link-local frames via
tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
mask which restricts the forwarding of STP and LACP, but we need to be able
to forward these over tunnels and control that forwarding on a per-port
basis thus add a new per-port group_fwd_mask option which only disallows
mac pause frames to be forwarded (they're always dropped anyway).
The patch does not change the current default situation - all of the others
are still restricted unless configured for forwarding.
We have successfully tested this patch with LACP and STP forwarding over
VxLAN and qinq tunnels.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-29 06:02:55 +01:00
Linus Torvalds
e49aa15ef6 Revert "Bluetooth: Add option for disabling legacy ioctl interfaces"
This reverts commit dbbccdc4ce.

It turns out that the "legacy" users aren't so legacy at all, and that
turning off the legacy ioctl will break the current Qt bluetooth stack
for bluetooth LE devices that were released just a couple of months ago.

So it's simply not true that this was a legacy interface that hasn't
been needed and is only limited to old legacy BT devices.  Because I
actually read Kconfig help messages, and actively try to turn off
features that I don't need, I turned the option off.

Then I spent _way_ too much time debugging BLE issues until I realized
that it wasn't the Qt and subsurface development that had broken one of
my dive computer BLE downloads, but simply my broken kernel config.

Maybe in a decade it will be true that this is a legacy interface.  And
maybe with a better help-text and correct dependencies, this kind of
legacy removal might be acceptable.  But as things are right now both
the commit message and the Kconfig help text were misleading, and the
Kconfig option had the wrong dependenencies.

There's no reason to keep that broken Kconfig option in the tree.

Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-28 13:20:32 -07:00
Christoph Paasch
9d538fa60b net: Set sk_prot_creator when cloning sockets to the right proto
sk->sk_prot and sk->sk_prot_creator can differ when the app uses
IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
Which is why sk_prot_creator is there to make sure that sk_prot_free()
does the kmem_cache_free() on the right kmem_cache slab.

Now, if such a socket gets transformed back to a listening socket (using
connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
sk_clone_lock() when a new connection comes in. But sk_prot_creator will
still point to the IPv6 kmem_cache (as everything got copied in
sk_clone_lock()). When freeing, we will thus put this
memory back into the IPv6 kmem_cache although it was allocated in the
IPv4 cache. I have seen memory corruption happening because of this.

With slub-debugging and MEMCG_KMEM enabled this gives the warning
	"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"

A C-program to trigger this:

void main(void)
{
        int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
        int new_fd, newest_fd, client_fd;
        struct sockaddr_in6 bind_addr;
        struct sockaddr_in bind_addr4, client_addr1, client_addr2;
        struct sockaddr unsp;
        int val;

        memset(&bind_addr, 0, sizeof(bind_addr));
        bind_addr.sin6_family = AF_INET6;
        bind_addr.sin6_port = ntohs(42424);

        memset(&client_addr1, 0, sizeof(client_addr1));
        client_addr1.sin_family = AF_INET;
        client_addr1.sin_port = ntohs(42424);
        client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&client_addr2, 0, sizeof(client_addr2));
        client_addr2.sin_family = AF_INET;
        client_addr2.sin_port = ntohs(42421);
        client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&unsp, 0, sizeof(unsp));
        unsp.sa_family = AF_UNSPEC;

        bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));

        listen(fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
        new_fd = accept(fd, NULL, NULL);
        close(fd);

        val = AF_INET;
        setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));

        connect(new_fd, &unsp, sizeof(unsp));

        memset(&bind_addr4, 0, sizeof(bind_addr4));
        bind_addr4.sin_family = AF_INET;
        bind_addr4.sin_port = ntohs(42421);
        bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));

        listen(new_fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));

        newest_fd = accept(new_fd, NULL, NULL);
        close(new_fd);

        close(client_fd);
        close(new_fd);
}

As far as I can see, this bug has been there since the beginning of the
git-days.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:33:22 -07:00
Willem de Bruijn
da7c956101 packet: only test po->has_vnet_hdr once in packet_snd
Packet socket option po->has_vnet_hdr can be updated concurrently with
other operations if no ring is attached.

Do not test the option twice in packet_snd, as the value may change in
between calls. A race on setsockopt disable may cause a packet > mtu
to be sent without having GSO options set.

Fixes: bfd5f4a3d6 ("packet: Add GSO/csum offload support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:24:31 -07:00
Willem de Bruijn
4971613c16 packet: in packet_do_bind, test fanout with bind_lock held
Once a socket has po->fanout set, it remains a member of the group
until it is destroyed. The prot_hook must be constant and identical
across sockets in the group.

If fanout_add races with packet_do_bind between the test of po->fanout
and taking the lock, the bind call may make type or dev inconsistent
with that of the fanout group.

Hold po->bind_lock when testing po->fanout to avoid this race.

I had to introduce artificial delay (local_bh_enable) to actually
observe the race.

Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:24:31 -07:00
Florian Westphal
4c82a95e52 rtnetlink: rtnl_have_link_slave_info doesn't need rtnl
it can be switched to rcu.

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:20:49 -07:00
Florian Westphal
b1e66b9a67 rtnetlink: add helpers to dump netnsid information
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:20:49 -07:00
Florian Westphal
250fc3dfdb rtnetlink: add helpers to dump vf information
similar to earlier patches, split out more parts of this function to
better see what is happening and where we assume rtnl is locked.

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:20:49 -07:00
Florian Westphal
79110a0426 rtnetlink: add helper to put master and link ifindexes
rtnl_fill_ifinfo currently requires caller to hold the rtnl mutex.
Unfortunately the function is quite large which makes it harder to see
which spots require the lock, which spots assume it and which ones could
do without.

Add helpers to factor out the ifindex dumping, one can use rcu to avoid
rtnl dependency.

Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:20:49 -07:00
Florian Fainelli
e804441cfe net: dsa: Fix network device registration order
We cannot be registering the network device first, then setting its
carrier off and finally connecting it to a PHY, doing that leaves a
window during which the carrier is at best inconsistent, and at worse
the device is not usable without a down/up sequence since the network
device is visible to user space with possibly no PHY device attached.

Re-order steps so that they make logical sense. This fixes some devices
where the port was not usable after e.g: an unbind then bind of the
driver.

Fixes: 0071f56e46 ("dsa: Register netdev before phy")
Fixes: 91da11f870 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:12:53 -07:00
Cong Wang
e7614370d6 net_sched: use idr to allocate u32 filter handles
Instead of calling u32_lookup_ht() in a loop to find
a unused handle, just switch to idr API to allocate
new handles. u32 filters are special as the handle
could contain a hash table id and a key id, so we
need two IDR to allocate each of them.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:43:41 -07:00
Cong Wang
1d8134fea2 net_sched: use idr to allocate basic filter handles
Instead of calling basic_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:43:41 -07:00
Cong Wang
76cf546c28 net_sched: use idr to allocate bpf filter handles
Instead of calling cls_bpf_get() in a loop to find
a unused handle, just switch to idr API to allocate
new handles.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:43:41 -07:00
Eric Dumazet
8f1975e31d inetpeer: speed up inetpeer_invalidate_tree()
As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:40:59 -07:00
Eric Dumazet
35f493b87e inetpeer: fix RCU lookup() again
My prior fix was not complete, as we were dereferencing a pointer
three times per node, not twice as I initially thought.

Fixes: 4cc5b44b29 ("inetpeer: fix RCU lookup()")
Fixes: b145425f26 ("inetpeer: remove AVL implementation in favor of RB tree")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 09:39:34 -07:00
Sven Eckelmann
48915aed60 batman-adv: Fix "line over 80 characters" checkpatch warning
Fixes: 242c1a28eb ("net: Remove useless function skb_header_release")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-09-28 17:20:45 +02:00
Sven Eckelmann
825ffe1f7b batman-adv: Remove unnecessary parentheses
checkpatch introduced with commit 63b7c73ec8 ("checkpatch: add --strict
check for ifs with unnecessary parentheses") an additional test which
identifies some unnecessary parentheses.

Remove these unnecessary parentheses to avoid the warnings and to unify the
coding style slightly more.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-09-28 17:20:45 +02:00
Simon Wunderlich
7e7c1afb67 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-09-28 17:20:45 +02:00
Artem Savkov
dd269db849 xfrm: don't call xfrm_policy_cache_flush under xfrm_state_lock
I might be wrong but it doesn't look like xfrm_state_lock is required
for xfrm_policy_cache_flush and calling it under this lock triggers both
"sleeping function called from invalid context" and "possible circular
locking dependency detected" warnings on flush.

Fixes: ec30d78c14 xfrm: add xdst pcpu cache
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-09-28 09:39:05 +02:00
Yotam Gigi
478e4c2f00 net: mroute: Check if rule is a default rule
When the ipmr starts, it adds one default FIB rule that matches all packets
and sends them to the DEFAULT (multicast) FIB table. A more complex rule
can be added by user to specify that for a specific interface, a packet
should be look up at either an arbitrary table or according to the l3mdev
of the interface.

For drivers willing to offload the ipmr logic into a hardware but don't
want to offload all the FIB rules functionality, provide a function that
can indicate whether the FIB rule is the default multicast rule, thus only
one routing table is needed.

This way, a driver can register to the FIB notification chain, get
notifications about FIB rules added and trigger some kind of an internal
abort mechanism when a non default rule is added by the user.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-27 11:33:27 -07:00
Yotam Gigi
c7c0bbeae9 net: ipmr: Add MFC offload indication
Allow drivers, registered to the fib notification chain indicate whether a
multicast MFC route is offloaded or not, similarly to unicast routes. The
indication of whether a route is offloaded is done using the mfc_flags
field on an mfc_cache struct, and the information is sent to the userspace
via the RTNetlink interface only.

Currently, MFC routes are either offloaded or not, thus there is no need to
add per-VIF offload indication.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-27 11:33:27 -07:00
Yotam Gigi
b362053a7c ipmr: Send FIB notifications on MFC and VIF entries
Use the newly introduced notification chain to send events upon VIF and MFC
addition and deletion. The MFC notifications are sent only on resolved MFC
entries, as unresolved cannot be offloaded.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-27 11:33:27 -07:00
Yotam Gigi
4d65b94878 ipmr: Add FIB notification access functions
Make the ipmr module register as a FIB notifier. To do that, implement both
the ipmr_seq_read and ipmr_dump ops.

The ipmr_seq_read op returns a sequence counter that is incremented on
every notification related operation done by the ipmr. To implement that,
add a sequence counter in the netns_ipv4 struct and increment it whenever a
new MFC route or VIF are added or deleted. The sequence operations are
protected by the RTNL lock.

The ipmr_dump iterates the list of MFC routes and the list of VIF entries
and sends notifications about them. The entries dump is done under RCU
where the VIF dump uses the mrt_lock too, as the vif->dev field can change
under RCU.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-27 11:33:27 -07:00
Yotam Gigi
310ebbba3b ipmr: Add reference count to MFC entries
Next commits will introduce MFC notifications through the atomic
fib_notification chain, thus allowing modules to be aware of MFC entries.

Due to the fact that modules may need to hold a reference to an MFC entry,
add reference count to MFC entries to prevent them from being freed while
these modules use them.

The reference counting is done only on resolved MFC entries currently.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-27 11:33:27 -07:00
Dan Carpenter
c2cc187e53 sctp: Fix a big endian bug in sctp_diag_dump()
The sctp_for_each_transport() function takes an pointer to int.  The
cb->args[] array holds longs so it's only using the high 32 bits.  It
works on little endian system but will break on big endian 64 bit
machines.

Fixes: d25adbeb0c ("sctp: fix an use-after-free issue in sctp_sock_dump")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 21:16:29 -07:00
Vivien Didelot
69b2c16296 net: dsa: use phy_ethtool_nway_reset
Use phy_ethtool_nway_reset now that dsa_slave_nway_reset does exactly
the same.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 20:06:34 -07:00
Vivien Didelot
aa62a8ca85 net: dsa: use phy_ethtool_set_link_ksettings
Use phy_ethtool_set_link_ksettings now that dsa_slave_set_link_ksettings
does exactly the same.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 20:06:34 -07:00
Vivien Didelot
771df31ace net: dsa: use phy_ethtool_get_link_ksettings
Use phy_ethtool_get_link_ksettings now that dsa_slave_get_link_ksettings
does exactly the same.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 20:06:34 -07:00
Vivien Didelot
0115dcd178 net: dsa: use slave device phydev
There is no need to store a phy_device in dsa_slave_priv since
net_device already provides one. Simply s/p->phy/dev->phydev/.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 20:06:34 -07:00
Vivien Didelot
f4344e0a48 net: dsa: return -ENODEV is there is no slave PHY
Instead of returning -EOPNOTSUPP when a slave device has no PHY,
directly return -ENODEV as ethtool and phylib do.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 20:06:34 -07:00
Daniel Borkmann
de8f3a83b0 bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
Daniel Borkmann
6aaae2b6c4 bpf: rename bpf_compute_data_end into bpf_compute_data_pointers
Just do the rename into bpf_compute_data_pointers() as we'll add
one more pointer here to recompute.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
Florian Westphal
e23ed762db netfilter: ipset: pernet ops must be unregistered last
Removing the ipset module leaves a small window where one cpu performs
module removal while another runs a command like 'ipset flush'.

ipset uses net_generic(), unregistering the pernet ops frees this
storage area.

Fix it by first removing the user-visible api handlers and the pernet
ops last.

Fixes: 1785e8f473 ("netfiler: ipset: Add net namespace for ipset")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 20:15:17 +02:00
Jozsef Kadlecsik
48596a8ddc netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses
Wrong comparison prevented the hash types to add a range with more than
2^31 addresses but reported as a success.

Fixes Netfilter's bugzilla id #1005, reported by Oleg Serditov and
Oliver Ford.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 20:15:04 +02:00
Subash Abhinov Kasiviswanathan
89fcbb564f netfilter: xt_socket: Restore mark from full sockets only
An out of bounds error was detected on an ARM64 target with
Android based kernel 4.9. This occurs while trying to
restore mark on a skb from an inet request socket.

BUG: KASAN: slab-out-of-bounds in socket_match.isra.2+0xc8/0x1f0 net/netfilter/xt_socket.c:248
Read of size 4 at addr ffffffc06a8d824c by task syz-fuzzer/1532
CPU: 7 PID: 1532 Comm: syz-fuzzer Tainted: G        W  O    4.9.41+ #1
Call trace:
[<ffffff900808d2f8>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:76
[<ffffff900808d760>] show_stack+0x28/0x38 arch/arm64/kernel/traps.c:226
[<ffffff90085f7dc8>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffff90085f7dc8>] dump_stack+0xe4/0x134 lib/dump_stack.c:51
[<ffffff900830f358>] print_address_description+0x68/0x258 mm/kasan/report.c:248
[<ffffff900830f770>] kasan_report_error mm/kasan/report.c:347 [inline]
[<ffffff900830f770>] kasan_report.part.2+0x228/0x2f0 mm/kasan/report.c:371
[<ffffff900830fdec>] kasan_report+0x5c/0x70 mm/kasan/report.c:372
[<ffffff900830de98>] check_memory_region_inline mm/kasan/kasan.c:308 [inline]
[<ffffff900830de98>] __asan_load4+0x88/0xa0 mm/kasan/kasan.c:740
[<ffffff90097498f8>] socket_match.isra.2+0xc8/0x1f0 net/netfilter/xt_socket.c:248
[<ffffff9009749a5c>] socket_mt4_v1_v2_v3+0x3c/0x48 net/netfilter/xt_socket.c:272
[<ffffff90097f7e4c>] ipt_do_table+0x54c/0xad8 net/ipv4/netfilter/ip_tables.c:311
[<ffffff90097fcf14>] iptable_mangle_hook+0x6c/0x220 net/ipv4/netfilter/iptable_mangle.c:90
...
Allocated by task 1532:
 save_stack_trace_tsk+0x0/0x2a0 arch/arm64/kernel/stacktrace.c:131
 save_stack_trace+0x28/0x38 arch/arm64/kernel/stacktrace.c:215
 save_stack mm/kasan/kasan.c:495 [inline]
 set_track mm/kasan/kasan.c:507 [inline]
 kasan_kmalloc+0xd8/0x188 mm/kasan/kasan.c:599
 kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:537
 slab_post_alloc_hook mm/slab.h:417 [inline]
 slab_alloc_node mm/slub.c:2728 [inline]
 slab_alloc mm/slub.c:2736 [inline]
 kmem_cache_alloc+0x14c/0x2e8 mm/slub.c:2741
 reqsk_alloc include/net/request_sock.h:87 [inline]
 inet_reqsk_alloc+0x4c/0x238 net/ipv4/tcp_input.c:6236
 tcp_conn_request+0x2b0/0xea8 net/ipv4/tcp_input.c:6341
 tcp_v4_conn_request+0xe0/0x100 net/ipv4/tcp_ipv4.c:1256
 tcp_rcv_state_process+0x384/0x18a8 net/ipv4/tcp_input.c:5926
 tcp_v4_do_rcv+0x2f0/0x3e0 net/ipv4/tcp_ipv4.c:1430
 tcp_v4_rcv+0x1278/0x1350 net/ipv4/tcp_ipv4.c:1709
 ip_local_deliver_finish+0x174/0x3e0 net/ipv4/ip_input.c:216

v1->v2: Change socket_mt6_v1_v2_v3() as well as mentioned by Eric
v2->v3: Put the correct fixes tag

Fixes: 01555e74bd ("netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flag")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 20:04:34 +02:00
Sabrina Dubroca
62b982eeb4 l2tp: fix race condition in l2tp_tunnel_delete
If we try to delete the same tunnel twice, the first delete operation
does a lookup (l2tp_tunnel_get), finds the tunnel, calls
l2tp_tunnel_delete, which queues it for deletion by
l2tp_tunnel_del_work.

The second delete operation also finds the tunnel and calls
l2tp_tunnel_delete. If the workqueue has already fired and started
running l2tp_tunnel_del_work, then l2tp_tunnel_delete will queue the
same tunnel a second time, and try to free the socket again.

Add a dead flag to prevent firing the workqueue twice. Then we can
remove the check of queue_work's result that was meant to prevent that
race but doesn't.

Reproducer:

    ip l2tp add tunnel tunnel_id 3000 peer_tunnel_id 4000 local 192.168.0.2 remote 192.168.0.1 encap udp udp_sport 5000 udp_dport 6000
    ip l2tp add session name l2tp1 tunnel_id 3000 session_id 1000 peer_session_id 2000
    ip link set l2tp1 up
    ip l2tp del tunnel tunnel_id 3000
    ip l2tp del tunnel tunnel_id 3000

Fixes: f8ccac0e44 ("l2tp: put tunnel socket release on a workqueue")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 10:24:34 -07:00
Alexey Kodanev
36f6ee22d2 vti: fix use after free in vti_tunnel_xmit/vti6_tnl_xmit
When running LTP IPsec tests, KASan might report:

BUG: KASAN: use-after-free in vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
Read of size 4 at addr ffff880dc6ad1980 by task swapper/0/0
...
Call Trace:
  <IRQ>
  dump_stack+0x63/0x89
  print_address_description+0x7c/0x290
  kasan_report+0x28d/0x370
  ? vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
  __asan_report_load4_noabort+0x19/0x20
  vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
  ? vti_init_net+0x190/0x190 [ip_vti]
  ? save_stack_trace+0x1b/0x20
  ? save_stack+0x46/0xd0
  dev_hard_start_xmit+0x147/0x510
  ? icmp_echo.part.24+0x1f0/0x210
  __dev_queue_xmit+0x1394/0x1c60
...
Freed by task 0:
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_slab_free+0x70/0xc0
  kmem_cache_free+0x81/0x1e0
  kfree_skbmem+0xb1/0xe0
  kfree_skb+0x75/0x170
  kfree_skb_list+0x3e/0x60
  __dev_queue_xmit+0x1298/0x1c60
  dev_queue_xmit+0x10/0x20
  neigh_resolve_output+0x3a8/0x740
  ip_finish_output2+0x5c0/0xe70
  ip_finish_output+0x4ba/0x680
  ip_output+0x1c1/0x3a0
  xfrm_output_resume+0xc65/0x13d0
  xfrm_output+0x1e4/0x380
  xfrm4_output_finish+0x5c/0x70

Can be fixed if we get skb->len before dst_output().

Fixes: b9959fd3b0 ("vti: switch to new ip tunnel code")
Fixes: 22e1b23daf ("vti6: Support inter address family tunneling.")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 09:58:21 -07:00
Tobias Klauser
d9db5e3680 kcm: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 09:54:06 -07:00
Tobias Klauser
63a4e80be4 ipv6: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 09:54:06 -07:00
Tobias Klauser
98e4fcff3e datagram: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 09:54:06 -07:00