Commit Graph

53672 Commits

Author SHA1 Message Date
Yafang Shao
c6849a3ac1 net: init sk_cookie for inet socket
With sk_cookie we can identify a socket, that is very helpful for
traceing and statistic, i.e. tcp tracepiont and ebpf.
So we'd better init it by default for inet socket.
When using it, we just need call atomic64_read(&sk->sk_cookie).

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 11:56:44 -04:00
Roopa Prabhu
b16fb418b1 net: fib_rules: add extack support
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 10:21:24 -04:00
Roopa Prabhu
f9d4b0c1e9 fib_rules: move common handling of newrule delrule msgs into fib_nl2rule
This reduces code duplication in the fib rule add and del paths.
Get rid of validate_rulemsg. This became obvious when adding duplicate
extack support in fib newrule/delrule error paths.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 10:21:24 -04:00
Yafang Shao
6163849d28 net: introduce a new tracepoint for tcp_rcv_space_adjust
tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.

When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.
With these two tracepoints, we could figure out whether the user program
processes this packet immediately or there's latency.

Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.

Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 09:58:18 -04:00
Jann Horn
7e5a206ab6 tcp: don't read out-of-bounds opsize
The old code reads the "opsize" variable from out-of-bounds memory (first
byte behind the segment) if a broken TCP segment ends directly after an
opcode that is neither EOL nor NOP.

The result of the read isn't used for anything, so the worst thing that
could theoretically happen is a pagefault; and since the physmap is usually
mostly contiguous, even that seems pretty unlikely.

The following C reproducer triggers the uninitialized read - however, you
can't actually see anything happen unless you put something like a
pr_warn() in tcp_parse_md5sig_option() to print the opsize.

====================================
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <net/if.h>
#include <linux/if.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/in.h>
#include <linux/if_tun.h>
#include <err.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <assert.h>

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
    err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

#define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))

void sum_accumulate(unsigned int *sum, void *data, int len) {
  assert((len&2)==0);
  for (int i=0; i<len/2; i++) {
    *sum += ntohs(((unsigned short *)data)[i]);
  }
}

unsigned short sum_final(unsigned int sum) {
  sum = (sum >> 16) + (sum & 0xffff);
  sum = (sum >> 16) + (sum & 0xffff);
  return htons(~sum);
}

void fix_ip_sum(struct iphdr *ip) {
  unsigned int sum = 0;
  sum_accumulate(&sum, ip, sizeof(*ip));
  ip->check = sum_final(sum);
}

void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
  unsigned int sum = 0;
  struct {
    unsigned int saddr;
    unsigned int daddr;
    unsigned char pad;
    unsigned char proto_num;
    unsigned short tcp_len;
  } fakehdr = {
    .saddr = ip->saddr,
    .daddr = ip->daddr,
    .proto_num = ip->protocol,
    .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
  };
  sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
  sum_accumulate(&sum, tcp, tcp->doff*4);
  tcp->check = sum_final(sum);
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);
  systemf("ip addr add 192.168.42.1/24 dev %s", devname);

  struct {
    struct iphdr ip;
    struct tcphdr tcp;
    unsigned char tcp_opts[20];
  } __attribute__((packed)) syn_packet = {
    .ip = {
      .ihl = sizeof(struct iphdr)/4,
      .version = 4,
      .tot_len = htons(sizeof(syn_packet)),
      .ttl = 30,
      .protocol = IPPROTO_TCP,
      /* FIXUP check */
      .saddr = IPADDR(192,168,42,2),
      .daddr = IPADDR(192,168,42,1)
    },
    .tcp = {
      .source = htons(1),
      .dest = htons(1337),
      .seq = 0x12345678,
      .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
      .syn = 1,
      .window = htons(64),
      .check = 0 /*FIXUP*/
    },
    .tcp_opts = {
      /* INVALID: trailing MD5SIG opcode after NOPs */
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 19
    }
  };
  fix_ip_sum(&syn_packet.ip);
  fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
  while (1) {
    int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
    if (write_res != sizeof(syn_packet))
      err(1, "packet write failed");
  }
}
====================================

Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 09:51:06 -04:00
Alexander Aring
d57493d6d1 net: sched: ife: check on metadata length
This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Alexander Aring
cc74eddd0f net: sched: ife: handle malformed tlv length
There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Alexander Aring
f6cd14537f net: sched: ife: signal not finding metaid
We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:12:00 -04:00
Doron Roberts-Kedes
7c5aba211d strparser: Do not call mod_delayed_work with a timeout of LONG_MAX
struct sock's sk_rcvtimeo is initialized to
LONG_MAX/MAX_SCHEDULE_TIMEOUT in sock_init_data. Calling
mod_delayed_work with a timeout of LONG_MAX causes spurious execution of
the work function. timer->expires is set equal to jiffies + LONG_MAX.
When timer_base->clk falls behind the current value of jiffies,
the delta between timer_base->clk and jiffies + LONG_MAX causes the
expiration to be in the past. Returning early from strp_start_timer if
timeo == LONG_MAX solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a ("strparser: Stream parser for messages")
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:09:16 -04:00
Ahmed Abdelsalam
a957fa190a ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
in order to set the src addr of outer IPv6 header.

The net_device is required for set_tun_src(). However calling ip6_dst_idev()
on dst_entry in case of IPv4 traffic results on the following bug.

Using just dst->dev should fix this BUG.

[  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
[  196.243329] Oops: 0000 [#1] SMP PTI
[  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
[  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
[  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
[  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
[  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
[  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
[  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
[  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
[  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
[  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
[  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
[  196.247804] Call Trace:
[  196.247972]  seg6_do_srh+0x15b/0x1c0
[  196.248156]  seg6_output+0x3c/0x220
[  196.248341]  ? prandom_u32+0x14/0x20
[  196.248526]  ? ip_idents_reserve+0x6c/0x80
[  196.248723]  ? __ip_select_ident+0x90/0x100
[  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
[  196.249133]  lwtunnel_output+0x44/0x70
[  196.249328]  ip_send_skb+0x15/0x40
[  196.249515]  raw_sendmsg+0x8c3/0xac0
[  196.249701]  ? _copy_from_user+0x2e/0x60
[  196.249897]  ? rw_copy_check_uvector+0x53/0x110
[  196.250106]  ? _copy_from_user+0x2e/0x60
[  196.250299]  ? copy_msghdr_from_user+0xce/0x140
[  196.250508]  sock_sendmsg+0x36/0x40
[  196.250690]  ___sys_sendmsg+0x292/0x2a0
[  196.250881]  ? _cond_resched+0x15/0x30
[  196.251074]  ? copy_termios+0x1e/0x70
[  196.251261]  ? _copy_to_user+0x22/0x30
[  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
[  196.251782]  ? _cond_resched+0x15/0x30
[  196.251972]  ? mutex_lock+0xe/0x30
[  196.252152]  ? vvar_fault+0xd2/0x110
[  196.252337]  ? __do_fault+0x1f/0xc0
[  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
[  196.252727]  ? __sys_sendmsg+0x63/0xa0
[  196.252919]  __sys_sendmsg+0x63/0xa0
[  196.253107]  do_syscall_64+0x72/0x200
[  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  196.253530] RIP: 0033:0x7fc4480b0690
[  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
[  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
[  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
[  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
[  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
[  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
[  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
[  196.256445] CR2: 0000000000000000
[  196.256676] ---[ end trace 71af7d093603885c ]---

Fixes: 8936ef7604 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:04:17 -04:00
Cong Wang
3a04ce7130 llc: fix NULL pointer deref for SOCK_ZAPPED
For SOCK_ZAPPED socket, we don't need to care about llc->sap,
so we should just skip these refcount functions in this case.

Fixes: f7e4367268 ("llc: hold llc_sap before release_sock()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 14:56:22 -04:00
Cong Wang
b905ef9ab9 llc: delete timers synchronously in llc_sk_free()
The connection timers of an llc sock could be still flying
after we delete them in llc_sk_free(), and even possibly
after we free the sock. We could just wait synchronously
here in case of troubles.

Note, I leave other call paths as they are, since they may
not have to wait, at least we can change them to synchronously
when needed.

Also, move the code to net/llc/llc_conn.c, which is apparently
a better place.

Reported-by: <syzbot+f922284c18ea23a8e457@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 14:55:03 -04:00
Guillaume Nault
5411b6187a l2tp: fix {pppol2tp, l2tp_dfs}_seq_stop() in case of seq_file overflow
Commit 0e0c3fee3a ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
assumed that if pppol2tp_seq_stop() was called with non-NULL private
data (the 'v' pointer), then pppol2tp_seq_start() would not be called
again. It turns out that this isn't guaranteed, and overflowing the
seq_file's buffer in pppol2tp_seq_show() is a way to get into this
situation.

Therefore, pppol2tp_seq_stop() needs to reset pd->tunnel, so that
pppol2tp_seq_start() won't drop a reference again if it gets called.
We also have to clear pd->session, because the rest of the code expects
a non-NULL tunnel when pd->session is set.

The l2tp_debugfs module has the same issue. Fix it in the same way.

Fixes: 0e0c3fee3a ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
Fixes: f726214d9b ("l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 14:46:35 -04:00
Sven Eckelmann
dc06338df9 batman-adv: Remove unused dentry without DEBUGFS
The debug_dir variable in the main structures is only accessed when
CONFIG_BATMAN_ADV_DEBUGFS is enabled. It can be dropped to potentially save
some bytes of memory.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 10:11:07 +02:00
Sven Eckelmann
6486379d87 batman-adv: Avoid bool in structures
Using the bool type for structure member is considered inappropriate [1]
for the kernel. Its size is not well defined (but usually 1 byte but maybe
also 4 byte).

[1] https://lkml.org/lkml/2017/11/21/384

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 10:11:06 +02:00
Linus Lüssing
f26e4e98b1 batman-adv: Avoid old nodes disabling multicast optimizations completely
Instead of disabling multicast optimizations mesh-wide once a node with
no multicast optimizations capabilities joins the mesh, do the
following:

Just insert such nodes into the WANT_ALL_IPV4/IPV6 lists. This is
sufficient to avoid multicast packet loss to such unsupportive nodes.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:14 +02:00
Sven Eckelmann
1ba93211e2 batman-adv: Disable CONFIG_BATMAN_ADV_DEBUGFS by default
All tools which were known to the batman-adv development team are
supporting the batman-adv netlink interface since a while. Also debugfs is
not supported for batman-adv interfaces in any non-default netns. Thus
disabling CONFIG_BATMAN_ADV_DEBUGFS by default should not cause problems on
most systems. It is still possible to enable it in case it is still
required in a specific setup.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:14 +02:00
Simon Wunderlich
f546a99d62 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:14 +02:00
Colin Ian King
65cc02a8e1 batman-adv: don't pass a NULL hard_iface to batadv_hardif_put
In the case where hard_iface is NULL, the error path may pass a null
pointer to batadv_hardif_put causing a null pointer dereference error.
Avoid this by only calling the function if  hard_iface not null.

Detected by CoverityScan, CID#1466456 ("Explicit null dereferenced")

Fixes: 53dd9a68ba ("batman-adv: add multicast flags netlink support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-22 09:29:01 +02:00
David S. Miller
e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
David Ahern
8ae869714b net/ipv6: Remove unncessary check on f6i in fib6_check
Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:14 -04:00
David Ahern
a68886a691 net/ipv6: Make from in rt6_info rcu protected
When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.

When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.

There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:14 -04:00
David Ahern
5bcaa41b96 net/ipv6: Move release of fib6_info from pcpu routes to helper
Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
a87b7dc9f7 net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
4d85cd0c2a net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

With the move, the fib6_info_hold for the uncached_rt is no longer
needed since the rcu_lock is still held.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
a269f1a764 net/ipv6: Rename rt6_get_cookie_safe
rt6_get_cookie_safe takes a fib6_info and checks the sernum of
the node. Update the name to reflect its purpose.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
6a3e030f08 net/ipv6: Clean up rt expires helpers
rt6_clean_expires and rt6_set_expires are no longer used. Removed them.
rt6_update_expires has 1 caller in route.c, so move it from the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David S. Miller
1b80f86ed6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 15:56:15 -04:00
Felix Fietkau
1a999d899b netfilter: nf_flow_table: rename nf_flow_table.c to nf_flow_table_core.c
Preparation for adding more code to the same module

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21 19:41:55 +02:00
Felix Fietkau
4f3780c004 netfilter: nf_flow_table: cache mtu in struct flow_offload_tuple
Reduces the number of cache lines touched in the offload forwarding
path. This is safe because PMTU limits are bypassed for the forwarding
path (see commit f87c10a8aa for more details).

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21 19:20:40 +02:00
Felix Fietkau
07cb9623ee ipv6: make ip6_dst_mtu_forward inline
Just like ip_dst_mtu_maybe_forward(), to avoid a dependency with ipv6.ko.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21 19:20:04 +02:00
Linus Torvalds
a72db42cee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Unbalanced refcounting in TIPC, from Jon Maloy.

 2) Only allow TCP_MD5SIG to be set on sockets in close or listen state.
    Once the connection is established it makes no sense to change this.
    From Eric Dumazet.

 3) Missing attribute validation in neigh_dump_table(), also from Eric
    Dumazet.

 4) Fix address comparisons in SCTP, from Xin Long.

 5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller.

 6) Fix tunnel refcounting in l2tp, from Guillaume Nault.

 7) Fix double list insert in team driver, from Paolo Abeni.

 8) af_vsock.ko module was accidently made unremovable, from Stefan
    Hajnoczi.

 9) Fix reference to freed llc_sap object in llc stack, from Cong Wang.

10) Don't assume netdevice struct is DMA'able memory in virtio_net
    driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
  net/smc: fix shutdown in state SMC_LISTEN
  bnxt_en: Fix memory fault in bnxt_ethtool_init()
  virtio_net: sparse annotation fix
  virtio_net: fix adding vids on big-endian
  virtio_net: split out ctrl buffer
  net: hns: Avoid action name truncation
  docs: ip-sysctl.txt: fix name of some ipv6 variables
  vmxnet3: fix incorrect dereference when rxvlan is disabled
  llc: hold llc_sap before release_sock()
  MAINTAINERS: Direct networking documentation changes to netdev
  atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
  net: qmi_wwan: add Wistron Neweb D19Q1
  net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
  net: stmmac: Disable ACS Feature for GMAC >= 4
  net: mvpp2: Fix DMA address mask size
  net: change the comment of dev_mc_init
  net: qualcomm: rmnet: Fix warning seen with fill_info
  tun: fix vlan packet truncation
  tipc: fix infinite loop when dumping link monitor summary
  tipc: fix use-after-free in tipc_nametbl_stop
  ...
2018-04-20 09:34:39 -07:00
Eric Dumazet
263243d6c2 net/ipv6: Fix ip6_convert_metrics() bug
If ip6_convert_metrics() fails to allocate memory, it should not
overwrite rt->fib6_metrics or we risk a crash later as syzbot found.

BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487

CPU: 1 PID: 4487 Comm: syzkaller832429 Not tainted 4.16.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 kasan_report_error mm/kasan/report.c:352 [inline]
 kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
 refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
 fib6_info_destroy+0x2d0/0x3c0 net/ipv6/ip6_fib.c:206
 fib6_info_release include/net/ip6_fib.h:304 [inline]
 ip6_route_info_create+0x677/0x3240 net/ipv6/route.c:3020
 ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
 inet6_rtm_newroute+0x142/0x160 net/ipv6/route.c:4406
 rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4648
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4666
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 SYSC_sendmsg net/socket.c:2164 [inline]
 SyS_sendmsg+0x29/0x30 net/socket.c:2162
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:36:15 -04:00
GhantaKrishnamurthy MohanKrishna
682cd3cf94 tipc: confgiure and apply UDP bearer MTU on running links
Currently, we have option to configure MTU of UDP media. The configured
MTU takes effect on the links going up after that moment. I.e, a user
has to reset bearer to have new value applied across its links. This is
confusing and disturbing on a running cluster.

We now introduce the functionality to change the default UDP bearer MTU
in struct tipc_bearer. Additionally, the links are updated dynamically,
without any need for a reset, when bearer value is changed. We leverage
the existing per-link functionality and the design being symetrical to
the confguration of link tolerance.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna
901271e040 tipc: implement configuration of UDP media MTU
In previous commit, we changed the default emulated MTU for UDP bearers
to 14k.

This commit adds the functionality to set/change the default value
by configuring new MTU for UDP media. UDP bearer(s) have to be disabled
and enabled back for the new MTU to take effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
GhantaKrishnamurthy MohanKrishna
a4dfa72d0a tipc: set default MTU for UDP media
Currently, all bearers are configured with MTU value same as the
underlying L2 device. However, in case of bearers with media type
UDP, higher throughput is possible with a fixed and higher emulated
MTU value than adapting to the underlying L2 MTU.

In this commit, we introduce a parameter mtu in struct tipc_media
and a default value is set for UDP. A default value of 14k
was determined by experimentation and found to have a higher throughput
than 16k. MTU for UDP bearers are assigned the above set value of
media MTU.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:04:05 -04:00
Srinivas Dasari
2f0605a697 nl80211: Free connkeys on external authentication failure
The failure scenario while processing
NL80211_ATTR_EXTERNAL_AUTH_SUPPORT does not free
the connkeys. This commit addresses the same.

Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-04-20 09:58:03 +02:00
Ursula Braun
1255fcb2a6 net/smc: fix shutdown in state SMC_LISTEN
Calling shutdown with SHUT_RD and SHUT_RDWR for a listening SMC socket
crashes, because
   commit 127f497058 ("net/smc: release clcsock from tcp_listen_worker")
releases the internal clcsock in smc_close_active() and sets smc->clcsock
to NULL.
For SHUT_RD the smc_close_active() call is removed.
For SHUT_RDWR the kernel_sock_shutdown() call is omitted, since the
clcsock is already released.

Fixes: 127f497058 ("net/smc: release clcsock from tcp_listen_worker")
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 16:38:39 -04:00
David Ahern
27b10608a2 net/ipv6: Fix gfp_flags arg to addrconf_prefix_route
Eric noticed that __ipv6_ifa_notify is called under rcu_read_lock, so
the gfp argument to addrconf_prefix_route can not be GFP_KERNEL.

While scrubbing other calls I noticed addrconf_addr_gen has one
place with GFP_ATOMIC that can be GFP_KERNEL.

Fixes: acb54e3cba ("net/ipv6: Add gfp_flags to route add functions")
Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
dcd1f57295 net/ipv6: Remove fib6_idev
fib6_idev can be obtained from __in6_dev_get on the nexthop device
rather than caching it in the fib6_info. Remove it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
eea68cd371 net/ipv6: Remove unnecessary checks on fib6_idev
Prior to 4832c30d54 ("net: ipv6: put host and anycast routes on device
with address") host routes and anycast routes were installed with the
device set to loopback (or VRF device once that feature was added). In the
older code dst.dev was set to loopback (needed for packet tx) and rt6i_idev
was used to denote the actual interface.

Commit 4832c30d54 changed the code to have dst.dev pointing to the real
device with the switch to lo or vrf device done on dst clones. As a
consequence of this change a couple of device checks during route lookups
are no longer needed. Remove them.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
9ee8cbb2fd net/ipv6: Remove aca_idev
aca_idev has only 1 user - inet6_fill_ifacaddr - and it only
wants the device index which can be extracted from the fib6_info
nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
360a9887c8 net/ipv6: Rename addrconf_dst_alloc
addrconf_dst_alloc now returns a fib6_info. Update the name
and its users to reflect the change.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
93c2fb253d net/ipv6: Rename fib6_info struct elements
Change the prefix for fib6_info struct elements from rt6i_ to fib6_.
rt6i_pcpu and rt6i_exception_bucket are left as is given that they
point to rt6_info entries.

Rename only; not functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:12 -04:00
Cong Wang
f7e4367268 llc: hold llc_sap before release_sock()
syzbot reported we still access llc->sap in llc_backlog_rcv()
after it is freed in llc_sap_remove_socket():

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 llc_conn_ac_send_sabme_cmd_p_set_x+0x3a8/0x460 net/llc/llc_c_ac.c:785
 llc_exec_conn_trans_actions net/llc/llc_conn.c:475 [inline]
 llc_conn_service net/llc/llc_conn.c:400 [inline]
 llc_conn_state_process+0x4e1/0x13a0 net/llc/llc_conn.c:75
 llc_backlog_rcv+0x195/0x1e0 net/llc/llc_conn.c:891
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x12f/0x3a0 net/core/sock.c:2335
 release_sock+0xa4/0x2b0 net/core/sock.c:2850
 llc_ui_release+0xc8/0x220 net/llc/af_llc.c:204

llc->sap is refcount'ed and llc_sap_remove_socket() is paired
with llc_sap_add_socket(). This can be amended by holding its refcount
before llc_sap_remove_socket() and releasing it after release_sock().

Reported-by: <syzbot+6e181fc95081c2cf9051@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:54:53 -04:00
Eric Dumazet
88078d98d1 net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:44:11 -04:00
Colin Ian King
5e84b38b07 net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
Trivial fix to spelling mistake

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:37:10 -04:00
Felix Fietkau
047b300e6e netfilter: nf_flow_table: clean up flow_offload_alloc
Reduce code duplication and make it much easier to read

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 19:22:06 +02:00
Yuchung Cheng
feb5f2ec64 tcp: export packets delivery info
Export data delivered and delivered with CE marks to
1) SNMP TCPDelivered and TCPDeliveredCE
2) getsockopt(TCP_INFO)
3) Timestamping API SOF_TIMESTAMPING_OPT_STATS

Note that for SCM_TSTAMP_ACK, the delivery info in
SOF_TIMESTAMPING_OPT_STATS is reported before the info
was fully updated on the ACK.

These stats help application monitor TCP delivery and ECN status
on per host, per connection, even per message level.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00
Yuchung Cheng
e21db6f69a tcp: track total bytes delivered with ECN CE marks
Introduce a new delivered_ce stat in tcp socket to estimate
number of packets being marked with CE bits. The estimation is
done via ACKs with ECE bit. Depending on the actual receiver
behavior, the estimation could have biases.

Since the TCP sender can't really see the CE bit in the data path,
so the sender is technically counting packets marked delivered with
the "ECE / ECN-Echo" flag set.

With RFC3168 ECN, because the ECE bit is sticky, this count can
drastically overestimate the nummber of CE-marked data packets

With DCTCP-style ECN this should be reasonably precise unless there
is loss in the ACK path, in which case it's not precise.

With AccECN proposal this can be made still more precise, even in
the case some degree of ACK loss.

However this is sender's best estimate of CE information.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 13:05:16 -04:00