Commit Graph

6483 Commits

Author SHA1 Message Date
Martin KaFai Lau
2dbb9b9e6d bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].

At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
point,  it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.

For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb().  Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.

Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.

The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).

In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages.  Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.

The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").

The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()".  Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case).  The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want.  This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match).  The bpf prog can decide if it wants to do SK_DROP as its
discretion.

When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk.  If not, the kernel
will select one from "reuse->socks[]" (as before this patch).

The SK_DROP and SK_PASS handling logic will be in the next patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
Martin KaFai Lau
5dc4c4b7d4 bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]

During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
ensured in "reuseport_array_update_check()".

A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map).  SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.

When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also.  It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]".  Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.

The reuseport_array will also support lookup from the syscall
side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).

The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update.  Supporting different
value_size between lookup and update seems unintuitive also.

We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4.  The syscall's lookup
will return ENOSPC on value_size=4.  It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
David S. Miller
fd685657cd Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains netfilter updates for your net-next tree:

1) Expose NFT_OSF_MAXGENRELEN maximum OS name length from the new OS
   passive fingerprint matching extension, from Fernando Fernandez.

2) Add extension to support for fine grain conntrack timeout policies
   from nf_tables. As preparation works, this patchset moves
   nf_ct_untimeout() to nf_conntrack_timeout and it also decouples the
   timeout policy from the ctnl_timeout object, most work done by
   Harsha Sharma.

3) Enable connection tracking when conntrack helper is in place.

4) Missing enumeration in uapi header when splitting original xt_osf
   to nfnetlink_osf, also from Fernando.

5) Fix a sparse warning due to incorrect typing in the nf_osf_find(),
   from Wei Yongjun.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-10 10:33:08 -07:00
Fernando Fernandez Mancera
212dfd909e netfilter: nfnetlink_osf: add missing enum in nfnetlink_osf uapi header
xt_osf_window_size_options was originally part of
include/uapi/linux/netfilter/xt_osf.h, restore it.

Fixes: bfb15f2a95 ("netfilter: extract Passive OS fingerprint infrastructure from xt_osf")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-08 18:23:49 +02:00
Pieter Jansen van Vuuren
0a6e77784f net/sched: allow flower to match tunnel options
Allow matching on options in Geneve tunnel headers.
This makes use of existing tunnel metadata support.

The options can be described in the form
CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.

e.g.
 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev geneve0 ingress
 # tc filter add dev geneve0 protocol ip parent ffff: \
     flower \
       enc_src_ip 10.0.99.192 \
       enc_dst_ip 10.0.99.193 \
       enc_key_id 11 \
       geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \
       ip_proto udp \
       action mirred egress redirect dev eth1

This patch adds support for matching Geneve options in the order
supplied by the user. This leads to an efficient implementation in
the software datapath (and in our opinion hardware datapaths that
offload this feature). It is also compatible with Geneve options
matching provided by the Open vSwitch kernel datapath which is
relevant here as the Flower classifier may be used as a mechanism
to program flows into hardware as a form of Open vSwitch datapath
offload (sometimes referred to as OVS-TC). The netlink
Kernel/Userspace API may be extended, for example by adding a flag,
if other matching options are desired, for example matching given
options in any order. This would require an implementation in the
TC software datapath. And be done in a way that drivers that
facilitate offload of the Flower classifier can reject or accept
such flows based on hardware datapath capabilities.

This approach was discussed and agreed on at Netconf 2017 in Seoul.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:22:15 -07:00
Florian Fainelli
6cfef793b5 ethtool: Add WAKE_FILTER and RX_CLS_FLOW_WAKE
Add the ability to specify through ethtool::rxnfc that a rule location is
special and will be used to participate in Wake-on-LAN, by e.g: having a
specific pattern be matched. When this is the case, fs->ring_cookie must
be set to the special value RX_CLS_FLOW_WAKE.

We also define an additional ethtool::wolinfo flag: WAKE_FILTER which
can be used to configure an Ethernet adapter to allow Wake-on-LAN using
previously programmed filters.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:15:03 -07:00
David S. Miller
1ba982806c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-08-07

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add cgroup local storage for BPF programs, which provides a fast
   accessible memory for storing various per-cgroup data like number
   of transmitted packets, etc, from Roman.

2) Support bpf_get_socket_cookie() BPF helper in several more program
   types that have a full socket available, from Andrey.

3) Significantly improve the performance of perf events which are
   reported from BPF offload. Also convert a couple of BPF AF_XDP
   samples overto use libbpf, both from Jakub.

4) seg6local LWT provides the End.DT6 action, which allows to
   decapsulate an outer IPv6 header containing a Segment Routing Header.
   Adds this action now to the seg6local BPF interface, from Mathieu.

5) Do not mark dst register as unbounded in MOV64 instruction when
   both src and dst register are the same, from Arthur.

6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier
   instructions on arm64 for the AF_XDP sample code, from Brian.

7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts
   over from Python 2 to Python 3, from Jeremy.

8) Enable BTF build flags to the BPF sample code Makefile, from Taeung.

9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee.

10) Several improvements to the README.rst from the BPF documentation
    to make it more consistent with RST format, from Tobin.

11) Replace all occurrences of strerror() by calls to strerror_r()
    in libbpf and fix a FORTIFY_SOURCE build error along with it,
    from Thomas.

12) Fix a bug in bpftool's get_btf() function to correctly propagate
    an error via PTR_ERR(), from Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 11:02:05 -07:00
Harsha Sharma
7e0b2b57f0 netfilter: nft_ct: add ct timeout support
This patch allows to add, list and delete connection tracking timeout
policies via nft objref infrastructure and assigning these timeout
via nft rule.

%./libnftnl/examples/nft-ct-timeout-add ip raw cttime tcp

Ruleset:

table ip raw {
   ct timeout cttime {
       protocol tcp;
       policy = {established: 111, close: 13 }
   }

   chain output {
       type filter hook output priority -300; policy accept;
       ct timeout set "cttime"
   }
}

%./libnftnl/examples/nft-rule-ct-timeout-add ip raw output cttime

%conntrack -E
[NEW] tcp      6 111 ESTABLISHED src=172.16.19.128 dst=172.16.19.1
sport=22 dport=41360 [UNREPLIED] src=172.16.19.1 dst=172.16.19.128
sport=41360 dport=22

%nft delete rule ip raw output handle <handle>
%./libnftnl/examples/nft-ct-timeout-del ip raw cttime

Joint work with Pablo Neira.

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:23 +02:00
Fernando Fernandez Mancera
35a8a3bd1c netfilter: nft_osf: use NFT_OSF_MAXGENRELEN instead of IFNAMSIZ
As no "genre" on pf.os exceed 16 bytes of length, we reduce
NFT_OSF_MAXGENRELEN parameter to 16 bytes and use it instead of IFNAMSIZ.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:04 +02:00
Jason Wang
429711aec2 vhost: switch to use new message format
We use to have message like:

struct vhost_msg {
	int type;
	union {
		struct vhost_iotlb_msg iotlb;
		__u8 padding[64];
	};
};

Unfortunately, there will be a hole of 32bit in 64bit machine because
of the alignment. This leads a different formats between 32bit API and
64bit API. What's more it will break 32bit program running on 64bit
machine.

So fixing this by introducing a new message type with an explicit
32bit reserved field after type like:

struct vhost_msg_v2 {
	__u32 type;
	__u32 reserved;
	union {
		struct vhost_iotlb_msg iotlb;
		__u8 padding[64];
	};
};

We will have a consistent ABI after switching to use this. To enable
this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).

Fixes: 6b1e6cc785 ("vhost: new device IOTLB API")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06 10:41:04 -07:00
Wanpeng Li
aaffcfd1e8 KVM: X86: Implement PV IPIs in linux guest
Implement paravirtual apic hooks to enable PV IPIs for KVM if the "send IPI"
hypercall is available.  The hypercall lets a guest send IPIs, with
at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per
hypercall in 32-bit mode.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:22 +02:00
Wanpeng Li
4180bf1b65 KVM: X86: Implement "send IPI" hypercall
Using hypercall to send IPIs by one vmexit instead of one by one for
xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
is enabled in qemu, however, latest AMD EPYC still just supports xapic
mode which can get great improvement by Exit-less IPIs. This patchset
lets a guest send multicast IPIs, with at most 128 destinations per
hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.

Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):

x2apic cluster mode, vanilla

 Dry-run:                         0,            2392199 ns
 Self-IPI:                  6907514,           15027589 ns
 Normal IPI:              223910476,          251301666 ns
 Broadcast IPI:                   0,         9282161150 ns
 Broadcast lock:                  0,         8812934104 ns

x2apic cluster mode, pv-ipi

 Dry-run:                         0,            2449341 ns
 Self-IPI:                  6720360,           15028732 ns
 Normal IPI:              228643307,          255708477 ns
 Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
 Broadcast lock:                  0,         8316124651 ns

x2apic physical mode, vanilla

 Dry-run:                         0,            3135933 ns
 Self-IPI:                  8572670,           17901757 ns
 Normal IPI:              226444334,          255421709 ns
 Broadcast IPI:                   0,        19845070887 ns
 Broadcast lock:                  0,        19827383656 ns

x2apic physical mode, pv-ipi

 Dry-run:                         0,            2446381 ns
 Self-IPI:                  6788217,           15021056 ns
 Normal IPI:              219454441,          249583458 ns
 Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
 Broadcast lock:                  0,         9143618799 ns

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:20 +02:00
Jim Mattson
8fcc4b5923 kvm: nVMX: Introduce KVM_CAP_NESTED_STATE
For nested virtualization L0 KVM is managing a bit of state for L2 guests,
this state can not be captured through the currently available IOCTLs. In
fact the state captured through all of these IOCTLs is usually a mix of L1
and L2 state. It is also dependent on whether the L2 guest was running at
the moment when the process was interrupted to save its state.

With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
that is in VMX operation.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
[karahmed@ - rename structs and functions and make them ready for AMD and
             address previous comments.
           - handle nested.smm state.
           - rebase & a bit of refactoring.
           - Merge 7/8 and 8/8 into one patch. ]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:30 +02:00
Paolo Bonzini
d2ce98ca0a Merge tag 'v4.18-rc6' into HEAD
Pull bug fixes into the KVM development tree to avoid nasty conflicts.
2018-08-06 17:31:36 +02:00
Christoph Hellwig
bfe4037e72 aio: implement IOCB_CMD_POLL
Simple one-shot poll through the io_submit() interface.  To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL.  It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:33 +02:00
Jens Axboe
05b9ba4b55 Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
Peter Oskolkov
7969e5c40d ip: discard IPv4 datagrams with overlapping segments.
This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:16:46 -07:00
David S. Miller
074fb88016 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next tree:

1) Support for transparent proxying for nf_tables, from Mate Eckl.

2) Patchset to add OS passive fingerprint recognition for nf_tables,
   from Fernando Fernandez. This takes common code from xt_osf and
   place it into the new nfnetlink_osf module for codebase sharing.

3) Lightweight tunneling support for nf_tables.

4) meta and lookup are likely going to be used in rulesets, make them
   direct calls. From Florian Westphal.

A bunch of incremental updates:

5) use PTR_ERR_OR_ZERO() from nft_numgen, from YueHaibing.

6) Use kvmalloc_array() to allocate hashtables, from Li RongQing.

7) Explicit dependencies between nfnetlink_cttimeout and conntrack
   timeout extensions, from Harsha Sharma.

8) Simplify NLM_F_CREATE handling in nf_tables.

9) Removed unused variable in the get element command, from
   YueHaibing.

10) Expose bridge hook priorities through uapi, from Mate Eckl.

And a few fixes for previous Netfilter batch for net-next:

11) Use per-netns mutex from flowtable event, from Florian Westphal.

12) Remove explicit dependency on iptables CT target from conntrack
    zones, from Florian.

13) Fix use-after-free in rmmod nf_conntrack path, also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 16:25:22 -07:00
Florian Fainelli
d89d415561 ethtool: Remove trailing semicolon for static inline
Android's header sanitization tool chokes on static inline functions having a
trailing semicolon, leading to an incorrectly parsed header file. While the
tool should obviously be fixed, also fix the header files for the two affected
functions: ethtool_get_flow_spec_ring() and ethtool_get_flow_spec_ring_vf().

Fixes: 8cf6f497de ("ethtool: Add helper routines to pass vf to rx_flow_spec")
Reporetd-by: Blair Prescott <blair.prescott@broadcom.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 14:56:23 -07:00
Máté Eckl
94276fa8a2 netfilter: bridge: Expose nf_tables bridge hook priorities through uapi
Netfilter exposes standard hook priorities in case of ipv4, ipv6 and
arp but not in case of bridge.

This patch exposes the hook priority values of the bridge family (which are
different from the formerly mentioned) via uapi so that they can be used by
user-space applications just like the others.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:15:09 +02:00
Pablo Neira Ayuso
aaecfdb5c5 netfilter: nf_tables: match on tunnel metadata
This patch allows us to match on the tunnel metadata that is available
of the packet. We can use this to validate if the packet comes from/goes
to tunnel and the corresponding tunnel ID.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:12:19 +02:00
Pablo Neira Ayuso
af308b94a2 netfilter: nf_tables: add tunnel support
This patch implements the tunnel object type that can be used to
configure tunnels via metadata template through the existing lightweight
API from the ingress path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:12:12 +02:00
Guillaume Nault
e9697e2eff l2tp: ignore L2TP_ATTR_MTU
This attribute's handling is broken. It can only be used when creating
Ethernet pseudo-wires, in which case its value can be used as the
initial MTU for the l2tpeth device.
However, when handling update requests, L2TP_ATTR_MTU only modifies
session->mtu. This value is never propagated to the l2tpeth device.
Dump requests also return the value of session->mtu, which is not
synchronised anymore with the device MTU.

The same problem occurs if the device MTU is properly updated using the
generic IFLA_MTU attribute. In this case, session->mtu is not updated,
and L2TP_ATTR_MTU will report an invalid value again when dumping the
session.

It does not seem worthwhile to complexify l2tp_eth.c to synchronise
session->mtu with the device MTU. Even the ip-l2tp manpage advises to
use 'ip link' to initialise the MTU of l2tpeth devices (iproute2 does
not handle L2TP_ATTR_MTU at all anyway). So let's just ignore it
entirely.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 10:03:57 -07:00
Fernando Fernandez Mancera
ddba40be59 netfilter: nfnetlink_osf: rename nf_osf header file to nfnetlink_osf
The first client of the nf_osf.h userspace header is nft_osf, coming in
this batch, rename it to nfnetlink_osf.h as there are no userspace
clients for this yet, hence this looks consistent with other nfnetlink
subsystem.

Suggested-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:38:30 +02:00
Fernando Fernandez Mancera
7cca1ed0bb netfilter: nf_osf: move nf_osf_fingers to non-uapi header file
All warnings (new ones prefixed by >>):

>> ./usr/include/linux/netfilter/nf_osf.h:73: userspace cannot reference function or variable defined in the kernel

Fixes: f932495208 ("netfilter: nfnetlink_osf: extract nfnetlink_subsystem code from xt_osf.c")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:38:29 +02:00
Potnuri Bharat Teja
b9855f4ca0 iw_cxgb4: RDMA write with immediate support
Adds iw_cxgb4 functionality to support RDMA_WRITE_WITH_IMMEDATE opcode.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-02 20:16:02 -06:00
Yixian Liu
0425e3e6e0 RDMA/hns: Support flush cqe for hip08 in kernel space
According to IB protocol, there are some cases that work requests must
return the flush error completion status through the completion queue. Due
to hardware limitation, the driver needs to assist the flush process.

This patch adds the support of flush cqe for hip08 in the cases that
needed, such as poll cqe, post send, post recv and aeqe handle.

The patch also considered the compatibility between kernel and user space.

Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-02 20:03:25 -06:00
Roman Gushchin
cd33943176 bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.

It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
de9cbbaadb bpf: introduce cgroup storage maps
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps:
a special type of maps which are implementing the cgroup storage.

>From the userspace point of view it's almost a generic
hash map with the (cgroup inode id, attachment type) pair
used as a key.

The only difference is that some operations are restricted:
  1) a user can't create new entries,
  2) a user can't remove existing entries.

The lookup from userspace is o(log(n)).

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Todor Tomov
6e15bec49f media: v4l: Add new 10-bit packed grayscale format
The new format will be called V4L2_PIX_FMT_Y10P.
It is similar to the V4L2_PIX_FMT_SBGGR10P family formats
but V4L2_PIX_FMT_Y10P is a grayscale format.

Signed-off-by: Todor Tomov <todor.tomov@linaro.org>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hansverk@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-08-02 06:07:05 -04:00
Todor Tomov
451af0bf04 media: v4l: Add new 2X8 10-bit grayscale media bus code
The code will be called MEDIA_BUS_FMT_Y10_2X8_PADHI_LE.
It is similar to MEDIA_BUS_FMT_SBGGR10_2X8_PADHI_LE
but MEDIA_BUS_FMT_Y10_2X8_PADHI_LE describes grayscale
data.

Signed-off-by: Todor Tomov <todor.tomov@linaro.org>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hansverk@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-08-02 06:04:57 -04:00
Sakari Ailus
83b15832ab media: doc-rst: Add packed Bayer raw14 pixel formats
These formats are compressed 14-bit raw bayer formats with four different
pixel orders. They are similar to 10-bit variants. The formats added by
this patch are

	V4L2_PIX_FMT_SBGGR14P
	V4L2_PIX_FMT_SGBRG14P
	V4L2_PIX_FMT_SGRBG14P
	V4L2_PIX_FMT_SRGGB14P

Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Todor Tomov <todor.tomov@linaro.org>
Signed-off-by: Hans Verkuil <hansverk@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-08-02 06:01:02 -04:00
Wei Wang
7ec65372ca tcp: add stat of data packet reordering events
Introduce a new TCP stats to record the number of reordering events seen
and expose it in both tcp_info (TCP_INFO) and opt_stats
(SOF_TIMESTAMPING_OPT_STATS).
Application can use this stats to track the frequency of the reordering
events in addition to the existing reordering stats which tracks the
magnitude of the latest reordering event.

Note: this new stats tracks reordering events triggered by ACKs, which
could often be fewer than the actual number of packets being delivered
out-of-order.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
7e10b6554f tcp: add dsack blocks received stats
Introduce a new TCP stat to record the number of DSACK blocks received
(RFC4989 tcpEStatsStackDSACKDups) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
fb31c9b9f6 tcp: add data bytes retransmitted stats
Introduce a new TCP stat to record the number of bytes retransmitted
(RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
ba113c3aa7 tcp: add data bytes sent stats
Introduce a new TCP stat to record the number of bytes sent
(RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Dave Airlie
f8f15c34ac Merge tag 'drm-msm-next-2018-07-30' of git://people.freedesktop.org/~robclark/linux into drm-next
A bit larger this time around, due to introduction of "dpu1" support
for the display controller in sdm845 and beyond.  This has been on
list and undergoing refactoring since Feb (going from ~110kloc to
~30kloc), and all my review complaints have been addressed, so I'd be
happy to see this upstream so further feature work can procede on top
of upstream.

Also includes the gpu coredump support, which should be useful for
debugging gpu crashes.  And various other misc fixes and such.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGv-8y3zguY0Mj1vh=o+vrv_bJ8AwZ96wBXYPvMeQT2XcA@mail.gmail.com
2018-08-01 08:52:19 +10:00
Finn Thain
ebd722275f macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
Now that the PowerMac via-pmu driver supports m68k PowerBooks,
switch over to that driver and remove the via-pmu68k driver.

Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-31 19:56:42 +10:00
Finn Thain
54c990775f macintosh/via-pmu68k: Don't load driver on unsupported hardware
Don't load the via-pmu68k driver on early PowerBooks. The M50753 PMU
device found in those models was never supported by this driver.
Attempting to load the driver usually causes a boot hang.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Reviewed-by: Michael Schmitz <schmitzmic@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-31 19:56:42 +10:00
Andrey Ignatov
d692f1138a bpf: Support bpf_get_socket_cookie in more prog types
bpf_get_socket_cookie() helper can be used to identify skb that
correspond to the same socket.

Though socket cookie can be useful in many other use-cases where socket is
available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR
and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of
them can augment a value in a map prepared earlier by other program for
the same socket.

The patch adds support to call bpf_get_socket_cookie() from
BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS.

It doesn't introduce new helpers. Instead it reuses same helper name
bpf_get_socket_cookie() but adds support to this helper to accept
`struct bpf_sock_addr` and `struct bpf_sock_ops`.

Documentation in bpf.h is changed in a way that should not break
automatic generation of markdown.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-31 09:33:48 +02:00
Janosch Frank
a449938297 KVM: s390: Add huge page enablement control
General KVM huge page support on s390 has to be enabled via the
kvm.hpage module parameter. Either nested or hpage can be enabled, as
we currently do not support vSIE for huge backed guests. Once the vSIE
support is added we will either drop the parameter or enable it as
default.

For a guest the feature has to be enabled through the new
KVM_CAP_S390_HPAGE_1M capability and the hpage module
parameter. Enabling it means that cmm can't be enabled for the vm and
disables pfmf and storage key interpretation.

This is due to the fact that in some cases, in upcoming patches, we
have to split huge pages in the guest mapping to be able to set more
granular memory protection on 4k pages. These split pages have fake
page tables that are not visible to the Linux memory management which
subsequently will not manage its PGSTEs, while the SIE will. Disabling
these features lets us manage PGSTE data in a consistent matter and
solve that problem.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 23:13:38 +02:00
Mauro Carvalho Chehab
d21c249b26 media: dvb/audio.h: get rid of unused APIs
There are a number of other ioctls that aren't used anywhere
inside the Kernel tree.

Get rid of them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-07-30 16:21:49 -04:00
Mauro Carvalho Chehab
b41e44b4cb media: dvb/video.h: get rid of unused APIs
There are a number of other ioctls that aren't used anywhere
inside the Kernel tree.

Get rid of them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-07-30 15:43:47 -04:00
Linus Torvalds
0634922a78 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes:

   - AMD IBS data corruptor fix (uncovered by UBSAN)

   - an Intel PEBS entry unwind error fix

   - a HW-tracing crash fix

   - a MAINTAINERS update"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix crash when using HW tracing kernel filters
  perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)
  MAINTAINERS: Add Naveen N. Rao as kprobes co-maintainer
  perf/x86/amd/ibs: Don't access non-started event
2018-07-30 11:45:30 -07:00
Paolo Abeni
802bfb1915 net/sched: user-space can't set unknown tcfa_action values
Currently, when initializing an action, the user-space can specify
and use arbitrary values for the tcfa_action field. If the value
is unknown by the kernel, is implicitly threaded as TC_ACT_UNSPEC.

This change explicitly checks for unknown values at action creation
time, and explicitly convert them to TC_ACT_UNSPEC. No functional
changes are introduced, but this will allow introducing tcfa_action
values not exposed to user-space in a later patch.

Note: we can't use the above to hide TC_ACT_REDIRECT from user-space,
as the latter is already part of uAPI.

v3 -> v4:
 - use an helper to check for action validity (JiriP)
 - emit an extack for invalid actions (JiriP)
v4 -> v5:
 - keep messages on a single line, drop net_warn (Marcelo)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30 09:31:13 -07:00
Mauro Carvalho Chehab
ea8532daee media: videodev2: get rid of VIDIOC_RESERVED
While this ioctl is there at least since Kernel 2.6.12-rc2, it
was never used by any upstream driver.

Get rid of it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-07-30 10:23:34 -04:00
Máté Eckl
4ed8eb6570 netfilter: nf_tables: Add native tproxy support
A great portion of the code is taken from xt_TPROXY.c

There are some changes compared to the iptables implementation:
 - tproxy statement is not terminal here
 - Either address or port has to be specified, but at least one of them
   is necessary. If one of them is not specified, the evaluation will be
   performed with the original attribute of the packet (ie. target port
   is not specified => the packet's dport will be used).

To make this work in inet tables, the tproxy structure has a family
member (typically called priv->family) which is not necessarily equal to
ctx->family.

priv->family can have three values legally:
 - NFPROTO_IPV4 if the table family is ip OR if table family is inet,
   but an ipv4 address is specified as a target address. The rule only
   evaluates ipv4 packets in this case.
 - NFPROTO_IPV6 if the table family is ip6 OR if table family is inet,
   but an ipv6 address is specified as a target address. The rule only
   evaluates ipv6 packets in this case.
 - NFPROTO_UNSPEC if the table family is inet AND if only the port is
   specified. The rule will evaluate both ipv4 and ipv6 packets.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-30 14:07:12 +02:00
Fernando Fernandez Mancera
b96af92d6e netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf
Add basic module functions into nft_osf.[ch] in order to implement OSF
module in nf_tables.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-30 14:07:11 +02:00
Fernando Fernandez Mancera
f932495208 netfilter: nfnetlink_osf: extract nfnetlink_subsystem code from xt_osf.c
Move nfnetlink osf subsystem from xt_osf.c to standalone module so we can
reuse it from the new nft_ost extension.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-30 14:07:11 +02:00