Commit Graph

44719 Commits

Author SHA1 Message Date
Stefan Hajnoczi
4192f672fa vsock: make listener child lock ordering explicit
There are several places where the listener and pending or accept queue
child sockets are accessed at the same time.  Lockdep is unhappy that
two locks from the same class are held.

Tell lockdep that it is safe and document the lock ordering.

Originally Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> sent a similar
patch asking whether this is safe.  I have audited the code and also
covered the vsock_pending_work() function.

Suggested-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27 10:44:46 -04:00
Paolo Abeni
48f1dcb55a ipv6: enforce egress device match in per table nexthop lookups
with the commit 8c14586fc3 ("net: ipv6: Use passed in table for
nexthop lookups"), net hop lookup is first performed on route creation
in the passed-in table.
However device match is not enforced in table lookup, so the found
route can be later discarded due to egress device mismatch and no
global lookup will be performed.
This cause the following to fail:

ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dummy1 up
ip link set dummy2 up
ip route add 2001:db8:8086::/48 dev dummy1 metric 20
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20
ip route add 2001:db8:8086::/48 dev dummy2 metric 21
ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21
RTNETLINK answers: No route to host

This change fixes the issue enforcing device lookup in
ip6_nh_lookup_table()

v1->v2: updated commit message title

Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27 10:37:20 -04:00
David S. Miller
5db15872c5 Merge tag 'linux-can-next-for-4.8-20160623' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:

====================
pull-request: can-next 2016-06-17

this is a pull request of 4 patches for net-next/master.

Arnd Bergmann's patch fixes a regresseion in af_can introduced in
linux-can-next-for-4.8-20160617. There are two patches by Ramesh
Shanmugasundaram, which add CAN-2.0 support to the rcar_canfd driver.
And a patch by Ed Spiridonov that adds better error diagnoses messages
to the Ed Spiridonov driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27 10:33:42 -04:00
Amitoj Kaur Chawla
810bf11033 tipc: Use kmemdup instead of kmalloc and memcpy
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@

-  to = \(kmalloc\|kzalloc\)(size,flag);
+  to = kmemdup(from,size,flag);
   if (to==NULL || ...) S
-  memcpy(to, from, size);

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27 09:56:58 -04:00
David S. Miller
2b7c4f7a0e Merge tag 'rxrpc-rewrite-20160622-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:

====================
rxrpc: Get rid of conn bundle and transport structs

Here's the next part of the AF_RXRPC rewrite.  The primary purpose of this
set is to get rid of the rxrpc_conn_bundle and rxrpc_transport structs.
This simplifies things for future development of the connection handling.

To this end, the following significant changes are made:

 (1) The rxrpc_connection struct is given pointers to the local and peer
     endpoints, inside the rxrpc_conn_parameters struct.  Pointers to the
     transport's copy of these pointers are then redirected to the
     connection struct.

 (2) Exclusive connection handling is fixed.  Exclusive connections should
     do just one call and then be retired.  They are used in security
     negotiations and, I believe, the idea is to avoid reuse of negotiated
     security contexts.

     The current code is doing a single connection per socket and doing all
     the calls over that.  With this change it gets a new connection for
     each call made.

 (3) A new sendmsg() control message marker is added to make individual
     calls operate over exclusive connections.  This should be used in
     future in preference to the sockopt that marks a socket as "exclusive
     connection".

 (4) IDs for client connections initiated by a machine are now allocated
     from a global pool using the IDR facility and are unique across all
     client connections, no matter their destination.  The IDR facility is
     then used to look up a connection on the connection ID alone.  Other
     parameters are then verified afterwards.

     Note that the IDR facility may use a lot of memory if the IDs it holds
     are widely scattered.  Given this, in a future commit, client
     connections will be retired if they are more than a certain distance
     from the last ID allocated.

     The client epoch is advanced by 1 each time the client ID counter
     wraps.  Connections outside the current epoch will also be retired in
     a future commit.

 (5) The connection bundle concept is removed and the client connection
     tree is moved into the local endpoint.  The queue for waiting for a
     call channel is moved to the rxrpc_connection struct as there can only
     be one connection for any particular key going to any particular peer
     now.

 (6) The rxrpc_transport struct is removed and the service connection tree
     is moved into the peer struct.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-26 16:01:54 -04:00
Eric Dumazet
4d202a0d31 net_sched: generalize bulk dequeue
When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a346 "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.

With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).

For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...

Idea is to bulk-dequeue packets mapping to the same transmit queue.

This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :

1) NUMA node contention :   610,000 pps -> 1,110,000 pps
2) No node contention   : 1,380,000 pps -> 1,930,000 pps

Now we should work to add batches on the enqueue() side ;)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Eric Dumazet
338ed9b4de net_sched: sch_htb: export class backlog in dumps
We already get child qdisc qlen, we also can get its backlog
so that class dumps can report it.

Also replace qstats by a single drop counter, but move it in
a separate cache line so that drops do not dirty useful cache lines.

Tested:

$ tc -s cl sh dev eth0
class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
 Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
 rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
 lended: 9021770 borrowed: 0 giants: 0
 tokens: 9 ctokens: 9

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Eric Dumazet
008830bc32 net_sched: fq_codel: cache skb->truesize into skb->cb
Now we defer skb drops, it makes sense to keep a copy
of skb->truesize in struct codel_skb_cb to avoid one
cache line miss per dropped skb in fq_codel_drop(),
to reduce latencies a bit further.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Eric Dumazet
520ac30f45 net_sched: drop packets after root qdisc lock is released
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Jiri Slaby
a2a5f1a2d2 net: ircomm, cleanup TIOCGSERIAL
In ircomm_tty_get_serial_info, struct serial_struct is memset to 0 and
then some members set to 0 explicitly.

Remove the latter as it is obviously superfluous.

And remove the retinfo check against NULL. copy_to_user will take care
of that.

Part of hub6 cleanup series.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Samuel Ortiz <samuel@sortiz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-25 08:56:30 -07:00
Jarno Rajahalme
7d904c7bcd openvswitch: Only set mark and labels with a commit flag.
Only set conntrack mark or labels when the commit flag is specified.
This makes sure we can not set them before the connection has been
persisted, as in that case the mark and labels would be lost in an
event of an userspace upcall.

OVS userspace already requires the commit flag to accept setting
ct_mark and/or ct_labels.  Validate for this in the kernel API.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 11:55:51 -04:00
Jarno Rajahalme
1c1779fa54 openvswitch: Set mark and labels before confirming.
Set conntrack mark and labels right before committing so that
the initial conntrack NEW event has the mark and labels.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 11:55:51 -04:00
Arturo Borrero
0071e184a5 netfilter: nf_tables: add support for inverted logic in nft_lookup
Introduce a new configuration option for this expression, which allows users
to invert the logic of set lookups.

In _init() we will now return EINVAL if NFT_LOOKUP_F_INV is in anyway
related to a map lookup.

The code in the _eval() function has been untangled and updated to sopport the
XOR of options, as we should consider 4 cases:
 * lookup false, invert false -> NFT_BREAK
 * lookup false, invert true -> return w/o NFT_BREAK
 * lookup true, invert false -> return w/o NFT_BREAK
 * lookup true, invert true -> NFT_BREAK

Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:29 +02:00
Pablo Neira Ayuso
82bec71d46 netfilter: nf_tables: get rid of NFT_BASECHAIN_DISABLED
This flag was introduced to restore rulesets from the new netdev
family, but since 5ebe0b0eec ("netfilter: nf_tables: destroy
basechain and rules on netdevice removal") the ruleset is released
once the netdev is gone.

This also removes nft_register_basechain() and
nft_unregister_basechain() since they have no clients anymore after
this rework.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:28 +02:00
Florian Westphal
3183ab8997 netfilter: conntrack: allow increasing bucket size via sysctl too
No need to restrict this to module parameter.

We export a copy of the real hash size -- when user alters the value we
allocate the new table, copy entries etc before we update the real size
to the requested one.

This is also needed because the real size is used by concurrent readers
and cannot be changed without synchronizing the conntrack generation
seqcnt.

We only allow changing this value from the initial net namespace.

Tested using http-client-benchmark vs. httpterm with concurrent

while true;do
 echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
done

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:28 +02:00
Pablo Neira Ayuso
8eee54be73 netfilter: nft_hash: support deletion of inactive elements
New elements are inactive in the preparation phase, and its
NFT_SET_ELEM_BUSY_MASK flag is set on.

This busy flag doesn't allow us to delete it from the same transaction,
following a sequence like:

	begin transaction
	add element X
	delete element X
	end transaction

This sequence is valid and may be triggered by robots. To resolve this
problem, allow deactivating elements that are active in the current
generation (ie. those that has been just added in this batch).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:27 +02:00
Pablo Neira Ayuso
4e5001651f netfilter: nft_rbtree: check for next generation when deactivating elements
set->ops->deactivate() is invoked from nft_del_setelem() that happens
from the transaction path, so we have to check if the object is active
in the next generation, not the current.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:26 +02:00
Pablo Neira Ayuso
37a9cc5255 netfilter: nf_tables: add generation mask to sets
Similar to ("netfilter: nf_tables: add generation mask to tables").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:26 +02:00
Pablo Neira Ayuso
664b0f8cd8 netfilter: nf_tables: add generation mask to chains
Similar to ("netfilter: nf_tables: add generation mask to tables").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:25 +02:00
Pablo Neira Ayuso
f2a6d76676 netfilter: nf_tables: add generation mask to tables
This patch addresses two problems:

1) The netlink dump is inconsistent when interfering with an ongoing
   transaction update for several reasons:

1.a) We don't honor the internal NFT_TABLE_INACTIVE flag, and we should
     be skipping these inactive objects in the dump.

1.b) We perform speculative deletion during the preparation phase, that
     may result in skipping active objects.

1.c) The listing order changes, which generates noise when tracking
     incremental ruleset update via tools like git or our own
     testsuite.

2) We don't allow to add and to update the object in the same batch,
   eg. add table x; add table x { flags dormant\; }.

In order to resolve these problems:

1) If the user requests a deletion, the object becomes inactive in the
   next generation. Then, ignore objects that scheduled to be deleted
   from the lookup path, as they will be effectively removed in the
   next generation.

2) From the get/dump path, if the object is not currently active, we
   skip it.

3) Support 'add X -> update X' sequence from a transaction.

After this update, we obtain a consistent list as long as we stay
in the same generation. The userspace side can detect interferences
through the generation counter so it can restart the dumping.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:24 +02:00
Pablo Neira Ayuso
889f7ee7c6 netfilter: nf_tables: add generic macros to check for generation mask
Thus, we can reuse these to check the genmask of any object type, not
only rules. This is required now that tables, chain and sets will get a
generation mask field too in follow up patches.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:24 +02:00
Vishwanath Pai
7643507fe8 netfilter: xt_NFLOG: nflog-range does not truncate packets
li->u.ulog.copy_len is currently ignored by the kernel, we should truncate
the packet to either li->u.ulog.copy_len (if set) or copy_range before
sending it to userspace. 0 is a valid input for copy_len, so add a new
flag to indicate whether this was option was specified by the user or not.

Add two flags to indicate whether nflog-size/copy_len was set or not.
XT_NFLOG_F_COPY_LEN is for XT_NFLOG and NFLOG_F_COPY_LEN for nfnetlink_log

On the userspace side, this was initially represented by the option
nflog-range, this will be replaced by --nflog-size now. --nflog-range would
still exist but does not do anything.

Reported-by: Joe Dollard <jdollard@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:23 +02:00
Liping Zhang
e1dbbc5907 netfilter: nf_reject_ipv4: don't send tcp RST if the packet is non-TCP
In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.

This means that user can add such a rule like follows via nfnetlink:
  "nft add rule filter forward ip protocol udp reject with tcp reset"

This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:22 +02:00
Eric W. Biederman
d91ee87d8d vfs: Pass data, ns, and ns->userns to mount_ns
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks.  The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts.  Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:53 -05:00
Eric Dumazet
21de12ee55 netem: fix a use after free
If the packet was dropped by lower qdisc, then we must not
access it later.

Save qdisc_pkt_len(skb) in a temp variable.

Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 15:07:44 -04:00
WANG Cong
817e9f2c5c act_ife: acquire ife_mod_lock before reading ifeoplist
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 12:02:36 -04:00
WANG Cong
067a7cd06f act_ife: only acquire tcf_lock for existing actions
Alexey reported that we have GFP_KERNEL allocation when
holding the spinlock tcf_lock. Actually we don't have
to take that spinlock for all the cases, especially
for the new one we just create. To modify the existing
actions, we still need this spinlock to make sure
the whole update is atomic.

For net-next, we can get rid of this spinlock because
we already hold the RTNL lock on slow path, and on fast
path we can use RCU to protect the metalist.

Joint work with Jamal.

Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 12:02:36 -04:00
Herbert Xu
962fcef33b esp: Fix ESN generation under UDP encapsulation
Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.

This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.

Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 11:52:00 -04:00
Liping Zhang
62131e5d73 netfilter: nft_meta: set skb->nf_trace appropriately
When user add a nft rule to set nftrace to zero, for example:

  # nft add rule ip filter input nftrace set 0

We should set nf_trace to zero also.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 14:15:33 +02:00
Liping Zhang
6cafaf4764 netfilter: nf_tables: fix memory leak if expr init fails
If expr init fails then we need to free it.

So when the user add a nft rule as follows:

  # nft add rule filter input tcp dport 22 flow table ssh \
    { ip saddr limit rate 0/second }

memory leak will happen.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 14:15:24 +02:00
Eric W. Biederman
9847371a84 netfilter: Allow xt_owner in any user namespace
Making this work is a little tricky as it really isn't kosher to
change the xt_owner_match_info in a check function.

Without changing xt_owner_match_info we need to know the user
namespace the uids and gids are specified in.  In the common case
net->user_ns == current_user_ns().  Verify net->user_ns ==
current_user_ns() in owner_check so we can later assume it in
owner_mt.

In owner_check also verify that all of the uids and gids specified are
in net->user_ns and that the expected min/max relationship exists
between the uids and gids in xt_owner_match_info.

In owner_mt get the network namespace from the outgoing socket, as this
must be the same network namespace as the netfilter rules, and use that
network namespace to find the user namespace the uids and gids in
xt_match_owner_info are encoded in.  Then convert from their encoded
from into the kernel internal format for uids and gids and perform the
owner match.

Similar to ping_group_range, this code does not try to detect
noncontiguous UID/GID ranges.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:58:55 +02:00
Florian Westphal
6c8dee9842 netfilter: move zone info into struct nf_conn
Curently we store zone information as a conntrack extension.
This has one drawback: for every lookup we need to fetch the zone data
from the extension area.

This change place the zone data directly into the main conntrack object
structure and then removes the zone conntrack extension.

The zone data is just 4 bytes, it fits into a padding hole before
the tuplehash info, so we do not even increase the nf_conn structure size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:33:12 +02:00
Shivani Bhardwaj
7e53e7f8ca netfilter: nf_log: Remove NULL check
If 'logger' was NULL, there would be a direct jump to the label 'out',
since it has already been checked for NULL, remove this unnecessary
check.

Signed-off-by: Shivani Bhardwaj <shivanib134@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:32:43 +02:00
Florian Westphal
5a75cdebab netfilter: conntrack: align nf_conn on cacheline boundary
increases struct size by 32 bytes (288 -> 320), but it is the right thing,
else any attempt to (re-)arrange nf_conn members by cacheline won't work.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:31:54 +02:00
Liping Zhang
36f959c491 netfilter: xt_TRACE: add explicitly nf_logger_find_get call
Consider such situation, if nf_log_ipv4 kernel module is not installed,
and the user add a following iptables rule:
  # iptables -t raw -I PREROUTING -j TRACE

There will be no trace log generated until the user install nf_log_ipv4
module manully. So we should add request related nf_log module
appropriately here.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:26:49 +02:00
Liping Zhang
f3bb53338e netfilter: nf_log: handle NFPROTO_INET properly in nf_logger_[find_get|put]
When we request NFPROTO_INET, it means both NFPROTO_IPV4 and NFPROTO_IPV6.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:24:42 +02:00
Xiubo Li
a6d0bae148 netfilter: x_tables: fix possible ZERO_SIZE_PTR pointer dereferencing error.
Since we cannot make sure that the 'hook_mask' will always be none
zero here. If it equals to zero, the num_hooks will be zero too,
and then kmalloc() will return ZERO_SIZE_PTR, which is (void *)16.

Then the following error check will fails:
  ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
  if (ops == NULL)
          return ERR_PTR(-ENOMEM);

So this patch will fix this with just doing the zero check before
kmalloc() is called.

Maybe the case above will never happen here, but in theory.

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 12:13:06 +02:00
Arnd Bergmann
2781ff5c8f can: only call can_stat_update with procfs
The change to leave out procfs support in CAN when CONFIG_PROC_FS
is not set was incomplete and leads to a build error:

net/built-in.o: In function `can_init':
:(.init.text+0x9858): undefined reference to `can_stat_update'
ERROR: "can_stat_update" [net/can/can.ko] undefined!

This tries a better approach, encapsulating all of the calls
within IS_ENABLED(), so we also leave out the timer function
from the object file.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a20fadf853 ("can: build proc support only if CONFIG_PROC_FS is activated")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-06-23 11:23:49 +02:00
William Tu
b95e5928fc openvswitch: Add packet len info to upcall.
The commit f2a4d086ed ("openvswitch: Add packet truncation support.")
introduces packet truncation before sending to userspace upcall receiver.
This patch passes up the skb->len before truncation so that the upcall
receiver knows the original packet size. Potentially this will be used
by sFlow, where OVS translates sFlow config header=N to a sample action,
truncating packet to N byte in kernel datapath. Thus, only N bytes instead
of full-packet size is copied from kernel to userspace, saving the
kernel-to-userspace bandwidth.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:34:39 -04:00
Jon Paul Maloy
27777daa8b tipc: unclone unbundled buffers before forwarding
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.

We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.

Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:33:35 -04:00
Jiri Slaby
d19af0a764 kcm: fix /proc memory leak
Every open of /proc/net/kcm leaks 16 bytes of memory as is reported by
kmemleak:
unreferenced object 0xffff88059c0e3458 (size 192):
  comm "cat", pid 1401, jiffies 4294935742 (age 310.720s)
  hex dump (first 32 bytes):
    28 45 71 96 05 88 ff ff 00 10 00 00 00 00 00 00  (Eq.............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8156a2de>] kmem_cache_alloc_trace+0x16e/0x230
    [<ffffffff8162a479>] seq_open+0x79/0x1d0
    [<ffffffffa0578510>] kcm_seq_open+0x0/0x30 [kcm]
    [<ffffffff8162a479>] seq_open+0x79/0x1d0
    [<ffffffff8162a8cf>] __seq_open_private+0x2f/0xa0
    [<ffffffff81712548>] seq_open_net+0x38/0xa0
...

It is caused by a missing free in the ->release path. So fix it by
providing seq_release_net as the ->release method.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: cd6e111bf5 (kcm: Add statistics and proc interfaces)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:32:23 -04:00
David Howells
aa390bbe21 rxrpc: Kill off the rxrpc_transport struct
The rxrpc_transport struct is now redundant, given that the rxrpc_peer
struct is now per peer port rather than per peer host, so get rid of it.

Service connection lists are transferred to the rxrpc_peer struct, as is
the conn_lock.  Previous patches moved the client connection handling out
of the rxrpc_transport struct and discarded the connection bundling code.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 14:00:23 +01:00
David Howells
999b69f892 rxrpc: Kill the client connection bundle concept
Kill off the concept of maintaining a bundle of connections to a particular
target service to increase the number of call slots available for any
beyond four for that service (there are four call slots per connection).

This will make cleaning up the connection handling code easier and
facilitate removal of the rxrpc_transport struct.  Bundling can be
reintroduced later if necessary.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:20:55 +01:00
David Howells
5627cc8b96 rxrpc: Provide more refcount helper functions
Provide refcount helper functions for connections so that the code doesn't
touch local or connection usage counts directly.

Also make it such that local and peer put functions can take a NULL
pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
985a5c824a rxrpc: Make rxrpc_send_packet() take a connection not a transport
Make rxrpc_send_packet() take a connection not a transport as part of the
phasing out of the rxrpc_transport struct.

Whilst we're at it, rename the function to rxrpc_send_data_packet() to
differentiate it from the other packet sending functions.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
f4e7da8cde rxrpc: Calls displayed in /proc may in future lack a connection
Allocated rxrpc calls displayed in /proc/net/rxrpc_calls may in future be
on the proc list before they're connected or after they've been
disconnected - in which case they may not have a pointer to a connection
struct that can be used to get data from there.

Deal with this by using stuff from the call struct in preference where
possible and printing "no_connection" rather than a peer address if no
connection is assigned.

This change also has the added bonus that the service ID is now taken from
the call rather the connection which will allow per-call service upgrades
to be shown - something required for AuriStor server compatibility.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
f4552c2d24 rxrpc: Validate the net address given to rxrpc_kernel_begin_call()
Validate the net address given to rxrpc_kernel_begin_call() before using
it.

Whilst this should be mostly unnecessary for in-kernel users, it does clear
the tail of the address struct in case we want to hash or compare the whole
thing.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
4a3388c803 rxrpc: Use IDR to allocate client conn IDs on a machine-wide basis
Use the IDR facility to allocate client connection IDs on a machine-wide
basis so that each client connection has a unique identifier.  When the
connection ID space wraps, we advance the epoch by 1, thereby effectively
having a 62-bit ID space.  The IDR facility is then used to look up client
connections during incoming packet routing instead of using an rbtree
rooted on the transport.

This change allows for the removal of the transport in the future and also
means that client connections can be looked up directly in the data-ready
handler by connection ID.

The ID management code is placed in a new file, conn-client.c, to which all
the client connection-specific code will eventually move.

Note that the IDR tree gets very expensive on memory if the connection IDs
are widely scattered throughout the number space, so we shall need to
retire connections that have, say, an ID more than four times the maximum
number of client conns away from the current allocation point to try and
keep the IDs concentrated.  We will also need to retire connections from an
old epoch.

Also note that, for the moment, a pointer to the transport has to be passed
through into the ID allocation function so that we can take a BH lock to
prevent a locking issue against in-BH lookup of client connections.  This
will go away later when RCU is used for server connections also.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:02 +01:00
David Howells
b3f575043f rxrpc: rxrpc_connection_lock shouldn't be a BH lock, but conn_lock is
rxrpc_connection_lock shouldn't be accessed as a BH-excluding lock.  It's
only accessed in a few places and none of those are in BH-context.

rxrpc_transport::conn_lock, however, *is* a BH-excluding lock and should be
accessed so consistently.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:02 +01:00
David Howells
42886ffe77 rxrpc: Pass sk_buff * rather than rxrpc_host_header * to functions
Pass a pointer to struct sk_buff rather than struct rxrpc_host_header to
functions so that they can in the future get at transport protocol parameters
rather than just RxRPC parameters.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:01 +01:00