Commit Graph

45659 Commits

Author SHA1 Message Date
Jarno Rajahalme
7d904c7bcd openvswitch: Only set mark and labels with a commit flag.
Only set conntrack mark or labels when the commit flag is specified.
This makes sure we can not set them before the connection has been
persisted, as in that case the mark and labels would be lost in an
event of an userspace upcall.

OVS userspace already requires the commit flag to accept setting
ct_mark and/or ct_labels.  Validate for this in the kernel API.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 11:55:51 -04:00
Jarno Rajahalme
1c1779fa54 openvswitch: Set mark and labels before confirming.
Set conntrack mark and labels right before committing so that
the initial conntrack NEW event has the mark and labels.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 11:55:51 -04:00
Arturo Borrero
0071e184a5 netfilter: nf_tables: add support for inverted logic in nft_lookup
Introduce a new configuration option for this expression, which allows users
to invert the logic of set lookups.

In _init() we will now return EINVAL if NFT_LOOKUP_F_INV is in anyway
related to a map lookup.

The code in the _eval() function has been untangled and updated to sopport the
XOR of options, as we should consider 4 cases:
 * lookup false, invert false -> NFT_BREAK
 * lookup false, invert true -> return w/o NFT_BREAK
 * lookup true, invert false -> return w/o NFT_BREAK
 * lookup true, invert true -> NFT_BREAK

Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:29 +02:00
Pablo Neira Ayuso
82bec71d46 netfilter: nf_tables: get rid of NFT_BASECHAIN_DISABLED
This flag was introduced to restore rulesets from the new netdev
family, but since 5ebe0b0eec ("netfilter: nf_tables: destroy
basechain and rules on netdevice removal") the ruleset is released
once the netdev is gone.

This also removes nft_register_basechain() and
nft_unregister_basechain() since they have no clients anymore after
this rework.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:28 +02:00
Florian Westphal
3183ab8997 netfilter: conntrack: allow increasing bucket size via sysctl too
No need to restrict this to module parameter.

We export a copy of the real hash size -- when user alters the value we
allocate the new table, copy entries etc before we update the real size
to the requested one.

This is also needed because the real size is used by concurrent readers
and cannot be changed without synchronizing the conntrack generation
seqcnt.

We only allow changing this value from the initial net namespace.

Tested using http-client-benchmark vs. httpterm with concurrent

while true;do
 echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
done

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:28 +02:00
Pablo Neira Ayuso
8eee54be73 netfilter: nft_hash: support deletion of inactive elements
New elements are inactive in the preparation phase, and its
NFT_SET_ELEM_BUSY_MASK flag is set on.

This busy flag doesn't allow us to delete it from the same transaction,
following a sequence like:

	begin transaction
	add element X
	delete element X
	end transaction

This sequence is valid and may be triggered by robots. To resolve this
problem, allow deactivating elements that are active in the current
generation (ie. those that has been just added in this batch).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:27 +02:00
Pablo Neira Ayuso
4e5001651f netfilter: nft_rbtree: check for next generation when deactivating elements
set->ops->deactivate() is invoked from nft_del_setelem() that happens
from the transaction path, so we have to check if the object is active
in the next generation, not the current.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:26 +02:00
Pablo Neira Ayuso
37a9cc5255 netfilter: nf_tables: add generation mask to sets
Similar to ("netfilter: nf_tables: add generation mask to tables").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:26 +02:00
Pablo Neira Ayuso
664b0f8cd8 netfilter: nf_tables: add generation mask to chains
Similar to ("netfilter: nf_tables: add generation mask to tables").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:25 +02:00
Pablo Neira Ayuso
f2a6d76676 netfilter: nf_tables: add generation mask to tables
This patch addresses two problems:

1) The netlink dump is inconsistent when interfering with an ongoing
   transaction update for several reasons:

1.a) We don't honor the internal NFT_TABLE_INACTIVE flag, and we should
     be skipping these inactive objects in the dump.

1.b) We perform speculative deletion during the preparation phase, that
     may result in skipping active objects.

1.c) The listing order changes, which generates noise when tracking
     incremental ruleset update via tools like git or our own
     testsuite.

2) We don't allow to add and to update the object in the same batch,
   eg. add table x; add table x { flags dormant\; }.

In order to resolve these problems:

1) If the user requests a deletion, the object becomes inactive in the
   next generation. Then, ignore objects that scheduled to be deleted
   from the lookup path, as they will be effectively removed in the
   next generation.

2) From the get/dump path, if the object is not currently active, we
   skip it.

3) Support 'add X -> update X' sequence from a transaction.

After this update, we obtain a consistent list as long as we stay
in the same generation. The userspace side can detect interferences
through the generation counter so it can restart the dumping.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:24 +02:00
Pablo Neira Ayuso
889f7ee7c6 netfilter: nf_tables: add generic macros to check for generation mask
Thus, we can reuse these to check the genmask of any object type, not
only rules. This is required now that tables, chain and sets will get a
generation mask field too in follow up patches.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:24 +02:00
Vishwanath Pai
7643507fe8 netfilter: xt_NFLOG: nflog-range does not truncate packets
li->u.ulog.copy_len is currently ignored by the kernel, we should truncate
the packet to either li->u.ulog.copy_len (if set) or copy_range before
sending it to userspace. 0 is a valid input for copy_len, so add a new
flag to indicate whether this was option was specified by the user or not.

Add two flags to indicate whether nflog-size/copy_len was set or not.
XT_NFLOG_F_COPY_LEN is for XT_NFLOG and NFLOG_F_COPY_LEN for nfnetlink_log

On the userspace side, this was initially represented by the option
nflog-range, this will be replaced by --nflog-size now. --nflog-range would
still exist but does not do anything.

Reported-by: Joe Dollard <jdollard@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:23 +02:00
Liping Zhang
e1dbbc5907 netfilter: nf_reject_ipv4: don't send tcp RST if the packet is non-TCP
In iptables, if the user add a rule to send tcp RST and specify the
non-TCP protocol, such as UDP, kernel will reject this request. But
in nftables, this validity check only occurs in nft tool, i.e. only
in userspace.

This means that user can add such a rule like follows via nfnetlink:
  "nft add rule filter forward ip protocol udp reject with tcp reset"

This will generate some confusing tcp RST packets. So we should send
tcp RST only when it is TCP packet.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24 11:03:22 +02:00
Eric W. Biederman
d91ee87d8d vfs: Pass data, ns, and ns->userns to mount_ns
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks.  The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts.  Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:53 -05:00
Eric Dumazet
21de12ee55 netem: fix a use after free
If the packet was dropped by lower qdisc, then we must not
access it later.

Save qdisc_pkt_len(skb) in a temp variable.

Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 15:07:44 -04:00
WANG Cong
817e9f2c5c act_ife: acquire ife_mod_lock before reading ifeoplist
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 12:02:36 -04:00
WANG Cong
067a7cd06f act_ife: only acquire tcf_lock for existing actions
Alexey reported that we have GFP_KERNEL allocation when
holding the spinlock tcf_lock. Actually we don't have
to take that spinlock for all the cases, especially
for the new one we just create. To modify the existing
actions, we still need this spinlock to make sure
the whole update is atomic.

For net-next, we can get rid of this spinlock because
we already hold the RTNL lock on slow path, and on fast
path we can use RCU to protect the metalist.

Joint work with Jamal.

Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 12:02:36 -04:00
Herbert Xu
962fcef33b esp: Fix ESN generation under UDP encapsulation
Blair Steven noticed that ESN in conjunction with UDP encapsulation
is broken because we set the temporary ESP header to the wrong spot.

This patch fixes this by first of all using the right spot, i.e.,
4 bytes off the real ESP header, and then saving this information
so that after encryption we can restore it properly.

Fixes: 7021b2e1cd ("esp4: Switch to new AEAD interface")
Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-23 11:52:00 -04:00
Liping Zhang
62131e5d73 netfilter: nft_meta: set skb->nf_trace appropriately
When user add a nft rule to set nftrace to zero, for example:

  # nft add rule ip filter input nftrace set 0

We should set nf_trace to zero also.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 14:15:33 +02:00
Liping Zhang
6cafaf4764 netfilter: nf_tables: fix memory leak if expr init fails
If expr init fails then we need to free it.

So when the user add a nft rule as follows:

  # nft add rule filter input tcp dport 22 flow table ssh \
    { ip saddr limit rate 0/second }

memory leak will happen.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 14:15:24 +02:00
Eric W. Biederman
9847371a84 netfilter: Allow xt_owner in any user namespace
Making this work is a little tricky as it really isn't kosher to
change the xt_owner_match_info in a check function.

Without changing xt_owner_match_info we need to know the user
namespace the uids and gids are specified in.  In the common case
net->user_ns == current_user_ns().  Verify net->user_ns ==
current_user_ns() in owner_check so we can later assume it in
owner_mt.

In owner_check also verify that all of the uids and gids specified are
in net->user_ns and that the expected min/max relationship exists
between the uids and gids in xt_owner_match_info.

In owner_mt get the network namespace from the outgoing socket, as this
must be the same network namespace as the netfilter rules, and use that
network namespace to find the user namespace the uids and gids in
xt_match_owner_info are encoded in.  Then convert from their encoded
from into the kernel internal format for uids and gids and perform the
owner match.

Similar to ping_group_range, this code does not try to detect
noncontiguous UID/GID ranges.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:58:55 +02:00
Florian Westphal
6c8dee9842 netfilter: move zone info into struct nf_conn
Curently we store zone information as a conntrack extension.
This has one drawback: for every lookup we need to fetch the zone data
from the extension area.

This change place the zone data directly into the main conntrack object
structure and then removes the zone conntrack extension.

The zone data is just 4 bytes, it fits into a padding hole before
the tuplehash info, so we do not even increase the nf_conn structure size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:33:12 +02:00
Shivani Bhardwaj
7e53e7f8ca netfilter: nf_log: Remove NULL check
If 'logger' was NULL, there would be a direct jump to the label 'out',
since it has already been checked for NULL, remove this unnecessary
check.

Signed-off-by: Shivani Bhardwaj <shivanib134@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:32:43 +02:00
Florian Westphal
5a75cdebab netfilter: conntrack: align nf_conn on cacheline boundary
increases struct size by 32 bytes (288 -> 320), but it is the right thing,
else any attempt to (re-)arrange nf_conn members by cacheline won't work.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:31:54 +02:00
Liping Zhang
36f959c491 netfilter: xt_TRACE: add explicitly nf_logger_find_get call
Consider such situation, if nf_log_ipv4 kernel module is not installed,
and the user add a following iptables rule:
  # iptables -t raw -I PREROUTING -j TRACE

There will be no trace log generated until the user install nf_log_ipv4
module manully. So we should add request related nf_log module
appropriately here.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:26:49 +02:00
Liping Zhang
f3bb53338e netfilter: nf_log: handle NFPROTO_INET properly in nf_logger_[find_get|put]
When we request NFPROTO_INET, it means both NFPROTO_IPV4 and NFPROTO_IPV6.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 13:24:42 +02:00
Xiubo Li
a6d0bae148 netfilter: x_tables: fix possible ZERO_SIZE_PTR pointer dereferencing error.
Since we cannot make sure that the 'hook_mask' will always be none
zero here. If it equals to zero, the num_hooks will be zero too,
and then kmalloc() will return ZERO_SIZE_PTR, which is (void *)16.

Then the following error check will fails:
  ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
  if (ops == NULL)
          return ERR_PTR(-ENOMEM);

So this patch will fix this with just doing the zero check before
kmalloc() is called.

Maybe the case above will never happen here, but in theory.

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23 12:13:06 +02:00
Arnd Bergmann
2781ff5c8f can: only call can_stat_update with procfs
The change to leave out procfs support in CAN when CONFIG_PROC_FS
is not set was incomplete and leads to a build error:

net/built-in.o: In function `can_init':
:(.init.text+0x9858): undefined reference to `can_stat_update'
ERROR: "can_stat_update" [net/can/can.ko] undefined!

This tries a better approach, encapsulating all of the calls
within IS_ENABLED(), so we also leave out the timer function
from the object file.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a20fadf853 ("can: build proc support only if CONFIG_PROC_FS is activated")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-06-23 11:23:49 +02:00
William Tu
b95e5928fc openvswitch: Add packet len info to upcall.
The commit f2a4d086ed ("openvswitch: Add packet truncation support.")
introduces packet truncation before sending to userspace upcall receiver.
This patch passes up the skb->len before truncation so that the upcall
receiver knows the original packet size. Potentially this will be used
by sFlow, where OVS translates sFlow config header=N to a sample action,
truncating packet to N byte in kernel datapath. Thus, only N bytes instead
of full-packet size is copied from kernel to userspace, saving the
kernel-to-userspace bandwidth.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:34:39 -04:00
Jon Paul Maloy
27777daa8b tipc: unclone unbundled buffers before forwarding
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.

We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.

Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:33:35 -04:00
Jiri Slaby
d19af0a764 kcm: fix /proc memory leak
Every open of /proc/net/kcm leaks 16 bytes of memory as is reported by
kmemleak:
unreferenced object 0xffff88059c0e3458 (size 192):
  comm "cat", pid 1401, jiffies 4294935742 (age 310.720s)
  hex dump (first 32 bytes):
    28 45 71 96 05 88 ff ff 00 10 00 00 00 00 00 00  (Eq.............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8156a2de>] kmem_cache_alloc_trace+0x16e/0x230
    [<ffffffff8162a479>] seq_open+0x79/0x1d0
    [<ffffffffa0578510>] kcm_seq_open+0x0/0x30 [kcm]
    [<ffffffff8162a479>] seq_open+0x79/0x1d0
    [<ffffffff8162a8cf>] __seq_open_private+0x2f/0xa0
    [<ffffffff81712548>] seq_open_net+0x38/0xa0
...

It is caused by a missing free in the ->release path. So fix it by
providing seq_release_net as the ->release method.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: cd6e111bf5 (kcm: Add statistics and proc interfaces)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tom Herbert <tom@herbertland.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:32:23 -04:00
David Howells
aa390bbe21 rxrpc: Kill off the rxrpc_transport struct
The rxrpc_transport struct is now redundant, given that the rxrpc_peer
struct is now per peer port rather than per peer host, so get rid of it.

Service connection lists are transferred to the rxrpc_peer struct, as is
the conn_lock.  Previous patches moved the client connection handling out
of the rxrpc_transport struct and discarded the connection bundling code.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 14:00:23 +01:00
David Howells
999b69f892 rxrpc: Kill the client connection bundle concept
Kill off the concept of maintaining a bundle of connections to a particular
target service to increase the number of call slots available for any
beyond four for that service (there are four call slots per connection).

This will make cleaning up the connection handling code easier and
facilitate removal of the rxrpc_transport struct.  Bundling can be
reintroduced later if necessary.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:20:55 +01:00
David Howells
5627cc8b96 rxrpc: Provide more refcount helper functions
Provide refcount helper functions for connections so that the code doesn't
touch local or connection usage counts directly.

Also make it such that local and peer put functions can take a NULL
pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
985a5c824a rxrpc: Make rxrpc_send_packet() take a connection not a transport
Make rxrpc_send_packet() take a connection not a transport as part of the
phasing out of the rxrpc_transport struct.

Whilst we're at it, rename the function to rxrpc_send_data_packet() to
differentiate it from the other packet sending functions.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
f4e7da8cde rxrpc: Calls displayed in /proc may in future lack a connection
Allocated rxrpc calls displayed in /proc/net/rxrpc_calls may in future be
on the proc list before they're connected or after they've been
disconnected - in which case they may not have a pointer to a connection
struct that can be used to get data from there.

Deal with this by using stuff from the call struct in preference where
possible and printing "no_connection" rather than a peer address if no
connection is assigned.

This change also has the added bonus that the service ID is now taken from
the call rather the connection which will allow per-call service upgrades
to be shown - something required for AuriStor server compatibility.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
f4552c2d24 rxrpc: Validate the net address given to rxrpc_kernel_begin_call()
Validate the net address given to rxrpc_kernel_begin_call() before using
it.

Whilst this should be mostly unnecessary for in-kernel users, it does clear
the tail of the address struct in case we want to hash or compare the whole
thing.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells
4a3388c803 rxrpc: Use IDR to allocate client conn IDs on a machine-wide basis
Use the IDR facility to allocate client connection IDs on a machine-wide
basis so that each client connection has a unique identifier.  When the
connection ID space wraps, we advance the epoch by 1, thereby effectively
having a 62-bit ID space.  The IDR facility is then used to look up client
connections during incoming packet routing instead of using an rbtree
rooted on the transport.

This change allows for the removal of the transport in the future and also
means that client connections can be looked up directly in the data-ready
handler by connection ID.

The ID management code is placed in a new file, conn-client.c, to which all
the client connection-specific code will eventually move.

Note that the IDR tree gets very expensive on memory if the connection IDs
are widely scattered throughout the number space, so we shall need to
retire connections that have, say, an ID more than four times the maximum
number of client conns away from the current allocation point to try and
keep the IDs concentrated.  We will also need to retire connections from an
old epoch.

Also note that, for the moment, a pointer to the transport has to be passed
through into the ID allocation function so that we can take a BH lock to
prevent a locking issue against in-BH lookup of client connections.  This
will go away later when RCU is used for server connections also.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:02 +01:00
David Howells
b3f575043f rxrpc: rxrpc_connection_lock shouldn't be a BH lock, but conn_lock is
rxrpc_connection_lock shouldn't be accessed as a BH-excluding lock.  It's
only accessed in a few places and none of those are in BH-context.

rxrpc_transport::conn_lock, however, *is* a BH-excluding lock and should be
accessed so consistently.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:02 +01:00
David Howells
42886ffe77 rxrpc: Pass sk_buff * rather than rxrpc_host_header * to functions
Pass a pointer to struct sk_buff rather than struct rxrpc_host_header to
functions so that they can in the future get at transport protocol parameters
rather than just RxRPC parameters.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:01 +01:00
David Howells
cc8feb8edd rxrpc: Fix exclusive connection handling
"Exclusive connections" are meant to be used for a single client call and
then scrapped.  The idea is to limit the use of the negotiated security
context.  The current code, however, isn't doing this: it is instead
restricting the socket to a single virtual connection and doing all the
calls over that.

This is changed such that the socket no longer maintains a special virtual
connection over which it will do all the calls, but rather gets a new one
each time a new exclusive call is made.

Further, using a socket option for this is a poor choice.  It should be
done on sendmsg with a control message marker instead so that calls can be
marked exclusive individually.  To that end, add RXRPC_EXCLUSIVE_CALL
which, if passed to sendmsg() as a control message element, will cause the
call to be done on an single-use connection.

The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
programs using the setsockopt() will appear to work the same.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:00 +01:00
David Howells
85f32278bd rxrpc: Replace conn->trans->{local,peer} with conn->params.{local,peer}
Replace accesses of conn->trans->{local,peer} with
conn->params.{local,peer} thus making it easier for a future commit to
remove the rxrpc_transport struct.

This also reduces the number of memory accesses involved.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:00 +01:00
David Howells
19ffa01c9c rxrpc: Use structs to hold connection params and protocol info
Define and use a structure to hold connection parameters.  This makes it
easier to pass multiple connection parameters around.

Define and use a structure to hold protocol information used to hash a
connection for lookup on incoming packet.  Most of these fields will be
disposed of eventually, including the duplicate local pointer.

Whilst we're at it rename "proto" to "family" when referring to a protocol
family.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:09:59 +01:00
Arnd Bergmann
2f9f9f5210 rxrpc: fix uninitialized variable use
Hashing the peer key was introduced for AF_INET, but gcc
warns about the rxrpc_peer_hash_key function returning uninitialized
data for any other value of srx->transport.family:

net/rxrpc/peer_object.c: In function 'rxrpc_peer_hash_key':
net/rxrpc/peer_object.c:57:15: error: 'p' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Assuming that nothing else can be set here, this changes the
function to just return zero in case of an unknown address
family.

Fixes: be6e6707f6 ("rxrpc: Rework peer object handling to use hash table and RCU")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:09:58 +01:00
Dan Carpenter
0e4699e4a3 rxrpc: checking for IS_ERR() instead of NULL
rxrpc_lookup_peer_rcu() and rxrpc_lookup_peer() return NULL on error, never
error pointers, so IS_ERR() can't be used.

Fix three callers of those functions.

Fixes: be6e6707f6 ('rxrpc: Rework peer object handling to use hash table and RCU')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:09:58 +01:00
Eric Dumazet
20e1954fe2 ipv6: RFC 4884 partial support for SIT/GRE tunnels
When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.

This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.

It also avoids complex skb logic if the data_len is not
a multiple of 8.

RFC states :

   The "original datagram" field MUST contain at least 128 octets.
   If the original datagram did not contain 128 octets, the
   "original datagram" field MUST be zero padded to 128 octets.

In practice routers use 128 bytes of original datagram, not more.

Initial translation was added in commit ca15a078bd
("sit: generate icmpv6 error when receiving icmpv4 error")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:39 -07:00
Eric Dumazet
9b8c6d7bf2 gre: better support for ICMP messages for gre+ipv6
ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:39 -07:00
Eric Dumazet
2d7a3b276b ipv6: translate ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED
For better traceroute/mtr support for SIT and GRE tunnels,
we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED

We also have to translate the IPv4 source IP address of ICMP
message to IPv6 v4mapped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:39 -07:00
Eric Dumazet
5fbba8ac93 ip6: move ipip6_err_gen_icmpv6_unreach()
We want to use this helper from GRE as well, so this is
the time to move it in net/ipv6/icmp.c

Also add a @nhs parameter, since SIT and GRE have different
values for the header(s) to skip.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:39 -07:00
Eric Dumazet
b1cadc1a09 ipv6: icmp: add a force_saddr param to icmp6_send()
SIT or GRE tunnels might want to translate an IPV4 address
into a v4mapped one when translating ICMP to ICMPv6.

This patch adds the parameter to icmp6_send() but
does not change icmpv6_send() signature.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18 22:11:38 -07:00