Commit Graph

45659 Commits

Author SHA1 Message Date
Chuck Lever
08cf2efd54 xprtrdma: Use smaller buffers for RPC-over-RDMA headers
Commit 949317464b ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
9c40c49f14 xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
5a6d1db455 SUNRPC: Add a transport-specific private field in rpc_rqst
Currently there's a hidden and indirect mechanism for finding the
rpcrdma_req that goes with an rpc_rqst. It depends on getting from
the rq_buffer pointer in struct rpc_rqst to the struct
rpcrdma_regbuf that controls that buffer, and then to the struct
rpcrdma_req it goes with.

This was done back in the day to avoid the need to add a per-rqst
pointer or to alter the buf_free API when support for RPC-over-RDMA
was introduced.

I'm about to change the way regbuf's work to support larger inline
thresholds. Now is a good time to replace this indirect mechanism
with something that is more straightforward. I guess this should be
considered a clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
68778945e4 SUNRPC: Separate buffer pointers for RPC Call and Reply messages
For xprtrdma, the RPC Call and Reply buffers are involved in real
I/O operations.

To start with, the DMA direction of the I/O for a Call is opposite
that of a Reply.

In the current arrangement, the Reply buffer address is on a
four-byte alignment just past the call buffer. Would be friendlier
on some platforms if that was at a DMA cache alignment instead.

Because the current arrangement allocates a single memory region
which contains both buffers, the RPC Reply buffer often contains a
page boundary in it when the Call buffer is large enough (which is
frequent).

It would be a little nicer for setting up DMA operations (and
possible registration of the Reply buffer) if the two buffers were
separated, well-aligned, and contained as few page boundaries as
possible.

Now, I could just pad out the single memory region used for the pair
of buffers. But frequently that would mean a lot of unused space to
ensure the Reply buffer did not have a page boundary.

Add a separate pointer to rpc_rqst that points right to the RPC
Reply buffer. This makes no difference to xprtsock, but it will help
xprtrdma in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
3435c74aed SUNRPC: Generalize the RPC buffer release API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Instead of passing just the rq_buffer into the buf_free method, pass
the task structure and let buf_free take care of freeing both
XDR buffers at once.

There's a micro-optimization here. In the common case, both
xprt_release and the transport's buf_free method were checking if
rq_buffer was NULL. Now the check is done only once per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
5fe6eaa1f9 SUNRPC: Generalize the RPC buffer allocation API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Transports that want to allocate separate Call and Reply buffers
will ignore the "size" argument anyway.  Don't bother passing it.

The buf_alloc method can't return two pointers. Instead, make the
method's return value an error code, and set the rq_buffer pointer
in the method itself.

This gives call_allocate an opportunity to terminate an RPC instead
of looping forever when a permanent problem occurs. If a request is
just bogus, or the transport is in a state where it can't allocate
resources for any request, there needs to be a way to kill the RPC
right there and not loop.

This immediately fixes a rare problem in the backchannel send path,
which loops if the server happens to send a CB request whose
call+reply size is larger than a page (which it shouldn't do yet).

One more issue: looks like xprt_inject_disconnect was incorrectly
placed in the failure path in call_allocate. It needs to be in the
success path, as it is for other call-sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
b9c5bc03be SUNRPC: Refactor rpc_xdr_buf_init()
Clean up: there is some XDR initialization logic that is common
to the forward channel and backchannel. Move it to an XDR header
so it can be shared.

rpc_rqst::rq_buffer points to a buffer containing big-endian data.
Update its annotation as part of the clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever
eb342e9a38 xprtrdma: Eliminate INLINE_THRESHOLD macros
Clean up: r_xprt is already available everywhere these macros are
invoked, so just dereference that directly.

RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Andy Adamson
fda0ab4117 SUNRPC: rpc_clnt_add_xprt setup function for NFS layer
Use a setup function to call into the NFS layer to test an rpc_xprt
for session trunking so as to not leak the rpc_xprt_switch into
the nfs layer.

Search for the address in the rpc_xprt_switch first so as not to
put an unnecessary EXCHANGE_ID on the wire.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
39e5d2df95 SUNRPC search xprt switch for sockaddr
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
dd69171769 SUNRPC rpc_clnt_xprt_switch_add_xprt
Give the NFS layer access to the rpc_xprt_switch_add_xprt function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
3b58a8a904 SUNRPC rpc_clnt_xprt_switch_put
Give the NFS layer access to the xprt_switch_put function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson
7705f6abbb SUNRPC remove rpc_task_release_client from rpc_task_set_client
rpc_task_set_client is only called from rpc_run_task after
rpc_new_task and rpc_task_release_client is not needed as the
task is new.

When called from rpc_new_task, rpc_task_set_client also removed the
assigned rpc_xprt which is not desired.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Trond Myklebust
d002526886 SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create()
Clean up.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Amitoj Kaur Chawla
2813b626e3 sunrpc: Remove unnecessary variable
The variable `err` is not used anywhere and just returns the
predefined value `0` at the end of the function. Hence, remove the
variable and return 0 explicitly.

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:35 -04:00
Ilan Tayari
b588479358 xfrm: Fix memory leak of aead algorithm name
commit 1a6509d991 ("[IPSEC]: Add support for combined mode algorithms")
introduced aead. The function attach_aead kmemdup()s the algorithm
name during xfrm_state_construct().
However this memory is never freed.
Implementation has since been slightly modified in
commit ee5c23176f ("xfrm: Clone states properly on migration")
without resolving this leak.
This patch adds a kfree() call for the aead algorithm name.

Fixes: 1a6509d991 ("[IPSEC]: Add support for combined mode algorithms")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-19 12:08:58 +02:00
David S. Miller
e867e87ae8 Merge tag 'rxrpc-rewrite-20160917-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:

====================
rxrpc: Tracepoint addition and improvement

Here is a set of patches that add some more tracepoints and improve a couple
of existing ones.  New additions include:

 (1) Connection refcount tracking.

 (2) Client connection state machine tracking.

 (3) Tx and Rx packet lifecycle.

 (4) ACK reception and transmission.

 (5) recvmsg processing.

Updates include:

 (1) Print the symbolic packet name in the Rx packet tracepoint.

 (2) Additional call refcount trace events.

 (3) Improvements to sk_buff tracking with AF_RXRPC.

In addition:

 (1) Config option to inject packet loss during both transmission and
     reception.

 (2) Removal of some printks.

This series needs to be applied on top of the previously posted fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:52:21 -04:00
David S. Miller
5b0c6fc8ef Merge tag 'rxrpc-rewrite-20160917-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:

====================
rxrpc: Fixes & miscellany

Here are some more AF_RXRPC fix patches with a couple of miscellaneous
changes also.  Fixes include:

 (1) Make RxRPC IPv6 support conditional on IPv6 being available.

 (2) Move the condition check in rxrpc_locate_data() into the caller and
     check the error return.

 (3) Fix the detection of the last received packet in recvmsg.

 (4) Account calls that need acceptance and clean up any unaccepted ones if
     the socket gets closed.

 (5) Fix the cleanup of client connections.

 (6) Fix the soft-ACK parsing and the retransmission of packets based on
     those ACKs.

 (7) Suppress transmission of an ACK when there's no pending ACK to
     transmit because another thread stole it.

And some miscellany:

 (8) Whitespace removal.

 (9) Switch-value consistency in rxrpc_send_call_packet().

(10) Fix the basic transmission packet size to allow for spur-of-the-moment
     jumbo DATA packet production.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:51:21 -04:00
Florian Westphal
48da34b7a7 sched: add and use qdisc_skb_head helpers
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
ed760cb8aa sched: replace __skb_dequeue with __qdisc_dequeue_head
After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
ec32336879 sched: remove qdisc arg from __qdisc_dequeue_head
Moves qdisc stat accouting to qdisc_dequeue_head.

The only direct caller of the __qdisc_dequeue_head version open-codes
this now.

This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
97d0678f91 sched: don't use skb queue helpers
A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal
1486587b2f pie: use qdisc_dequeue_head wrapper
Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Christophe Jaillet
e8bc8f9a67 sctp: Remove some redundant code
In commit 311b21774f ("sctp: simplify sk_receive_queue locking"), a call
to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was
hidden in 'sctp_skb_list_tail()'

Now, the code around it looks redundant. The '_init()' part of
'skb_queue_splice_tail_init()' should already do the same.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:34:01 -04:00
Mahesh Bandewar
e8bffe0cf9 net: Add _nf_(un)register_hooks symbols
Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Mahesh Bandewar
d409b84768 ipv6: Export p6_route_input_lookup symbol
Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Nogah Frankel
69ae6ad2ff net: core: Add offload stats to if_stats_msg
Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:33:42 -04:00
David S. Miller
c13ed534b8 Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:

====================
This time we have various things - all across the board:
 * MU-MIMO sniffer support in mac80211
 * a create_singlethread_workqueue() cleanup
 * interface dump filtering that was documented but not implemented
 * support for the new radiotap timestamp field
 * send delBA in two unexpected conditions (as required by the spec)
 * connect keys cleanups - allow only WEP with index 0-3
 * per-station aggregation limit to work around broken APs
 * debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:29:08 -04:00
David S. Miller
7ac327318e Merge tag 'mac80211-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:

====================
Two more fixes:
 * reject aggregation sessions for TSID/TID 8-16 that we
   can never use anyway and which could confuse drivers
 * check return value of skb_linearize()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:26:49 -04:00
Eric Dumazet
695b4ec0f0 pkt_sched: fq: use proper locking in fq_dump_stats()
When fq is used on 32bit kernels, we need to lock the qdisc before
copying 64bit fields.

Otherwise "tc -s qdisc ..." might report bogus values.

Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:15:08 -04:00
Thadeu Lima de Souza Cascardo
db74a3335e openvswitch: use percpu flow stats
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Cc: pravin shelar <pshelar@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:14:01 -04:00
Thadeu Lima de Souza Cascardo
40773966cc openvswitch: fix flow stats accounting when node 0 is not possible
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:14:01 -04:00
Xin Long
41001faf95 sctp: not return ENOMEM err back in sctp_packet_transmit
As David and Marcelo's suggestion, ENOMEM err shouldn't return back to
user in transmit path. Instead, sctp's retransmit would take care of
the chunks that fail to send because of ENOMEM.

This patch is only to do some release job when alloc_skb fails, not to
return ENOMEM back any more.

Besides, it also cleans up sctp_packet_transmit's err path, and fixes
some issues in err path:

 - It didn't free the head skb in nomem: path.
 - No need to check nskb in no_route: path.
 - It should goto err: path if alloc_skb fails for head.
 - Not all the NOMEMs should free nskb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:33 -04:00
Xin Long
83dbc3d4a3 sctp: make sctp_outq_flush/tail/uncork return void
sctp_outq_flush return value is meaningless now, this patch is
to make sctp_outq_flush return void, as well as sctp_outq_fail
and sctp_outq_uncork.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:33 -04:00
Xin Long
645194409b sctp: save transmit error to sk_err in sctp_outq_flush
Every time when sctp calls sctp_outq_flush, it sends out the chunks of
control queue, retransmit queue and data queue. Even if some trunks are
failed to transmit, it still has to flush all the transports, as it's
the only chance to clean that transmit_list.

So the latest transmit error here should be returned back. This transmit
error is an internal error of sctp stack.

I checked all the places where it uses the transmit error (the return
value of sctp_outq_flush), most of them are actually just save it to
sk_err.

Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if
it's failed to send a REPLY, which is actually incorrect, as we can't
be sure the error that sctp_outq_flush returns is from sending that
REPLY.

So it's meaningless for sctp_outq_flush to return error back.

This patch is to save transmit error to sk_err in sctp_outq_flush, the
new error can update the old value. Eventually, sctp_wait_for_* would
check for it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long
b61c654f9b sctp: free msg->chunks when sctp_primitive_SEND return err
Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
made sctp_primitive_SEND return err only when asoc state is unavailable.
In this case, chunks are not enqueued, they have no chance to be freed if
we don't take care of them later.

This Patch is actually to revert commit 1cd4d5c432 ("sctp: remove the
unused sctp_datamsg_free()"), commit 69b5777f2e ("sctp: hold the chunks
only after the chunk is enqueued in outq") and commit 8b570dc9f7 ("sctp:
only drop the reference on the datamsg after sending a msg"), to use
sctp_datamsg_free to free the chunks of current msg.

Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long
66388f2c08 sctp: do not return the transmit err back to sctp_sendmsg
Once a chunk is enqueued successfully, sctp queues can take care of it.
Even if it is failed to transmit (like because of nomem), it should be
put into retransmit queue.

If sctp report this error to users, it confuses them, they may resend
that msg, but actually in kernel sctp stack is in charge of retransmit
it already.

Besides, this error probably is not from the failure of transmitting
current msg, but transmitting or retransmitting another msg's chunks,
as sctp_outq_flush just tries to send out all transports' chunks.

This patch is to make sctp_cmd_send_msg return avoid, and not return the
transmit err back to sctp_sendmsg

Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long
2c89791eeb sctp: remove the unnecessary state check in sctp_outq_tail
Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks
the asoc's state through statetable before calling sctp_outq_tail. So
there's no need to check the asoc's state again in sctp_outq_tail.

Besides, sctp_do_sm is protected by lock_sock, even if sending msg is
interrupted by timer events, the event's processes still need to acquire
lock_sock first. It means no others CMDs can be enqueue into side effect
list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it.

This patch is to remove redundant asoc->state check from sctp_outq_tail.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Alexei Starovoitov
8d79266bc4 ip6_tunnel: add collect_md mode to IPv6 tunnels
Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels
to operate in 'collect metadata' mode.
Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function
for both collect_md and traditional tunnels.
bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:13:07 -04:00
Alexei Starovoitov
cfc7381b30 ip_tunnel: add collect_md mode to IPIP tunnel
Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to
operate in 'collect metadata' mode.
bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away.
ovs can use it as well in the future (once appropriate ovs-vport
abstractions and user apis are added).
Note that just like in other tunnels we cannot cache the dst,
since tunnel_info metadata can be different for every packet.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:13:07 -04:00
Julia Lawall
eb94737d71 l2tp: constify net_device_ops structures
Check for net_device_ops structures that are only stored in the netdev_ops
field of a net_device structure.  This field is declared const, so
net_device_ops structures that have this property can be declared as const
also.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct net_device_ops i@p = { ... };

@ok@
identifier r.i;
struct net_device e;
position p;
@@
e.netdev_ops = &i@p;

@bad@
position p != {r.p,ok.p};
identifier r.i;
struct net_device_ops e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
 struct net_device_ops i = { ... };
// </smpl>

The result of size on this file before the change is:
   text	      data     bss     dec         hex	  filename
   3401        931      44    4376        1118	net/l2tp/l2tp_eth.o

and after the change it is:
   text	     data        bss	    dec	    hex	filename
   3993       347         44       4384    1120	net/l2tp/l2tp_eth.o

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:07:23 -04:00
Alan Cox
5ff904d55d llc: switch type to bool as the timeout is only tested versus 0
(As asked by Dave in Februrary)

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:05:05 -04:00
Eric Dumazet
3613b3dbd1 tcp: prepare skbs for better sack shifting
With large BDP TCP flows and lossy networks, it is very important
to keep a low number of skbs in the write queue.

RACK and SACK processing can perform a linear scan of it.

We should avoid putting any payload in skb->head, so that SACK
shifting can be done if needed.

With this patch, we allow to pack ~0.5 MB per skb instead of
the 64KB initially cooked at tcp_sendmsg() time.

This gives a reduction of number of skbs in write queue by eight.
tcp_rack_detect_loss() likes this.

We still allow payload in skb->head for first skb put in the queue,
to not impact RPC workloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:05:05 -04:00
phil.turnbull@oracle.com
8ab86c00e3 irda: Free skb on irda_accept error path.
skb is not freed if newsk is NULL. Rework the error path so free_skb is
unconditionally called on function exit.

Fixes: c3ea9fa274 ("[IrDA] af_irda: IRDA_ASSERT cleanups")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 09:59:31 -04:00
Eric Dumazet
ffb4d6c850 tcp: fix overflow in __tcp_retransmit_skb()
If a TCP socket gets a large write queue, an overflow can happen
in a test in __tcp_retransmit_skb() preventing all retransmits.

The flow then stalls and resets after timeouts.

Tested:

sysctl -w net.core.wmem_max=1000000000
netperf -H dest -- -s 1000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 09:59:30 -04:00
David Howells
8a681c3605 rxrpc: Add config to inject packet loss
Add a configuration option to inject packet loss by discarding
approximately every 8th packet received and approximately every 8th DATA
packet transmitted.

Note that no locking is used, but it shouldn't really matter.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells
71f3ca408f rxrpc: Improve skb tracing
Improve sk_buff tracing within AF_RXRPC by the following means:

 (1) Use an enum to note the event type rather than plain integers and use
     an array of event names rather than a big multi ?: list.

 (2) Distinguish Rx from Tx packets and account them separately.  This
     requires the call phase to be tracked so that we know what we might
     find in rxtx_buffer[].

 (3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the
     event type.

 (4) A pair of 'rotate' events are added to indicate packets that are about
     to be rotated out of the Rx and Tx windows.

 (5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for
     packet loss injection recording.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells
ba39f3a0ed rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var
Remove _enter/_debug/_leave calls from rxrpc_recvmsg_data() of which one
uses an uninitialised variable.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells
849979051c rxrpc: Add a tracepoint to follow what recvmsg does
Add a tracepoint to follow what recvmsg does within AF_RXRPC.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells
58dc63c998 rxrpc: Add a tracepoint to follow packets in the Rx buffer
Add a tracepoint to follow the life of packets that get added to a call's
receive buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00