Now that we send the pages using a struct msghdr, instead of
using sendpage(), we no longer need to 'prime the socket' with
an address for unconnected UDP messages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the client stream receive code receives an ESHUTDOWN error either
because the server closed the connection, or because it sent a
callback which cannot be processed, then we should shut down
the connection.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the message read completes, but the socket returned an error
condition, we should ensure to propagate that error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
A zero length fragment is really a bug, but let's ensure we don't
go nuts when one turns up.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
To ensure that the receive worker has exclusive access to the stream record
info, we must not reset the contents other than when holding the
transport->recv_mutex.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
As reported by Dan Carpenter, this test for acred->cred being set is
inconsistent with the dereference of the pointer a few lines earlier.
An 'auth_cred' *always* has ->cred set - every place that creates one
initializes this field, often as the first thing done.
So remove this test.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Because we clear XPRT_SOCK_DATA_READY before reading, we can end up
with a situation where new data arrives, causing xs_data_ready() to
queue up a second receive worker job for the same socket, which then
immediately gets stuck waiting on the transport receive mutex.
The fix is to only clear XPRT_SOCK_DATA_READY once we're done reading,
and then to use poll() to check if we might need to queue up a new
job in order to deal with any new data.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Set memalloc_nofs_save() on all the rpciod/xprtiod jobs so that we
ensure memory allocations for asynchronous rpc calls don't ever end
up recursing back to the NFS layer for memory reclaim.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull more nfsd fixes from Bruce Fields:
"Two small fixes, one for crashes using nfs/krb5 with older enctypes,
one that could prevent clients from reclaiming state after a kernel
upgrade"
* tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux:
sunrpc: fix 4 more call sites that were using stack memory with a scatterlist
Revert "nfsd4: return default lease period"
Pull more NFS client fixes from Anna Schumaker:
"Three fixes this time.
Nicolas's is for xprtrdma completion vector allocation on single-core
systems. Greg's adds an error check when allocating a debugfs dentry.
And Ben's is an additional fix for nfs_page_async_flush() to prevent
pages from accidentally getting truncated.
Summary:
- Make sure Send CQ is allocated on an existing compvec
- Properly check debugfs dentry before using it
- Don't use page_file_mapping() after removing a page"
* tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Don't use page_file_mapping after removing the page
rpc: properly check debugfs dentry before using it
xprtrdma: Make sure Send CQ is allocated on an existing compvec
While trying to reproduce a reported kernel panic on arm64, I discovered
that AUTH_GSS basically doesn't work at all with older enctypes on arm64
systems with CONFIG_VMAP_STACK enabled. It turns out there still a few
places using stack memory with scatterlists, causing krb5_encrypt() and
krb5_decrypt() to produce incorrect results (or a BUG if CONFIG_DEBUG_SG
is enabled).
Tested with cthon on v4.0/v4.1/v4.2 with krb5/krb5i/krb5p using
des3-cbc-sha1 and arcfour-hmac-md5.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
au_rslack is significantly smaller than (au_cslack << 2). Using
that value results in smaller receive buffers. In some cases this
eliminates an extra segment in Reply chunks (RPC/RDMA).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently rpc_inline_rcv_pages() uses au_rslack to estimate the
size of the upper layer reply header. This is fine for auth flavors
where au_verfsize == au_rslack.
However, some auth flavors have more going on. krb5i for example has
two more words after the verifier, and another blob following the
RPC message. The calculation involving au_rslack pushes the upper
layer reply header too far into the rcv_buf.
au_rslack is still valuable: it's the amount of buffer space needed
for the reply, and is used when allocating the reply buffer. We'll
keep that.
But, add a new field that can be used to properly estimate the
location of the upper layer header in each RPC reply, based on the
auth flavor in use.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
au_verfsize will be needed for a non-flavor-specific computation
in a subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Certain NFS results (eg. READLINK) might expect a data payload that
is not an exact multiple of 4 bytes. In this case, XDR encoding
is required to pad that payload so its length on the wire is a
multiple of 4 bytes. The constants that define the maximum size of
each NFS result do not appear to account for this extra word.
In each case where the data payload is to be received into pages:
- 1 word is added to the size of the receive buffer allocated by
call_allocate
- rpc_inline_rcv_pages subtracts 1 word from @hdrsize so that the
extra buffer space falls into the rcv_buf's tail iovec
- If buf->pagelen is word-aligned, an XDR pad is not needed and
is thus removed from the tail
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
prepare_reply_buffer() and its NFSv4 equivalents expose the details
of the RPC header and the auth slack values to upper layer
consumers, creating a layering violation, and duplicating code.
Remedy these issues by adding a new RPC client API that hides those
details from upper layers in a common helper function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The key action of xdr_buf_trim() is that it shortens buf->len, the
length of the xdr_buf's content. The other actions -- shortening the
head, pages, and tail components -- are actually not necessary. In
particular, changing the size of those components can corrupt the
RPC message contained in the buffer. This is an accident waiting to
happen rather than a current bug, as far as we know.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add infrastructure for trace points in the RPC_AUTH_GSS kernel
module, and add a few sample trace points. These report exceptional
or unexpected events, and observe the assignment of GSS sequence
numbers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
- Recover some instruction count because I'm about to introduce a
few xdr_inline_decode call sites
- Replace dprintk() call sites with trace points
- Reduce the hot path so it fits in fewer cachelines
I've also renamed it rpc_decode_header() to match everything else
in the RPC client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Enable distributions to enforce the rejection of ancient and
insecure Kerberos enctypes in the kernel's RPCSEC_GSS
implementation. These are the single-DES encryption types that
were deprecated in 2012 by RFC 6649.
Enctypes that were deprecated more recently (by RFC 8429) remain
fully supported for now because they are still likely to be widely
used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Simo Sorce <simo@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
tsh_size was added to accommodate transports that send a pre-amble
before each RPC message. However, this assumes the pre-amble is
fixed in size, which isn't true for some transports. That makes
tsh_size not very generic.
Also I'd like to make the estimation of RPC send and receive
buffer sizes more precise. tsh_size doesn't currently appear to be
accounted for at all by call_allocate.
Therefore let's just remove the tsh_size concept, and make the only
transports that have a non-zero tsh_size employ a direct approach.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Reduce dprintk noise by removing dprintk() call sites
from hot path that do not report exceptions. These are usually
replaceable with function graph tracing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We don't want READ payloads that are partially in the head iovec and
in the page buffer because this requires pull-up, which can be
expensive.
The NFS/RPC client tries hard to predict the size of the head iovec
so that the incoming READ data payload lands only in the page
vector, but it doesn't always get it right. To help diagnose such
problems, add a trace point in the logic that decodes READ-like
operations that reports whether pull-up is being done.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This can help field troubleshooting without needing the overhead of
a full network capture (ie, tcpdump).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Having access to the controlling rpc_rqst means a trace point in the
XDR code can report:
- the XID
- the task ID and client ID
- the p_name of RPC being processed
Subsequent patches will introduce such trace points.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Post RECV WRs in batches to reduce the hardware doorbell rate per
transport. This helps the RPC-over-RDMA client scale better in
number of transports.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In very rare cases, an NFS READ operation might predict that the
non-payload part of the RPC Call is large. For instance, an
NFSv4 COMPOUND with a large GETATTR result, in combination with a
large Kerberos credential, could push the non-payload part to be
several kilobytes.
If the non-payload part is larger than the connection's inline
threshold, the client is required to provision a Reply chunk. The
current Linux client does not check for this case. There are two
obvious ways to handle it:
a. Provision a Write chunk for the payload and a Reply chunk for
the non-payload part
b. Provision a Reply chunk for the whole RPC Reply
Some testing at a recent NFS bake-a-thon showed that servers can
mostly handle a. but there are some corner cases that do not work
yet. b. already works (it has to, to handle krb5i/p), but could be
somewhat less efficient. However, I expect this scenario to be very
rare -- no-one has reported a problem yet.
So I'm going to implement b. Sometime later I will provide some
patches to help make b. a little more efficient by more carefully
choosing the Reply chunk's segment sizes to ensure the payload is
optimally aligned.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
These can result in a lot of log noise, and are able to be triggered
by client misbehavior. Since there are trace points in these
handlers now, there's no need to spam the log.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
CC [M] net/sunrpc/xprtrdma/svc_rdma_transport.o
linux/net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’:
linux/net/sunrpc/xprtrdma/svc_rdma_transport.c:452:19: warning: variable ‘sap’ set but not used [-Wunused-but-set-variable]
struct sockaddr *sap;
^
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
instance = kmalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In the rpc server, When something happens that might be reason to wake
up a thread to do something, what we do is
- modify xpt_flags, sk_sock->flags, xpt_reserved, or
xpt_nr_rqsts to indicate the new situation
- call svc_xprt_enqueue() to decide whether to wake up a thread.
svc_xprt_enqueue may require multiple conditions to be true before
queueing up a thread to handle the xprt. In the SMP case, one of the
other CPU's may have set another required condition, and in that case,
although both CPUs run svc_xprt_enqueue(), it's possible that neither
call sees the writes done by the other CPU in time, and neither one
recognizes that all the required conditions have been set. A socket
could therefore be ignored indefinitely.
Add memory barries to ensure that any svc_xprt_enqueue() call will
always see the conditions changed by other CPUs before deciding to
ignore a socket.
I've never seen this race reported. In the unlikely event it happens,
another event will usually come along and the problem will fix itself.
So I don't think this is worth backporting to stable.
Chuck tried this patch and said "I don't see any performance
regressions, but my server has only a single last-level CPU cache."
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use READ_ONCE() to tell the compiler to not optimse away the read of
xprt->xpt_flags in svc_xprt_release_slot().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Two and a half years ago, the client was changed to use gathered
Send for larger inline messages, in commit 655fec6987 ("xprtrdma:
Use gathered Send for large inline messages"). Several fixes were
required because there are a few in-kernel device drivers whose
max_sge is 3, and these were broken by the change.
Apparently my memory is going, because some time later, I submitted
commit 25fd86eca1 ("svcrdma: Don't overrun the SGE array in
svc_rdma_send_ctxt"), and after that, commit f3c1fd0ee2 ("svcrdma:
Reduce max_send_sges"). These too incorrectly assumed in-kernel
device drivers would have more than a few Send SGEs available.
The fix for the server side is not the same. This is because the
fundamental problem on the server is that, whether or not the client
has provisioned a chunk for the RPC reply, the server must squeeze
even the most complex RPC replies into a single RDMA Send. Failing
in the send path because of Send SGE exhaustion should never be an
option.
Therefore, instead of failing when the send path runs out of SGEs,
switch to using a bounce buffer mechanism to handle RPC replies that
are too complex for the device to send directly. That allows us to
remove the max_sge check to enable drivers with small max_sge to
work again.
Reported-by: Don Dutile <ddutile@redhat.com>
Fixes: 25fd86eca1 ("svcrdma: Don't overrun the SGE array in ...")
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When using Kerberos with v4.20, I've observed frequent connection
loss on heavy workloads. I traced it down to the client underrunning
the GSS sequence number window -- NFS servers are required to drop
the RPC with the low sequence number, and also drop the connection
to signal that an RPC was dropped.
Bisected to commit 918f3c1fe8 ("SUNRPC: Improve latency for
interactive tasks").
I've got a one-line workaround for this issue, which is easy to
backport to v4.20 while a more permanent solution is being derived.
Essentially, tk_owner-based sorting is disabled for RPCs that carry
a GSS sequence number.
Fixes: 918f3c1fe8 ("SUNRPC: Improve latency for interactive ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
According to RFC2203, the RPCSEC_GSS sequence numbers are bounded to
an upper limit of MAXSEQ = 0x80000000. Ensure that we handle that
correctly.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The clean up is handled by the caller, rpcrdma_buffer_create(), so this
call to rpcrdma_sendctxs_destroy() leads to a double free.
Fixes: ae72950abf ("xprtrdma: Add data structure to manage RDMA Send arguments")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>