xlog_recover_clear_agi_bucket didn't set the
type to XFS_BLFT_AGI_BUF, so we got a warning during log
replay (or an ASSERT on a debug build).
XFS (md0): Unknown buffer type 0!
XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1
Fix this, as was done in f19b872b for 2 other locations
with the same problem.
cc: <stable@vger.kernel.org> # 3.10 to current
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If the session has an error, then we want to start by recovering the
session, as any SEQUENCE we send is going to fail with a session
error.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In the case where SEQUENCE receives a NFS4ERR_BADSESSION or
NFS4ERR_DEADSESSION error, we just want to report the session as needing
recovery, and then we want to retry the operation.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When looking at whether or not our dcache is valid, we really don't care
about the general state of the directory attribute cache. Instead, we
we only care about the state of the change attribute.
This fixes a performance issue when the client is responsible for
changing the directory contents; a number of NFSv4 operations will
atomically update the directory change attribute, but may not return
all the other attributes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should only care about checking the attributes if the page cache
is marked as dubious (using NFS_INO_REVAL_PAGECACHE) and the
NFS_INO_REVAL_FORCED flag is set.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should only care about checking the attributes if the page cache
is marked as dubious (using NFS_INO_REVAL_PAGECACHE) and the
NFS_INO_REVAL_FORCED flag is set.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 7cbdb4a286 altered the autofs indirect mount expire to
not hold a spin lock during the expire check.
The direct mount expire needs the same treatment because to
make autofs expires namespace aware may_umount_tree() needs to
to use a similar method to may_umount() when checking if a mount
tree is in use.
This means may_umount_tree() will end up taking the namespace_sem
for the check so the autofs direct mount expire won't be allowed
to hold a spin lock over the check.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If an automount mount is clone(2)ed into a file system that is propagation
private, when it later expires in the originating namespace, subsequent
calls to autofs ->d_automount() for that dentry in the original namespace
will return ELOOP until the mount is umounted in the cloned namespace.
Now that a struct path is available where needed use path_has_submounts()
instead of have_submounts() so we don't get false positives when checking
if a dentry is a mount point or contains mounts in the current namespace.
Link: http://lkml.kernel.org/r/20161011053423.27645.91233.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If an automount mount is clone(2)ed into a file system that is propagation
private, when it later expires in the originating namespace, subsequent
calls to autofs ->d_automount() for that dentry in the original namespace
will return ELOOP until the mount is umounted in the cloned namespace.
Now that a struct path is available where needed use path_is_mountpoint()
instead of d_mountpoint() so we don't get false positives when checking if
a dentry is a mount point in the current namespace.
Link: http://lkml.kernel.org/r/20161011053418.27645.15241.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before commit c3fe493ccd ('ext4: remove unneeded test in
ext4_alloc_file_blocks()') then it was possible for "depth" to be -1
but now, it's not possible that it is negative.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Combination of data=ordered mode and journal_async_commit mount option
is invalid. However the check in parse_options() fails to detect the
case where we simply end up defaulting to data=ordered mode and we
detect the problem only on remount which triggers hard to understand
failure to remount the filesystem.
Fix the checking of mount options to take into account also the default
mode by moving the check somewhat later in the mount sequence.
Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
mb_cache_entry_find_first() and mb_cache_entry_find_next() only return
cache entries with the 'e_reusable' bit set. This should be documented.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
mbcache used several different types to represent the number of entries
in the cache. For consistency within mbcache and with the shrinker API,
always use unsigned long.
This does not change behavior for current mbcache users (ext2 and ext4)
since they limit the entry count to a value which easily fits in an int.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
When mbcache is built as a module, any modules that use it (ext2 and/or
ext4) will depend on its symbols directly, incrementing its reference
count. Therefore, there is no need to do module_get/module_put.
Also note that since the module_get/module_put were in the mbcache
module itself, executing those lines of code was already dependent on
another reference to the mbcache module being held.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Add the layout error payload to the flexfiles layoutreturn private
data, and set up the encoding mechanisms. This is a refactoring in
preparation for adding the layout iostats payload.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
mbcache can be a module that is loaded long after startup, when someone
asks to mount an ext2 or ext4 filesystem. Therefore it should not BUG()
if kmem_cache_create() fails, but rather just fail the module load.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
mbcache entries have an 'e_referenced' bit which users can set with
mb_cache_entry_touch() to indicate that an entry should be given another
pass through the LRU list before the shrinker can delete it. However,
mb_cache_shrink() actually would, when seeing an e_referenced entry at
the front of the list (the least-recently used end), place it right at
the front of the list again. The next iteration would then remove the
entry from the list and delete it. Consequently, e_referenced had
essentially no effect, so ext2/ext4 xattr blocks would sometimes not be
reused as often as expected.
Fix this by making the shrinker move e_referenced entries to the back of
the list rather than the front.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Add a callback to allow the flexfiles layout driver to initialise the
layout private payload.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Couple conflicts resolved here:
1) In the MACB driver, a bug fix to properly initialize the
RX tail pointer properly overlapped with some changes
to support variable sized rings.
2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
overlapping with a reorganization of the driver to support
ACPI, OF, as well as PCI variants of the chip.
3) In 'net' we had several probe error path bug fixes to the
stmmac driver, meanwhile a lot of this code was cleaned up
and reorganized in 'net-next'.
4) The cls_flower classifier obtained a helper function in
'net-next' called __fl_delete() and this overlapped with
Daniel Borkamann's bug fix to use RCU for object destruction
in 'net'. It also overlapped with Jiri's change to guard
the rhashtable_remove_fast() call with a check against
tc_skip_sw().
5) In mlx4, a revert bug fix in 'net' overlapped with some
unrelated changes in 'net-next'.
6) In geneve, a stale header pointer after pskb_expand_head()
bug fix in 'net' overlapped with a large reorganization of
the same code in 'net-next'. Since the 'net-next' code no
longer had the bug in question, there was nothing to do
other than to simply take the 'net-next' hunks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup to allow layout drivers to attach private data to layoutreturn,
and manage the data.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
On a filesystem with no journal, a symlink longer than about 32
characters (exact length depending on padding for encryption) could not
be followed or read immediately after being created in an encrypted
directory. This happened because when the symlink data went through the
delayed allocation path instead of the journaling path, the symlink was
incorrectly detected as a "fast" symlink rather than a "slow" symlink
until its data was written out.
To fix this, disable delayed allocation for symlinks, since there is
no benefit for delayed allocation anyway.
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If there have been no reads or writes to a given mirror since the last
layoutstats update, then don't resend the same data.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the use called stat() on an 'ls -l' workload, and the attribute
cache was successfully revalidate by READDIRPLUS, then we want to
report that back so that the readdir code continues to use
readdirplus.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There is little point in setting NFS_INO_ADVISE_RDPLUS in nfs_lookup and
nfs_lookup_revalidate() unless a process is actually doing readdir on the
parent directory.
Furthermore, there is little point in using readdirplus if we're trying
to revalidate a negative dentry.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Ben Coddington reports that commit 311324ad17, by adding the function
nfs_dir_mapping_need_revalidate() that checks page cache validity on
each call to nfs_readdir() causes a performance regression when
the directory is being modified.
If the directory is changing while we're iterating through the directory,
POSIX does not require us to invalidate the page cache unless the user
calls rewinddir(). However, we still do want to ensure that we use
readdirplus in order to avoid a load of stat() calls when the user
is doing an 'ls -l' workload.
The fix should be to invalidate the page cache immediately when we're
setting the NFS_INO_ADVISE_RDPLUS bit.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 311324ad17 ("NFS: Be more aggressive in using readdirplus...")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It now has only one field and is only used in one structure.
So replaced it in that structure by the field it contains.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A process can have two possible lock owner for a given open file:
a per-process Posix lock owner and a per-open-file flock owner
Use both of these when searching for a suitable stateid to use.
With this patch, READ/WRITE requests will use the correct stateid
if a flock lock is active.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The only time that a lock_context is not immediately available is in
setattr, and now that it has an open_context, it can easily find one
with nfs_get_lock_context.
This removes the need for the on-stack nfs_lockowner.
This change is preparation for correctly support flock stateids.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The open_context can always lead directly to the state, and is always easily
available, so this is a straightforward change.
Doing this makes more information available to _nfs4_do_setattr() for use
in the next patch.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
An open file description (struct file) in a given process can be
associated with two different lock owners.
It can have a Posix lock owner which will be different in each process
that has a fd on the file.
It can have a Flock owner which will be the same in all processes.
When searching for a lock stateid to use, we need to consider both of these
owners
So add a new "flock_owner" to the "nfs_open_context" (of which there
is one for each open file description).
This flock_owner does not need to be reference-counted as there is a
1-1 relation between 'struct file' and nfs open contexts,
and it will never be part of a list of contexts. So there is no need
for a 'flock_context' - just the owner is enough.
The io_count included in the (Posix) lock_context provides no
guarantee that all read-aheads that could use the state have
completed, so not supporting it for flock locks in not a serious
problem. Synchronization between flock and read-ahead can be added
later if needed.
When creating an open_context for a non-openning create call, we don't have
a 'struct file' to pass in, so the lock context gets initialized with
a NULL owner, but this will never be used.
The flock_owner is not used at all in this patch, that will come later.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
this field is not used in any important way and probably should
have been removed by
Commit: 8003d3c4aa ("nfs4: treat lock owners as opaque values")
which removed the pid argument from nfs4_get_lock_state.
Except in unusual and uninteresting cases, two threads with the same
->tgid will have the same ->files pointer, so keeping them both
for comparison brings no benefit.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It's possible that two different servers can return the same (clientid,
verifier) pair purely by coincidence. Both are 64-bit values, but
depending on the server implementation, they can be highly predictable
and collisions may be quite likely, especially when there are lots of
servers.
So, check for this case. If the clientid and verifier both match, then
we actually know they *can't* be the same server, since a new
SETCLIENTID to an already-known server should have changed the verifier.
This helps fix a bug that could cause the client to mount a filesystem
from the wrong server.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Yongcheng Yang <yoyang@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>