We're doing a unnecessary extra lookup of the ino cache's
inode when we already have it (and holding a reference)
during the process of saving the ino cache contents to disk.
Therefore remove this extra lookup.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While running some snashot aware defrag tests I noticed I was panicing every
once and a while in key_search. This is because of the optimization that says
if we find a key at slot 0 it will be at slot 0 all the way down the rest of the
tree. This isn't the case for btrfs_search_old_slot since it will likely replay
changes to a buffer if something has changed since we took our sequence number.
So short circuit this optimization by setting prev_cmp to -1 every time we call
key_search so we will do our normal binary search. With this patch I am no
longer seeing the panics I was seeing before. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While looking at somebodys corruption I became completely convinced that
btrfs_split_item was broken, so I wrote this test to verify that it was working
as it was supposed to. Thankfully it appears to be working as intended, so just
add this test to make sure nobody breaks it in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is not necessary to store the NULL byte in a symlink inline file
extent. There's currently no code that requires the NULL byte to be
present in the extent. This change also doesn't break file format
compatibility nor the send/receive feature.
The VFS also doesn't need the NULL byte to be present in the extent,
as it reads up to inode->i_size bytes (which already excluded the NULL
byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()).
So with this change we save 1 byte per symlink file extent (which is
always inlined in the btree leaf) without losing backward and forward
compatibility.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The fact that btrfs_root_refs() returned 0 for the tree_root caused
bugs in the past, therefore it is set to 1 with this patch and
(hopefully) all affected code is adapted to this change.
I verified this change by temporarily adding WARN_ON() checks
everywhere where btrfs_root_refs() is used, checking whether the
logic of the code is changed by btrfs_root_refs() returning 1
instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
With these added checks, I ran the xfstests './check -g auto'.
The two roots chunk_root and log_root_tree that are only referenced
by the superblock and the log_roots below the log_root_tree still
have btrfs_root_refs() == 0, only the tree_root is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle are:
- (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
van Riel, Peter Zijlstra et al. Yay!
- optimize preemption counter handling: merge the NEED_RESCHED flag
into the preempt_count variable, by Peter Zijlstra.
- wait.h fixes and code reorganization from Peter Zijlstra
- cfs_bandwidth fixes from Ben Segall
- SMP load-balancer cleanups from Peter Zijstra
- idle balancer improvements from Jason Low
- other fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
stop_machine: Fix race between stop_two_cpus() and stop_cpus()
sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
sched: Fix asymmetric scheduling for POWER7
sched: Move completion code from core.c to completion.c
sched: Move wait code from core.c to wait.c
sched: Move wait.c into kernel/sched/
sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
sched: Avoid throttle_cfs_rq() racing with period_timer stopping
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
sched: Remove extra put_online_cpus() inside sched_setaffinity()
sched/rt: Fix task_tick_rt() comment
sched/wait: Fix build breakage
sched/wait: Introduce prepare_to_wait_event()
sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
sched: Remove get_online_cpus() usage
sched: Fix race in migrate_swap_stop()
...
A bit of cleanup plus some gratuitous variable renaming. I think using
structures instead of numeric offsets makes this code much more
understandable.
Also added a comment about current time range expected by
the server.
Acked-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Opens on current cifs/smb2/smb3 mounts with O_DIRECT flag fail
even when caching is disabled on the mount. This was
reported by those running SMB2 benchmarks who need to
be able to pass O_DIRECT on many of their open calls to
reduce caching effects, but would also be needed by other
applications.
When mounting with forcedirectio ("cache=none") cifs and smb2/smb3
do not go through the page cache and thus opens with O_DIRECT flag
should work (when posix extensions are negotiated we even are
able to send the flag to the server). This patch fixes that
in a simple way.
The 9P client has a similar situation (caching is often disabled)
and takes the same approach to O_DIRECT support ie works if caching
disabled, but if client caching enabled it fails with EINVAL.
A followon idea for a future patch as Pavel noted, could
be that files opened with O_DIRECT could cause us to change
inode->i_fop on the fly from
cifs_file_strict_ops
to
cifs_file_direct_ops
which would allow us to support this on non-forcedirectio mounts
(cache=strict and cache=loose) as well.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Andrey reported that he was seeing cifs.ko spam the logs with messages
like this:
CIFS VFS: Unexpected lookup error -26
He was listing the root directory of a server and hitting an error when
trying to QUERY_PATH_INFO against hiberfil.sys and pagefile.sys. The
right fix would be to switch the lookup code over to using FIND_FIRST,
but until then we really don't need to report this at a level of
KERN_ERR. Convert this message over to FYI level.
Reported-by: "Andrey Shernyukov" <andreysh@nioch.nsc.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Sometimes, the server will report an error that basically indicates
that it's running out of resources. These include these under SMB1:
NT_STATUS_NO_MEMORY
NT_STATUS_SECTION_TOO_BIG
NT_STATUS_TOO_MANY_PAGING_FILES
...and this one under SMB2:
STATUS_NO_MEMORY
Currently, this gets mapped to ENOMEM by the client, but that's
confusing as an ENOMEM error is typically an indicator that the
client is out of memory.
Change these errors to instead map to EREMOTEIO to indicate that
the problem is actually server-side and not on the client.
Reported-by: "ISHIKAWA,chiaki" <ishikawa@yk.rim.or.jp>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Now we treat any reparse point as a symbolic link and map it to a Unix
one that is not true in a common case due to many reparse point types
supported by SMB servers.
Distinguish reparse point types into two groups:
1) that can be accessed directly through a reparse point
(junctions, deduplicated files, NFS symlinks);
2) that need to be processed manually (Windows symbolic links, DFS);
and map only Windows symbolic links to Unix ones.
Cc: <stable@vger.kernel.org>
Acked-by: Jeff Layton <jlayton@redhat.com>
Reported-and-tested-by: Joao Correia <joaomiguelcorreia@gmail.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
o Changes from v1
Use find_next(_zero)_bit suggested by jg.kim
When f2fs issues discard command, if segment is contiguous,
let's issue more large segment to gather adjacent segments.
** blktrace **
179,1 0 5859 42.619023770 971 C D 131072 + 2097152 [0]
179,1 0 33665 108.840475468 971 C D 2228224 + 2494464 [0]
179,1 0 33671 109.131616427 971 C D 14909440 + 344064 [0]
179,1 0 33677 109.137100677 971 C D 15261696 + 4096 [0]
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull gfs2 updates from Steven Whitehouse:
"The main feature of interest this time is quota updates. There are
some clean ups and some patches to use the new generic lru list code.
There is still plenty of scope for some further changes in due course -
faster lookups of quota structures is very much on the todo list.
Also, a start has been made towards the more tricky issue of using the
generic lru code with glocks, but that will have to be completed in a
subsequent merge window.
The other, more minor feature, is that there have been a number of
performance patches which relate to block allocation. In particular
they will improve performance when the disk is nearly full"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Use generic list_lru for quota
GFS2: Rename quota qd_lru_lock qd_lock
GFS2: Use reflink for quota data cache
GFS2: Use lockref for glocks
GFS2: Protect quota sync generation
GFS2: Inline qd_trylock into gfs2_quota_unlock
GFS2: Make two similar quota code fragments into a function
GFS2: Remove obsolete quota tunable
GFS2: Move gfs2_icbit_munge into quota.c
GFS2: Speed up starting point selection for block allocation
GFS2: Add allocation parameters structure
GFS2: Clean up reservation removal
GFS2: fix dentry leaks
GFS2: new function gfs2_rbm_incr
GFS2: Introduce rbm field bii
GFS2: Do not reset flags on active reservations
GFS2: introduce bi_blocks for optimization
GFS2: optimize rbm_from_block wrt bi_start
GFS2: d_splice_alias() can't return error
We need to break delegations on any operation that changes the set of
links pointing to an inode. Start with unlink.
Such operations also hold the i_mutex on a parent directory. Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client. To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:
acquire locks
...
test for delegation; if found:
take reference on inode
release locks
wait for delegation break
drop reference on inode
retry
It is possible this could never terminate. (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.) But this seems very unlikely.
The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack. We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.
Cc: David Howells <dhowells@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Implement NFSv4 delegations at the vfs level using the new FL_DELEG lock
type.
Note nfsd is the only delegation user and is only using read
delegations. Warn on any attempt to set a write delegation for now.
We'll come back to that case later.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For now FL_DELEG is just a synonym for FL_LEASE. So this patch doesn't
change behavior.
Next we'll modify break_lease to treat FL_DELEG leases differently, to
account for the fact that NFSv4 delegations should be broken in more
situations than Windows oplocks.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A read delegation is used by NFSv4 as a guarantee that a client can
perform local read opens without informing the server.
The open operation takes the last component of the pathname as an
argument, thus is also a lookup operation, and giving the client the
above guarantee means informing the client before we allow anything that
would change the set of names pointing to the inode.
Therefore, we need to break delegations on rename, link, and unlink.
We also need to prevent new delegations from being acquired while one of
these operations is in progress.
We could add some completely new locking for that purpose, but it's
simpler to use the i_mutex, since that's already taken by all the
operations we care about.
The single exception is rename. So, modify rename to take the i_mutex
on the file that is being renamed.
Also fix up lockdep and Documentation/filesystems/directory-locking to
reflect the change.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_MUTEX_QUOTA is now just being used whenever we want to lock two
non-directories. So the name isn't right. I_MUTEX_NONDIR2 isn't
especially elegant but it's the best I could think of.
Also fix some outdated documentation.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reserve I_MUTEX_PARENT and I_MUTEX_CHILD for locking of actual
directories.
(Also I_MUTEX_QUOTA isn't really a meaningful name for this locking
class any more; fixed in a later patch.)
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Suppose we're given the filehandle for a directory whose closest
ancestor in the dcache is its Nth ancestor.
The main loop in reconnect_path searches for an IS_ROOT ancestor of
target_dir, reconnects that ancestor to its parent, then recommences the
search for an IS_ROOT ancestor from target_dir.
This behavior is quadratic in N. And there's really no need to restart
the search from target_dir each time: once a directory has been looked
up, it won't become IS_ROOT again. So instead of starting from
target_dir each time, we can continue where we left off.
This simplifies the code and improves performance on very deep directory
heirachies. (I can't think of any reason anyone should need heirarchies
a hundred or more deep, but the performance improvement may be valuable
if only to limit damage in case of abuse.)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Also replace 3 easily-confused three-letter acronyms by more helpful
variable names.
Just cleanup, no change in functionality, with one exception: the
dentry_connected() check in the "out_reconnected" case will now only
check the ancestors of the current dentry instead of checking all the
way from target_dir. Since we've already verified connectivity up to
this dentry, that should be sufficient.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note this counter is now being set to 0 on every pass through the loop,
so it no longer serves any useful purpose.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are two places here where we could race with a rename or remove:
- We could find the parent, but then be removed or renamed away
from that parent directory before finding our name in that
directory.
- We could find the parent, and find our name in that parent,
but then be renamed or removed before we look ourselves up by
that name in that parent.
In both cases the concurrent rename or remove will take care of
reconnecting the directory that we're currently examining. Our target
directory should then also be connected. Check this and clear
DISCONNECTED in these cases instead of looping around again.
Note: we *do* need to check that this actually happened if we want to be
robust in the face of corrupted filesystems: a corrupted filesystem
could just return a completely wrong parent, and we want to fail with an
error in that case before starting to clear DISCONNECTED on
non-DISCONNECTED filesystems.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Once we've found any connected parent, we know all our parents are
connected--that's true even if there's a concurrent rename. May as well
clear them all at once and be done with it.
Reviewed-by: Cristoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This would indicate a nasty bug in the dcache and has never triggered in
the past 10 years as far as I know.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The DCACHE_NEED_LOOKUP case referred to here was removed with
39e3c9553f "vfs: remove
DCACHE_NEED_LOOKUP".
There are only four real_lookup() callers and all of them pass in an
unhashed dentry just returned from d_alloc.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
DCACHE_DISCONNECTED should not be cleared until we're sure the dentry is
connected all the way up to the root of the filesystem. It *shouldn't*
be cleared as soon as the dentry is connected to a parent. That will
cause bugs at least on exportable filesystems.
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I can't for the life of me see any reason why anyone should care whether
a dentry that is never hooked into the dentry cache would need
DCACHE_DISCONNECTED set.
This originates from 4b936885ab "fs:
improve scalability of pseudo filesystems", which probably just made the
false assumption the DCACHE_DISCONNECTED was meant to be set on anything
not connected to a parent somehow.
So this is just confusing. Ideally the only uses of DCACHE_DISCONNECTED
would be in the filehandle-lookup code, which needs it to ensure
dentries are connected into the dentry tree before use.
I left d_alloc_pseudo there even though it's now equivalent to
__d_alloc(), just on the theory the name is better documentation of its
intended use outside dcache.c.
Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Every hashed dentry is either hashed in the dentry_hashtable, or a
superblock's s_anon list.
__d_drop() assumes it can determine which is the case by checking
DCACHE_DISCONNECTED; this is not true.
It is true that when DCACHE_DISCONNECTED is cleared, the dentry is not
only hashed on dentry_hashtable, but is fully connected to its parents
back to the root.
But the converse is *not* true: fs/exportfs/expfs.c:reconnect_path()
attempts to connect a directory (found by filehandle lookup) back to
root by ascending to parents and performing lookups one at a time. It
does not clear DCACHE_DISCONNECTED until it's done, and that is not at
all an atomic process.
In particular, it is possible for DCACHE_DISCONNECTED to be set on a
dentry which is hashed on the dentry_hashtable.
Instead, use IS_ROOT() to check which hash chain a dentry is on. This
*does* work:
Dentries are hashed only by:
- d_obtain_alias, which adds an IS_ROOT() dentry to sb_anon.
- __d_rehash, called by _d_rehash: hashes to the dentry's
parent, and all callers of _d_rehash appear to have d_parent
set to a "real" parent.
- __d_rehash, called by __d_move: rehashes the moved dentry to
hash chain determined by target, and assigns target's d_parent
to its d_parent, before dropping the dentry's d_lock.
Therefore I believe it's safe for a holder of a dentry's d_lock to
assume that it is hashed on sb_anon if and only if IS_ROOT(dentry) is
true.
I believe the incorrect assumption about DCACHE_DISCONNECTED was
originally introduced by ceb5bdc2d2 "fs: dcache per-bucket dcache hash
locking".
Also add a comment while we're here.
Cc: Nick Piggin <npiggin@kernel.dk>
Acked-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Trevor Cordes <trevor@tecnopolis.ca>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Put a type field into struct dentry::d_flags to indicate if the dentry is one
of the following types that relate particularly to pathwalk:
Miss (negative dentry)
Directory
"Automount" directory (defective - no i_op->lookup())
Symlink
Other (regular, socket, fifo, device)
The type field is set to one of the first five types on a dentry by calls to
__d_instantiate() and d_obtain_alias() from information in the inode (if one is
given).
The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
dentry as a negative dentry.
Accessors provided are:
d_set_type(dentry, type)
d_is_directory(dentry)
d_is_autodir(dentry)
d_is_symlink(dentry)
d_is_file(dentry)
d_is_negative(dentry)
d_is_positive(dentry)
A bunch of checks in pathname resolution switched to those.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't abuse anon_inodes.c to host private files needed by aio;
we can bloody well declare a mini-fs of our own instead of
patching up what anon_inodes can create for us.
Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>