We are holding ->truncate_mutex, so nobody else can alter our
block pointers. Rechecks/retries were needed back when we
only held BKL there, and had to cope with write_begin/writepage
and writepage/truncate races. Can't happen anymore...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's a case when an indirect block gets dirtied for no good
reason - when there's a hole starting in the middle of area
covered by it and spanning past its end, and truncate() is done
precisely to the beginning of the hole.
The block is obviously not modified at all - all removals happen
beyond it. However, existing code ends up dirtying it just in
case. It's trivial to fix and while it's not a real bug by any
stretch of imagination, it makes the damn thing harder to follow.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that it's already made unreachable from the inode, so we don't have
to worry about ufs_frag_map() walking into something already freed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Have caller fetch the block number *and* remove it from wherever
it was. Pass the block number instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We always have 0 < depth2 <= depth in there, so
if (--depth) {
if (--depth2)
A
B
} else {
C // not using depth2
}
D // not using depth2
is equivalent to
if (--depth2)
A with s/depth/depth - 1/
if (--depth)
B
else
C
D
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For calls in __ufs_truncate_blocks() it's just a matter of not
incrementing offsets[0] and not making that call - immediately
following loop will be executed one extra time and we'll be just
fine. For recursive call in ufs_trunc_branch() itself, just
assing NULL to offsets if we would be about to make such call.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of manually checking that the array contains only zeroes,
find the position of the last non-zero (in __ufs_truncate(), where
we can conveniently do that) and use that to tell if there's
any non-zero in the array tail passed to ufs_trunc_...indirect().
The goal of all that clumsiness is to get fold these functions
together.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
rather than bitslicing the offset just formed as sum of shifted indices,
pass the array of those indices itself. NULL is used as equivalent
of "all zeroes" (== free the entire branch).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
IOW, the distance of cutoff from the begining of the branch
(in blocks).
That (and the fact that block just prior to cutoff is guaranteed to
be present) allows to tell whether to free triple indirect block
just by looking at the offset.
While we are at it, using u64 for index in the block is wrong -
those should be unsigned int.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use ufs_block_to_path() to find the cutoff path in the block pointers' tree.
For now just use the information about the depth (to bypass the fully
preserved subtrees); subsequent commits will use the information about actual
path.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
type makes no sense - those are indices in block number arrays, not
block numbers. And no, UFS is not likely to grow indirect blocks with
4Gpointers in them...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is closely tied to block pointers handling there, can benefit
from existing helpers, etc. - no point keeping them apart.
Trimmed the trailing whitespaces in inode.c at the same time.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently - on lock_ufs(), eventually - on per-inode mutex.
lock_ufs() used to be mere BKL, which is much weaker, so it needed
those rechecks. BKL doesn't provide any exclusion once we lose CPU;
its blind replacement, OTOH, _does_. Making that per-filesystem was
an atrocity, but at least we can simplify life here. And yes, we
certainly need to make that sucker per-inode - these days inode.c and
truncate.c uses are needed only to protect the block pointers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There were 3 remaining users; in two of them we took ->s_lock immediately
after lock_ufs() and held it until just before unlock_ufs(); the third
one (statfs) could not be called from itself or from other two (remount
and sync_fs). Just use ->s_lock in statfs and don't bother with lock_ufs
at all.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* stores to block pointers are under per-inode seqlock (meta_lock) and
mutex (truncate_mutex)
* fetches of block pointers are either under truncate_mutex, or wrapped
into seqretry loop on meta_lock
* all changes of ->i_size are under truncate_mutex and i_mutex
* all changes of ->i_lastfrag are under truncate_mutex
It's similar to what ext2 is doing; the main difference is that unlike
ext2 we can't rely upon the atomicity of stores into block pointers -
on UFS2 they are 64bit. So we can't cut the corner when switching
a pointer from NULL to non-NULL as we could in ext2_splice_branch()
and need to use meta_lock on all modifications.
We use seqlock where ext2 uses rwlock; ext2 could probably also benefit
from such change...
Another non-trivial difference is that with UFS we *cannot* have reader
grab truncate_mutex in case of race - it has to keep retrying. That
might be possible to change, but not until we lift tail unpacking
several levels up in call chain.
After that commit we do *NOT* hold fs-wide serialization on accesses
to block pointers anymore. Moreover, lock_ufs() can become a normal
mutex now - it's only used on statfs, remount and sync_fs and none
of those uses are recursive. As the matter of fact, *now* it can be
collapsed with ->s_lock, and be eventually replaced with saner
per-cylinder-group spinlocks, but that's a separate story.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
right now it doesn't matter (lock_ufs() serializes everything),
but when we switch to per-inode locking, it will be needed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Broken in "[PATCH] ufs: truncate should allocate block for last byte";
all way back in 2006. ufs_setattr() hadn't been the only user of
vmtruncate() and eliminating ->truncate() method required corrections
in a bunch of places. Eventually those places had migrated into
->write_begin() failure exit and ->write_end() after short copy...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) move it inside ufs_truncate()
b) ufs_free_inode() doesn't need it - it's serialized on ->s_lock
c) ufs_write_inode() doesn't need it either (and can be called without
it anyway).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull ext4 bugfixes from Ted Ts'o:
"Bug fixes (all for stable kernels) for ext4:
- address corner cases for indirect blocks->extent migration
- fix reserved block accounting invalidate_page when
page_size != block_size (i.e., ppc or 1k block size file systems)
- fix deadlocks when a memcg is under heavy memory pressure
- fix fencepost error in lazytime optimization"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: replace open coded nofail allocation in ext4_free_blocks()
ext4: correctly migrate a file with a hole at the beginning
ext4: be more strict when migrating to non-extent based file
ext4: fix reservation release on invalidatepage for delalloc fs
ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
bufferhead: Add _gfp version for sb_getblk()
ext4: fix fencepost error in lazytime optimization
Ensure that the calls to renew_lease() in open_done() etc. only apply
to session-less versions of NFSv4.x (i.e. NFSv4.0).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of just kicking off lease recovery, we should look into the
sequence flag errors and handle them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
RFC5661 states:
The server has encountered an unrecoverable fault with the
backchannel (e.g., it has lost track of the sequence ID for a slot
in the backchannel). The client MUST stop sending more requests
on the session's fore channel, wait for all outstanding requests
to complete on the fore and back channel, and then destroy the
session.
Ensure we do so...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Try to handle this for now by invalidating all outstanding layouts for this
server and then testing all the open+lock+delegation stateids.
At some later stage, we may want to optimise by separating out the testing of
delegation stateids only, and adding testing of layout stateids.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the server tells us that only some state has been revoked, then we
need to run the full TEST_STATEID dog and pony show in order to discover
which locks and delegations are still OK. Currently we blow away all
state, which means that we lose all locks!
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method. To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.
Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When userspace does a write, there's no need for the written data to
pollute the CPU cache. This matches the original XIP code.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull scheduler fixes from Ingo Molnar:
"Debug info and other statistics fixes and related enhancements"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Fix numa balancing stats in /proc/pid/sched
sched/numa: Show numa_group ID in /proc/sched_debug task listings
sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
sched/stat: Simplify the sched_info accounting dependency
Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file. This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.
# assmuing 4k block size ext4, with delalloc enabled
# skip the first block and write to the second block
xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile
# converting to indirect-mapped file, which would move the data blocks
# to the beginning of the file, but extent status cache still marks
# that region as a hole
chattr -e /mnt/ext4/testfile
# delayed allocation writes to the "hole", reclaim the same data block
# again, results in i_blocks corruption
xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
umount /mnt/ext4
e2fsck -nf /dev/sda6
...
Inode 53, i_blocks is 16, should be 8. Fix? no
...
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:
a) delayed allocated extents could bypass the check on eh->eh_entries
and eh->eh_depth
This can be demonstrated by this script
xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
chattr -e /mnt/ext4/testfile
where testfile has two extents but still be converted to non-extent
based file format.
b) only extent length is checked but not the offset, which would result
in data lose (delalloc) or fs corruption (nodelalloc), because
non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
blocks
This can be demostrated by
xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
chattr -e /mnt/ext4/testfile
sync
If delalloc is enabled, dmesg prints
EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
EXT4-fs (dm-4): This should not happen!! Data will be lost
If delalloc is disabled, e2fsck -nf shows corruption
Inode 53, i_size is 5497558142976, should be 4096. Fix? no
Fix the two issues by
a) forcing all delayed allocation blocks to be allocated before checking
eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.
However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:
EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks
This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.
Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.
This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.
[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org