Commit Graph

41956 Commits

Author SHA1 Message Date
Al Viro
5a39c25562 ufs_inode_get{frag,block}(): get rid of retries
We are holding ->truncate_mutex, so nobody else can alter our
block pointers.  Rechecks/retries were needed back when we
only held BKL there, and had to cope with write_begin/writepage
and writepage/truncate races.  Can't happen anymore...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:52 -04:00
Al Viro
f53bd1421b __ufs_truncate_blocks(): avoid excessive dirtying of indirect blocks
There's a case when an indirect block gets dirtied for no good
reason - when there's a hole starting in the middle of area
covered by it and spanning past its end, and truncate() is done
precisely to the beginning of the hole.

The block is obviously not modified at all - all removals happen
beyond it.  However, existing code ends up dirtying it just in
case.  It's trivial to fix and while it's not a real bug by any
stretch of imagination, it makes the damn thing harder to follow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:51 -04:00
Al Viro
cc7231e309 free_full_branch(): don't bother modifying the block we are going to free
Note that it's already made unreachable from the inode, so we don't have
to worry about ufs_frag_map() walking into something already freed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:50 -04:00
Al Viro
b6eede0ec6 move marking inode dirty to the end of __ufs_truncate_blocks()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:49 -04:00
Al Viro
163073db51 free_full_branch(): saner calling conventions
Have caller fetch the block number *and* remove it from wherever
it was.  Pass the block number instead.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:48 -04:00
Al Viro
7b4e4f7f81 ufs_trunc_branch(): kill recursion
turn recursion into a pair of loops

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:47 -04:00
Al Viro
6aab6dd379 ufs_trunc_branch(): massage towards killing recursion
We always have 0 < depth2 <= depth in there, so
if (--depth) {
	if (--depth2)
		A
	B
} else {
	C // not using depth2
}
D // not using depth2

is equivalent to

if (--depth2)
	A with s/depth/depth - 1/
if (--depth)
	B
else
	C
D

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:46 -04:00
Al Viro
6d1ebbca2b split ufs_truncate_branch() into full- and partial-branch variants
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:45 -04:00
Al Viro
a138b4b688 ufs: unify the logics for collecting adjacent data blocks to free
open-coded in several places...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:44 -04:00
Al Viro
a96574233c ufs_trunc_branch(): separate the calls with non-NULL offsets
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:43 -04:00
Al Viro
97e0f8f87c ufs_trunc_branch(): never call with offsets != NULL && depth2 == 0
For calls in __ufs_truncate_blocks() it's just a matter of not
incrementing offsets[0] and not making that call - immediately
following loop will be executed one extra time and we'll be just
fine.  For recursive call in ufs_trunc_branch() itself, just
assing NULL to offsets if we would be about to make such call.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:42 -04:00
Al Viro
42432739b5 __ufs_trunc_blocks(): turn the part after switch into a loop
... and turn the switch into if (), since all cases with
depth != 1 have just become identical.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:41 -04:00
Al Viro
ef3a315d4c __ufs_truncate_blocks(): unify freeing the full branches
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:40 -04:00
Al Viro
9e0fbbde27 unify ufs_trunc_..indirect()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:39 -04:00
Al Viro
6775e24d9c ufs_trunc_..indirect(): more massage towards unifying
Instead of manually checking that the array contains only zeroes,
find the position of the last non-zero (in __ufs_truncate(), where
we can conveniently do that) and use that to tell if there's
any non-zero in the array tail passed to ufs_trunc_...indirect().

The goal of all that clumsiness is to get fold these functions
together.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:38 -04:00
Al Viro
85416288bf ufs_trunc_...indirect(): pass the array of indices instead of offsets
rather than bitslicing the offset just formed as sum of shifted indices,
pass the array of those indices itself.  NULL is used as equivalent
of "all zeroes" (== free the entire branch).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:37 -04:00
Al Viro
7a4fdda724 __ufs_truncate(); find cutoff distances into branches by offsets[] array
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:36 -04:00
Al Viro
7bad5939fc ufs_trunc_dindirect(): pass the number of blocks to keep
same as the previous two.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:35 -04:00
Al Viro
6ac36b8777 ufs_trunc_indirect(): pass the index of the first pointer to free
... instead of file offset.  Same cleanups as in the tindirect
conversion in previous commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:34 -04:00
Al Viro
18ca51d821 ufs_trunc_tindirect(): pass the number of blocks to keep
IOW, the distance of cutoff from the begining of the branch
(in blocks).

That (and the fact that block just prior to cutoff is guaranteed to
be present) allows to tell whether to free triple indirect block
just by looking at the offset.

While we are at it, using u64 for index in the block is wrong -
those should be unsigned int.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:33 -04:00
Al Viro
31cd043e1a ufs: beginning of __ufs_truncate_block() massage
Use ufs_block_to_path() to find the cutoff path in the block pointers' tree.
For now just use the information about the depth (to bypass the fully
preserved subtrees); subsequent commits will use the information about actual
path.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:32 -04:00
Al Viro
4e3911f3d7 ufs: the offsets ufs_block_to_path() puts into array are not sector_t
type makes no sense - those are indices in block number arrays, not
block numbers.  And no, UFS is not likely to grow indirect blocks with
4Gpointers in them...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:31 -04:00
Al Viro
010d331fc3 ufs: move truncate code into inode.c
It is closely tied to block pointers handling there, can benefit
from existing helpers, etc. - no point keeping them apart.

Trimmed the trailing whitespaces in inode.c at the same time.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:30 -04:00
Al Viro
0d23cf7616 ufs: no retries are needed on truncate
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:29 -04:00
Al Viro
687857930d ufs: ufs_trunc_...() has exclusion with everything that might cause allocations
Currently - on lock_ufs(), eventually - on per-inode mutex.
lock_ufs() used to be mere BKL, which is much weaker, so it needed
those rechecks.  BKL doesn't provide any exclusion once we lose CPU;
its blind replacement, OTOH, _does_.  Making that per-filesystem was
an atrocity, but at least we can simplify life here.  And yes, we
certainly need to make that sucker per-inode - these days inode.c and
truncate.c uses are needed only to protect the block pointers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:28 -04:00
Al Viro
6a799d3514 ufs: ufs_trunc_direct() always returns 0
make it return void

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:27 -04:00
Al Viro
dff7cfd36e ufs: kill lock_ufs()
There were 3 remaining users; in two of them we took ->s_lock immediately
after lock_ufs() and held it until just before unlock_ufs(); the third
one (statfs) could not be called from itself or from other two (remount
and sync_fs).  Just use ->s_lock in statfs and don't bother with lock_ufs
at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:26 -04:00
Al Viro
724bb09fdc ufs: don't use lock_ufs() for block pointers tree protection
* stores to block pointers are under per-inode seqlock (meta_lock) and
mutex (truncate_mutex)
* fetches of block pointers are either under truncate_mutex, or wrapped
into seqretry loop on meta_lock
* all changes of ->i_size are under truncate_mutex and i_mutex
* all changes of ->i_lastfrag are under truncate_mutex

It's similar to what ext2 is doing; the main difference is that unlike
ext2 we can't rely upon the atomicity of stores into block pointers -
on UFS2 they are 64bit.  So we can't cut the corner when switching
a pointer from NULL to non-NULL as we could in ext2_splice_branch()
and need to use meta_lock on all modifications.

We use seqlock where ext2 uses rwlock; ext2 could probably also benefit
from such change...

Another non-trivial difference is that with UFS we *cannot* have reader
grab truncate_mutex in case of race - it has to keep retrying.  That
might be possible to change, but not until we lift tail unpacking
several levels up in call chain.

After that commit we do *NOT* hold fs-wide serialization on accesses
to block pointers anymore.  Moreover, lock_ufs() can become a normal
mutex now - it's only used on statfs, remount and sync_fs and none
of those uses are recursive.  As the matter of fact, *now* it can be
collapsed with ->s_lock, and be eventually replaced with saner
per-cylinder-group spinlocks, but that's a separate story.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:25 -04:00
Al Viro
4af7b2c080 ufs: bforget() indirect blocks before freeing them
right now it doesn't matter (lock_ufs() serializes everything),
but when we switch to per-inode locking, it will be needed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:24 -04:00
Al Viro
493b4537a2 ufs: move lock_ufs() down into __ufs_truncate_blocks()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:23 -04:00
Al Viro
2401aa29ab ufs: move truncate_setsize() down into ufs_truncate()
just prior to __ufs_truncate_blocks(), with matching change of calling
conventions

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:22 -04:00
Al Viro
3b7a3a05e8 ufs: free excessive blocks upon ->write_begin() failure/short copy
Broken in "[PATCH] ufs: truncate should allocate block for last byte";
all way back in 2006.  ufs_setattr() hadn't been the only user of
vmtruncate() and eliminating ->truncate() method required corrections
in a bunch of places.  Eventually those places had migrated into
->write_begin() failure exit and ->write_end() after short copy...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:21 -04:00
Al Viro
d622f167b8 ufs: switch ufs_evict_inode() to trimmed-down variant of ufs_truncate()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:20 -04:00
Al Viro
f3e0f3da1b ufs: kill more lock_ufs() calls
a) move it inside ufs_truncate()
b) ufs_free_inode() doesn't need it - it's serialized on ->s_lock
c) ufs_write_inode() doesn't need it either (and can be called without
it anyway).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-06 17:39:19 -04:00
Linus Torvalds
1c4c7159ed Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
 "Bug fixes (all for stable kernels) for ext4:

   - address corner cases for indirect blocks->extent migration

   - fix reserved block accounting invalidate_page when
     page_size != block_size (i.e., ppc or 1k block size file systems)

   - fix deadlocks when a memcg is under heavy memory pressure

   - fix fencepost error in lazytime optimization"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: replace open coded nofail allocation in ext4_free_blocks()
  ext4: correctly migrate a file with a hole at the beginning
  ext4: be more strict when migrating to non-extent based file
  ext4: fix reservation release on invalidatepage for delalloc fs
  ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
  bufferhead: Add _gfp version for sb_getblk()
  ext4: fix fencepost error in lazytime optimization
2015-07-05 16:24:54 -07:00
Trond Myklebust
be824167e3 NFSv4: Leases are renewed in sequence_done when we have sessions
Ensure that the calls to renew_lease() in open_done() etc. only apply
to session-less versions of NFSv4.x (i.e. NFSv4.0).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05 15:50:19 -04:00
Trond Myklebust
b15c7cdde4 NFSv4.1: nfs41_sequence_done should handle sequence flag errors
Instead of just kicking off lease recovery, we should look into the
sequence flag errors and handle them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05 15:50:19 -04:00
Trond Myklebust
b13529059c NFSv4.1: Handle SEQ4_STATUS_BACKCHANNEL_FAULT correctly
RFC5661 states:

      The server has encountered an unrecoverable fault with the
      backchannel (e.g., it has lost track of the sequence ID for a slot
      in the backchannel).  The client MUST stop sending more requests
      on the session's fore channel, wait for all outstanding requests
      to complete on the fore and back channel, and then destroy the
      session.

Ensure we do so...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05 15:50:18 -04:00
Trond Myklebust
4099287feb NFSv4.1: Handle SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit correctly
Try to handle this for now by invalidating all outstanding layouts for this
server and then testing all the open+lock+delegation stateids.
At some later stage, we may want to optimise by separating out the testing of
delegation stateids only, and adding testing of layout stateids.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05 15:50:18 -04:00
Trond Myklebust
8b895ce652 NFSv4.1: Handle SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED status bit correctly.
If the server tells us that only some state has been revoked, then we
need to run the full TEST_STATEID dog and pony show in order to discover
which locks and delegations are still OK. Currently we blow away all
state, which means that we lose all locks!

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05 15:46:38 -04:00
Michal Hocko
7444a072c3 ext4: replace open coded nofail allocation in ext4_free_blocks()
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-05 12:33:44 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Matthew Wilcox
43c3dd08da dax: bdev_direct_access() may sleep
The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method.  To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04 15:56:57 -04:00
Matthew Wilcox
bbab37ddc2 block: Add support for DAX reads/writes to block devices
If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04 15:56:57 -04:00
Matthew Wilcox
872eb127e3 dax: Use copy_from_iter_nocache
When userspace does a write, there's no need for the written data to
pollute the CPU cache.  This matches the original XIP code.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04 15:56:56 -04:00
Linus Torvalds
22a093b2fb Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Debug info and other statistics fixes and related enhancements"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix numa balancing stats in /proc/pid/sched
  sched/numa: Show numa_group ID in /proc/sched_debug task listings
  sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
  sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
  sched/stat: Simplify the sched_info accounting dependency
2015-07-04 08:56:53 -07:00
Naveen N. Rao
5968cecedd sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
Expand /proc/pid/schedstat output:

 - enable it on CONFIG_TASK_DELAY_ACCT=y && !CONFIG_SCHEDSTATS kernels.

 - dump all zeroes on kernels that are booted with the 'nodelayacct'
   option, which boot option disables delay accounting on
   CONFIG_TASK_DELAY_ACCT=y kernels.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: a.p.zijlstra@chello.nl
Cc: ricklind@us.ibm.com
Link: http://lkml.kernel.org/r/5ccbef17d4bc841084ea6e6421d4e4a23b7b806f.1435654789.git.naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-04 10:04:31 +02:00
Eryu Guan
8974fec7d7 ext4: correctly migrate a file with a hole at the beginning
Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file.  This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.

  # assmuing 4k block size ext4, with delalloc enabled
  # skip the first block and write to the second block
  xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

  # converting to indirect-mapped file, which would move the data blocks
  # to the beginning of the file, but extent status cache still marks
  # that region as a hole
  chattr -e /mnt/ext4/testfile

  # delayed allocation writes to the "hole", reclaim the same data block
  # again, results in i_blocks corruption
  xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
  umount /mnt/ext4
  e2fsck -nf /dev/sda6
  ...
  Inode 53, i_blocks is 16, should be 8.  Fix? no
  ...

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-04 00:03:44 -04:00
Eryu Guan
d6f123a929 ext4: be more strict when migrating to non-extent based file
Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:

a) delayed allocated extents could bypass the check on eh->eh_entries
   and eh->eh_depth

This can be demonstrated by this script

  xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile

where testfile has two extents but still be converted to non-extent
based file format.

b) only extent length is checked but not the offset, which would result
   in data lose (delalloc) or fs corruption (nodelalloc), because
   non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
   blocks

This can be demostrated by

  xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile
  sync

If delalloc is enabled, dmesg prints
  EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
  EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
  EXT4-fs (dm-4): This should not happen!! Data will be lost

If delalloc is disabled, e2fsck -nf shows corruption
  Inode 53, i_size is 5497558142976, should be 4096.  Fix? no

Fix the two issues by

a) forcing all delayed allocation blocks to be allocated before checking
   eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-03 23:56:50 -04:00
Lukas Czerner
9705acd63b ext4: fix reservation release on invalidatepage for delalloc fs
On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.

However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:

EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks

This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.

Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.

This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.

[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2015-07-03 21:13:55 -04:00