Commit Graph

48450 Commits

Author SHA1 Message Date
Trond Myklebust
1592c4d62a Merge branch 'nfs-rdma' 2016-07-24 17:09:02 -04:00
Trond Myklebust
668f455dac Merge branch 'pnfs' 2016-07-24 17:08:59 -04:00
Trond Myklebust
362745268c Merge branch 'writeback' 2016-07-24 17:08:31 -04:00
Wei Fang
47be61845c fs/dcache.c: avoid soft-lockup in dput()
We triggered soft-lockup under stress test which
open/access/write/close one file concurrently on more than
five different CPUs:

WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
...
[<ffffffc0003986f8>] dput+0x100/0x298
[<ffffffc00038c2dc>] terminate_walk+0x4c/0x60
[<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8
[<ffffffc00038f780>] filename_lookup+0x38/0xf0
[<ffffffc000391180>] user_path_at_empty+0x78/0xd0
[<ffffffc0003911f4>] user_path_at+0x1c/0x28
[<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230

->d_lock trylock may failed many times because of concurrently
operations, and dput() may execute a long time.

Fix this by replacing cpu_relax() with cond_resched().
dput() used to be sleepable, so make it sleepable again
should be safe.

Cc: <stable@vger.kernel.org>
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-24 16:37:16 -04:00
Miklos Szeredi
285b102d3b vfs: new d_init method
Allow filesystem to initialize dentry at allocation time.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-24 16:36:29 -04:00
Al Viro
17648b871d Merge branch 'test.d_iput' into work.misc 2016-07-24 16:36:04 -04:00
Oleg Drokin
f4fdace947 vfs: Update lookup_dcache() comment
commit 6c51e513a3 ("lookup_dcache(): lift d_alloc() into callers")
removed the need_lookup argument from lookup_dcache(), but the
comment was forgotten. Also it no longer allocates a new dentry
if nothing was found.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-24 16:35:02 -04:00
Trond Myklebust
01d7b29f0e pNFS: Remove redundant smp_mb() from pnfs_init_lseg()
It's not visible yet, and won't be until after we grab the inode->i_lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:43 -04:00
Trond Myklebust
119cef97a4 pNFS: Cleanup - do layout segment initialisation in one place
...instead of splitting the initialisation over init_lseg() and
pnfs_layout_process().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:42 -04:00
Trond Myklebust
28c1acffea pNFS: Remove redundant stateid invalidation
The layout stateid will be invalidated once it holds no more layout
segments anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:42 -04:00
Trond Myklebust
f71dfe8fc9 pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()
That's already being taken care of in pnfs_layout_remove_lseg().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:41 -04:00
Trond Myklebust
d9b61708fe pNFS: Clear the layout metadata if the server changed the layout stateid
If the server changed the layout stateid's "other" field, then
we should treat the old layout as being completely gone. In that
case, we want to clear the metadata such as scheduled layoutreturns.

Do this by calling pnfs_mark_layout_stateid_invalid().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:41 -04:00
Trond Myklebust
5f46be049b pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()
Ensure nfs42_layoutstat_done() layoutget don't open code layout stateid
invalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:40 -04:00
Trond Myklebust
e036f46453 NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id
When determining which layout segments to return, we do want
pnfs_mark_matching_lsegs_return to check that they match the layout
sequence id. This ensures that we don't waste time if the server
is replaying a layout recall that has already been satisfied.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:40 -04:00
Trond Myklebust
2d6cf5ab0b pNFS: Do not set plh_return_seq for non-callback related layoutreturns
In cases where we need to send a layoutreturn in order to propagate
an error, we should not tie that to a specific layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:39 -04:00
Trond Myklebust
e5fd1904b8 pNFS: Ensure layoutreturn acts as a completion for layout callbacks
When we return NFS_OK to the CB_LAYOUTRECALL, we are required to
send a layoutreturn that "completes" that layout recall request, using
the correct stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:39 -04:00
Trond Myklebust
793b7fe558 pNFS: Fix CB_LAYOUTRECALL stateid verification
We want to evaluate in this order:

If the client holds no layout for this inode, then return
NFS4ERR_NOMATCHING_LAYOUT; it probably forgot the layout.

If the client finds the inode among the list of layouts, but the corresponding
stateid has not yet been initialised, then return NFS4ERR_DELAY to ask the
server to retry once the outstanding LAYOUTGET is complete.

If the current layout stateid's "other" field does not match the recalled
stateid, return NFS4ERR_BAD_STATEID.

If already processing a layout recall with a newer stateid, return
NFS4ERR_OLD_STATEID. This can only happens for servers that are
non-compliant with the NFSv4.1 protocol.

If already processing a layout recall with an older stateid, return
NFS4ERR_DELAY to ask the server to retry once the outstanding
LAYOUTRETURN is complete. Again, this is technically incompliant with
the NFSv4.1 protocol.

If the current layout sequence id is newer than the recalled stateid's
sequence id, return NFS4ERR_OLD_STATEID. This too implies protocol
non-compliance.

If the current layout sequence id is older than the recalled stateid's
sequence id+1, return NFS4ERR_DELAY.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:38 -04:00
Trond Myklebust
ecebb80bf3 pNFS: Always update the layout barrier seqid on LAYOUTGET
Currently, pnfs_set_layout_stateid() will update the layout sequence
id barrier only if the stateid itself is newer than the current
layout stateid. However in a situation where multiple LAYOUTGET calls
and a LAYOUTRETURN raced, it is entirely possible for one of the
LAYOUTGET to set the current stateid to something newer than the
LAYOUTRETURN that needs to set the barrier.

The fix is to allow the "update_barrier" flag to force a check as to
whether or not the barrier needs to be updated.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:38 -04:00
Trond Myklebust
13bede18de pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set
If the layout stateid is invalid, then pnfs_set_layout_stateid() must
always initialise it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 16:16:25 -04:00
Trond Myklebust
8e0acf9046 pNFS: Clear the layout return tracking on layout reinitialisation
Ensure that we don't carry over layoutreturn info from a previous
incarnation of this layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 12:51:49 -04:00
Trond Myklebust
45fcc7bca7 pNFS: LAYOUTRETURN should only update the stateid if the layout is valid
If the layout was completely returned, then ignore the returned layout
stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-24 12:51:49 -04:00
Trond Myklebust
dc05973b28 Merge commit 'e7bdea7750eb'
Needed in order to work on top of pNFS changes in Linus' upstream kernel.
2016-07-24 12:51:10 -04:00
Dan Williams
0606263f24 Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next 2016-07-24 08:05:44 -07:00
David S. Miller
de0ba9a0d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just several instances of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 00:53:32 -04:00
Eric W. Biederman
aeaa4a79ff fs: Call d_automount with the filesystems creds
Seth Forshee reported a mount regression in nfs autmounts
with "fs: Add user namespace member to struct super_block".

It turns out that the assumption that current->cred is something
reasonable during mount while necessary to improve support of
unprivileged mounts is wrong in the automount path.

To fix the existing filesystems override current->cred with the
init_cred before calling d_automount and restore current->cred after
d_automount completes.

To support unprivileged mounts would require a more nuanced cred
selection, so fail on unprivileged mounts for the time being.  As none
of the filesystems that currently set FS_USERNS_MOUNT implement
d_automount this check is only good for preventing future problems.

Fixes: 6e4eab577a ("fs: Add user namespace member to struct super_block")
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-07-23 14:51:26 -05:00
Linus Torvalds
88083e9845 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This contains a fix for a potential crash/corruption issue and another
  where the suid/sgid bits weren't cleared on write"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: verify upper dentry in ovl_remove_and_whiteout()
  ovl: Copy up underlying inode's ->i_mode to overlay inode
  ovl: handle ATTR_KILL*
2016-07-23 14:25:02 +09:00
Benjamin Coddington
149a4fddd0 nfs: don't create zero-length requests
NFS doesn't expect requests with wb_bytes set to zero and may make
unexpected decisions about how to handle that request at the page IO layer.
Skip request creation if we won't have any wb_bytes in the request.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-22 15:15:16 -04:00
Artem Savkov
297fae4d0b Fix NULL pointer dereference in bl_free_device().
When bl_parse_deviceid() fails in bl_alloc_deviceid_node() on
blkdev_get_by_*() step we get an pnfs_block_dev struct that is
uninitialized except for bdev field which is set to whatever error
blkdev_get_by_*() returns.  bl_free_device() then tries to call
blkdev_put() if bdev is not 0 resulting in a wrong pointer dereference.

Fixing this by setting bdev in struct pnfs_block_dev only if we didn't
get an error from blkdev_get_by_*().

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-22 15:14:21 -04:00
Yunlei He
fe94793e55 f2fs: get victim segment again after new cp
Previous selected segment may become free after write_checkpoint,
if we do garbage collect on this segment, and then new_curseg happen
to reuse it, it may cause f2fs_bug_on as below.

	panic+0x154/0x29c
	do_garbage_collect+0x15c/0xaf4
	f2fs_gc+0x2dc/0x444
	f2fs_balance_fs.part.22+0xcc/0x14c
	f2fs_balance_fs+0x28/0x34
	f2fs_map_blocks+0x5ec/0x790
	f2fs_preallocate_blocks+0xe0/0x100
	f2fs_file_write_iter+0x64/0x11c
	new_sync_write+0xac/0x11c
	vfs_write+0x144/0x1e4
	SyS_write+0x60/0xc0

Here, maybe we check sit and ssa type during reset_curseg. So, we check
segment is stale or not, and select a new victim to avoid this.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-07-22 11:55:31 -07:00
Maxim Patlasov
cfc9fde0b0 ovl: verify upper dentry in ovl_remove_and_whiteout()
The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-07-22 10:54:20 +02:00
Dave Chinner
f2bdfda9a1 Merge branch 'xfs-4.8-misc-fixes-4' into for-next 2016-07-22 14:10:56 +10:00
Dave Chinner
72ccbbe154 xfs: remove EXPERIMENTAL tag from sparse inode feature
Been around for long enough now, hasn't caused any regression test
failures in the past 3 months, so it's time to make it a fully
supported feature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 14:10:18 +10:00
Dave Chinner
28b783e47a xfs: bufferhead chains are invalid after end_page_writeback
In xfs_finish_page_writeback(), we have a loop that looks like this:

        do {
                if (off < bvec->bv_offset)
                        goto next_bh;
                if (off > end)
                        break;
                bh->b_end_io(bh, !error);
next_bh:
                off += bh->b_size;
        } while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO.  This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.....
 INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
  free_buffer_head+0x41/0x90
  __slab_free+0x1ed/0x340
  kmem_cache_free+0x270/0x300
  free_buffer_head+0x41/0x90
  try_to_free_buffers+0x171/0x240
  xfs_vm_releasepage+0xcb/0x3b0
  try_to_release_page+0x106/0x190
  shrink_page_list+0x118e/0x1a10
  shrink_inactive_list+0x42c/0xdf0
  shrink_zone_memcg+0xa09/0xfa0
  shrink_zone+0x2c3/0xbc0
.....
 Call Trace:
  <IRQ>  [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
  [<ffffffff8153a995>] print_trailer+0x115/0x1a0
  [<ffffffff81541174>] object_err+0x34/0x40
  [<ffffffff815436e7>] kasan_report_error+0x217/0x530
  [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
  [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
  [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
  [<ffffffff81de0c58>] bio_endio+0x158/0x1b0
  [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
  [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
  [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
  [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
  [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340


Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

cc: <stable@vger.kernel.org> # 4.7
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:56:38 +10:00
Dave Chinner
b1c5ebb213 xfs: allocate log vector buffers outside CIL context lock
One of the problems we currently have with delayed logging is that
under serious memory pressure we can deadlock memory reclaim. THis
occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
inodes and issues a log force to unpin inodes that are dirty in the
CIL.

The CIL is pushed, but this will only occur once it gets the CIL
context lock to ensure that all committing transactions are complete
and no new transactions start being committed to the CIL while the
push switches to a new context.

The deadlock occurs when the CIL context lock is held by a
committing process that is doing memory allocation for log vector
buffers, and that allocation is then blocked on memory reclaim
making progress. Memory reclaim, however, is blocked waiting for
a log force to make progress, and so we effectively deadlock at this
point.

To solve this problem, we have to move the CIL log vector buffer
allocation outside of the context lock so that memory reclaim can
always make progress when it needs to force the log. The problem
with doing this is that a CIL push can take place while we are
determining if we need to allocate a new log vector buffer for
an item and hence the current log vector may go away without
warning. That means we canot rely on the existing log vector being
present when we finally grab the context lock and so we must have a
replacement buffer ready to go at all times.

To ensure this, introduce a "shadow log vector" buffer that is
always guaranteed to be present when we gain the CIL context lock
and format the item. This shadow buffer may or may not be used
during the formatting, but if the log item does not have an existing
log vector buffer or that buffer is too small for the new
modifications, we swap it for the new shadow buffer and format
the modifications into that new log vector buffer.

The result of this is that for any object we modify more than once
in a given CIL checkpoint, we double the memory required
to track dirty regions in the log. For single modifications then
we consume the shadow log vectorwe allocate on commit, and that gets
consumed by the checkpoint. However, if we make multiple
modifications, then the second transaction commit will allocate a
shadow log vector and hence we will end up with double the memory
usage as only one of the log vectors is consumed by the CIL
checkpoint. The remaining shadow vector will be freed when th elog
item is freed.

This can probably be optimised in future - access to the shadow log
vector is serialised by the object lock (as opposited to the active
log vector, which is controlled by the CIL context lock) and so we
can probably free shadow log vector from some objects when the log
item is marked clean on removal from the AIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:52:35 +10:00
Dave Chinner
160ae76fa1 libxfs: directory node splitting does not have an extra block
xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8

xfs_da3_split() has to handle all three versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry.  This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory.

The problem is that the code assumes that there may be extra blocks
that we need to link into the tree once we've split the root, but
this is not true for either dir or attr trees, because the extra
attr block is always consumed by the last node split before we split
the root. Hence the linking in an extra block is always wrong at the
root split level, and this manifests itself in repair as a directory
corruption in a repaired directory, leaving the directory rebuild
incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by ensuring the linking of the blocks after the root split
never tries to make use of the extra blocks that may be held in the
cursor. They are held there for other purposes and should never be
touched by the root splitting code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:51:05 +10:00
Arnd Bergmann
f021bd071f xfs: remove dax code from object file when disabled
We check IS_DAX(inode) before calling either xfs_file_dax_read or
xfs_file_dax_write, and this will lead the call being optimized out at
compile time when CONFIG_FS_DAX is disabled.

However, the two functions are marked STATIC, so they become global
symbols when CONFIG_XFS_DEBUG is set, leaving us with two unused global
functions that call into an undefined function and a broken "allmodconfig"
build:

fs/built-in.o: In function `xfs_file_dax_read':
fs/xfs/xfs_file.c:348: undefined reference to `dax_do_io'
fs/built-in.o: In function `xfs_file_dax_write':
fs/xfs/xfs_file.c:758: undefined reference to `dax_do_io'

Marking the two functions 'static noinline' instead of 'STATIC' will let
the compiler drop the symbols when there are no callers but avoid the
implicit inlining.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 16d4d43595 ("xfs: split direct I/O and DAX path")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:50:55 +10:00
Brian Foster
99579ccec4 xfs: skip dirty pages in ->releasepage()
XFS has had scattered reports of delalloc blocks present at
->releasepage() time. This results in a warning with a stack trace
similar to the following:

 ...
 Call Trace:
  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
  [<ffffffffa2168539>] kswapd+0x4f9/0x970
  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100

This occurs because it is possible for shrink_active_list() to send
pages marked dirty to ->releasepage() when certain buffer_head threshold
conditions are met. shrink_active_list() doesn't check the page dirty
state apparently to handle an old ext3 corner case where in some cases
clean pages would not have the dirty bit cleared, thus it is up to the
filesystem to determine how to handle the page.

XFS currently handles the delalloc case properly, but this behavior
makes the warning spurious. Update the XFS ->releasepage() handler to
explicitly skip dirty pages. Retain the existing delalloc/unwritten
checks so we continue to warn if such buffers exist on clean pages when
they shouldn't.

Diagnosed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:50:38 +10:00
Bob Peterson
e1cb6be9e1 GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
Before this patch, if you used gfs2_jadd to add new journals of a
size smaller than the existing journals, replaying those new journals
would withdraw. That's because function gfs2_replay_incr_blk was
using the number of journal blocks (jd_block) from the superblock's
journal pointer. In other words, "My journal's max size" rather than
"the journal we're replaying's size." This patch changes the function
to use the size of the pertinent journal rather than always using the
journal we happen to be using.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-21 13:02:44 -05:00
Trond Myklebust
e033fb51eb pNFS/files: filelayout_write_done_cb must call nfs_writeback_update_inode()
All write callbacks are required to call nfs_writeback_update_inode() upon
success to ensure that file size changes are recorded, and the attribute
cache is invalidated.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-07-21 09:46:42 -04:00
Chris Mason
8b8b08cbfb Btrfs: fix delalloc accounting after copy_from_user faults
Commit 56244ef151 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.

Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.

This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did.  The
problem is accounting for the offset into the page/sector when we do a
partial copy.  This one just uses the dirty_sectors variable which
should already be updated properly.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v4.6+
2016-07-21 04:03:40 -07:00
Miklos Szeredi
0f7d93416d Merge branch 'for-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-next 2016-07-21 11:14:30 +02:00
Al Viro
9aba36dea5 qstr constify instances in fs/dcache.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Al Viro
beffb8feb6 qstr: constify instances in nfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Al Viro
612645f7cf qstr: constify instances in ocfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Al Viro
8ac790f312 qstr: constify instances in autofs4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Al Viro
71e939634d qstr: constify instances in hfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Al Viro
b5cce521e8 qstr: constify instances in hfsplus
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Al Viro
7f5458ec5c qstr: constify instances in logfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-20 23:30:06 -04:00
Toshi Kani
163d4baaeb block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support.  This will allow to set the DAX capability based on how
mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:01:01 -06:00
Josef Bacik
bac357dcec Btrfs: avoid deadlocks during reservations in btrfs_truncate_block
The new enospc code makes it possible to deadlock if we don't use
FLUSH_LIMIT during reservations inside a transaction.  This enforces
the correct flush type to avoid both deadlocks and assertions

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2016-07-20 16:58:04 -07:00