Commit Graph

56429 Commits

Author SHA1 Message Date
Brian Foster
63db7c815b xfs: use ->b_state to fix buffer I/O accounting release race
We've had user reports of unmount hangs in xfs_wait_buftarg() that
analysis shows is due to btp->bt_io_count == -1. bt_io_count
represents the count of in-flight asynchronous buffers and thus
should always be >= 0. xfs_wait_buftarg() waits for this value to
stabilize to zero in order to ensure that all untracked (with
respect to the lru) buffers have completed I/O processing before
unmount proceeds to tear down in-core data structures.

The value of -1 implies an I/O accounting decrement race. Indeed,
the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
(where the buffer lock is no longer held) means that bp->b_flags can
be updated from an unsafe context. While a user-level reproducer is
currently not available, some intrusive hacks to run racing buffer
lookups/ioacct/releases from multiple threads was used to
successfully manufacture this problem.

Existing callers do not expect to acquire the buffer lock from
xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
this context. It turns out that we already have separate buffer
state bits and associated serialization for dealing with buffer LRU
state in the form of ->b_state and ->b_lock. Therefore, replace the
_XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
accounting wrappers appropriately and make sure they are used with
the correct locking. This ensures that buffer in-flight state can be
modified at buffer release time without racing with modifications
from a buffer lock holder.

Fixes: 9c7504aa72 ("xfs: track and serialize in-flight async buffers against unmount")
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Libor Pechacek <lpechacek@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-31 08:22:52 -07:00
Linus Torvalds
f511c0b17b "Yes, people use FOLL_FORCE ;)"
This effectively reverts commit 8ee74a91ac ("proc: try to remove use
of FOLL_FORCE entirely")

It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:

Keno Fischer:
 "We used these semantics as a hardening mechanism in the julia JIT. By
  opening /proc/self/mem and using these semantics, we could avoid
  needing RWX pages, or a dual mapping approach. We do have fallbacks to
  these other methods (though getting EIO here actually causes an assert
  in released versions - we'll updated that to make sure to take the
  fall back in that case).

  Nevertheless the /proc/self/mem approach was our favored approach
  because it a) Required an attacker to be able to execute syscalls
  which is a taller order than getting memory write and b) didn't double
  the virtual address space requirements (as a dual mapping approach
  would).

  I think in general this feature is very useful for anybody who needs
  to precisely control the execution of some other process. Various
  debuggers (gdb/lldb/rr) certainly fall into that category, but there's
  another class of such processes (wine, various emulators) which may
  want to do that kind of thing.

  Now, I suspect most of these will have the other process under ptrace
  control, so maybe allowing (same_mm || ptraced) would be ok, but at
  least for the sandbox/remote-jit use case, it would be perfectly
  reasonable to not have the jit server be a ptracer"

Robert O'Callahan:
 "We write to readonly code and data mappings via /proc/.../mem in lots
  of different situations, particularly when we're adjusting program
  state during replay to match the recorded execution.

  Like Julia, we can add workarounds, but they could be expensive."

so not only do people use FOLL_FORCE for both reads and writes, but they
use it for both the local mm and remote mm.

With these comments in mind, we likely also cannot add the "are we
actively ptracing" check either, so this keeps the new code organization
and does not do a real revert that would add back the original comment
about "Maybe we should limit FOLL_FORCE to actual ptrace users?"

Reported-by: Keno Fischer <keno@juliacomputing.com>
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-30 12:38:59 -07:00
Jan Kara
67a7d5f561 ext4: fix fdatasync(2) after extent manipulation operations
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

CC: stable@vger.kernel.org
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-29 13:24:55 -04:00
Miklos Szeredi
a082c6f680 ovl: filter trusted xattr for non-admin
Filesystems filter out extended attributes in the "trusted." domain for
unprivlieged callers.

Overlay calls underlying filesystem's method with elevated privs, so need
to do the filtering in overlayfs too.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 15:15:27 +02:00
Amir Goldstein
f3a1568582 ovl: mark upper merge dir with type origin entries "impure"
An upper dir is marked "impure" to let ovl_iterate() know that this
directory may contain non pure upper entries whose d_ino may need to be
read from the origin inode.

We already mark a non-merge dir "impure" when moving a non-pure child
entry inside it, to let ovl_iterate() know not to iterate the non-merge
dir directly.

Mark also a merge dir "impure" when moving a non-pure child entry inside
it and when copying up a child entry inside it.

This can be used to optimize ovl_iterate() to perform a "pure merge" of
upper and lower directories, merging the content of the directories,
without having to read d_ino from origin inodes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 11:48:00 +02:00
Kees Cook
7585d12f65 ocfs2: Use ERR_CAST() to avoid cross-structure cast
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.

Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28 10:11:49 -07:00
Kees Cook
fee2aa7538 ntfs: Use ERR_CAST() to avoid cross-structure cast
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.

Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28 10:11:48 -07:00
Kees Cook
fe3b81b446 NFS: Use ERR_CAST() to avoid cross-structure cast
When the call to nfs_devname() fails, the error path attempts to retain
the error via the mnt variable, but this requires a cast across very
different types (char * to struct vfsmount *), which the upcoming
structure layout randomization plugin flags as being potentially
dangerous in the face of randomization. This is a false positive, but
what this code actually wants to do is retain the error value, so this
patch explicitly sets it, instead of using what seems to be an
unexpected cast.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-28 10:11:47 -07:00
Al Viro
4d7edbc34c nfsd_readlink(): switch to vfs_get_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27 16:11:23 -04:00
Christoph Hellwig
a75d30c772 fs/locks: pass kernel struct flock to fcntl_getlk/setlk
This will make it easier to implement a sane compat fcntl syscall.

[ jlayton: fix undeclared identifiers in 32-bit fcntl64 syscall handler ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:19 -04:00
Mauro Carvalho Chehab
80b79dd0e2 fs: locks: Fix some troubles at kernel-doc comments
There are a few syntax violations that cause outputs of
a few comments to not be properly parsed in ReST format.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:18 -04:00
Jan Kara
a056bdaae7 ext4: fix data corruption for mmap writes
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.

Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.

Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:45:45 -04:00
Jan Kara
4f8caa60a5 ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
allocated blocks and these blocks are actually converted from unwritten
extent the following race can happen:

CPU0					CPU1

page fault				page fault
...					...
ext4_map_blocks()
  ext4_ext_map_blocks()
    ext4_ext_handle_unwritten_extents()
      ext4_ext_convert_to_initialized()
	- zero out converted extent
	ext4_zeroout_es()
	  - inserts extent as initialized in status tree

					ext4_map_blocks()
					  ext4_es_lookup_extent()
					    - finds initialized extent
					write data
  ext4_issue_zeroout()
    - zeroes out new extent overwriting data

This problem can be reproduced by generic/340 for the fallocated case
for the last block in the file.

Fix the problem by avoiding zeroing out the area we are mapping with
ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
to zero out this area in the first place as the caller asked us to
convert the area to initialized because he is just going to write data
there before the transaction finishes. To achieve this we delete the
special case of zeroing out full extent as that will be handled by the
cases below zeroing only the part of the extent that needs it. We also
instruct ext4_split_extent() that the middle of extent being split
contains data so that ext4_split_extent_at() cannot zero out full extent
in case of ENOSPC.

CC: stable@vger.kernel.org
Fixes: 12735f8819
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:40:52 -04:00
Linus Torvalds
cdbe020678 Merge tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Darrick Wong:
 "A few miscellaneous bug fixes & cleanups:

   - Fix indlen block reservation accounting bug when splitting delalloc
     extent

   - Fix warnings about unused variables that appeared in -rc1.

   - Don't spew errors when bmapping a local format directory

   - Fix an off-by-one error in a delalloc eof assertion

   - Make fsmap only return inode information for CAP_SYS_ADMIN

   - Fix a potential mount time deadlock recovering cow extents

   - Fix unaligned memory access in _btree_visit_blocks

   - Fix various SEEK_HOLE/SEEK_DATA bugs"

* tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
  xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
  xfs: Fix missed holes in SEEK_HOLE implementation
  xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
  xfs: fix unaligned access in xfs_btree_visit_blocks
  xfs: avoid mount-time deadlock in CoW extent recovery
  xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
  xfs: bad assertion for delalloc an extent that start at i_size
  xfs: fix warnings about unused stack variables
  xfs: BMAPX shouldn't barf on inline-format directories
  xfs: fix indlen accounting error on partial delalloc conversion
2017-05-26 12:13:08 -07:00
Al Viro
8d1a81a852 sanitize do_i2c_smbus_ioctl()
no need to mess with __copy_in_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-25 17:52:59 -04:00
Jan Kara
a54fba8f5a xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
Currently several places in xfs_find_get_desired_pgoff() handle the case
of a missing page. Make them all handled in one place after the loop has
terminated.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Jan Kara
d7fd24257a xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
There is an off-by-one error in loop termination conditions in
xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Jan Kara
5375023ae1 xfs: Fix missed holes in SEEK_HOLE implementation
XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
can be seen by the following command:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
implementation. The bug is in xfs_find_get_desired_pgoff() which does
not properly detect the case when pages are not contiguous.

Fix the problem by properly detecting when found page has larger offset
than expected.

CC: stable@vger.kernel.org
Fixes: d126d43f63
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Eryu Guan
8affebe16d xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
xfs_find_get_desired_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size XFS on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/xfs/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Eric Sandeen
a4d768e702 xfs: fix unaligned access in xfs_btree_visit_blocks
This structure copy was throwing unaligned access warnings on sparc64:

Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]

xfs_btree_copy_ptrs does a memcpy, which avoids it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Tahsin Erdogan
b8cb5a545c ext4: fix quota charging for shared xattr blocks
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.

Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:24:07 -04:00
Eric Biggers
c41d342b39 ext4: remove redundant check for encrypted file on dio write path
Currently we don't allow direct I/O on encrypted regular files, so in
such cases we return 0 early in ext4_direct_IO().  There was also an
additional BUG_ON() check in ext4_direct_IO_write(), but it can never be
hit because of the earlier check for the exact same condition in
ext4_direct_IO().  There was also no matching check on the read path,
which made the write path specific check seem very ad-hoc.

Just remove the unnecessary BUG_ON().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: David Gstir <david@sigma-star.at>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:20:31 -04:00
Eric Biggers
d6b975504e ext4: remove unused d_name argument from ext4_search_dir() et al.
Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:10:49 -04:00
Eric Biggers
e5465795ca ext4: fix off-by-one error when writing back pages before dio read
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so
we need to subtract 1 from pos + count.

Note that 'count' is guaranteed to be nonzero since
ext4_file_read_iter() returns early when given a 0 count.

Fixes: 16c5468859 ("ext4: Allow parallel DIO reads")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:05:29 -04:00
Eryu Guan
624327f879 ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/ext4/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:02:20 -04:00
Luis Henriques
42c99fc4c7 ceph: check that the new inode size is within limits in ceph_fallocate()
Currently the ceph client doesn't respect the rlimit in fallocate.  This
means that a user can allocate a file with size > RLIMIT_FSIZE.  This
patch adds the call to inode_newsize_ok() to verify filesystem limits and
ulimits.  This should make ceph successfully run xfstest generic/228.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-24 18:10:54 +02:00
Trond Myklebust
b49c15f97c NFSv4.0: Fix a lock leak in nfs40_walk_client_list
Xiaolong Ye's kernel test robot detected the following Oops:
[  299.158991] BUG: scheduling while atomic: mount.nfs/9387/0x00000002
[  299.169587] 2 locks held by mount.nfs/9387:
[  299.176165]  #0:  (nfs_clid_init_mutex){......}, at: [<ffffffff8130cc92>] nfs4_discover_server_trunking+0x47/0x1fc
[  299.201802]  #1:  (&(&nn->nfs_client_lock)->rlock){......}, at: [<ffffffff813125fa>] nfs40_walk_client_list+0x2e9/0x338
[  299.221979] CPU: 0 PID: 9387 Comm: mount.nfs Not tainted 4.11.0-rc7-00021-g14d1bbb #45
[  299.235584] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[  299.251176] Call Trace:
[  299.255192]  dump_stack+0x61/0x7e
[  299.260416]  __schedule_bug+0x65/0x74
[  299.266208]  __schedule+0x5d/0x87c
[  299.271883]  schedule+0x89/0x9a
[  299.276937]  schedule_timeout+0x232/0x289
[  299.283223]  ? detach_if_pending+0x10b/0x10b
[  299.289935]  schedule_timeout_uninterruptible+0x2a/0x2c
[  299.298266]  ? put_rpccred+0x3e/0x115
[  299.304327]  ? schedule_timeout_uninterruptible+0x2a/0x2c
[  299.312851]  msleep+0x1e/0x22
[  299.317612]  nfs4_discover_server_trunking+0x102/0x1fc
[  299.325644]  nfs4_init_client+0x13f/0x194

It looks as if we recently added a spin_lock() leak to
nfs40_walk_client_list() when cleaning up the code.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Fixes: 14d1bbb0ca ("NFS: Create a common nfs4_match_client() function")
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 08:05:16 -04:00
Benjamin Coddington
08cb5b0f05 pnfs: Fix the check for requests in range of layout segment
It's possible and acceptable for NFS to attempt to add requests beyond the
range of the current pgio->pg_lseg, a case which should be caught and
limited by the pg_test operation.  However, the current handling of this
case replaces pgio->pg_lseg with a new layout segment (after a WARN) within
that pg_test operation.  That will cause all the previously added requests
to be submitted with this new layout segment, which may not be valid for
those requests.

Fix this problem by only returning zero for the number of bytes to coalesce
from pg_test for this case which allows any previously added requests to
complete on the current layout segment.  The check for requests starting
out of range of the layout segment moves to pg_init, so that the
replacement of pgio->pg_lseg will be done when the next request is added.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:55:02 -04:00
Dan Carpenter
662f9a105b pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
If xdr_inline_decode() fails then we end up returning ERR_PTR(0).  The
caller treats NULL returns as -ENOMEM so it doesn't really hurt runtime,
but obviously we intended to set an error code here.

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:52:54 -04:00
Olga Kornievskaia
6d3b5d8d8d NFS fix COMMIT after COPY
Fix a typo in the commit e092693443
"NFS append COMMIT after synchronous COPY"

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: e092693443 ("NFS append COMMIT after synchronous COPY")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:52:48 -04:00
Jan Kara
d8747d642e reiserfs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: reiserfs-devel@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2017-05-24 13:35:20 +02:00
Jan Kara
0f0b9b63e1 gfs2: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
CC: stable@vger.kernel.org
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-05-24 13:35:20 +02:00
Eric Biggers
aaebdee8b8 f2fs: don't bother checking for encryption key in ->write_iter()
Since only an open file can be written to, and we only allow open()ing
an encrypted file when its key is available, there is no need to check
for the key again before permitting each ->write_iter().

This code was also broken in that it wouldn't actually have failed if
the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:11:08 -07:00
Eric Biggers
b82a6ea6ec f2fs: don't bother checking for encryption key in ->mmap()
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().

This f2fs copy of this code was also broken in that it wouldn't actually
have failed if the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:36 -07:00
Chao Yu
6afae6336a f2fs: wait discard IO completion without cmd_lock held
Wait discard IO completion outside cmd_lock to avoid long latency
of holding cmd_lock in IO busy scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:03 -07:00
Chao Yu
e31b982157 f2fs: wake up all waiters in f2fs_submit_discard_endio
There could be more than one waiter waiting discard IO completion, so we
need use complete_all() instead of complete() in f2fs_submit_discard_endio
to avoid hungtask.

Fixes: 	ec9895add2 ("f2fs: don't hold cmd_lock during waiting discard
command")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:54 -07:00
Chao Yu
04dfc23006 f2fs: show more info if fail to issue discard
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:45 -07:00
Chao Yu
fb830fc5cf f2fs: introduce io_list for serialize data/node IOs
Serialize data/node IOs by using fifo list instead of mutex lock,
it will help to enhance concurrency of f2fs, meanwhile keeping LFS
IO semantics.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:03 -07:00
Chao Yu
e41e6d75e5 f2fs: split wio_mutex
Split wio_mutex to adjust different temperature bio cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:23 -07:00
Yunlei He
963932a93c f2fs: combine huge num of discard rb tree consistence checks
Came across a hungtask caused by huge number of rb tree traversing
during adding discard addrs in cp. This patch combine these consistence
checks and move it to discard thread.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:19 -07:00
Yunlei He
dad48e7312 f2fs: fix a bug caused by NULL extent tree
Thread A:					Thread B:

-f2fs_remount
    -sbi->mount_opt.opt = 0;
						<--- -f2fs_iget
						         -do_read_inode
							     -f2fs_init_extent_tree
							         -F2FS_I(inode)->extent_tree is NULL
        -default_options && parse_options
	    -remount return
						<---  -f2fs_map_blocks
						          -f2fs_lookup_extent_tree
                                                              -f2fs_bug_on(sbi, !et);

The same problem with f2fs_new_inode.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Jaegeuk Kim
1d7be27082 f2fs: try to freeze in gc and discard threads
This allows to freeze gc and discard threads.

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Yunlei He
b7b7c4cf1c f2fs: add a new function get_ssr_cost
This patch add a new method get_ssr_cost to select
SSR segment more accurately.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:17 -07:00
Hou Pengyang
bd80a4b981 f2fs: declare load_free_nid_bitmap static
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:16 -07:00
Jaegeuk Kim
cc15620bc8 f2fs: avoid f2fs_lock_op for IPU writes
Currently, if we do get_node_of_data before f2fs_lock_op, there may be dead lock
as follows, where process A would be in infinite loop, and B will NOT be awaked.

Process A(cp):            Process B:
f2fs_lock_all(sbi)
                        get_dnode_of_data <---- lock dn.node_page
flush_nodes             f2fs_lock_op

So, this patch adds f2fs_trylock_op to avoid f2fs_lock_op done by IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:15 -07:00
Jaegeuk Kim
a912b54d3a f2fs: split bio cache
Split DATA/NODE type bio cache according to different temperature,
so write IOs with the same temperature can be merged in corresponding
bio cache as much as possible, otherwise, different temperature write
IOs submitting into one bio cache will always cause split of bio.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:39 -07:00
Jaegeuk Kim
81377bd628 f2fs: use fio instead of multiple parameters
This patch just changes using fio instead of parameters.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:38 -07:00
Jaegeuk Kim
b9109b0e49 f2fs: remove unnecessary read cases in merged IO flow
Merged IO flow doesn't need to care about read IOs.

f2fs_submit_merged_bio -> f2fs_submit_merged_write
f2fs_submit_merged_bios -> f2fs_submit_merged_writes
f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:37 -07:00
Jaegeuk Kim
1919ffc0d7 f2fs: use f2fs_submit_page_bio for ra_meta_pages
This patch avoids to use f2fs_submit_merged_bio for read, which was the only
read case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:36 -07:00
Weichao Guo
e5dbd9563e f2fs: make sure f2fs_gc returns consistent errno
By default, f2fs_gc returns -EINVAL in general error cases, e.g., no victim
was selected. However, the default errno may be overwritten in two cases:
gc_more and BG_GC -> FG_GC. We should return consistent errno in such cases.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:35 -07:00