Commit Graph

66232 Commits

Author SHA1 Message Date
Jens Axboe
807abcb088 io_uring: ensure double poll additions work with both request types
The double poll additions were centered around doing POLL_ADD on file
descriptors that use more than one waitqueue (typically one for read,
one for write) when being polled. However, it can also end up being
triggered for when we use poll triggered retry. For that case, we cannot
safely use req->io, as that could be used by the request type itself.

Add a second io_poll_iocb pointer in the structure we allocate for poll
based retry, and ensure we use the right one from the two paths.

Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 19:41:05 -06:00
Linus Torvalds
6a70f89cc5 Merge tag 'nfs-for-5.8-3' of git://git.linux-nfs.org/projects/anna/linux-nfs into master
Pull NFS client fixes from Anna Schumaker:
 "A few more NFS client bugfixes for Linux 5.8:

  NFS:
   - Fix interrupted slots by using the SEQUENCE operation

  SUNRPC:
   - revert d03727b248 to fix unkillable IOs

  xprtrdma:
   - Fix double-free in rpcrdma_ep_create()
   - Fix recursion into rpcrdma_xprt_disconnect()
   - Fix return code from rpcrdma_xprt_connect()
   - Fix handling of connect errors
   - Fix incorrect header size calculations"

* tag 'nfs-for-5.8-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC reverting d03727b248 ("NFSv4 fix CLOSE not waiting for direct IO compeletion")
  xprtrdma: fix incorrect header size calculations
  NFS: Fix interrupted slots by sending a solo SEQUENCE operation
  xprtrdma: Fix handling of connect errors
  xprtrdma: Fix return code from rpcrdma_xprt_connect()
  xprtrdma: Fix recursion into rpcrdma_xprt_disconnect()
  xprtrdma: Fix double-free in rpcrdma_ep_create()
2020-07-17 16:37:52 -07:00
Eric Sandeen
4750a171c3 xfs: preserve inode versioning across remounts
The MS_I_VERSION mount flag is exposed via the VFS, as documented
in the mount manpages etc; see the iversion and noiversion mount
options in mount(8).

As a result, mount -o remount looks for this option in /proc/mounts
and will only send the I_VERSION flag back in during remount it it
is present.  Since it's not there, a remount will /remove/ the
I_VERSION flag at the vfs level, and iversion functionality is lost.

xfs v5 superblocks intend to always have i_version enabled; it is
set as a default at mount time, but is lost during remount for the
reasons above.

The generic fix would be to expose this documented option in
/proc/mounts, but since that was rejected, fix it up again in the
xfs remount path instead, so that at least xfs won't suffer from
this misbehavior.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-17 13:20:20 -07:00
Olga Kornievskaia
65caafd0d2 SUNRPC reverting d03727b248 ("NFSv4 fix CLOSE not waiting for direct IO compeletion")
Reverting commit d03727b248 "NFSv4 fix CLOSE not waiting for
direct IO compeletion". This patch made it so that fput() by calling
inode_dio_done() in nfs_file_release() would wait uninterruptably
for any outstanding directIO to the file (but that wait on IO should
be killable).

The problem the patch was also trying to address was REMOVE returning
ERR_ACCESS because the file is still opened, is supposed to be resolved
by server returning ERR_FILE_OPEN and not ERR_ACCESS.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-07-17 14:47:38 -04:00
Linus Torvalds
4ebf8d7649 Merge tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block into master
Pull io_uring fix from Jens Axboe:
 "Fix for a case where, with automatic buffer selection, we can leak the
  buffer descriptor for recvmsg"

* tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block:
  io_uring: fix recvmsg memory leak with buffer selection
2020-07-17 10:47:51 -07:00
Linus Torvalds
0dd68a34ec Merge tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into master
Pull fuse fixes from Miklos Szeredi:

 - two regressions in this cycle caused by the conversion of writepage
   list to an rb_tree

 - two regressions in v5.4 cause by the conversion to the new mount API

 - saner behavior of fsconfig(2) for the reconfigure case

 - an ancient issue with FS_IOC_{GET,SET}FLAGS ioctls

* tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
  fuse: don't ignore errors from fuse_writepages_fill()
  fuse: clean up condition for writepage sending
  fuse: reject options on reconfigure via fsconfig(2)
  fuse: ignore 'data' argument of mount(..., MS_REMOUNT)
  fuse: use ->reconfigure() instead of ->remount_fs()
  fuse: fix warning in tree_insert() and clean up writepage insertion
  fuse: move rb_erase() before tree_insert()
2020-07-17 10:36:19 -07:00
Linus Torvalds
44fea37378 Merge tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into master
Pull overlayfs fixes from Miklos Szeredi:

 - fix a regression introduced in v4.20 in handling a regenerated
   squashfs lower layer

 - two regression fixes for this cycle, one of which is Oops inducing

 - miscellaneous issues

* tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix lookup of indexed hardlinks with metacopy
  ovl: fix unneeded call to ovl_change_flags()
  ovl: fix mount option checks for nfs_export with no upperdir
  ovl: force read-only sb on failure to create index dir
  ovl: fix regression with re-formatted lower squashfs
  ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on
  ovl: relax WARN_ON() when decoding lower directory file handle
  ovl: remove not used argument in ovl_check_origin
  ovl: change ovl_copy_up_flags static
  ovl: inode reference leak in ovl_is_inuse true case.
2020-07-17 10:29:19 -07:00
Olga Kornievskaia
dbc4fec6b6 NFSv4.0 allow nconnect for v4.0
It looks like this "else" is just a typo.  It turns off nconnect for
NFSv4.0 even though it works for every other version.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-07-17 13:16:23 -04:00
He Zhe
ab91e7a6da freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
commit 0688e64bc6 ("NFS: Allow signal interruption of NFS4ERR_DELAYed operations")
introduces nfs4_delay_interruptible which also needs an _unsafe version to
avoid the following call trace for the same reason explained in
commit 416ad3c9c0 ("freezer: add unsafe versions of freezable helpers for NFS")

CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x1dc
show_stack+0x20/0x30
dump_stack+0xdc/0x150
debug_check_no_locks_held+0x98/0xa0
nfs4_delay_interruptible+0xd8/0x120
nfs4_handle_exception+0x130/0x170
nfs4_proc_rmdir+0x8c/0x220
nfs_rmdir+0xa4/0x360
vfs_rmdir.part.0+0x6c/0x1b0
do_rmdir+0x18c/0x210
__arm64_sys_unlinkat+0x64/0x7c
el0_svc_common.constprop.0+0x7c/0x110
do_el0_svc+0x24/0xa0
el0_sync_handler+0x13c/0x1b8
el0_sync+0x158/0x180

Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-07-17 13:12:44 -04:00
Trond Myklebust
57f80c0eda Merge branch 'xattr-devel' 2020-07-17 10:35:48 -04:00
Kees Cook
3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
Jason Yan
4df7a75f69 f2fs: Eliminate usage of uninitialized_var() macro
This is an effort to eliminate the uninitialized_var() macro[1].

The use of this macro is the wrong solution because it forces off ANY
analysis by the compiler for a given variable. It even masks "unused
variable" warnings.

Quoted from Linus[2]:

"It's a horrible thing to use, in that it adds extra cruft to the
source code, and then shuts up a compiler warning (even the _reliable_
warnings from gcc)."

Fix it by remove this variable since it is not needed at all.

[1] https://github.com/KSPP/linux/issues/81
[2] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200615085132.166470-1-yanaijie@huawei.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:32:26 -07:00
Christoph Hellwig
5b642d8b9f block: integrate bd_start_claiming into __blkdev_get
bd_start_claiming duplicates a lot of the work done in __blkdev_get.
Integrate the two functions to avoid the duplicate work, and to do the
right thing for the md -ERESTARTSYS corner case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 09:35:44 -06:00
Christoph Hellwig
ecbe6bc000 block: use bd_prepare_to_claim directly in the loop driver
The arcane magic in bd_start_claiming is only needed to be able to claim
a block_device that hasn't been fully set up.  Switch the loop driver
that claims from the ioctl path with a fully set up struct block_device
to just use the much simpler bd_prepare_to_claim directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 09:35:44 -06:00
Christoph Hellwig
58e46ed9cc block: refactor bd_start_claiming
Move the locking and assignment of bd_claiming from bd_start_claiming to
bd_prepare_to_claim.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 09:35:44 -06:00
Christoph Hellwig
c5638ab417 block: simplify the restart case in __blkdev_get
Insted of duplicating all the cleanup logic jump to the code that cleans
up anyway, and restart after that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 09:35:44 -06:00
Christoph Hellwig
9e96c8c0e9 fs: add a vfs_fchmod helper
Add a helper for struct file based chmode operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:04 +02:00
Christoph Hellwig
c04011fe8c fs: add a vfs_fchown helper
Add a helper for struct file based chown operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:00 +02:00
Amir Goldstein
4518dfcf76 ovl: fix lookup of indexed hardlinks with metacopy
We recently moved setting inode flag OVL_UPPERDATA to ovl_lookup().

When looking up an overlay dentry, upperdentry may be found by index
and not by name.  In that case, we fail to read the metacopy xattr
and falsly set the OVL_UPPERDATA on the overlay inode.

This caused a regression in xfstest overlay/033 when run with
OVERLAY_MOUNT_OPTIONS="-o metacopy=on".

Fixes: 28166ab3c8 ("ovl: initialize OVL_UPPERDATA in ovl_lookup()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 07:24:47 +02:00
Amir Goldstein
81a33c1ee9 ovl: fix unneeded call to ovl_change_flags()
The check if user has changed the overlay file was wrong, causing unneeded
call to ovl_change_flags() including taking f_lock on every file access.

Fixes: d989903058 ("ovl: do not generate duplicate fsnotify events for "fake" path")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 07:24:47 +02:00
David Howells
811f04bac1 afs: Fix interruption of operations
The afs filesystem driver allows unstarted operations to be cancelled by
signal, but most of these can easily be restarted (mkdir for example).  The
primary culprits for reproducing this are those applications that use
SIGALRM to display a progress counter.

File lock-extension operation is marked uninterruptible as we have a
limited time in which to do it, and the release op is marked
uninterruptible also as if we fail to unlock a file, we'll have to wait 20
mins before anyone can lock it again.

The store operation logs a warning if it gets interruption, e.g.:

	kAFS: Unexpected error from FS.StoreData -4

because it's run from the background - but it can also be run from
fdatasync()-type things.  However, store options aren't marked
interruptible at the moment.

Fix this in the following ways:

 (1) Mark store operations as uninterruptible.  It might make sense to
     relax this for certain situations, but I'm not sure how to make sure
     that background store ops aren't affected by signals to foreground
     processes that happen to trigger them.

 (2) In afs_get_io_locks(), where we're getting the serialisation lock for
     talking to the fileserver, return ERESTARTSYS rather than EINTR
     because a lot of the operations (e.g. mkdir) are restartable if we
     haven't yet started sending the op to the server.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-15 15:49:04 -07:00
Amir Goldstein
f0e1266ed2 ovl: fix mount option checks for nfs_export with no upperdir
Without upperdir mount option, there is no index dir and the dependency
checks nfs_export => index for mount options parsing are incorrect.

Allow the combination nfs_export=on,index=off with no upperdir and move
the check for dependency redirect_dir=nofollow for non-upper mount case
to mount options parsing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:11:15 +02:00
Amir Goldstein
470c156361 ovl: force read-only sb on failure to create index dir
With index feature enabled, on failure to create index dir, overlay is
being mounted read-only.  However, we do not forbid user to remount overlay
read-write.  Fix that by setting ofs->workdir to NULL, which prevents
remount read-write.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:11:15 +02:00
Amir Goldstein
a888db3101 ovl: fix regression with re-formatted lower squashfs
Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of lower
fs") relaxed the requirement for non null uuid with single lower layer to
allow enabling index and nfs_export features with single lower squashfs.

Fabian reported a regression in a setup when overlay re-uses an existing
upper layer and re-formats the lower squashfs image.  Because squashfs
has no uuid, the origin xattr in upper layer are decoded from the new
lower layer where they may resolve to a wrong origin file and user may
get an ESTALE or EIO error on lookup.

To avoid the reported regression while still allowing the new features
with single lower squashfs, do not allow decoding origin with lower null
uuid unless user opted-in to one of the new features that require
following the lower inode of non-dir upper (index, xino, metacopy).

Reported-by: Fabian <godi.beat@gmx.net>
Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid of lower fs")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:10:31 +02:00
Amir Goldstein
20396365a1 ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on
Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic
since v5.8-rc1 overlayfs updates.

 overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2)
 BUG: kernel NULL pointer dereference, address: 0000000000000030
 RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay]

Bisect point at commit c21c839b84 ("ovl: whiteout inode sharing")

Minimal reproducer:
--------------------------------------------------
rm -rf l u w m
mkdir -p l u w m
mkdir -p l/testdir
touch l/testdir/testfile
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
echo 1 > m/testdir/testfile
umount m
rm -rf u/testdir
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
umount m
--------------------------------------------------

When mount with nfs_export=on, and fail to verify an orphan index, we're
cleaning this index from indexdir by calling ovl_cleanup_and_whiteout().
This dereferences ofs->workdir, that was earlier set to NULL.

The design was that ovl->workdir will point at ovl->indexdir, but we are
assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup().
There is no reason not to do it sooner, because once we get success from
ofs->indexdir = ovl_workdir_create(... there is no turning back.

Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: c21c839b84 ("ovl: whiteout inode sharing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:09:59 +02:00
Amir Goldstein
124c2de2c0 ovl: relax WARN_ON() when decoding lower directory file handle
Decoding a lower directory file handle to overlay path with cold
inode/dentry cache may go as follows:

1. Decode real lower file handle to lower dir path
2. Check if lower dir is indexed (was copied up)
3. If indexed, get the upper dir path from index
4. Lookup upper dir path in overlay
5. If overlay path found, verify that overlay lower is the lower dir
   from step 1

On failure to verify step 5 above, user will get an ESTALE error and a
WARN_ON will be printed.

A mismatch in step 5 could be a result of lower directory that was renamed
while overlay was offline, after that lower directory has been copied up
and indexed.

This is a scripted reproducer based on xfstest overlay/052:

  # Create lower subdir
  create_dirs
  create_test_files $lower/lowertestdir/subdir
  mount_dirs
  # Copy up lower dir and encode lower subdir file handle
  touch $SCRATCH_MNT/lowertestdir
  test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
  # Rename lower dir offline
  unmount_dirs
  mv $lower/lowertestdir $lower/lowertestdir.new/
  mount_dirs
  # Attempt to decode lower subdir file handle
  test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle

Since this WARN_ON() can be triggered by user we need to relax it.

Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
Cc: <stable@vger.kernel.org> # v4.16+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:09:17 +02:00
youngjun
d78a0dcf64 ovl: remove not used argument in ovl_check_origin
ovl_check_origin outparam 'ctrp' argument not used by caller.  So remove
this argument.

Signed-off-by: youngjun <her0gyugyu@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:06:16 +02:00
youngjun
5ac8e8025a ovl: change ovl_copy_up_flags static
"ovl_copy_up_flags" is used in copy_up.c.
so, change it static.

Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:06:16 +02:00
youngjun
24f14009b8 ovl: inode reference leak in ovl_is_inuse true case.
When "ovl_is_inuse" true case, trap inode reference not put.  plus adding
the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.

Fixes: 0be0bfd2de ("ovl: fix regression caused by overlapping layers detection")
Cc: <stable@vger.kernel.org> # v4.19+
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16 00:05:40 +02:00
Pavel Begunkov
681fda8d27 io_uring: fix recvmsg memory leak with buffer selection
io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
causes a leak when used with automatic buffer selection.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 13:35:56 -06:00
Amir Goldstein
9c61f3b560 fanotify: break up fanotify_alloc_event()
Break up fanotify_alloc_event() into helpers by event struct type.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:41:33 +02:00
Amir Goldstein
b8a6c3a2f0 fanotify: create overflow event type
The special overflow event is allocated as struct fanotify_path_event,
but with a null path.

Use a special event type to identify the overflow event, so the helper
fanotify_has_event_path() will always indicate a non null path.

Allocating the overflow event doesn't need any of the fancy stuff in
fanotify_alloc_event(), so create a simplified helper for allocating the
overflow event.

There is also no need to store and report the pid with an overflow event.

Link: https://lore.kernel.org/r/20200708111156.24659-7-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:37:03 +02:00
Amir Goldstein
956235afd1 inotify: do not use objectid when comparing events
inotify's event->wd is the object identifier.
Compare that instead of the common fsnotidy event objectid, so
we can get rid of the objectid field later.

Link: https://lore.kernel.org/r/20200708111156.24659-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:58 +02:00
Amir Goldstein
9991bb84b2 kernfs: do not call fsnotify() with name without a parent
When creating an FS_MODIFY event on inode itself (not on parent)
the file_name argument should be NULL.

The change to send a non NULL name to inode itself was done on purpuse
as part of another commit, as Tejun writes: "...While at it, supply the
target file name to fsnotify() from kernfs_node->name.".

But this is wrong practice and inconsistent with inotify behavior when
watching a single file.  When a child is being watched (as opposed to the
parent directory) the inotify event should contain the watch descriptor,
but not the file name.

Fixes: df6a58c5c5 ("kernfs: don't depend on d_find_any_alias()...")
Link: https://lore.kernel.org/r/20200708111156.24659-5-amir73il@gmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:52 +02:00
Amir Goldstein
9a02aa40dd nfsd: use fsnotify_data_inode() to get the unlinked inode
The inode argument to handle_event() is about to become obsolete.

Link: https://lore.kernel.org/r/20200708111156.24659-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:47 +02:00
Amir Goldstein
cbcf47adc8 fsnotify: return non const from fsnotify_data_inode()
Return non const inode pointer from fsnotify_data_inode().
None of the fsnotify hooks pass const inode pointer as data and
callers often need to cast to a non const pointer.

Link: https://lore.kernel.org/r/20200708111156.24659-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:45 +02:00
Amir Goldstein
c738fbabb0 fsnotify: fold fsnotify() call into fsnotify_parent()
All (two) callers of fsnotify_parent() also call fsnotify() to notify
the child inode. Move the second fsnotify() call into fsnotify_parent().

This will allow more flexibility in making decisions about which of the
two event falvors should be sent.

Using 'goto notify_child' in the inline helper seems a bit strange, but
it mimics the code in __fsnotify_parent() for clarity and the goto
pattern will become less strage after following patches are applied.

Link: https://lore.kernel.org/r/20200708111156.24659-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:41 +02:00
Mel Gorman
71d734103e fsnotify: Rearrange fast path to minimise overhead when there is no watcher
The fsnotify paths are trivial to hit even when there are no watchers and
they are surprisingly expensive. For example, every successful vfs_write()
hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
As it stands, fsnotify_parent is a guaranteed functional call even if there
are no watchers and fsnotify() does a substantial amount of unnecessary
work before it checks if there are any watchers. A perf profile showed
that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
total samples taken in that function during a test. This patch rearranges
the fast paths to reduce the amount of work done when there are no
watchers.

The test motivating this was "perf bench sched messaging --pipe". Despite
the fact the pipes are anonymous, fsnotify is still called a lot and
the overhead is noticeable even though it's completely pointless. It's
likely the overhead is negligible for real IO so this is an extreme
example. This is a comparison of hackbench using processes and pipes on
a 1-socket machine with 8 CPU threads without fanotify watchers.

                              5.7.0                  5.7.0
                            vanilla      fastfsnotify-v1r1
Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*
Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)
Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)
Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)
Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)
Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)
Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*
Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)
Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)

                       5.7.0       5.7.0
                     vanilla fastfsnotify-v1r1
Duration User         157.05      152.79
Duration System      1279.98     1219.32
Duration Elapsed      182.81      174.52

This is showing that the latencies are improved by roughly 2-9%. The
variability is not shown but some of these results are within the noise
as this workload heavily overloads the machine. That said, the system CPU
usage is reduced by quite a bit so it makes sense to avoid the overhead
even if it is a bit tricky to detect at times. A perf profile of just 1
group of tasks showed that 5.14% of samples taken were in either fsnotify()
or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
mostly function entry and the initial check for watchers.  The check for
watchers is complicated enough that inlining it may be controversial.

[Amir] Slightly simplify with mnt_or_sb_mask => marks_mask

Link: https://lore.kernel.org/r/20200708111156.24659-1-amir73il@gmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 15:29:10 +02:00
Jan Kara
47aaabdedf fanotify: Avoid softlockups when reading many events
When user provides large buffer for events and there are lots of events
available, we can try to copy them all to userspace without scheduling
which can softlockup the kernel (furthermore exacerbated by the
contention on notification_lock). Add a scheduling point after copying
each event.

Note that usually the real underlying problem is the cost of fanotify
event merging and the resulting contention on notification_lock but this
is a cheap way to somewhat reduce the problem until we can properly
address that.

Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Link: https://lore.kernel.org/lkml/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 15:23:28 +02:00
Chirantan Ekbote
31070f6cce fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
The ioctl encoding for this parameter is a long but the documentation says
it should be an int and the kernel drivers expect it to be an int.  If the
fuse driver treats this as a long it might end up scribbling over the stack
of a userspace process that only allocated enough space for an int.

This was previously discussed in [1] and a patch for fuse was proposed in
[2].  From what I can tell the patch in [2] was nacked in favor of adding
new, "fixed" ioctls and using those from userspace.  However there is still
no "fixed" version of these ioctls and the fact is that it's sometimes
infeasible to change all userspace to use the new one.

Handling the ioctls specially in the fuse driver seems like the most
pragmatic way for fuse servers to support them without causing crashes in
userspace applications that call them.

[1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/
[2]: https://sourceforge.net/p/fuse/mailman/message/31771759/

Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Fixes: 59efec7b90 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-15 14:18:20 +02:00
Alexei Starovoitov
ec2ffdf65f Merge branch 'usermode-driver-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace into bpf-next 2020-07-14 12:18:01 -07:00
He Zhe
59679d9933 freezer: Add unsafe version of freezable_schedule_timeout_interruptible() for NFS
commit 0688e64bc6 ("NFS: Allow signal interruption of
NFS4ERR_DELAYed operations") introduces nfs4_delay_interruptible
which also needs an _unsafe version to avoid the following call
trace for the same reason explained in commit 416ad3c9c0 ("freezer:
add unsafe versions of freezable helpers for NFS")

CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x1dc
show_stack+0x20/0x30
dump_stack+0xdc/0x150
debug_check_no_locks_held+0x98/0xa0
nfs4_delay_interruptible+0xd8/0x120
nfs4_handle_exception+0x130/0x170
nfs4_proc_rmdir+0x8c/0x220
nfs_rmdir+0xa4/0x360
vfs_rmdir.part.0+0x6c/0x1b0
do_rmdir+0x18c/0x210
__arm64_sys_unlinkat+0x64/0x7c
el0_svc_common.constprop.0+0x7c/0x110
do_el0_svc+0x24/0xa0
el0_sync_handler+0x13c/0x1b8
el0_sync+0x158/0x180

Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-07-14 19:25:41 +02:00
YueHaibing
8464e650b9 xfs: remove duplicated include from xfs_buf_item.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2020-07-14 08:47:33 -07:00
Christoph Hellwig
76622c88c2 xfs: remove SYNC_WAIT and SYNC_TRYLOCK
These two definitions are unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2020-07-14 08:47:33 -07:00
Gao Xiang
92a005448f xfs: get rid of unnecessary xfs_perag_{get,put} pairs
In the course of some operations, we look up the perag from
the mount multiple times to get or change perag information.
These are often very short pieces of code, so while the
lookup cost is generally low, the cost of the lookup is far
higher than the cost of the operation we are doing on the
perag.

Since we changed buffers to hold references to the perag
they are cached in, many modification contexts already hold
active references to the perag that are held across these
operations. This is especially true for any operation that
is serialised by an allocation group header buffer.

In these cases, we can just use the buffer's reference to
the perag to avoid needing to do lookups to access the
perag. This means that many operations don't need to do
perag lookups at all to access the perag because they've
already looked up objects that own persistent references
and hence can use that reference instead.

Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-14 08:47:33 -07:00
Vasily Averin
7779b047a5 fuse: don't ignore errors from fuse_writepages_fill()
fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
either guarantee that all data was successfully saved or return error.

Fixes: 26d614df1d ("fuse: Implement writepages callback")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14 14:45:42 +02:00
Miklos Szeredi
6ddf3af93e fuse: clean up condition for writepage sending
fuse_writepages_fill uses following construction:

if (wpa && ap->num_pages &&
    (A || B || C)) {
        action;
} else if (wpa && D) {
        if (E) {
                the same action;
        }
}

 - ap->num_pages check is always true and can be removed

 - "if" and "else if" calls the same action and can be merged.

Move checking A, B, C, D, E conditions to a helper, add comments.

Original-patch-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14 14:45:41 +02:00
Miklos Szeredi
b330966f79 fuse: reject options on reconfigure via fsconfig(2)
Previous patch changed handling of remount/reconfigure to ignore all
options, including those that are unknown to the fuse kernel fs.  This was
done for backward compatibility, but this likely only affects the old
mount(2) API.

The new fsconfig(2) based reconfiguration could possibly be improved.  This
would make the new API less of a drop in replacement for the old, OTOH this
is a good chance to get rid of some weirdnesses in the old API.

Several other behaviors might make sense:

 1) unknown options are rejected, known options are ignored

 2) unknown options are rejected, known options are rejected if the value
 is changed, allowed otherwise

 3) all options are rejected

Prior to the backward compatibility fix to ignore all options all known
options were accepted (1), even if they change the value of a mount
parameter; fuse_reconfigure() does not look at the config values set by
fuse_parse_param().

To fix that we'd need to verify that the value provided is the same as set
in the initial configuration (2).  The major drawback is that this is much
more complex than just rejecting all attempts at changing options (3);
i.e. all options signify initial configuration values and don't make sense
on reconfigure.

This patch opts for (3) with the rationale that no mount options are
reconfigurable in fuse.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14 14:45:41 +02:00
Miklos Szeredi
e8b20a474c fuse: ignore 'data' argument of mount(..., MS_REMOUNT)
The command

  mount -o remount -o unknownoption /mnt/fuse

succeeds on kernel versions prior to v5.4 and fails on kernel version at or
after.  This is because fuse_parse_param() rejects any unrecognised options
in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.

This causes a regression in case the fuse filesystem is in fstab, since
remount sends all options found there to the kernel; even ones that are
meant for the initial mount and are consumed by the userspace fuse server.

Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
the conversion to the new API.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14 14:45:41 +02:00
Miklos Szeredi
0189a2d367 fuse: use ->reconfigure() instead of ->remount_fs()
s_op->remount_fs() is only called from legacy_reconfigure(), which is not
used after being converted to the new API.

Convert to using ->reconfigure().  This restores the previous behavior of
syncing the filesystem and rejecting MS_MANDLOCK on remount.

Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14 14:45:41 +02:00