Now that all filesystems have been converted to use
fscrypt_prepare_new_inode(), the encryption key for new symlink inodes
is now already set up whenever we try to encrypt the symlink target.
Enforce this rather than try to set up the key again when it may be too
late to do so safely.
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-9-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert ubifs to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context().
Unlike ext4 and f2fs, this doesn't appear to fix any deadlock bug. But
it does shorten the code slightly and get all filesystems using the same
helper functions, so that fscrypt_inherit_context() can be removed.
It also fixes an incorrect error code where ubifs returned EPERM instead
of the expected ENOKEY.
Link: https://lore.kernel.org/r/20200917041136.178600-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert f2fs to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context(). This avoids calling
fscrypt_get_encryption_info() from under f2fs_lock_op(), which can
deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.
For more details about this problem, see the earlier patch
"fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".
This also fixes a f2fs-specific deadlock when the filesystem is mounted
with '-o test_dummy_encryption' and a file is created in an unencrypted
directory other than the root directory:
INFO: task touch:207 blocked for more than 30 seconds.
Not tainted 5.9.0-rc4-00099-g729e3d0919844 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:touch state:D stack: 0 pid: 207 ppid: 167 flags:0x00000000
Call Trace:
[...]
lock_page include/linux/pagemap.h:548 [inline]
pagecache_get_page+0x25e/0x310 mm/filemap.c:1682
find_or_create_page include/linux/pagemap.h:348 [inline]
grab_cache_page include/linux/pagemap.h:424 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2395 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2373 [inline]
__get_node_page.part.0+0x39/0x2d0 fs/f2fs/node.c:1350
__get_node_page fs/f2fs/node.c:35 [inline]
f2fs_get_node_page+0x2e/0x60 fs/f2fs/node.c:1399
read_inline_xattr+0x88/0x140 fs/f2fs/xattr.c:288
lookup_all_xattrs+0x1f9/0x2c0 fs/f2fs/xattr.c:344
f2fs_getxattr+0x9b/0x160 fs/f2fs/xattr.c:532
f2fs_get_context+0x1e/0x20 fs/f2fs/super.c:2460
fscrypt_get_encryption_info+0x9b/0x450 fs/crypto/keysetup.c:472
fscrypt_inherit_context+0x2f/0xb0 fs/crypto/policy.c:640
f2fs_init_inode_metadata+0xab/0x340 fs/f2fs/dir.c:540
f2fs_add_inline_entry+0x145/0x390 fs/f2fs/inline.c:621
f2fs_add_dentry+0x31/0x80 fs/f2fs/dir.c:757
f2fs_do_add_link+0xcd/0x130 fs/f2fs/dir.c:798
f2fs_add_link fs/f2fs/f2fs.h:3234 [inline]
f2fs_create+0x104/0x290 fs/f2fs/namei.c:344
lookup_open.isra.0+0x2de/0x500 fs/namei.c:3103
open_last_lookups+0xa9/0x340 fs/namei.c:3177
path_openat+0x8f/0x1b0 fs/namei.c:3365
do_filp_open+0x87/0x130 fs/namei.c:3395
do_sys_openat2+0x96/0x150 fs/open.c:1168
[...]
That happened because f2fs_add_inline_entry() locks the directory
inode's page in order to add the dentry, then f2fs_get_context() tries
to lock it recursively in order to read the encryption xattr. This
problem is specific to "test_dummy_encryption" because normally the
directory's fscrypt_info would be set up prior to
f2fs_add_inline_entry() in order to encrypt the new filename.
Regardless, the new design fixes this test_dummy_encryption deadlock as
well as potential deadlocks with fs reclaim, by setting up any needed
fscrypt_info structs prior to taking so many locks.
The test_dummy_encryption deadlock was reported by Daniel Rosenberg.
Reported-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert ext4 to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context(). This avoids calling
fscrypt_get_encryption_info() from within a transaction, which can
deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.
For more details about this problem, see the earlier patch
"fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".
Link: https://lore.kernel.org/r/20200917041136.178600-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
To compute a new inode's xattr credits, we need to know whether the
inode will be encrypted or not. When we switch to use the new helper
function fscrypt_prepare_new_inode(), we won't find out whether the
inode will be encrypted until slightly later than is currently the case.
That will require moving the code block that computes the xattr credits.
To make this easier and reduce the length of __ext4_new_inode(), move
this code block into a new function ext4_xattr_credits_for_new_inode().
Link: https://lore.kernel.org/r/20200917041136.178600-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_get_encryption_info() is intended to be GFP_NOFS-safe. But
actually it isn't, since it uses functions like crypto_alloc_skcipher()
which aren't GFP_NOFS-safe, even when called under memalloc_nofs_save().
Therefore it can deadlock when called from a context that needs
GFP_NOFS, e.g. during an ext4 transaction or between f2fs_lock_op() and
f2fs_unlock_op(). This happens when creating a new encrypted file.
We can't fix this by just not setting up the key for new inodes right
away, since new symlinks need their key to encrypt the symlink target.
So we need to set up the new inode's key before starting the
transaction. But just calling fscrypt_get_encryption_info() earlier
doesn't work, since it assumes the encryption context is already set,
and the encryption context can't be set until the transaction.
The recently proposed fscrypt support for the ceph filesystem
(https://lkml.kernel.org/linux-fscrypt/20200821182813.52570-1-jlayton@kernel.org/T/#u)
will have this same ordering problem too, since ceph will need to
encrypt new symlinks before setting their encryption context.
Finally, f2fs can deadlock when the filesystem is mounted with
'-o test_dummy_encryption' and a new file is created in an existing
unencrypted directory. Similarly, this is caused by holding too many
locks when calling fscrypt_get_encryption_info().
To solve all these problems, add new helper functions:
- fscrypt_prepare_new_inode() sets up a new inode's encryption key
(fscrypt_info), using the parent directory's encryption policy and a
new random nonce. It neither reads nor writes the encryption context.
- fscrypt_set_context() persists the encryption context of a new inode,
using the information from the fscrypt_info already in memory. This
replaces fscrypt_inherit_context().
Temporarily keep fscrypt_inherit_context() around until all filesystems
have been converted to use fscrypt_set_context().
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The following sequence of commands,
mkfs.xfs -f -m reflink=0 -r rtdev=/dev/loop1,size=10M /dev/loop0
mount -o rtdev=/dev/loop1 /dev/loop0 /mnt
xfs_growfs /mnt
... causes the following call trace to be printed on the console,
XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 331
Call Trace:
xfs_buf_item_format+0x632/0x680
? kmem_alloc_large+0x29/0x90
? kmem_alloc+0x70/0x120
? xfs_log_commit_cil+0x132/0x940
xfs_log_commit_cil+0x26f/0x940
? xfs_buf_item_init+0x1ad/0x240
? xfs_growfs_rt_alloc+0x1fc/0x280
__xfs_trans_commit+0xac/0x370
xfs_growfs_rt_alloc+0x1fc/0x280
xfs_growfs_rt+0x1a0/0x5e0
xfs_file_ioctl+0x3fd/0xc70
? selinux_file_ioctl+0x174/0x220
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x3e/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurs because the buffer being formatted has the value of
XFS_BLFT_UNKNOWN_BUF assigned to the 'type' subfield of
bip->bli_formats->blf_flags.
This commit fixes the issue by assigning one of XFS_BLFT_RTSUMMARY_BUF
and XFS_BLFT_RTBITMAP_BUF to the 'type' subfield of
bip->bli_formats->blf_flags before committing the corresponding
transaction.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode extent truncate path unmaps extents from the inode block
mapping, finishes deferred ops to free the associated extents and
then explicitly rolls the transaction before processing the next
extent. The latter extent roll is spurious as xfs_defer_finish()
always returns a clean transaction and automatically relogs inodes
attached to the transaction (with lock_flags == 0). This can
unnecessarily increase the number of log ticket regrants that occur
during a long running truncate operation. Remove the explicit
transaction roll.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pass the full length to iomap_zero() and dax_iomap_zero(), and have
them return how many bytes they actually handled. This is preparatory
work for handling THP, although it looks like DAX could actually take
advantage of it if there's a larger contiguous area.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
iomap_write_end cannot return an error, so switch it to return
size_t instead of int and remove the error checking from the callers.
Also convert the arguments to size_t from unsigned int, in case anyone
ever wants to support a page size larger than 2GB.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Size the uptodate array dynamically to support larger pages in the
page cache. With a 64kB page, we're only saving 8 bytes per page today,
but with a 2MB maximum page size, we'd have to allocate more than 4kB
per page. Add a few debugging assertions.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that the bitmap is protected by a spinlock, we can use the
more efficient bitmap ops instead of individual test/set bit ops.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can skip most of the initialisation, although spinlocks still
need explicit initialisation as architectures may use a non-zero
value to indicate unlocked. The comment is no longer useful as
attach_page_private() handles the refcount now.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This helper is useful for both THPs and for supporting block size larger
than page size. Convert all users that I could find (we have a few
different ways of writing this idiom, and I may have missed some).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
If iomap_unshare_actor() unshares to an inline iomap, the page was
not being flushed. block_write_end() and __iomap_write_end() already
contain flushes, so adding it to iomap_write_end_inline() seems like
the best place. That means we can remove it from iomap_write_actor().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While it is true that reading from an unmirrored source always uses
index 0, that is no longer true for mirrored sources when we fail over.
Fixes: 563c53e73b ("NFS: Fix flexfiles read failover")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The hash_lock field of the cache structure was a leftover
of a previous iteration of the code. It is now unused,
so remove it.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Convert the uses of fallthrough comments to fallthrough macro. Please see
commit 294f69e662 ("compiler_attributes.h: Add 'fallthrough' pseudo
keyword for switch/case use") for detail.
Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The variable error is ssize_t, which is signed and will
cast to unsigned when comapre with variable size, so add
a check to avoid unexpected result in case of negative
value of error.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The pointer clnt is being initialized with a value that is never
read and so this is assignment redundant and can be removed. The
pointer can removed because it is being used as a temporary
variable and it is clearer to make the direct assignment and remove
it completely.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A previous commit unified how we handle prep for these two functions,
but this means that we check the allowed context (SQPOLL, specifically)
later than we should. Move the ring type checking into the two parent
functions, instead of doing it after we've done some setup work.
Fixes: ec65fea5a8 ("io_uring: deduplicate io_openat{,2}_prep()")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These will naturally fail when attempted through SQPOLL, but either
with -EFAULT or -EBADF. Make it explicit that these are not workable
through SQPOLL and return -EINVAL, just like other ops that need to
use ->files.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some block devices, like dm, bubble back -EAGAIN through the completion
handler. We check for this in io_read(), but don't honor it for when
we have copied the iov. Return -EAGAIN for this case before retrying,
to force punt to io-wq.
Fixes: bcf5a06304 ("io_uring: support true async buffered reads, if file provides it")
Reported-by: Zorro Lang <zlang@redhat.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we already have mapped the necessary data for retry, then don't set
it up again. It's a pointless operation, and we leak the iovec if it's
a large (non-stack) vec.
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
definition of dirtytime_interval_handler to match its prototype in
linux/writeback.h which fixes the following sparse error/warning:
fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces)
fs/fs-writeback.c:2189:50: expected void *
fs/fs-writeback.c:2189:50: got void [noderef] __user *buffer
fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)):
fs/fs-writeback.c:2184:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
fs/fs-writeback.c: note: in included file:
./include/linux/writeback.h:374:5: note: previously declared as:
./include/linux/writeback.h:374:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 0615090c50 ("erofs: convert compressed files from
readpages to readahead"), add_to_page_cache_lru() was moved to mm
code, so that in below call path, no page will be cached into
@pagepool list or grabbed from @pagepool list:
- z_erofs_readpage
- z_erofs_do_read_page
- preload_compressed_pages
- erofs_allocpage
Let's get rid of this unneeded @pagepool parameter.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200917011821.22767-1-yuchao0@huawei.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
While it is true that reading from an unmirrored source always uses
index 0, that is no longer true for mirrored sources when we fail over.
Fixes: 563c53e73b ("NFS: Fix flexfiles read failover")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Submounts have their own superblock, which needs to be initialized.
However, they do not have a fuse_fs_context associated with them, and
the root node's attributes should be taken from the mountpoint's node.
Extend fuse_fill_super_common() to work for submounts by making the @ctx
parameter optional, and by adding a @submount_finode parameter.
(There is a plain "unsigned" in an existing code block that is being
indented by this commit. Extend it to "unsigned int" so checkpatch does
not complain.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We want to allow submounts for the same fuse_conn, but with different
superblocks so that each of the submounts has its own device ID. To do
so, we need to split all mount-specific information off of fuse_conn
into a new fuse_mount structure, so that multiple mounts can share a
single fuse_conn.
We need to take care only to perform connection-level actions once (i.e.
when the fuse_conn and thus the first fuse_mount are established, or
when the last fuse_mount and thus the fuse_conn are destroyed). For
example, fuse_sb_destroy() must invoke fuse_send_destroy() until the
last superblock is released.
To do so, we keep track of which fuse_mount is the root mount and
perform all fuse_conn-level actions only when this fuse_mount is
involved.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With the last commit, all functions that handle some existing fuse_req
no longer need to be given the associated fuse_conn, because they can
get it from the fuse_req object.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Every fuse_req belongs to a fuse_conn. Right now, we always know which
fuse_conn that is based on the respective device, but we want to allow
multiple (sub)mounts per single connection, and then the corresponding
filesystem is not going to be so trivial to obtain.
Storing a pointer to the associated fuse_conn in every fuse_req will
allow us to trivially find any request's superblock (and thus
filesystem) even then.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
After unlock_request() pages from the ap->pages[] array may be put (e.g. by
aborting the connection) and the pages can be freed.
Prevent use after free by grabbing a reference to the page before calling
unlock_request().
The original patch was created by Pradeep P V K.
Reported-by: Pradeep P V K <ppvk@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>