Make sure we specify the layout segment range when calculating the
mirror count. In theory, that number could depend on the range to
which we're writing.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Both nfs_pageio_reset_read_mds() and nfs_pageio_reset_write_mds()
do call pnfs_generic_pg_cleanup() for us.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the application uses the AT_STATX_DONT_SYNC flag after doing readdir(),
then we should still mark the parent inode as seeing a readdirplus hit.
That ensures that we continue to use readdirplus in the 'ls -l' type
of workflow to do fast lookups of the dentries.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull io_uring fixes from Jens Axboe:
"Two late fixes again:
- Fix missing msg_name assignment in certain cases (Pavel)
- Correct a previous fix for full coverage (Pavel)"
* tag 'io_uring-5.8-2020-07-12' of git://git.kernel.dk/linux-block:
io_uring: fix not initialised work->flags
io_uring: fix missing msg_name assignment
Pull btrfs fixes from David Sterba:
"Two refcounting fixes and one prepartory patch for upcoming splice
cleanup:
- fix double put of block group with nodatacow
- fix missing block group put when remounting with discard=async
- explicitly set splice callback (no functional change), to ease
integrating splice cleanup patches"
* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: wire up iter_file_splice_write
btrfs: fix double put of block group with nocow
btrfs: discard: add missing put when grabbing block group from unused list
59960b9deb ("io_uring: fix lazy work init") tried to fix missing
io_req_init_async(), but left out work.flags and hash. Do it earlier.
Fixes: 7cdaf587de ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure to set msg.msg_name for the async portion of send/recvmsg,
as the header copy will copy to/from it.
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cifs fixes from Steve French:
"Four cifs/smb3 fixes: the three for stable fix problems found recently
with change notification including a reference count leak"
* tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
cifs: fix reference leak for tlink
smb3: fix unneeded error message on change notify
cifs: remove the retry in cifs_poxis_lock_set
smb3: fix access denied on change notify request to some servers
A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.
Signed-off-by: Kees Cook <keescook@chromium.org>
debugfs_create_u32_array() allocates a small structure to wrap
the data and size information about the array. If users ever
try to remove the file this leads to a leak since nothing ever
frees this wrapper.
That said there are no upstream users of debugfs_create_u32_array()
that'd remove a u32 array file (we only have one u32 array user in
CMA), so there is no real bug here.
Make callers pass a wrapper they allocated. This way the lifetime
management of the wrapper is on the caller, and we can avoid the
potential leak in debugfs.
CC: Chucheng Luo <luochucheng@vivo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull io_uring fixes from Jens Axboe:
- Fix memleak for error path in registered files (Yang)
- Export CQ overflow state in flags, necessary to fix a case where
liburing doesn't know if it needs to enter the kernel (Xiaoguang)
- Fix for a regression in when user memory is accounted freed, causing
issues with back-to-back ring exit + init if the ulimit -l setting is
very tight.
* tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block:
io_uring: account user memory freed when exit has been queued
io_uring: fix memleak in io_sqe_files_register()
io_uring: fix memleak in __io_sqe_files_update()
io_uring: export cq overflow status to userspace
Pull in-kernel read and write op cleanups from Christoph Hellwig:
"Cleanup in-kernel read and write operations
Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
all users of in-kernel file I/O use them if they don't use iov_iter
based methods already.
The new WARN_ONs in combination with syzcaller already found a missing
input validation in 9p. The fix should be on your way through the
maintainer ASAP".
[ This is prep-work for the real changes coming 5.9 ]
* tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc:
fs: remove __vfs_read
fs: implement kernel_read using __kernel_read
integrity/ima: switch to using __kernel_read
fs: add a __kernel_read helper
fs: remove __vfs_write
fs: implement kernel_write using __kernel_write
fs: check FMODE_WRITE in __kernel_write
fs: unexport __kernel_write
bpfilter: switch to kernel_write
autofs: switch to kernel_write
cachefiles: switch to kernel_write
Pull gfs2 fixes from Andreas Gruenbacher:
"Fix gfs2 readahead deadlocks by adding a IOCB_NOIO flag that allows
gfs2 to use the generic fiel read iterator functions without having to
worry about being called back while holding locks".
* tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Rework read and page fault locking
fs: Add IOCB_NOIO flag for generic_file_read_iter
We currently account the memory after the exit work has been run, but
that leaves a gap where a process has closed its ring and until the
memory has been accounted as freed. If the memlocked ulimit is
borderline, then that can introduce spurious setup errors returning
-ENOMEM because the free work hasn't been run yet.
Account this as freed when we close the ring, as not to expose a tiny
gap where setting up a new ring can fail.
Fixes: 85faa7b834 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
on the mm of the current task. Drop the argument.
Move io_file_put_work() to where we have the other forward declarations
of functions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs implements the iter_write op and thus can use the more efficient
iov_iter based splice implementation. For now falling back to the less
efficient default is pretty harmless, but I have a pending series that
removes the default, and thus would cause btrfs to not support splice
at all.
Reported-by: Andy Lavr <andy.lavr@gmail.com>
Tested-by: Andy Lavr <andy.lavr@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:
======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30
but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
which lock already depends on the new lock.
:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_internal);
lock(fs_reclaim);
lock(sb_internal);
lock(fs_reclaim);
*** DEADLOCK ***
4 locks held by fsfreeze/4346:
#0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
#1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
#2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
#3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
stack backtrace:
Call Trace:
dump_stack+0xe0/0x19a
print_circular_bug.isra.10.cold.34+0x2f4/0x435
check_prev_add.constprop.19+0xca1/0x15f0
validate_chain.isra.14+0x11af/0x3b50
__lock_acquire+0x728/0x1200
lock_acquire+0x269/0x5a0
fs_reclaim_acquire.part.19+0x29/0x30
fs_reclaim_acquire+0x19/0x20
kmem_cache_alloc+0x3e/0x3f0
kmem_zone_alloc+0x79/0x150
xfs_trans_alloc+0xfa/0x9d0
xfs_sync_sb+0x86/0x170
xfs_log_sbcount+0x10f/0x140
xfs_quiesce_attr+0x134/0x270
xfs_fs_freeze+0x4a/0x70
freeze_super+0x1af/0x290
do_vfs_ioctl+0xedc/0x16c0
ksys_ioctl+0x41/0x80
__x64_sys_ioctl+0x73/0xa9
do_syscall_64+0x18f/0xd23
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen.
One way to avoid this splat is to add GFP_NOFS to the affected allocation
calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair.
This shouldn't matter unless the system is really running out of memory.
In that particular case, the filesystem freeze operation may fail while
it was succeeding previously.
Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.
# fsfreeze -f /home
# fsfreeze --unfreeze /home
# grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While debugging a patch that I wrote I was hitting use-after-free panics
when accessing block groups on unmount. This turned out to be because
in the nocow case if we bail out of doing the nocow for whatever reason
we need to call btrfs_dec_nocow_writers() if we called the inc. This
puts our block group, but a few error cases does
if (nocow) {
btrfs_dec_nocow_writers();
goto error;
}
unfortunately, error is
error:
if (nocow)
btrfs_dec_nocow_writers();
so we get a double put on our block group. Fix this by dropping the
error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
error label now.
Fixes: 762bf09893 ("btrfs: improve error handling in run_delalloc_nocow")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit
bf67fad19e ("efi: Use more granular check for availability for variable services")
introduced a check into the efivarfs, efi-pstore and other drivers that
aborts loading of the module if not all three variable runtime services
(GetVariable, SetVariable and GetNextVariable) are supported. However, this
results in efivarfs being unavailable entirely if only SetVariable support
is missing, which is only needed if you want to make any modifications.
Also, efi-pstore and the sysfs EFI variable interface could be backed by
another implementation of the 'efivars' abstraction, in which case it is
completely irrelevant which services are supported by the EFI firmware.
So make the generic 'efivars' abstraction dependent on the availibility of
the GetVariable and GetNextVariable EFI runtime services, and add a helper
'efivar_supports_writes()' to find out whether the currently active efivars
abstraction supports writes (and wire it up to the availability of
SetVariable for the generic one).
Then, use the efivar_supports_writes() helper to decide whether to permit
efivarfs to be mounted read-write, and whether to enable efi-pstore or the
sysfs EFI variable interface altogether.
Fixes: bf67fad19e ("efi: Use more granular check for availability for variable services")
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Link: https://lore.kernel.org/r/20200708171905.15396-1-grandmaster@al2klimov.de
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Almost all callers of ext2_find_entry() transform NULL return value to
-ENOENT, so just let ext2_find_entry() retuen -ENOENT instead of NULL
if no valid entry found, and also switch to check the return value of
ext2_inode_by_name() in ext2_lookup() and ext2_get_parent().
Link: https://lore.kernel.org/r/20200608034043.10451-2-yi.zhang@huawei.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
The same to commit <36de928641ee4> (ext4: propagate errors up to
ext4_find_entry()'s callers') in ext4, also return error instead of NULL
pointer in case of some error happens in ext2_find_entry() (e.g. -ENOMEM
or -EIO). This could avoid a negative dentry cache entry installed even
it failed to read directory block due to IO error.
Link: https://lore.kernel.org/r/20200608034043.10451-1-yi.zhang@huawei.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In the process of changing value for existing EA,
there is an improper assignment of e_value_offs(setting to 0),
because it will be reset to incorrect value in the following
loop(shifting EA values before target). Delayed assignment
can avoid this issue.
Link: https://lore.kernel.org/r/20200603084429.25344-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
meta inode's pages are used for encrypted, verity and compressed blocks,
so the meta inode's cache invalidation condition in do_checkpoint() should
consider compression as well, not just for verity and encryption, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:
static void test_cq_overflow(struct io_uring *ring)
{
struct io_uring_cqe *cqe;
struct io_uring_sqe *sqe;
int issued = 0;
int ret = 0;
do {
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
break;;
}
ret = io_uring_submit(ring);
if (ret <= 0) {
if (ret != -EBUSY)
fprintf(stderr, "sqe submit failed: %d\n", ret);
break;
}
issued++;
} while (ret > 0);
assert(ret == -EBUSY);
printf("issued requests: %d\n", issued);
while (issued) {
ret = io_uring_peek_cqe(ring, &cqe);
if (ret) {
if (ret != -EAGAIN) {
fprintf(stderr, "peek completion failed: %s\n",
strerror(ret));
break;
}
printf("left requets: %d\n", issued);
continue;
}
io_uring_cqe_seen(ring, cqe);
issued--;
printf("left requets: %d\n", issued);
}
}
int main(int argc, char *argv[])
{
int ret;
struct io_uring ring;
ret = io_uring_queue_init(16, &ring, 0);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
test_cq_overflow(&ring);
return 0;
}
To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
check_disk_change isn't for consumers of the block layer, so remove
the comment mentioning it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
flush_disk has only two callers, so open code it there. That also helps
clarifying the error message for the particular case, and allows to remove
setting bd_invalidated in check_disk_size_change, which will be cleared
again instantly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's safe to call kfree() with a NULL pointer, but it's also pointless.
Most of the time we don't have any data to free, and at millions of
requests per second, the redundant function call adds noticeable
overhead (about 1.3% of the runtime).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The "apoll" variable is freed and then used on the next line. We need
to move the free down a few lines.
Fixes: 0be0b0e33b ("io_uring: simplify io_async_task_func()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wire up ext4 to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for
this part, since ext4 sometimes uses ll_rw_block() on file data.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Wire up f2fs to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
This patch includes a fix for a race during IPU by
Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200702015607.1215430-4-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add support for inline encryption to fs/crypto/. With "inline
encryption", the block layer handles the decryption/encryption as part
of the bio, instead of the filesystem doing the crypto itself via
Linux's crypto API. This model is needed in order to take advantage of
the inline encryption hardware present on most modern mobile SoCs.
To use inline encryption, the filesystem needs to be mounted with
'-o inlinecrypt'. Blk-crypto will then be used instead of the traditional
filesystem-layer crypto whenever possible to encrypt the contents
of any encrypted files in that filesystem. Fscrypt still provides the key
and IV to use, and the actual ciphertext on-disk is still the same;
therefore it's testable using the existing fscrypt ciphertext verification
tests.
Note that since blk-crypto has a fallback to Linux's crypto API, and
also supports all the encryption modes currently supported by fscrypt,
this feature is usable and testable even without actual inline
encryption hardware.
Per-filesystem changes will be needed to set encryption contexts when
submitting bios and to implement the 'inlinecrypt' mount option. This
patch just adds the common code.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-3-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
- don't panic kernel if f2fs_get_node_page() fails in
f2fs_recover_inline_data() or f2fs_recover_inline_xattr();
- return error number of f2fs_truncate_blocks() to
f2fs_recover_inline_data()'s caller;
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Consolidate the two in-kernel read helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_read, and an access_ok check that doesn't make sense for
kernel buffers to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>