3a0cdd336d92c429b51a79bf4f64b17eafa0325d
67840 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
3c08f772ad |
io_uring: don't attempt IO reissue from the ring exit path
[ Upstream commit 7c977a58dc83366e488c217fd88b1469d242bee5 ] If we're exiting the ring, just let the IO fail with -EAGAIN as nobody will care anyway. It's not the right context to reissue from. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
1c20e9040f |
io_uring: fix inconsistent lock state
[ Upstream commit 9ae1f8dd372e0e4c020b345cf9e09f519265e981 ] WARNING: inconsistent lock state inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs->lock); <Interrupt> lock(&fs->lock); *** DEADLOCK *** 1 lock held by syz-executor217/8450: #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442 stack backtrace: CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> [...] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_req_clean_work fs/io_uring.c:1398 [inline] io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046 io_free_req fs/io_uring.c:2269 [inline] io_double_put_req fs/io_uring.c:2392 [inline] io_put_req+0xf9/0x570 fs/io_uring.c:2388 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline] __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline] run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline] sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629 RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline] RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199 spin_unlock_irq include/linux/spinlock.h:404 [inline] io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639 io_queue_link_head fs/io_uring.c:6650 [inline] io_submit_sqe fs/io_uring.c:6697 [inline] io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Don't free requests from under hrtimer context (softirq) as it may sleep or take spinlocks improperly (e.g. non-irq versions). Cc: stable@vger.kernel.org # 5.6+ Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
04eb2b2fa1 |
cifs: fix allocation size on newly created files
commit 65af8f0166f4d15e61c63db498ec7981acdd897f upstream. Applications that create and extend and write to a file do not expect to see 0 allocation size. When file is extended, set its allocation size to a plausible value until we have a chance to query the server for it. When the file is cached this will prevent showing an impossible number of allocated blocks (like 0). This fixes e.g. xfstests 614 which does 1) create a file and set its size to 64K 2) mmap write 64K to the file 3) stat -c %b for the file (to query the number of allocated blocks) It was failing because we returned 0 blocks. Even though we would return the correct cached file size, we returned an impossible allocation size. Signed-off-by: Steve French <stfrench@microsoft.com> CC: <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
6cae809549 |
io_uring: ensure that SQPOLL thread is started for exit
commit 3ebba796fa251d042be42b929a2d916ee5c34a49 upstream. If we create it in a disabled state because IORING_SETUP_R_DISABLED is set on ring creation, we need to ensure that we've kicked the thread if we're exiting before it's been explicitly disabled. Otherwise we can run into a deadlock where exit is waiting go park the SQPOLL thread, but the SQPOLL thread itself is waiting to get a signal to start. That results in the below trace of both tasks hung, waiting on each other: INFO: task syz-executor458:8401 blocked for more than 143 seconds. Not tainted 5.11.0-next-20210226-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor458 state:D stack:27536 pid: 8401 ppid: 8400 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4324 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5075 schedule+0xcf/0x270 kernel/sched/core.c:5154 schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 io_sq_thread_park fs/io_uring.c:7115 [inline] io_sq_thread_park+0xd5/0x130 fs/io_uring.c:7103 io_uring_cancel_task_requests+0x24c/0xd90 fs/io_uring.c:8745 __io_uring_files_cancel+0x110/0x230 fs/io_uring.c:8840 io_uring_files_cancel include/linux/io_uring.h:47 [inline] do_exit+0x299/0x2a60 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 __do_sys_exit_group kernel/exit.c:933 [inline] __se_sys_exit_group kernel/exit.c:931 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x43e899 RSP: 002b:00007ffe89376d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00000000004af2f0 RCX: 000000000043e899 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffffffffffffc0 R09: 0000000010000000 R10: 0000000000008011 R11: 0000000000000246 R12: 00000000004af2f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 INFO: task iou-sqp-8401:8402 can't die for more than 143 seconds. task:iou-sqp-8401 state:D stack:30272 pid: 8402 ppid: 8400 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4324 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5075 schedule+0xcf/0x270 kernel/sched/core.c:5154 schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x168/0x270 kernel/sched/completion.c:138 io_sq_thread+0x27d/0x1ae0 fs/io_uring.c:6717 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 INFO: task iou-sqp-8401:8402 blocked for more than 143 seconds. Reported-by: syzbot+fb5458330b4442f2090d@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
a7acb61428 |
pstore: Fix warning in pstore_kill_sb()
commit 9c7d83ae6ba67d6c6199cce24573983db3b56332 upstream. syzbot is hitting WARN_ON(pstore_sb != sb) at pstore_kill_sb() [1], for the assumption that pstore_sb != NULL is wrong because pstore_fill_super() will not assign pstore_sb = sb when new_inode() for d_make_root() returned NULL (due to memory allocation fault injection). Since mount_single() calls pstore_kill_sb() when pstore_fill_super() failed, pstore_kill_sb() needs to be aware of such failure path. [1] https://syzkaller.appspot.com/bug?id=6abacb8da5137cb47a416f2bef95719ed60508a0 Reported-by: syzbot <syzbot+d0cf0ad6513e9a1da5df@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210214031307.57903-1-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
982b899ba6 |
NFSD: fix dest to src mount in inter-server COPY
commit 614c9750173e412663728215152cc6d12bcb3425 upstream.
A cleanup of the inter SSC copy needs to call fput() of the source
file handle to make sure that file structure is freed as well as
drop the reference on the superblock to unmount the source server.
Fixes:
|
||
![]() |
12628e7779 |
nfsd: don't abort copies early
commit bfdd89f232aa2de5a4b3fc985cba894148b830a8 upstream.
The typical result of the backwards comparison here is that the source
server in a server-to-server copy will return BAD_STATEID within a few
seconds of the copy starting, instead of giving the copy a full lease
period, so the copy_file_range() call will end up unnecessarily
returning a short read.
Fixes:
|
||
![]() |
5ea0aa29ad |
nfsd: Don't keep looking up unhashed files in the nfsd file cache
commit d30881f573e565ebb5dbb50b31ed6106b5c81328 upstream.
If a file is unhashed, then we're going to reject it anyway and retry,
so make sure we skip it when we're doing the RCU lockless lookup.
This avoids a number of unnecessary nfserr_jukebox returns from
nfsd_file_acquire()
Fixes:
|
||
![]() |
64195f022a |
afs: Stop listxattr() from listing "afs.*" attributes
commit a7889c6320b9200e3fe415238f546db677310fa9 upstream.
afs_listxattr() lists all the available special afs xattrs (i.e. those in
the "afs.*" space), no matter what type of server we're dealing with. But
OpenAFS servers, for example, cannot deal with some of the extra-capable
attributes that AuriStor (YFS) servers provide. Unfortunately, the
presence of the afs.yfs.* attributes causes errors[1] for anything that
tries to read them if the server is of the wrong type.
Fix the problem by removing afs_listxattr() so that none of the special
xattrs are listed (AFS doesn't support xattrs). It does mean, however,
that getfattr won't list them, though they can still be accessed with
getxattr() and setxattr().
This can be tested with something like:
getfattr -d -m ".*" /afs/example.com/path/to/file
With this change, none of the afs.* attributes should be visible.
Changes:
ver #2:
- Hide all of the afs.* xattrs, not just the ACL ones.
Fixes:
|
||
![]() |
78ba4793b0 |
afs: Fix accessing YFS xattrs on a non-YFS server
commit 64fcbb6158ecc684d84c64424830a9c37c77c5b9 upstream.
If someone attempts to access YFS-related xattrs (e.g. afs.yfs.acl) on a
file on a non-YFS AFS server (such as OpenAFS), then the kernel will jump
to a NULL function pointer because the afs_fetch_acl_operation descriptor
doesn't point to a function for issuing an operation on a non-YFS
server[1].
Fix this by making afs_wait_for_operation() check that the issue_afs_rpc
method is set before jumping to it and setting -ENOTSUPP if not. This fix
also covers other potential operations that also only exist on YFS servers.
afs_xattr_get/set_yfs() then need to translate -ENOTSUPP to -ENODATA as the
former error is internal to the kernel.
The bug shows up as an oops like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[...]
Call Trace:
afs_wait_for_operation+0x83/0x1b0 [kafs]
afs_xattr_get_yfs+0xe6/0x270 [kafs]
__vfs_getxattr+0x59/0x80
vfs_getxattr+0x11c/0x140
getxattr+0x181/0x250
? __check_object_size+0x13f/0x150
? __fput+0x16d/0x250
__x64_sys_fgetxattr+0x64/0xb0
do_syscall_64+0x49/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb120a9defe
This was triggered with "cp -a" which attempts to copy xattrs, including
afs ones, but is easier to reproduce with getfattr, e.g.:
getfattr -d -m ".*" /afs/openafs.org/
Fixes:
|
||
![]() |
2c8d6a9474 |
btrfs: fix slab cache flags for free space tree bitmap
commit 34e49994d0dcdb2d31d4d2908d04f4e9ce57e4d7 upstream.
The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
that's a debugging flag and not always enabled. Also the other slabs are
created with at least SLAB_MEM_SPREAD that we want as well to average
the memory placement cost.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes:
|
||
![]() |
38ffe9eaeb |
btrfs: fix race when cloning extent buffer during rewind of an old root
commit dbcc7d57bffc0c8cac9dac11bec548597d59a6a5 upstream.
While resolving backreferences, as part of a logical ino ioctl call or
fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
operations of a root, triggering a stack trace like the following:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1210!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G W 5.11.0-2d11c0084b02-misc-next+ #89
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
FS: 00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
Call Trace:
btrfs_search_old_slot+0x265/0x10d0
? lock_acquired+0xbb/0x600
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x30/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? resolve_indirect_refs+0xfc0/0xfc0
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x160/0x210
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? poison_range+0x38/0x40
? unpoison_range+0x14/0x40
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? lock_downgrade+0x3d0/0x3d0
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? _raw_spin_unlock+0x22/0x30
? __kasan_check_write+0x14/0x20
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x205e/0x4040
? __might_sleep+0x71/0xe0
? btrfs_ioctl_get_supported_features+0x30/0x30
? getrusage+0x4b6/0x9c0
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __might_fault+0x64/0xd0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __task_pid_nr_ns+0xd3/0x250
? lock_acquire+0xc7/0x510
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fd1976e2427
Code: 00 00 90 48 8b 05 (...)
RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
Modules linked in:
---[ end trace ec8931a1c36e57be ]---
(gdb) l *(__tree_mod_log_rewind+0x3b1)
0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
1205 * the modification. as we're going backwards, we do the
1206 * opposite of each operation here.
1207 */
1208 switch (tm->op) {
1209 case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
1210 BUG_ON(tm->slot < n);
1211 fallthrough;
1212 case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
1213 case MOD_LOG_KEY_REMOVE:
1214 btrfs_set_node_key(eb, &tm->key, tm->slot);
Here's what happens to hit that BUG_ON():
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
tree_mod_log_insert_root(eb X, eb Y, 1);
3) At tree_mod_log_insert_root() we create tree mod log elements for each
slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
with a ->logical pointing to ebX->start. These are placed in an array
named tm_list.
Lets assume there are N elements (N pointers in eb X);
4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
to the level of eb X and ->generation set to the generation of eb X;
5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
tm_list as argument. After that, tree_mod_log_free_eb() calls
__tree_mod_log_insert() for each member of tm_list in reverse order,
from highest slot in eb X, slot N - 1, to slot 0 of eb X;
6) __tree_mod_log_insert() sets the sequence number of each given tree mod
log operation - it increments fs_info->tree_mod_seq and sets
fs_info->tree_mod_seq as the sequence number of the given tree mod log
operation.
This means that for the tm_list created at tree_mod_log_insert_root(),
the element corresponding to slot 0 of eb X has the highest sequence
number (1 + N), and the element corresponding to the last slot has the
lowest sequence number (2);
7) Then, after inserting tm_list's elements into the tree mod log rbtree,
the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
sequence number, which is N + 2;
8) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
9) Later some other task T allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
10) The tree mod log user task calls btrfs_search_old_slot(), which calls
get_old_root(), and finally that calls __tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
11) First iteration of the while loop finds the tree mod log element with
sequence number N + 2, for the logical address of eb Y and of type
MOD_LOG_ROOT_REPLACE;
12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
of the loop, and set root_logical to point to tm->old_root.logical
which corresponds to the logical address of eb X;
13) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
the old slot N - 1 of eb X (eb X had N items in it before being freed);
14) We then break out of the while loop and return the tree mod log operation
of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
eb X, to get_old_root();
15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
and set "logical" to the logical address of eb X, which was the old
root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
which results in adding a tree mod log operation of type
MOD_LOG_KEY_ADD to the tree mod log - this is done at
ctree.c:insert_ptr() - but after adding the tree mod log operation
and before updating the number of items in eb X from 0 to 1...
17) The task at get_old_root() calls tree_mod_log_search() and gets the
tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
Then it enters the following if branch:
if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
(...)
} (...)
Calls read_tree_block() for eb X, which gets a reference on eb X but
does not lock it - task T has it locked.
Then it clones eb X while it has nritems set to 0 in its header, before
task T sets nritems to 1 in eb X's header. From hereupon we use the
clone of eb X which no other task has access to;
18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
mod log operation we just got from tree_mod_log_search() in the
previous step and the cloned version of eb X;
19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
of items set in eb X's clone, which is 0. Then we enter the while loop,
and in its first iteration we process the MOD_LOG_KEY_ADD operation,
which just decrements "n" from 0 to (u32)-1, since "n" is declared with
a type of u32. At the end of this iteration we call rb_next() to find the
next tree mod log operation for eb X, that gives us the mod log operation
of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
number of N + 1 (steps 3 to 6);
20) Then we go back to the top of the while loop and trigger the following
BUG_ON():
(...)
switch (tm->op) {
case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
Fix this by taking a read lock on the extent buffer before cloning it at
ctree.c:get_old_root(). This should be done regardless of the extent
buffer having been freed and reused, as a concurrent task might be
modifying it (while holding a write lock on it).
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
Fixes:
|
||
![]() |
78486cf1f3 |
zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()
commit 6980d29ce4da223ad7f0751c7f1d61d3c6b54ab3 upstream.
In zonefs_open_zone(), if opened zone count is larger than
.s_max_open_zones threshold, we missed to recover .i_wr_refcnt,
fix this.
Fixes:
|
||
![]() |
9c1c5e81a0 |
zonefs: prevent use of seq files as swap file
commit 1601ea068b886da1f8f8d4e18b9403e9e24adef6 upstream.
The sequential write constraint of sequential zone file prevent their
use as swap files. Only allow conventional zone files to be used as swap
files.
Fixes:
|
||
![]() |
dfbdbf0f35 |
zonefs: Fix O_APPEND async write handling
commit ebfd68cd0c1e81267c757332385cb96df30dacce upstream.
zonefs updates the size of a sequential zone file inode only on
completion of direct writes. When executing asynchronous append writes
(with a file open with O_APPEND or using RWF_APPEND), the use of the
current inode size in generic_write_checks() to set an iocb offset thus
leads to unaligned write if an application issues an append write
operation with another write already being executed.
Fix this problem by introducing zonefs_write_checks() as a modified
version of generic_write_checks() using the file inode wp_offset for an
append write iocb offset. Also introduce zonefs_write_check_limits() to
replace generic_write_check_limits() call. This zonefs special helper
makes sure that the maximum file limit used is the maximum size of the
file being accessed.
Since zonefs_write_checks() already truncates the iov_iter, the calls
to iov_iter_truncate() in zonefs_file_dio_write() and
zonefs_file_buffered_write() are removed.
Fixes:
|
||
![]() |
df8596f577 |
Revert "nfsd4: a client's own opens needn't prevent delegations"
commit 6ee65a773096ab3f39d9b00311ac983be5bdeb7c upstream.
This reverts commit
|
||
![]() |
894ecf0cb5 |
Revert "nfsd4: remove check_conflicting_opens warning"
commit 4aa5e002034f0701c3335379fd6c22d7f3338cce upstream. This reverts commit |
||
![]() |
d955f13ea2 |
fuse: fix live lock in fuse_iget()
commit 775c5033a0d164622d9d10dd0f0a5531639ed3ed upstream. Commit 5d069dbe8aaf ("fuse: fix bad inode") replaced make_bad_inode() in fuse_iget() with a private implementation fuse_make_bad(). The private implementation fails to remove the bad inode from inode cache, so the retry loop with iget5_locked() finds the same bad inode and marks it bad forever. kmsg snip: [ ] rcu: INFO: rcu_sched self-detected stall on CPU ... [ ] ? bit_wait_io+0x50/0x50 [ ] ? fuse_init_file_inode+0x70/0x70 [ ] ? find_inode.isra.32+0x60/0xb0 [ ] ? fuse_init_file_inode+0x70/0x70 [ ] ilookup5_nowait+0x65/0x90 [ ] ? fuse_init_file_inode+0x70/0x70 [ ] ilookup5.part.36+0x2e/0x80 [ ] ? fuse_init_file_inode+0x70/0x70 [ ] ? fuse_inode_eq+0x20/0x20 [ ] iget5_locked+0x21/0x80 [ ] ? fuse_inode_eq+0x20/0x20 [ ] fuse_iget+0x96/0x1b0 Fixes: 5d069dbe8aaf ("fuse: fix bad inode") Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
5ab9464a2a |
binfmt_misc: fix possible deadlock in bm_register_write
commit e7850f4d844e0acfac7e570af611d89deade3146 upstream.
There is a deadlock in bm_register_write:
First, in the begining of the function, a lock is taken on the binfmt_misc
root inode with inode_lock(d_inode(root)).
Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
open_exec on the user-provided interpreter.
open_exec will call a path lookup, and if the path lookup process includes
the root of binfmt_misc, it will try to take a shared lock on its inode
again, but it is already locked, and the code will get stuck in a deadlock
To reproduce the bug:
$ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register
backtrace of where the lock occurs (#5):
0 schedule () at ./arch/x86/include/asm/current.h:15
1 0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992
2 0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213
3 __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222
4 down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355
5 0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783
6 open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
7 path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
8 0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
9 0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913
10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948
11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682
12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF
", count=19) at fs/read_write.c:658
14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
To solve the issue, the open_exec call is moved to before the write
lock is taken by bm_register_write
Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
Fixes:
|
||
![]() |
109720342e |
configfs: fix a use-after-free in __configfs_open_file
[ Upstream commit 14fbbc8297728e880070f7b077b3301a8c698ef9 ] Commit |
||
![]() |
caa86901c8 |
NFSv4.2: fix return value of _nfs4_get_security_label()
[ Upstream commit 53cb245454df5b13d7063162afd7a785aed6ebf2 ]
An xattr 'get' handler is expected to return the length of the value on
success, yet _nfs4_get_security_label() (and consequently also
nfs4_xattr_get_nfs4_label(), which is used as an xattr handler) returns
just 0 on success.
Fix this by returning label.len instead, which contains the length of
the result.
Fixes:
|
||
![]() |
e181960ec5 |
NFS: Don't gratuitously clear the inode cache when lookup failed
[ Upstream commit 47397915ede0192235474b145ebcd81b37b03624 ]
The fact that the lookup revalidation failed, does not mean that the
inode contents have changed.
Fixes:
|
||
![]() |
dd756d05be |
NFS: Don't revalidate the directory permissions on a lookup failure
[ Upstream commit 82e7ca1334ab16e2e04fafded1cab9dfcdc11b40 ]
There should be no reason to expect the directory permissions to change
just because the directory contents changed or a negative lookup timed
out. So let's avoid doing a full call to nfs_mark_for_revalidate() in
that case.
Furthermore, if this is a negative dentry, and we haven't actually done
a new lookup, then we have no reason yet to believe the directory has
changed at all. So let's remove the gratuitous directory inode
invalidation altogether when called from
nfs_lookup_revalidate_negative().
Reported-by: Geert Jansen <gerardu@amazon.com>
Fixes:
|
||
![]() |
d308202c1b |
cifs: do not send close in compound create+close requests
commit 04ad69c342fc4de5bd23be9ef15ea7574fb1a87e upstream. In case of interrupted syscalls, prevent sending CLOSE commands for compound CREATE+CLOSE requests by introducing an CIFS_CP_CREATE_CLOSE_OP flag to indicate lower layers that it should not send a CLOSE command to the MIDs corresponding the compound CREATE+CLOSE request. A simple reproducer: #!/bin/bash mount //server/share /mnt -o username=foo,password=*** tc qdisc add dev eth0 root netem delay 450ms stat -f /mnt &>/dev/null & pid=$! sleep 0.01 kill $pid tc qdisc del dev eth0 root umount /mnt Before patch: ... 6 0.256893470 192.168.122.2 → 192.168.122.15 SMB2 402 Create Request File: ;GetInfo Request FS_INFO/FileFsFullSizeInformation;Close Request 7 0.257144491 192.168.122.15 → 192.168.122.2 SMB2 498 Create Response File: ;GetInfo Response;Close Response 9 0.260798209 192.168.122.2 → 192.168.122.15 SMB2 146 Close Request File: 10 0.260841089 192.168.122.15 → 192.168.122.2 SMB2 130 Close Response, Error: STATUS_FILE_CLOSED Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
d44c9780ed |
block: Try to handle busy underlying device on discard
commit 56887cffe946bb0a90c74429fa94d6110a73119d upstream. Commit |
||
![]() |
64578f9417 |
ext4: don't try to processed freed blocks until mballoc is initialized
[ Upstream commit 027f14f5357279655c3ebc6d14daff8368d4f53f ] If we try to make any changes via the journal between when the journal is initialized, but before the multi-block allocated is initialized, we will end up deferencing a NULL pointer when the journal commit callback function calls ext4_process_freed_data(). The proximate cause of this failure was commit 2d01ddc86606 ("ext4: save error info to sb through journal if available") since file system corruption problems detected before the call to ext4_mb_init() would result in a journal commit before we aborted the mount of the file system.... and we would then trigger the NULL pointer deref. Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.edu Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
82d6c12899 |
udf: fix silent AED tagLocation corruption
[ Upstream commit 63c9e47a1642fc817654a1bc18a6ec4bbcc0f056 ] When extending a file, udf_do_extend_file() may enter following empty indirect extent. At the end of udf_do_extend_file() we revert prev_epos to point to the last written extent. However if we end up not adding any further extent in udf_do_extend_file(), the reverting points prev_epos into the header area of the AED and following updates of the extents (in udf_update_extents()) will corrupt the header. Make sure that we do not follow indirect extent if we are not going to add any more extents so that returning back to the last written extent works correctly. Link: https://lore.kernel.org/r/20210107234116.6190-2-magnani@ieee.org Signed-off-by: Steven J. Magnani <magnani@ieee.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
3370a84d78 |
cifs: fix credit accounting for extra channel
commit a249cc8bc2e2fed680047d326eb9a50756724198 upstream. With multichannel, operations like the queries from "ls -lR" can cause all credits to be used and errors to be returned since max_credits was not being set correctly on the secondary channels and thus the client was requesting 0 credits incorrectly in some cases (which can lead to not having enough credits to perform any operation on that channel). Signed-off-by: Aurelien Aptel <aaptel@suse.com> CC: <stable@vger.kernel.org> # v5.8+ Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
3d0bbd97eb |
cifs: return proper error code in statfs(2)
commit 14302ee3301b3a77b331cc14efb95bf7184c73cc upstream. In cifs_statfs(), if server->ops->queryfs is not NULL, then we should use its return value rather than always returning 0. Instead, use rc variable as it is properly set to 0 in case there is no server->ops->queryfs. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
36e1efcdc5 |
mount: fix mounting of detached mounts onto targets that reside on shared mounts
commit ee2e3f50629f17b0752b55b2566c15ce8dafb557 upstream.
Creating a series of detached mounts, attaching them to the filesystem,
and unmounting them can be used to trigger an integer overflow in
ns->mounts causing the kernel to block any new mounts in count_mounts()
and returning ENOSPC because it falsely assumes that the maximum number
of mounts in the mount namespace has been reached, i.e. it thinks it
can't fit the new mounts into the mount namespace anymore.
Depending on the number of mounts in your system, this can be reproduced
on any kernel that supportes open_tree() and move_mount() by compiling
and running the following program:
/* SPDX-License-Identifier: LGPL-2.1+ */
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
/* open_tree() */
#ifndef OPEN_TREE_CLONE
#define OPEN_TREE_CLONE 1
#endif
#ifndef OPEN_TREE_CLOEXEC
#define OPEN_TREE_CLOEXEC O_CLOEXEC
#endif
#ifndef __NR_open_tree
#if defined __alpha__
#define __NR_open_tree 538
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_open_tree 4428
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_open_tree 6428
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_open_tree 5428
#endif
#elif defined __ia64__
#define __NR_open_tree (428 + 1024)
#else
#define __NR_open_tree 428
#endif
#endif
/* move_mount() */
#ifndef MOVE_MOUNT_F_EMPTY_PATH
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
#endif
#ifndef __NR_move_mount
#if defined __alpha__
#define __NR_move_mount 539
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_move_mount 4429
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_move_mount 6429
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_move_mount 5429
#endif
#elif defined __ia64__
#define __NR_move_mount (428 + 1024)
#else
#define __NR_move_mount 429
#endif
#endif
static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
{
return syscall(__NR_open_tree, dfd, filename, flags);
}
static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
const char *to_pathname, unsigned int flags)
{
return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
}
static bool is_shared_mountpoint(const char *path)
{
bool shared = false;
FILE *f = NULL;
char *line = NULL;
int i;
size_t len = 0;
f = fopen("/proc/self/mountinfo", "re");
if (!f)
return 0;
while (getline(&line, &len, f) > 0) {
char *slider1, *slider2;
for (slider1 = line, i = 0; slider1 && i < 4; i++)
slider1 = strchr(slider1 + 1, ' ');
if (!slider1)
continue;
slider2 = strchr(slider1 + 1, ' ');
if (!slider2)
continue;
*slider2 = '\0';
if (strcmp(slider1 + 1, path) == 0) {
/* This is the path. Is it shared? */
slider1 = strchr(slider2 + 1, ' ');
if (slider1 && strstr(slider1, "shared:")) {
shared = true;
break;
}
}
}
fclose(f);
free(line);
return shared;
}
static void usage(void)
{
const char *text = "mount-new [--recursive] <base-dir>\n";
fprintf(stderr, "%s", text);
_exit(EXIT_SUCCESS);
}
#define exit_usage(format, ...) \
({ \
fprintf(stderr, format "\n", ##__VA_ARGS__); \
usage(); \
})
#define exit_log(format, ...) \
({ \
fprintf(stderr, format "\n", ##__VA_ARGS__); \
exit(EXIT_FAILURE); \
})
static const struct option longopts[] = {
{"help", no_argument, 0, 'a'},
{ NULL, no_argument, 0, 0 },
};
int main(int argc, char *argv[])
{
int exit_code = EXIT_SUCCESS, index = 0;
int dfd, fd_tree, new_argc, ret;
char *base_dir;
char *const *new_argv;
char target[PATH_MAX];
while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
switch (ret) {
case 'a':
/* fallthrough */
default:
usage();
}
}
new_argv = &argv[optind];
new_argc = argc - optind;
if (new_argc < 1)
exit_usage("Missing base directory\n");
base_dir = new_argv[0];
if (*base_dir != '/')
exit_log("Please specify an absolute path");
/* Ensure that target is a shared mountpoint. */
if (!is_shared_mountpoint(base_dir))
exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir);
dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
if (dfd < 0)
exit_log("%m - Failed to open base directory \"%s\"", base_dir);
ret = mkdirat(dfd, "detached-move-mount", 0755);
if (ret < 0)
exit_log("%m - Failed to create required temporary directories");
ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir);
if (ret < 0 || (size_t)ret >= sizeof(target))
exit_log("%m - Failed to assemble target path");
/*
* Having a mount table with 10000 mounts is already quite excessive
* and shoult account even for weird test systems.
*/
for (size_t i = 0; i < 10000; i++) {
fd_tree = sys_open_tree(dfd, "detached-move-mount",
OPEN_TREE_CLONE |
OPEN_TREE_CLOEXEC |
AT_EMPTY_PATH);
if (fd_tree < 0) {
fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd);
exit_code = EXIT_FAILURE;
break;
}
ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH);
if (ret < 0) {
if (errno == ENOSPC)
fprintf(stderr, "%m - Buggy mount counting");
else
fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd);
exit_code = EXIT_FAILURE;
break;
}
close(fd_tree);
ret = umount2(target, MNT_DETACH);
if (ret < 0) {
fprintf(stderr, "%m - Failed to unmount %s", target);
exit_code = EXIT_FAILURE;
break;
}
}
(void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR);
close(dfd);
exit(exit_code);
}
and wait for the kernel to refuse any new mounts by returning ENOSPC.
How many iterations are needed depends on the number of mounts in your
system. Assuming you have something like 50 mounts on a standard system
it should be almost instantaneous.
The root cause of this is that detached mounts aren't handled correctly
when source and target mount are identical and reside on a shared mount
causing a broken mount tree where the detached source itself is
propagated which propagation prevents for regular bind-mounts and new
mounts. This ultimately leads to a miscalculation of the number of
mounts in the mount namespace.
Detached mounts created via
open_tree(fd, path, OPEN_TREE_CLONE)
are essentially like an unattached new mount, or an unattached
bind-mount. They can then later on be attached to the filesystem via
move_mount() which calls into attach_recursive_mount(). Part of
attaching it to the filesystem is making sure that mounts get correctly
propagated in case the destination mountpoint is MS_SHARED, i.e. is a
shared mountpoint. This is done by calling into propagate_mnt() which
walks the list of peers calling propagate_one() on each mount in this
list making sure it receives the propagation event.
The propagate_one() functions thereby skips both new mounts and bind
mounts to not propagate them "into themselves". Both are identified by
checking whether the mount is already attached to any mount namespace in
mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for.
However, detached mounts have an anonymous mount namespace attached to
them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't
realize they need to be skipped causing the mount to propagate "into
itself" breaking the mount table and causing a disconnect between the
number of mounts recorded as being beneath or reachable from the target
mountpoint and the number of mounts actually recorded/counted in
ns->mounts ultimately causing an overflow which in turn prevents any new
mounts via the ENOSPC issue.
So teach propagation to handle detached mounts by making it aware of
them. I've been tracking this issue down for the last couple of days and
then verifying that the fix is correct by
unmounting everything in my current mount table leaving only /proc and
/sys mounted and running the reproducer above overnight verifying the
number of mounts counted in ns->mounts. With this fix the counts are
correct and the ENOSPC issue can't be reproduced.
This change will only have an effect on mounts created with the new
mount API since detached mounts cannot be created with the old mount API
so regressions are extremely unlikely.
Link: https://lore.kernel.org/r/20210306101010.243666-1-christian.brauner@ubuntu.com
Fixes:
|
||
![]() |
bf6dd437c3 |
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
commit 4d14c5cde5c268a2bc26addecf09489cb953ef64 upstream Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction") * |
||
![]() |
cf9317ceb5 |
btrfs: export and rename qgroup_reserve_meta
commit 80e9baed722c853056e0c5374f51524593cb1031 upstream Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
a2501d8766 |
io_uring: ignore double poll add on the same waitqueue head
commit 1c3b3e6527e57156bf4082f11c2151957560fe6a upstream.
syzbot reports a deadlock, attempting to lock the same spinlock twice:
============================================
WARNING: possible recursive locking detected
5.11.0-syzkaller #0 Not tainted
--------------------------------------------
swapper/1/0 is trying to acquire lock:
ffff88801b2b1130 (&runtime->sleep){..-.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffff88801b2b1130 (&runtime->sleep){..-.}-{2:2}, at: io_poll_double_wake+0x25f/0x6a0 fs/io_uring.c:4960
but task is already holding lock:
ffff88801b2b3130 (&runtime->sleep){..-.}-{2:2}, at: __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:137
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&runtime->sleep);
lock(&runtime->sleep);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by swapper/1/0:
#0: ffff888147474908 (&group->lock){..-.}-{2:2}, at: _snd_pcm_stream_lock_irqsave+0x9f/0xd0 sound/core/pcm_native.c:170
#1: ffff88801b2b3130 (&runtime->sleep){..-.}-{2:2}, at: __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:137
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.11.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0xfa/0x151 lib/dump_stack.c:120
print_deadlock_bug kernel/locking/lockdep.c:2829 [inline]
check_deadlock kernel/locking/lockdep.c:2872 [inline]
validate_chain kernel/locking/lockdep.c:3661 [inline]
__lock_acquire.cold+0x14c/0x3b4 kernel/locking/lockdep.c:4900
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
io_poll_double_wake+0x25f/0x6a0 fs/io_uring.c:4960
__wake_up_common+0x147/0x650 kernel/sched/wait.c:108
__wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138
snd_pcm_update_state+0x46a/0x540 sound/core/pcm_lib.c:203
snd_pcm_update_hw_ptr0+0xa75/0x1a50 sound/core/pcm_lib.c:464
snd_pcm_period_elapsed+0x160/0x250 sound/core/pcm_lib.c:1805
dummy_hrtimer_callback+0x94/0x1b0 sound/drivers/dummy.c:378
__run_hrtimer kernel/time/hrtimer.c:1519 [inline]
__hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1600
__do_softirq+0x29b/0x9f6 kernel/softirq.c:345
invoke_softirq kernel/softirq.c:221 [inline]
__irq_exit_rcu kernel/softirq.c:422 [inline]
irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1100
</IRQ>
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline]
RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:70 [inline]
RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:137 [inline]
RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline]
RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516
Code: dd 38 6e f8 84 db 75 ac e8 54 32 6e f8 e8 0f 1c 74 f8 e9 0c 00 00 00 e8 45 32 6e f8 0f 00 2d 4e 4a c5 00 e8 39 32 6e f8 fb f4 <9c> 5b 81 e3 00 02 00 00 fa 31 ff 48 89 de e8 14 3a 6e f8 48 85 db
RSP: 0018:ffffc90000d47d18 EFLAGS: 00000293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8880115c3780 RSI: ffffffff89052537 RDI: 0000000000000000
RBP: ffff888141127064 R08: 0000000000000001 R09: 0000000000000001
R10: ffffffff81794168 R11: 0000000000000000 R12: 0000000000000001
R13: ffff888141127000 R14: ffff888141127064 R15: ffff888143331804
acpi_idle_enter+0x361/0x500 drivers/acpi/processor_idle.c:647
cpuidle_enter_state+0x1b1/0xc80 drivers/cpuidle/cpuidle.c:237
cpuidle_enter+0x4a/0xa0 drivers/cpuidle/cpuidle.c:351
call_cpuidle kernel/sched/idle.c:158 [inline]
cpuidle_idle_call kernel/sched/idle.c:239 [inline]
do_idle+0x3e1/0x590 kernel/sched/idle.c:300
cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:397
start_secondary+0x274/0x350 arch/x86/kernel/smpboot.c:272
secondary_startup_64_no_verify+0xb0/0xbb
which is due to the driver doing poll_wait() twice on the same
wait_queue_head. That is perfectly valid, but from checking the rest
of the kernel tree, it's the only driver that does this.
We can handle this just fine, we just need to ignore the second addition
as we'll get woken just fine on the first one.
Cc: stable@vger.kernel.org # 5.8+
Fixes:
|
||
![]() |
ae971992e9 |
btrfs: fix warning when creating a directory with smack enabled
commit fd57a98d6f0c98fa295813087f13afb26c224e73 upstream. When we have smack enabled, during the creation of a directory smack may attempt to add a "smack transmute" xattr on the inode, which results in the following warning and trace: WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0 Modules linked in: nft_objref nf_conntrack_netbios_ns (...) CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:start_transaction+0x489/0x4f0 Code: e9 be fc ff ff (...) RSP: 0018:ffffc90001887d10 EFLAGS: 00010202 RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003 RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000 RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004 R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2 R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000 FS: 00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? slab_free_freelist_hook+0xea/0x1b0 ? trace_hardirqs_on+0x1c/0xe0 btrfs_setxattr_trans+0x3c/0xf0 __vfs_setxattr+0x63/0x80 smack_d_instantiate+0x2d3/0x360 security_d_instantiate+0x29/0x40 d_instantiate_new+0x38/0x90 btrfs_mkdir+0x1cf/0x1e0 vfs_mkdir+0x14f/0x200 do_mkdirat+0x6d/0x110 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f673196ae6b Code: 8b 05 11 (...) RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000 R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0 irq event stamp: 11029 hardirqs last enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670 hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670 softirqs last enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20 softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20 This happens because at btrfs_mkdir() we call d_instantiate_new() while holding a transaction handle, which results in the following call chain: btrfs_mkdir() trans = btrfs_start_transaction(root, 5); d_instantiate_new() smack_d_instantiate() __vfs_setxattr() btrfs_setxattr_trans() btrfs_start_transaction() start_transaction() WARN_ON() --> a tansaction start has TRANS_EXTWRITERS set in its type h->orig_rsv = h->block_rsv h->block_rsv = NULL btrfs_end_transaction(trans) Besides the warning triggered at start_transaction, we set the handle's block_rsv to NULL which may cause some surprises later on. So fix this by making btrfs_setxattr_trans() not start a transaction when we already have a handle on one, stored in current->journal_info, and use that handle. We are good to use the handle because at btrfs_mkdir() we did reserve space for the xattr and the inode item. Reported-by: Casey Schaufler <casey@schaufler-ca.com> CC: stable@vger.kernel.org # 5.4+ Acked-by: Casey Schaufler <casey@schaufler-ca.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
e6ba61aaff |
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
commit 4f6a49de64fd1b1dba5229c02047376da7cf24fd upstream.
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.
Fixes:
|
||
![]() |
37ffce9668 |
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
commit 0f9c03d824f6f522d3bc43629635c9765546ebc5 upstream. Following commit |
||
![]() |
a64ad80223 |
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
commit 5011c5a663b9c6d6aff3d394f11049b371199627 upstream.
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes:
|
||
![]() |
e82407d249 |
btrfs: fix race between extent freeing/allocation when using bitmaps
commit 3c17916510428dbccdf657de050c34e208347089 upstream. During allocation the allocator will try to allocate an extent using cluster policy. Once the current cluster is exhausted it will remove the entry under btrfs_free_cluster::lock and subsequently acquire btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem because there exists a race condition between removing the entry under one lock and doing the necessary accounting holding a different lock since extent freeing only uses the 2nd lock. This can result in the following situation: T1: T2: btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock> if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) { rb_erase(entry) spin_unlock(&cluster->lock); (total_bitmaps is still 4) spin_lock(&cluster->lock); <doesn't find entry in cluster->root> spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds <blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap> recalculate_thresholds <crashes, due to total_bitmaps becoming 5 and triggering an ASSERT> To fix this ensure that once depleted, the cluster entry is deleted when both cluster lock and tree locks are held in the allocator (T1), this ensures that even if there is a race with a concurrent insert_into_bitmap call it will correctly find the entry in the cluster and add the new space to it. CC: <stable@vger.kernel.org> # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
1559d94fec |
btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
commit 3660d0bcdb82807d434da9d2e57d88b37331182d upstream. When using the NO_HOLES feature, if we clone a file range that spans only a hole into a range that is at or beyond the current i_size of the destination file, we end up not setting the full sync runtime flag on the inode. As a result, if we then fsync the destination file and have a power failure, after log replay we can end up exposing stale data instead of having a hole for that range. The conditions for this to happen are the following: 1) We have a file with a size of, for example, 1280K; 2) There is a written (non-prealloc) extent for the file range from 1024K to 1280K with a length of 256K; 3) This particular file extent layout is durably persisted, so that the existing superblock persisted on disk points to a subvolume root where the file has that exact file extent layout and state; 4) The file is truncated to a smaller size, to an offset lower than the start offset of its last extent, for example to 800K. The truncate sets the full sync runtime flag on the inode; 6) Fsync the file to log it and clear the full sync runtime flag; 7) Clone a region that covers only a hole (implicit hole due to NO_HOLES) into the file with a destination offset that starts at or beyond the 256K file extent item we had - for example to offset 1024K; 8) Since the clone operation does not find extents in the source range, we end up in the if branch at the bottom of btrfs_clone() where we punch a hole for the file range starting at offset 1024K by calling btrfs_replace_file_extents(). There we end up not setting the full sync flag on the inode, because we don't know we are being called in a clone context (and not fallocate's punch hole operation), and neither do we create an extent map to represent a hole because the requested range is beyond eof; 9) A further fsync to the file will be a fast fsync, since the clone operation did not set the full sync flag, and therefore it relies on modified extent maps to correctly log the file layout. But since it does not find any extent map marking the range from 1024K (the previous eof) to the new eof, it does not log a file extent item for that range representing the hole; 10) After a power failure no hole for the range starting at 1024K is punched and we end up exposing stale data from the old 256K extent. Turning this into exact steps: $ mkfs.btrfs -f -O no-holes /dev/sdi $ mount /dev/sdi /mnt # Create our test file with 3 extents of 256K and a 256K hole at offset # 256K. The file has a size of 1280K. $ xfs_io -f -s \ -c "pwrite -S 0xab -b 256K 0 256K" \ -c "pwrite -S 0xcd -b 256K 512K 256K" \ -c "pwrite -S 0xef -b 256K 768K 256K" \ -c "pwrite -S 0x73 -b 256K 1024K 256K" \ /mnt/sdi/foobar # Make sure it's durably persisted. We want the last committed super # block to point to this particular file extent layout. sync # Now truncate our file to a smaller size, falling within a position of # the second extent. This sets the full sync runtime flag on the inode. # Then fsync the file to log it and clear the full sync flag from the # inode. The third extent is no longer part of the file and therefore # it is not logged. $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar # Now do a clone operation that only clones the hole and sets back the # file size to match the size it had before the truncate operation # (1280K). $ xfs_io \ -c "reflink /mnt/foobar 256K 1024K 256K" \ -c "fsync" \ /mnt/foobar # File data before power failure: $ od -A d -t x1 /mnt/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef * 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 1310720 <power fail> # Mount the fs again to replay the log tree. $ mount /dev/sdi /mnt # File data after power failure: $ od -A d -t x1 /mnt/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef * 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 * 1310720 The range from 1024K to 1280K should correspond to a hole but instead it points to stale data, to the 256K extent that should not exist after the truncate operation. The issue does not exists when not using NO_HOLES, because for that case we use file extent items to represent holes, these are found and copied during the loop that iterates over extents at btrfs_clone(), and that causes btrfs_replace_file_extents() to be called with a non-NULL extent_info argument and therefore set the full sync runtime flag on the inode. So fix this by making the code that deals with a trailing hole during cloning, at btrfs_clone(), to set the full sync flag on the inode, if the range starts at or beyond the current i_size. A test case for fstests will follow soon. Backporting notes: for kernel 5.4 the change goes to ioctl.c into btrfs_clone before the last call to btrfs_punch_hole_range. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
6fc9e5866c |
btrfs: fix race between swap file activation and snapshot creation
commit dd0734f2a866f9d619d4abf97c3d71bcdee40ea9 upstream.
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes:
|
||
![]() |
501fdd1cef |
btrfs: fix race between writes to swap files and scrub
commit 195a49eaf655eb914896c92cecd96bc863c9feb3 upstream.
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes:
|
||
![]() |
b2a4876132 |
btrfs: fix raid6 qstripe kmap
commit d70cef0d46729808dc53f145372c02b145c92604 upstream. When a qstripe is required an extra page is allocated and mapped. There were 3 problems: 1) There is no corresponding call of kunmap() for the qstripe page. 2) There is no reason to map the qstripe page more than once if the number of bits set in rbio->dbitmap is greater than one. 3) There is no reason to map the parity page and unmap it each time through the loop. The page memory can continue to be reused with a single mapping on each iteration by raid6_call.gen_syndrome() without remapping. So map the page for the duration of the loop. Similarly, improve the algorithm by mapping the parity page just 1 time. Fixes: |
||
![]() |
a01415e5e8 |
btrfs: avoid double put of block group when emptying cluster
commit 95c85fba1f64c3249c67f0078a29f8a125078189 upstream.
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes:
|
||
![]() |
5aa2717b6b |
btrfs: fix error handling in commit_fs_roots
[ Upstream commit 4f4317c13a40194940acf4a71670179c4faca2b5 ] While doing error injection I would sometimes get a corrupt file system. This is because I was injecting errors at btrfs_search_slot, but would only do it one time per stack. This uncovered a problem in commit_fs_roots, where if we get an error we would just break. However we're in a nested loop, the first loop being a loop to find all the dirty fs roots, and then subsequent root updates would succeed clearing the error value. This isn't likely to happen in real scenarios, however we could potentially get a random ENOMEM once and then not again, and we'd end up with a corrupted file system. Fix this by moving the error checking around a bit to the main loop, as this is the only place where something will fail, and return the error as soon as it occurs. With this patch my reproducer no longer corrupts the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
0d2d6857db |
f2fs: fix to set/clear I_LINKABLE under i_lock
[ Upstream commit 46085f37fc9e12d5c3539fb768b5ad7951e72acf ] fsstress + fault injection test case reports a warning message as below: WARNING: CPU: 13 PID: 6226 at fs/inode.c:361 inc_nlink+0x32/0x40 Call Trace: f2fs_init_inode_metadata+0x25c/0x4a0 [f2fs] f2fs_add_inline_entry+0x153/0x3b0 [f2fs] f2fs_add_dentry+0x75/0x80 [f2fs] f2fs_do_add_link+0x108/0x160 [f2fs] f2fs_rename2+0x6ab/0x14f0 [f2fs] vfs_rename+0x70c/0x940 do_renameat2+0x4d8/0x4f0 __x64_sys_renameat2+0x4b/0x60 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Following race case can cause this: Thread A Kworker - f2fs_rename - f2fs_create_whiteout - __f2fs_tmpfile - f2fs_i_links_write - f2fs_mark_inode_dirty_sync - mark_inode_dirty_sync - writeback_single_inode - __writeback_single_inode - spin_lock(&inode->i_lock) - inode->i_state |= I_LINKABLE - inode->i_state &= ~dirty - spin_unlock(&inode->i_lock) - f2fs_add_link - f2fs_do_add_link - f2fs_add_dentry - f2fs_add_inline_entry - f2fs_init_inode_metadata - f2fs_i_links_write - inc_nlink - WARN_ON(!(inode->i_state & I_LINKABLE)) Fix to add i_lock to avoid i_state update race condition. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
c86df2b84b |
f2fs: handle unallocated section and zone on pinned/atgc
[ Upstream commit 632faca72938f9f63049e48a8c438913828ac7a9 ] If we have large section/zone, unallocated segment makes them corrupted. E.g., - Pinned file: -1 119304647 119304647 - ATGC data: -1 119304647 119304647 Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
01fd84a436 |
fs: make unlazy_walk() error handling consistent
[ Upstream commit e36cffed20a324e116f329a94061ae30dd26fb51 ] Most callers check for non-zero return, and assume it's -ECHILD (which it always will be). One caller uses the actual error return. Clean this up and make it fully consistent, by having unlazy_walk() return a bool instead. Rename it to try_to_unlazy() and return true on success, and failure on error. That's easier to read. No functional changes in this patch. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
c57ba68e73 |
xfs: Fix assert failure in xfs_setattr_size()
commit 88a9e03beef22cc5fabea344f54b9a0dfe63de08 upstream. An assert failure is triggered by syzkaller test due to ATTR_KILL_PRIV is not cleared before xfs_setattr_size. As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size, just remove it from the assert. Signed-off-by: Yumei Huang <yuhuang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
5e0068a4fb |
erofs: fix shift-out-of-bounds of blkszbits
commit bde545295b710bdd13a0fcd4b9fddd2383eeeb3a upstream. syzbot generated a crafted bitszbits which can be shifted out-of-bounds[1]. So directly print unsupported blkszbits instead of blksize. [1] https://lore.kernel.org/r/000000000000c72ddd05b9444d2f@google.com Link: https://lore.kernel.org/r/20210120013016.14071-1-hsiangkao@aol.com Reported-by: syzbot+c68f467cd7c45860e8d4@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
40f6090d6e |
JFS: more checks for invalid superblock
commit 3bef198f1b17d1bb89260bad947ef084c0a2d1a6 upstream. syzbot is feeding invalid superblock data to JFS for mount testing. JFS does not check several of the fields -- just assumes that they are good since the JFS_MAGIC and version fields are good. In this case (syzbot reproducer), we have s_l2bsize == 0xda0c, pad == 0xf045, and s_state == 0x50, all of which are invalid IMO. Having s_l2bsize == 0xda0c causes this UBSAN warning: UBSAN: shift-out-of-bounds in fs/jfs/jfs_mount.c:373:25 shift exponent -9716 is negative s_l2bsize can be tested for correctness. pad can be tested for non-0 and punted. s_state can be tested for its valid values and punted. Do those 3 tests and if any of them fails, report the superblock as invalid/corrupt and let fsck handle it. With this patch, chkSuper() says this when JFS_DEBUG is enabled: jfs_mount: Mount Failure: superblock is corrupt! Mount JFS Failure: -22 jfs_mount failed w/return code = -22 The obvious problem with this method is that next week there could be another syzbot test that uses different fields for invalid values, this making this like a game of whack-a-mole. syzkaller link: https://syzkaller.appspot.com/bug?extid=36315852ece4132ec193 Reported-by: syzbot+36315852ece4132ec193@syzkaller.appspotmail.com Reported-by: kernel test robot <lkp@intel.com> # v2 Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |