841b9f6f681dbb51e1ee5323b92b3445e51e40bb
1922 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
a1ccc4f441 |
btrfs: do not pin logs too early during renames
[ Upstream commit bd54f381a12ac695593271a663d36d14220215b2 ] During renames we pin the logs of the roots a bit too early, before the calls to btrfs_insert_inode_ref(). We can pin the logs after those calls, since those will not change anything in a log tree. In a scenario where we have multiple and diverse filesystem operations running in parallel, those calls can take a significant amount of time, due to lock contention on extent buffers, and delay log commits from other tasks for longer than necessary. So just pin logs after calls to btrfs_insert_inode_ref() and right before the first operation that can update a log tree. The following script that uses dbench was used for testing: $ cat dbench-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 120 16 umount $MNT The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device and using a non-debug kernel config (Debian's default config). The results compare a branch without this patch and without the previous patch in the series, that has the subject: "btrfs: eliminate some false positives when checking if inode was logged" Versus the same branch with these two patches applied. dbench with 8 clients, results before: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4391359 0.009 249.745 Close 3225882 0.001 3.243 Rename 185953 0.065 240.643 Unlink 886669 0.049 249.906 Deltree 112 2.455 217.433 Mkdir 56 0.002 0.004 Qpathinfo 3980281 0.004 3.109 Qfileinfo 697579 0.001 0.187 Qfsinfo 729780 0.002 2.424 Sfileinfo 357764 0.004 1.415 Find 1538861 0.016 4.863 WriteX |
||
![]() |
e1c50b0c62 |
btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
[ Upstream commit 3324d0547861b16cf436d54abba7052e0c8aa9de ] Sweet Tea spotted a race between subvolume deletion and snapshotting that can result in the root item for the snapshot having the BTRFS_ROOT_SUBVOL_DEAD flag set. The race is: Thread 1 | Thread 2 ----------------------------------------------|---------- btrfs_delete_subvolume | btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)| |btrfs_mksubvol | down_read(subvol_sem) | create_snapshot | ... | create_pending_snapshot | copy root item from source down_write(subvol_sem) | This flag is only checked in send and swap activate, which this would cause to fail mysteriously. create_snapshot() now checks the root refs to reject a deleted subvolume, so we can fix this by locking subvol_sem earlier so that the BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically. CC: stable@vger.kernel.org # 4.14+ Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
d073f4608b |
btrfs: remove err variable from btrfs_delete_subvolume
[ Upstream commit ee0d904fd9c5662c58a737c77384f8959fdc8d12 ] Use only a single 'ret' to control whether we should abort the transaction or not. That's fine, because if we abort a transaction then btrfs_end_transaction will return the same value as passed to btrfs_abort_transaction. No semantic changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 3324d0547861 ("btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
a842fb6038 |
btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid
[ Upstream commit abadc1fcd72e887a8f875dabe4a07aa8c28ac8af ] The former is going away as part of the inode map removal so switch callers to btrfs_find_free_objectid. No functional changes since with INODE_MAP disabled (default) find_free_objectid was called anyway. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 0004ff15ea26 ("btrfs: fix space cache inconsistency after error loading it from disk") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
c1ea39a77c |
btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
commit a4527e1853f8ff6e0b7c2dadad6268bd38427a31 upstream. When doing a direct IO read or write, we always return -ENOTBLK when we find a compressed extent (or an inline extent) so that we fallback to buffered IO. This however is not ideal in case we are in a NOWAIT context (io_uring for example), because buffered IO can block and we currently have no support for NOWAIT semantics for buffered IO, so if we need to fallback to buffered IO we should first signal the caller that we may need to block by returning -EAGAIN instead. This behaviour can also result in short reads being returned to user space, which although it's not incorrect and user space should be able to deal with partial reads, it's somewhat surprising and even some popular applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't deal with short reads properly (or at all). The short read case happens when we try to read from a range that has a non-compressed and non-inline extent followed by a compressed extent. After having read the first extent, when we find the compressed extent we return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to treat the request as a short read, returning 0 (success) and waiting for previously submitted bios to complete (this happens at fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at btrfs_file_read_iter(), we call filemap_read() to use buffered IO to read the remaining data, and pass it the number of bytes we were able to read with direct IO. Than at filemap_read() if we get a page fault error when accessing the read buffer, we return a partial read instead of an -EFAULT error, because the number of bytes previously read is greater than zero. So fix this by returning -EAGAIN for NOWAIT direct IO when we find a compressed or an inline extent. Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/ Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582 Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
76e086ce7b |
btrfs: do not warn for free space inode in cow_file_range
[ Upstream commit a7d16d9a07bbcb7dcd5214a1bea75c808830bc0d ] This is a long time leftover from when I originally added the free space inode, the point was to catch cases where we weren't honoring the NOCOW flag. However there exists a race with relocation, if we allocate our free space inode in a block group that is about to be relocated, we could trigger the COW path before the relocation has the opportunity to find the extents and delete the free space cache. In production where we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around 5k times in a 2 week period, so not super common but enough that it's at the top of our metrics. We're properly handling the error here, and with us phasing out v1 space cache anyway just drop the WARN_ON_ONCE. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
a044bca8ef |
btrfs: prevent subvol with swapfile from being deleted
commit 60021bd754c6ca0addc6817994f20290a321d8d6 upstream. A subvolume with an active swapfile must not be deleted otherwise it would not be possible to deactivate it. After the subvolume is deleted, we cannot swapoff the swapfile in this deleted subvolume because the path is unreachable. The swapfile is still active and holding references, the filesystem cannot be unmounted. The test looks like this: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs sub create $mnt/subvol touch $mnt/subvol/swapfile chmod 600 $mnt/subvol/swapfile chattr +C $mnt/subvol/swapfile dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096 mkswap $mnt/subvol/swapfile swapon $mnt/subvol/swapfile btrfs sub delete $mnt/subvol swapoff $mnt/subvol/swapfile # failed: No such file or directory swapoff --all unmount $mnt # target is busy. To prevent above issue, we simply check that whether the subvolume contains any active swapfile, and stop the deleting process. This behavior is like snapshot ioctl dealing with a swapfile. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Kaiwen Hu <kevinhu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
f8c3ec2e21 |
btrfs: respect the max size in the header when activating swap file
commit c2f822635df873c510bda6fb7fd1b10b7c31be2d upstream.
If we extended the size of a swapfile after its header was created (by the
mkswap utility) and then try to activate it, we will map the entire file
when activating the swap file, instead of limiting to the max size defined
in the swap file's header.
Currently test case generic/643 from fstests fails because we do not
respect that size limit defined in the swap file's header.
So fix this by not mapping file ranges beyond the max size defined in the
swap header.
This is the same type of bug that iomap used to have, and was fixed in
commit 36ca7943ac18ae ("mm/swap: consider max pages in
iomap_swapfile_add_extent").
Fixes:
|
||
![]() |
0901af53da |
btrfs: wake up async_delalloc_pages waiters after submit
commit ac98141d140444fe93e26471d3074c603b70e2ca upstream. We use the async_delalloc_pages mechanism to make sure that we've completed our async work before trying to continue our delalloc flushing. The reason for this is we need to see any ordered extents that were created by our delalloc flushing. However we're waking up before we do the submit work, which is before we create the ordered extents. This is a pretty wide race window where we could potentially think there are no ordered extents and thus exit shrink_delalloc prematurely. Fix this by waking us up after we've done the work to create ordered extents. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
d845f89d59 |
btrfs: fix race between marking inode needs to be logged and log syncing
commit bc0939fcfab0d7efb2ed12896b1af3d819954a14 upstream. We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
3134292a8e |
Revert "btrfs: compression: don't try to compress if we don't have enough pages"
commit 4e9655763b82a91e4c341835bb504a2b1590f984 upstream. This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763. [BUG] It's no longer possible to create compressed inline extent after commit f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
67fece6289 |
btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
[ Upstream commit 3f79f6f6247c83f448c8026c3ee16d4636ef8d4f ]
Cross-rename lacks a check when that would prevent exchanging a
directory and subvolume from different parent subvolume. This causes
data inconsistencies and is caught before commit by tree-checker,
turning the filesystem to read-only.
Calling the renameat2 with RENAME_EXCHANGE flags like
renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))
on two paths:
namesrc = dir1/subvol1/dir2
namedest = subvol2/subvol3
will cause key order problem with following write time tree-checker
report:
[1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
[1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
[1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
[1194842.338772] item 0 key (256 1 0) itemoff 16123 itemsize 160
[1194842.338793] inode generation 3 size 16 mode 40755
[1194842.338801] item 1 key (256 12 256) itemoff 16111 itemsize 12
[1194842.338809] item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
[1194842.338817] dir oid 258 type 2
[1194842.338823] item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
[1194842.338830] dir oid 257 type 2
[1194842.338836] item 4 key (256 96 2) itemoff 16009 itemsize 34
[1194842.338843] item 5 key (256 96 3) itemoff 15975 itemsize 34
[1194842.338852] item 6 key (257 1 0) itemoff 15815 itemsize 160
[1194842.338863] inode generation 6 size 8 mode 40755
[1194842.338869] item 7 key (257 12 256) itemoff 15801 itemsize 14
[1194842.338876] item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
[1194842.338883] dir oid 256 type 2
[1194842.338888] item 9 key (257 96 2) itemoff 15733 itemsize 34
[1194842.338895] item 10 key (258 12 256) itemoff 15719 itemsize 14
[1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
[1194842.339245] ------------[ cut here ]------------
[1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
[1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
[1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
[1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
[1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
[1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
[1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
[1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
[1194842.512092] FS: 0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
[1194842.512095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
[1194842.512103] Call Trace:
[1194842.512128] ? run_one_async_free+0x10/0x10 [btrfs]
[1194842.631729] btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
[1194842.631837] run_one_async_start+0x18/0x30 [btrfs]
[1194842.631938] btrfs_work_helper+0xd5/0x1d0 [btrfs]
[1194842.647482] process_one_work+0x262/0x5e0
[1194842.647520] worker_thread+0x4c/0x320
[1194842.655935] ? process_one_work+0x5e0/0x5e0
[1194842.655946] kthread+0x135/0x160
[1194842.655953] ? set_kthread_struct+0x40/0x40
[1194842.655965] ret_from_fork+0x1f/0x30
[1194842.672465] irq event stamp: 1729
[1194842.672469] hardirqs last enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
[1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
[1194842.672482] softirqs last enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
[1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0
The corrupted data will not be written, and filesystem can be unmounted
and mounted again (all changes since the last commit will be lost).
Add the missing check for new_ino so that all non-subvolumes must reside
under the same parent subvolume. There's an exception allowing to
exchange two subvolumes from any parents as the directory representing a
subvolume is only a logical link and does not have any other structures
related to the parent subvolume, unlike files, directories etc, that
are always in the inode namespace of the parent subvolume.
Fixes:
|
||
![]() |
ad71a9ad74 |
btrfs: don't clear page extent mapped if we're not invalidating the full page
[ Upstream commit bcd77455d590eaa0422a5e84ae852007cfce574a ] [BUG] With current btrfs subpage rw support, the following script can lead to fs hang: $ mkfs.btrfs -f -s 4k $dev $ mount $dev -o nospace_cache $mnt $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt The fs will hang at btrfs_start_ordered_extent(). [CAUSE] In above test case, btrfs_invalidate() will be called with the following parameters: offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000 Since @offset is 0, btrfs_invalidate() will try to invalidate the full page, and finally call clear_page_extent_mapped() which will detach subpage structure from the page. And since the page no longer has subpage structure, the subpage dirty bitmap will be cleared, preventing the dirty range from being written back, thus no way to wake up the ordered extent. [FIX] Just follow other filesystems, only to invalidate the page if the range covers the full page. There are cases like truncate_setsize() which can call btrfs_invalidatepage() with offset == 0 and length != 0 for the last page of an inode. Although the old code will still try to invalidate the full page, we are still safe to just wait for ordered extent to finish. So it shouldn't cause extra problems. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
![]() |
6b00b1717f |
btrfs: compression: don't try to compress if we don't have enough pages
commit f2165627319ffd33a6217275e5690b1ab5c45763 upstream. The early check if we should attempt compression does not take into account the number of input pages. It can happen that there's only one page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED have been processed, or an isolated page that won't be converted to an inline extent. The single page would be compressed but a later check would drop it again because the result size must be at least one block shorter than the input. That can never work with just one page. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
0df50d47d1 |
btrfs: abort in rename_exchange if we fail to insert the second ref
commit dc09ef3562726cd520c8338c1640872a60187af5 upstream. Error injection stress uncovered a problem where we'd leave a dangling inode ref if we failed during a rename_exchange. This happens because we insert the inode ref for one side of the rename, and then for the other side. If this second inode ref insert fails we'll leave the first one dangling and leave a corrupt file system behind. Fix this by aborting if we did the insert for the first inode ref. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
b547a16b24 |
btrfs: mark ordered extent and inode with error if we fail to finish
commit d61bec08b904cf171835db98168f82bc338e92e4 upstream. While doing error injection testing I saw that sometimes we'd get an abort that wouldn't stop the current transaction commit from completing. This abort was coming from finish ordered IO, but at this point in the transaction commit we should have gotten an error and stopped. It turns out the abort came from finish ordered io while trying to write out the free space cache. It occurred to me that any failure inside of finish_ordered_io isn't actually raised to the person doing the writing, so we could have any number of failures in this path and think the ordered extent completed successfully and the inode was fine. Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and marking the mapping of the inode with mapping_set_error, so any callers that simply call fdatawait will also get the error. With this we're seeing the IO error on the free space inode when we fail to do the finish_ordered_io. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
56001dda03 |
btrfs: avoid RCU stalls while running delayed iputs
commit 71795ee590111e3636cc3c148289dfa9fa0a5fc3 upstream. Generally a delayed iput is added when we might do the final iput, so usually we'll end up sleeping while processing the delayed iputs naturally. However there's no guarantee of this, especially for small files. In production we noticed 5 instances of RCU stalls while testing a kernel release overnight across 1000 machines, so this is relatively common: host count: 5 rcu: INFO: rcu_sched self-detected stall on CPU rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208 (t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1 CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1 Call Trace: <IRQ> dump_stack+0x50/0x70 nmi_cpu_backtrace.cold.6+0x30/0x65 ? lapic_can_unplug_cpu.cold.30+0x40/0x40 nmi_trigger_cpumask_backtrace+0xba/0xca rcu_dump_cpu_stacks+0x99/0xc7 rcu_sched_clock_irq.cold.90+0x1b2/0x3a3 ? trigger_load_balance+0x5c/0x200 ? tick_sched_do_timer+0x60/0x60 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x24/0x50 tick_sched_timer+0x37/0x70 __hrtimer_run_queues+0xfe/0x270 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0 RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029 RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200 RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000 R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200 R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870 ? kzalloc.constprop.57+0x30/0x30 list_lru_add+0x5a/0x100 inode_lru_list_add+0x20/0x40 iput+0x1c1/0x1f0 run_delayed_iput_locked+0x46/0x90 btrfs_run_delayed_iputs+0x3f/0x60 cleaner_kthread+0xf2/0x120 kthread+0x10b/0x130 Fix this by adding a cond_resched_lock() to the loop processing delayed iputs so we can avoid these sort of stalls. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
2c8d6a9474 |
btrfs: fix slab cache flags for free space tree bitmap
commit 34e49994d0dcdb2d31d4d2908d04f4e9ce57e4d7 upstream.
The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
that's a debugging flag and not always enabled. Also the other slabs are
created with at least SLAB_MEM_SPREAD that we want as well to average
the memory placement cost.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes:
|
||
![]() |
bf6dd437c3 |
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
commit 4d14c5cde5c268a2bc26addecf09489cb953ef64 upstream Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction") * |
||
![]() |
6fc9e5866c |
btrfs: fix race between swap file activation and snapshot creation
commit dd0734f2a866f9d619d4abf97c3d71bcdee40ea9 upstream.
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes:
|
||
![]() |
501fdd1cef |
btrfs: fix race between writes to swap files and scrub
commit 195a49eaf655eb914896c92cecd96bc863c9feb3 upstream.
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes:
|
||
![]() |
6a402b937e |
btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
[ Upstream commit 951c80f83d61bd4b21794c8aba829c3c1a45c2d0 ] Commit |
||
![]() |
a6703c7115 |
btrfs: fix crash after non-aligned direct IO write with O_DSYNC
Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we end up triggering an assertion and crashing. Example reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT # Do a direct IO write with O_DSYNC into a non-aligned range... xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar umount $MNT When running the reproducer an assertion fails and produces the following trace: [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467 [ 2418.403745] ------------[ cut here ]------------ [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286! [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G D 5.10.15-btrfs-next-87 #1 [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs] [ 2418.407835] Code: e6 48 c7 (...) [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246 [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000 [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000 [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000 [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000 [ 2418.412713] FS: 00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000 [ 2418.413326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0 [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2418.415669] Call Trace: [ 2418.416254] btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs] [ 2418.416812] btrfs_check_data_free_space+0x4c/0xa0 [btrfs] [ 2418.417380] btrfs_buffered_write+0x1b0/0x7f0 [btrfs] [ 2418.418315] btrfs_file_write_iter+0x2a9/0x770 [btrfs] [ 2418.418920] new_sync_write+0x11f/0x1c0 [ 2418.419430] vfs_write+0x2bb/0x3b0 [ 2418.419972] __x64_sys_pwrite64+0x90/0xc0 [ 2418.420486] do_syscall_64+0x33/0x80 [ 2418.420979] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 2418.421486] RIP: 0033:0x7f54fda0b986 [ 2418.421981] Code: 48 c7 c0 (...) [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012 [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986 [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003 [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400 [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000 [ 2418.426148] Modules linked in: btrfs blake2b_generic (...) [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]--- 1) At btrfs_file_write_iter() we set current->journal_info to BTRFS_DIO_SYNC_STUB; 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO(); 3) We can't do the direct IO write because it starts at a non-aligned offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from check_direct_IO() which does the alignment check), but we leave current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it at btrfs_dio_iomap_begin(), because we assume we always get there; 4) Then at __btrfs_direct_write() we see that the attempt to do the direct IO write was not successful, 0 bytes written, so we fallback to a buffered write by calling btrfs_buffered_write(); 5) There we call btrfs_check_data_free_space() which in turn calls btrfs_alloc_data_chunk_ondemand() and that calls btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA; 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion: int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes, enum btrfs_reserve_flush_enum flush) { struct btrfs_space_info *data_sinfo = fs_info->data_sinfo; int ret; ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA || flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE); ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA); (...) So fix that by setting the journal to NULL whenever check_direct_IO() returns a failure. This bug only affects 5.10 kernels, and the regression was introduced in 5.10-rc1 by commit |
||
![]() |
e3b5252b5c |
btrfs: shrink delalloc pages instead of full inodes
[ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ] Commit |
||
![]() |
17243f73ad |
btrfs: fix deadlock when cloning inline extent and low on free metadata space
[ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes:
|
||
![]() |
c334730988 |
btrfs: fix missing delalloc new bit for new delalloc ranges
When doing a buffered write, through one of the write family syscalls, we look for ranges which currently don't have allocated extents and set the 'delalloc new' bit on them, so that we can report a correct number of used blocks to the stat(2) syscall until delalloc is flushed and ordered extents complete. However there are a few other places where we can do a buffered write against a range that is mapped to a hole (no extent allocated) and where we do not set the 'new delalloc' bit. Those places are: - Doing a memory mapped write against a hole; - Cloning an inline extent into a hole starting at file offset 0; - Calling btrfs_cont_expand() when the i_size of the file is not aligned to the sector size and is located in a hole. For example when cloning to a destination offset beyond EOF. So after such cases, until the corresponding delalloc range is flushed and the respective ordered extents complete, we can report an incorrect number of blocks used through the stat(2) syscall. In some cases we can end up reporting 0 used blocks to stat(2), which is a particular bad value to report as it may mislead tools to think a file is completely sparse when its i_size is not zero, making them skip reading any data, an undesired consequence for tools such as archivers and other backup tools, as reported a long time ago in the following thread (and other past threads): https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html Example reproducer: $ cat reproducer.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -c "truncate 64K" \ -c "mmap -w 0 64K" \ -c "mwrite -S 0xab 0 64K" \ -c "munmap" \ $MNT/foo blocks_used=$(stat -c %b $MNT/foo) echo "blocks used: $blocks_used" if [ $blocks_used -eq 0 ]; then echo "ERROR: blocks used is 0" fi umount $DEV $ ./reproducer.sh blocks used: 0 ERROR: blocks used is 0 So move the logic that decides to set the 'delalloc bit' bit into the function btrfs_set_extent_delalloc(), since that is what we use for all those missing cases as well as for the cases that currently work well. This change is also preparatory work for an upcoming patch that fixes other problems related to tracking and reporting the number of bytes used by an inode. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
1afc708dca |
btrfs: fix relocation failure due to race with fallocate
When doing a fallocate() we have a short time window, after reserving an extent and before starting a transaction, where if relocation for the block group containing the reserved extent happens, we can end up missing the extent in the data relocation inode causing relocation to fail later. This only happens when we don't pass a transaction to the internal fallocate function __btrfs_prealloc_file_range(), which is for all the cases where fallocate() is called from user space (the internal use cases include space cache extent allocation and relocation). When the race triggers the relocation failure, it produces a trace like the following: [200611.995995] ------------[ cut here ]------------ [200611.997084] BTRFS: Transaction aborted (error -2) [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs] [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...) [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1 [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs] [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...) [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction. [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286 [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000 [200612.011078] BTRFS info (device sdb): forced readonly [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000 [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88 [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08 [200612.016182] FS: 00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000 [200612.017174] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0 [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [200612.021520] Call Trace: [200612.022434] btrfs_cow_block+0x10b/0x250 [btrfs] [200612.023407] do_relocation+0x54e/0x7b0 [btrfs] [200612.024343] ? do_raw_spin_unlock+0x4b/0xc0 [200612.025280] ? _raw_spin_unlock+0x29/0x40 [200612.026200] relocate_tree_blocks+0x3bc/0x6d0 [btrfs] [200612.027088] relocate_block_group+0x2f3/0x600 [btrfs] [200612.027961] btrfs_relocate_block_group+0x15e/0x340 [btrfs] [200612.028896] btrfs_relocate_chunk+0x38/0x110 [btrfs] [200612.029772] btrfs_balance+0xb22/0x1790 [btrfs] [200612.030601] ? btrfs_ioctl_balance+0x253/0x380 [btrfs] [200612.031414] btrfs_ioctl_balance+0x2cf/0x380 [btrfs] [200612.032279] btrfs_ioctl+0x620/0x36f0 [btrfs] [200612.033077] ? _raw_spin_unlock+0x29/0x40 [200612.033948] ? handle_mm_fault+0x116d/0x1ca0 [200612.034749] ? up_read+0x18/0x240 [200612.035542] ? __x64_sys_ioctl+0x83/0xb0 [200612.036244] __x64_sys_ioctl+0x83/0xb0 [200612.037269] do_syscall_64+0x33/0x80 [200612.038190] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [200612.038976] RIP: 0033:0x7fa55d07ed87 [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...) [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87 [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003 [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0 [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614 [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470 [200612.046361] irq event stamp: 0 [200612.047040] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [200612.047725] hardirqs last disabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0 [200612.048387] softirqs last enabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0 [200612.049024] softirqs last disabled at (0): [<0000000000000000>] 0x0 [200612.049722] ---[ end trace 49006c6876e65227 ]--- The race happens like this: 1) Task A starts an fallocate() (plain or zero range) and it calls __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL; 2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to block group X; 3) Before task A gets into btrfs_replace_file_extents(), through the call to insert_prealloc_file_extent(), task B starts relocation of block group X; 4) Task B enters btrfs_relocate_block_group() and it sets block group X to RO mode; 5) Task B enters relocate_block_group(), it calls prepare_to_relocate() whichs joins/starts a transaction and then commits the transaction; 6) Task B then starts scanning the extent tree looking for extents that belong to block group X - it does not find yet the extent reserved by task A, since that extent was not yet added to the extent tree, as its delayed reference was not even yet created at this point; 7) The data relocation inode ends up not having the extent reserved by task A associated to it; 8) Task A then starts a transaction through btrfs_replace_file_extents(), inserts a file extent item in the subvolume tree pointing to the reserved extent and creates a delayed reference for it; 9) Task A finishes and returns success to user space; 10) Later on, while relocation is still in progress, the leaf where task A inserted the new file extent item is COWed, so we end up at __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in turn calls relocation.c:replace_file_extents(); 11) At relocation.c:replace_file_extents() we iterate over all the items in the leaf and find the file extent item pointing to the extent that was allocated by task A, and then call relocation.c:get_new_location(), to find the new location for the extent; 12) However relocation.c:get_new_location() fails, returning -ENOENT, because it couldn't find a corresponding file extent item associated with the data relocation inode. This is because the extent was not seen in the extent tree at step 6). The -ENOENT error is propagated to __btrfs_cow_block(), which aborts the transaction. So fix this simply by decrementing the block group's number of reservations after calling insert_prealloc_file_extent(), as relocation waits for that counter to go down to zero before calling prepare_to_relocate() and start looking for extents in the extent tree. This issue only started to happen recently as of commit |
||
![]() |
1fd4033dd0 |
btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag
Commit |
||
![]() |
e3c57805f8 |
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
Since we now perform direct reads using i_rwsem, we can remove this inode flag used to co-ordinate unlocked reads. The truncate call takes i_rwsem. This means it is correctly synchronized with concurrent direct reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
905eb88bce |
btrfs: remove struct extent_io_ops
It's no longer used just remove the function and any related code which was initialising it for inodes. No functional changes. Removing 8 bytes from extent_io_tree in turn reduces size of other structures where it is embedded, notably btrfs_inode where it reduces size by 24 bytes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
908930f3ed |
btrfs: stop calling submit_bio_hook for data inodes
Instead export and rename the function to btrfs_submit_data_bio and call it directly in submit_one_bio. This avoids paying the cost for speculative attacks mitigations and improves code readability. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
1f03d9cfda |
btrfs: remove extent_io_ops::readpage_end_io_hook
It's no longer used so let's remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
9a446d6a9f |
btrfs: replace readpage_end_io_hook with direct calls
Don't call readpage_end_io_hook for the btree inode. Instead of relying on indirect calls to implement metadata buffer validation simply check if the inode whose page we are processing equals the btree inode. If it does call the necessary function. This is an improvement in 2 directions: 1. We aren't paying the penalty of indirect calls in a post-speculation attacks world. 2. The function is now named more explicitly so it's obvious what's going on This is in preparation to removing struct extent_io_ops altogether. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
c0a4360305 |
btrfs: remove inode argument from btrfs_start_ordered_extent
The passed in ordered_extent struct is always well-formed and contains the inode making the explicit argument redundant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
510f85edf1 |
btrfs: remove inode argument from add_pending_csums
It's used to reference the csum root which can be done from the trans handle as well. Simplify the signature and while at it also remove the noinline attribute as the function uses only at most 16 bytes of stack space. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
3c38c877fc |
btrfs: sink inode argument in insert_ordered_extent_file_extent
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
71fe0a55da |
btrfs: switch btrfs_remove_ordered_extent to btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
633cc816f7 |
btrfs: clean BTRFS_I usage in btrfs_destroy_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
0f20881249 |
btrfs: open code extent_read_full_page to its sole caller
This makes reading the code a tad easier by decreasing the level of indirection by one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
6f15af6060 |
btrfs: sink read_flags argument into extent_read_full_page
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
003c286aef |
btrfs: sink mirror_num argument in extent_read_full_page
It's always set to 0 from the sole caller - btrfs_readpage. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
c1be9c1ad5 |
btrfs: promote extent_read_full_page to btrfs_readpage
Now that btrfs_readpage is the only caller of extent_read_full_page the latter can be open coded in the former. Use the occassion to rename __extent_read_full_page to extent_read_full_page. To facillitate this change submit_one_bio has to be exported as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
72cffee463 |
btrfs: remove mirror_num argument from extent_read_full_page
It's called only from btrfs_readpage which always passes 0 so just sink the argument into extent_read_full_page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
1a5ee1e626 |
btrfs: remove btrfs_get_extent indirection from __do_readpage
Now that this function is only responsible for reading data pages it's no longer necessary to pass get_extent_t parameter across several layers of functions. This patch removes this parameter from multiple functions: __get_extent_map/__do_readpage/__extent_read_full_page/ extent_read_full_page and simply calls btrfs_get_extent directly in __get_extent_map. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
306bfec02b |
btrfs: rename btrfs_punch_hole_range() to a more generic name
The function btrfs_punch_hole_range() is now used to replace all the file extents in a given file range with an extent described in the given struct btrfs_replace_extent_info argument. This extent can either be an existing extent that is being cloned or it can be a new extent (namely a prealloc extent). When that argument is NULL it only punches a hole (drops all the existing extents) in the file range. So rename the function to btrfs_replace_file_extents(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
bf385648fa |
btrfs: rename struct btrfs_clone_extent_info to a more generic name
Now that we can use btrfs_clone_extent_info to convey information for a new prealloc extent as well, and not just for existing extents that are being cloned, rename it to btrfs_replace_extent_info, which reflects the fact that this is now more generic and it is used to replace all existing extents in a file range with the extent described by the structure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
fb870f6cdd |
btrfs: remove item_size member of struct btrfs_clone_extent_info
The value of item_size of struct btrfs_clone_extent_info is always set to the size of a non-inline file extent item, and in fact the infrastructure that uses this structure (btrfs_punch_hole_range()) does not work with inline file extents at all (and it is not supposed to). So just remove that field from the structure and use directly sizeof(struct btrfs_file_extent_item) instead. Also assert that the file extent type is not inline at btrfs_insert_clone_extent(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8fccebfa53 |
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume that reserving 3 units of metadata space is enough, that at most we touch one leaf in subvolume/fs tree for removing existing file extent items and inserting a new file extent item. This assumption is generally true for most common use cases. However when we end up needing to remove file extent items from multiple leaves, we can end up failing with -ENOSPC and abort the current transaction, turning the filesystem to RO mode. When this happens a stack trace like the following is dumped in dmesg/syslog: [ 1500.620934] ------------[ cut here ]------------ [ 1500.620938] BTRFS: Transaction aborted (error -28) [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...) [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.621026] Code: 8b 40 50 f0 48 (...) [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286 [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000 [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001 [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000 [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60 [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000 [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0 [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1500.621049] Call Trace: [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs] [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs] [ 1500.621108] vfs_fallocate+0x14d/0x290 [ 1500.621112] ksys_fallocate+0x3a/0x70 [ 1500.621117] __x64_sys_fallocate+0x1a/0x20 [ 1500.621120] do_syscall_64+0x33/0x80 [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1500.621126] RIP: 0033:0x7fb5b248c477 [ 1500.621128] Code: 89 7c 24 08 (...) [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477 [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003 [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000 [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010 [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003 [ 1500.621151] irq event stamp: 1026217 [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0 [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0 [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606 [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20 [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]--- [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left When we use fallocate() internally, for reserving an extent for a space cache, inode cache or relocation, we can't hit this problem since either there aren't any file extent items to remove from the subvolume tree or there is at most one. When using plain fallocate() it's very unlikely, since that would require having many file extent items representing holes for the target range and crossing multiple leafs - we attempt to increase the range (merge) of such file extent items when punching holes, so at most we end up with 2 file extent items for holes at leaf boundaries. However when using the zero range operation of fallocate() for a large range (100+ MiB for example) that's fairly easy to trigger. The following example reproducer triggers the issue: $ cat reproducer.sh #!/bin/bash umount /dev/sdj &> /dev/null mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null mount /dev/sdj /mnt/sdj # Create a 100M file with many file extent items. Punch a hole every 8K # just to speedup the file creation - we could do 4K sequential writes # followed by fsync (or O_SYNC) as well, but that takes a lot of time. file_size=$((100 * 1024 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar for ((i = 0; i < $file_size; i += 8192)); do xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar done # Force a transaction commit, so the zero range operation will be forced # to COW all metadata extents it need to touch. sync xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar umount /mnt/sdj $ ./reproducer.sh wrote 104857600/104857600 bytes at offset 0 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec) fallocate: No space left on device $ dmesg <shows the same stack trace pasted before> To fix this use the existing infrastructure that hole punching and extent cloning use for replacing a file range with another extent. This deals with doing the removal of file extent items and inserting the new one using an incremental approach, reserving more space when needed and always ensuring we don't leave an implicit hole in the range in case we need to do multiple iterations and a crash happens between iterations. A test case for fstests will follow up soon. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
c3e1f96c37 |
btrfs: enumerate the type of exclusive operation in progress
Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
facee0a09c |
btrfs: make extent_fiemap take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |