There was previously MAX_ALIGNED_DEV_REQ_(UP|DOWN)SIZE macros which
evaluated to MAX_DEV_REQ_(UP|DOWN)SIZE+8. As it is unclear what this is
for, other than creating a situation where we accept more data than we
can parse, it is removed.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Previously, when finishing a checkpoint, we have persisted all fs meta
info including meta inode, node inode, dentry page of directory inode, so,
after a sudden power cut, f2fs can recover from last checkpoint with full
directory structure.
But during checkpoint, we didn't flush dirty pages of regular and symlink
inode, so such dirty datas still in memory will be lost in that moment of
power off.
In order to reduce the chance of lost data, this patch enables
f2fs_balance_fs_bg with the ability of data flushing. It will try to flush
user data before starting a checkpoint. So user's data written after last
checkpoint which may not be fsynced could be saved.
When we mount with data_flush option, after every period of cp_interval
(could be configured in sysfs: /sys/fs/f2fs/device/cp_interval) seconds
user data could be flushed into device once f2fs_balance_fs_bg was called
in kworker thread or gc thread.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add to stat dirty regular and symlink inode for showing in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When doing a direct IO write, __blockdev_direct_IO() can call the
btrfs_get_blocks_direct() callback one or more times before it calls the
btrfs_submit_direct() callback. However it can fail after calling the
first callback and before calling the second callback, which is a problem
because the first one creates ordered extents and the second one is the
one that submits bios that cover the ordered extents created by the first
one. That means the ordered extents will never complete nor have any of
the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
subsequent operations (such as other direct IO writes, buffered writes or
hole punching) that lock the same IO range and lookup for ordered extents
in the range to hang forever waiting for those ordered extents because
they can not complete ever, since no bio was submitted.
Fix this by tracking a range of created ordered extents that don't have
yet corresponding bios submitted and completing the ordered extents in
the range if __blockdev_direct_IO() fails with an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
If readpages() (triggered by defrag or buffered reads) is called while a
direct IO write is in progress, we have a small time window where we can
deadlock, resulting in traces like the following being generated:
[84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
[84723.214310] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.217313] fio D ffff88023ec75218 0 2849 2835 0x00000000
[84723.218778] ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
[84723.220458] ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
[84723.230597] 0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
[84723.232085] Call Trace:
[84723.232625] [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
[84723.233529] [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.234398] [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
[84723.235384] [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
[84723.236426] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.237502] [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
[84723.238807] [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
[84723.242012] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
[84723.243064] [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
[84723.244116] [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
[84723.245029] [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
[84723.245942] [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
[84723.246596] [<ffffffff81478953>] bit_wait_io+0x39/0x45
[84723.247503] [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
[84723.248540] [<ffffffff8111684f>] __lock_page+0x66/0x68
[84723.249558] [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
[84723.250844] [<ffffffff81124a04>] lock_page+0x2c/0x2f
[84723.251871] [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
[84723.253274] [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
[84723.254757] [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
[84723.256378] [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
[84723.258556] [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
[84723.260064] [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
[84723.261479] [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
[84723.262961] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.264449] [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
[84723.265614] [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
[84723.266769] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.268264] [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
[84723.270954] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[84723.272465] [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
[84723.273734] [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
[84723.275101] [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
[84723.276200] [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
[84723.277298] [<ffffffff8116d79d>] SyS_write+0x50/0x7e
[84723.278327] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[84723.279595] INFO: lockdep is turned off.
[84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
[84723.380323] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84723.383003] btrfs D ffff88023ed75218 0 2923 2859 0x00000000
[84723.384277] ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
[84723.385748] ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
[84723.387152] ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
[84723.388620] Call Trace:
[84723.389105] [<ffffffff8147856a>] schedule+0x7d/0x95
[84723.391882] [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
[84723.393718] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[84723.395659] [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
[84723.397383] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.398852] [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
[84723.400561] [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
[84723.401787] [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
[84723.403121] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
[84723.404583] [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
[84723.406007] [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
[84723.407502] [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
[84723.408937] [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
[84723.410487] [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
[84723.411710] [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
[84723.413007] [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
[84723.414085] [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
[84723.415307] [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
[84723.416532] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
[84723.417731] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.418699] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
[84723.421532] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
[84723.422629] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
[84723.423712] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
[84723.424801] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[84723.425968] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[84723.427063] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[84723.428138] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
Consider the following logical and physical file layout:
logical: ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
4K 8K 16K
physical: ... 12853248 12857344 1103101952 ...
(= 12853248 + 4K)
Extents A and B are physically adjacent. The following diagram shows a
sequence of events that lead to the deadlock when we attempt to do a
direct IO write against the file range [4K, 16K[ and a defrag is triggered
simultaneously.
CPU 1 CPU 2
btrfs_direct_IO()
btrfs_get_blocks_direct()
creates ordered extent A, covering
the 4k prealloc extent A (range [4K, 8K[)
btrfs_defrag_file()
page_cache_sync_readahead([0K, 1M[)
btrfs_readpages()
extent_readpages()
locks all pages in the file
range [0K, 128K[ through calls
to add_to_page_cache_lru()
__do_contiguous_readpages()
finds ordered extent A
waits for it to complete
btrfs_get_blocks_direct() called again
lock_extent_direct(range [8K, 16K[)
finds a page in range [8K, 16K[ through
btrfs_page_exists_in_range()
invalidate_inode_pages2_range([8K, 16K[)
--> tries to lock pages that are already
locked by the task at CPU 2
--> our task, running __blockdev_direct_IO(),
hangs waiting to lock the pages and the
submit bio callback, btrfs_submit_direct(),
ends up never being called, resulting in the
ordered extent A never completing (because a
corresponding bio is never submitted) and
CPU 2 will wait for it forever while holding
the pages locked
---> deadlock!
Fix this by removing the page invalidation approach when attempting to
lock the range for IO from the callback btrfs_get_blocks_direct() and
falling back buffered IO. This was a rare case anyway and well behaved
applications do not mix concurrent direct IO writes with buffered reads
anyway, being a concurrent defrag the only normal case that could lead
to the deadlock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Commit 61de718fce ("Btrfs: fix memory corruption on failure to submit
bio for direct IO") fixed problems with the error handling code after we
fail to submit a bio for direct IO. However there were 2 problems that it
did not address when the failure is due to memory allocation failures for
direct IO writes:
1) We considered that there could be only one ordered extent for the whole
IO range, which is not always true, as we can have multiple;
2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
which can make other tasks running btrfs_wait_logged_extents() hang
forever, since they wait for that bit to be set. The general assumption
is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
Fix these issues by moving part of the btrfs_endio_direct_write() handler
into a new helper function and having that new helper function called when
we fail to allocate memory to submit the bio (and its private object) for
a direct IO write.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
When a transaction is aborted, or its commit fails before writing the new
superblock and calling btrfs_finish_extent_commit(), we leak reference
counts on the block groups attached to the transaction's delete_bgs list,
because btrfs_finish_extent_commit() is never called for those two cases.
Fix this by dropping their references at btrfs_put_transaction(), which
is called when transactions are aborted (by making the transaction kthread
commit the transaction) or if their commits fail.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:
[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---
This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_fallocate()
btrfs_alloc_data_chunk_ondemand()
btrfs_join_transaction()
--> starts a new transaction
do_chunk_alloc()
lock fs_info->chunk_mutex
btrfs_alloc_chunk()
--> creates extent map for
the new chunk with
em->bdev->map->stripes[i]->dev->devid
== X (X > 0)
--> extent map is added to
fs_info->mapping_tree
--> initial phase of bg A
allocation completes
unlock fs_info->chunk_mutex
lock fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> starts final phase of
bg A creation (update device,
extent, and chunk trees, etc)
btrfs_finish_chunk_alloc()
btrfs_update_device()
--> attempts to update a device
item with ID == 0ULL
(BTRFS_DEV_REPLACE_DEVID)
which is the current ID of
bg A's
em->bdev->map->stripes[i]->dev->devid
--> doesn't find such item
returns -ENOENT
--> the device id should have been X
and not 0ULL
got -ENOENT from
btrfs_finish_chunk_alloc()
and aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid, which is X (and X > 0)
frees the srcdev
unlock fs_info->chunk_mutex
So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.
This happened while running fstest btrfs/071.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Add a new option 'data_flush' to enable data flush functionality.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Maintain regular/symlink inode which has dirty pages in global dirty list
and record their total dirty pages count like the way of handling directory
inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce __f2fs_commit_super to include duplicated codes in
f2fs_commit_super for cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have already got one copy of valid super block in memory, do not grab
buffer header of super block all the time.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs use fields of f2fs_super_block struct directly in a grabbed buffer.
Once the buffer happen to be destroyed (e.g. through dd), it may bring
in unpredictable effect on f2fs.
This patch fixes to allocate additional buffer to store datas of super
block rather than using grabbed block buffer directly.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_map_blocks will set m_flags and m_len to 0, so we don't need to
reset m_flags ourselves, but have to reset m_len to correct value
before use it again.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We do need to serialize layout stateid morphing operations, but we
currently hold the ls_mutex across a layout recall which is pretty
ugly. It's also unnecessary -- once we've bumped the seqid and
copied it, we don't need to serialize the rest of the CB_LAYOUTRECALL
vs. anything else. Just drop the mutex once the copy is done.
This was causing a "workqueue leaked lock or atomic" warning and an
occasional deadlock.
There's more work to be done here but this fixes the immediate
regression.
Fixes: cc8a55320b "nfsd: serialize layout stateid morphing operations"
Cc: stable@vger.kernel.org
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Introduce __remove_dirty_inode to clean up codes in remove_dirty_dir_inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a new dirt list node member in inode info for linking the inode to
global dirty list in superblock, instead of old implementation which
allocate slab cache memory as an entry to inode.
It avoids memory pressure due to slab cache allocation, and also makes
codes more clean.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
remove_dirty_dir_inode will be renamed to remove_dirty_inode as a generic
function in following patch for removing directory/regular/symlink inode
in global dirty list.
Here rename ino management related functions for readability, also in
order to avoid name conflict.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page. But if the first call returns an error,
our page will be unlocked and its not safe to call it again.
This bug goes all the way back to 2011, and it's not something commonly
hit.
While we're here, add a more explicit check for the page being truncated
away. The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.
Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Dave Jones found a warning from kasan in setup_cluster_bitmaps()
==================================================================
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr ffff88039bef6828
Read of size 8 by task nfsd/1009
page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null)
index:0x0
flags: 0x8000000000000000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: G W
4.4.0-rc3-backup-debug+ #1
ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
Call Trace:
[<ffffffffa680a43e>] dump_stack+0x4b/0x6d
[<ffffffffa62638d1>] kasan_report_error+0x501/0x520
[<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
[<ffffffffa6263948>] kasan_report+0x58/0x60
[<ffffffffa6814b00>] ? rb_last+0x10/0x40
[<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa6262ead>] __asan_load8+0x5d/0x70
[<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
[<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
[<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
[<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
[<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
[<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
[<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520
Andrey noticed this was because we were doing list_first_entry on a list
that might be empty. Rework the tests a bit so we don't do that.
Signed-off-by: Chris Mason <clm@fb.com>
Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Dave Jones <dsj@fb.com>
Filesystem fuzzing revealed that we could get stuck in the
udf_process_sequence() loop.
The maximum limit was chosen arbitrarily but fixes the problem I saw.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
AV dislikes many parts of orangefs_devreq_writev. Besides making
orangefs_devreq_writev more easily readable and better commented,
this patch makes an effort to address some of the problems:
> The 5th is quietly ignored unless trailer_size is positive and
> status is zero. If trailer_size > 0 && status == 0, you verify that
> the length of the 5th segment is no more than trailer_size and copy
> it to vmalloc'ed buffer. Without bothering to zero the rest of that
> buffer out.
It was just wrong to allow a 5th segment that is not exactly equal to
trailer_size. Now that that's fixed, there's nothing to zero out in
the vmalloced buffer - it is exactly the right size to hold the
5th segment.
> Another API bogosity: when the 5th segment is present, successful writev()
> returns the sum of sizes of the first 4.
Added size of 5th segment to writev return...
> if concatenation of the first 4 segments is longer than
> 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine
> and proceed with garbage.
If 4th segment isn't exactly sizeof(struct pvfs2_downcall_s), whine and fail.
> if the 32bit value 4 bytes into op->downcall is zero and 64bit
> value following it is non-zero, the latter is interpreted as the size of
> trailer data.
The latter is what userspace claimed was the length of the trailer data.
The kernel module now compares it to the trailer iovec's iov_len as a
sanity check.
> if there's no trailer, the 5th segment (if present) is completely ignored.
Whine and fail if there should be no trailer, yet a 5th segment is present.
> if vmalloc fails, act as if status (32bit at offset 5 into
> op->downcall) had been -ENOMEM and don't look at the 5th segment at all.
whine and fail with -ENOMEM.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When gfs2 was unmounting filesystems or changing them to read-only it
was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This
caused a race. If an inode glock got demoted in the gap between
clearing the bit and the shutdown flush, it would be unable to reserve
log space to clear out the active items list in inode_go_sync, causing an
error in inode_go_inval because the glock was still dirty.
To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
shutdown log flush. This means that, because of the locking on the log
blocks, either inode_go_sync will be able to reserve space to clean the
glock before the shutdown flush, or the shutdown flush will clean the
glock itself, before inode_go_sync fails to reserve the space. Either
way, the glock will be clean before inode_go_inval.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
gfs2 currently returns 31 bits of filename hash as a cookie that readdir
uses for an offset into the directory. When there are a large number of
directory entries, the likelihood of a collision goes up way too
quickly. GFS2 will now return cookies that are guaranteed unique for a
while, and then fail back to using 30 bits of filename hash.
Specifically, the directory leaf blocks are divided up into chunks based
on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
cookie is based off the chunk where it starts, in the linked list of
leaf blocks that it hashes to (there are 131072 hash buckets). Directory
entries will have unique names until they take reach chunk 8192.
Assuming the largest filenames possible, and the least efficient spacing
possible, this new method will still be able to return unique names when
the previous method has statistically more than a 99% chance of a
collision. The non-unique names it fails back to are guaranteed to not
collide with the unique names.
unique cookies will be in this format:
- 1 bit "0" to make sure the the returned cookie is positive
- 17 bits for the hash table index
- 1 bit for the mode "0"
- 13 bits for the offset
non-unique cookies will be in this format:
- 1 bit "0" to make sure the the returned cookie is positive
- 17 bits for the hash table index
- 1 bit for the mode "1"
- 13 more bits of the name hash
Another benefit of location based cookies, is that once a directory's
exhash table is fully extended (so that multiple hash table indexs do
not use the same leaf blocks), gfs2 can skip sorting the directory
entries until it reaches the non-unique ones, and then it only needs to
sort these. This provides a significant speed up for directory reads of
very large directories.
The only issue is that for these cookies to continue to point to the
correct entry as files are added and removed from the directory, gfs2
must keep the entries at the same offset in the leaf block when they are
split (see my previous patch). This means that until all the nodes in a
cluster are running with code that will split the directory leaf blocks
this way, none of the nodes can use the new cookie code. To deal with
this, gfs2 now has the mount option loccookie, which, if set, will make
it return these new location based cookies. This option must not be set
until all nodes in the cluster are at least running this version of the
kernel code, and you have guaranteed that there are no outstanding
cookies required by other software, such as NFS.
gfs2 uses some of the extra space at the end of the gfs2_dirent
structure to store the calculated readdir cookies. This keeps us from
needing to allocate a seperate array to hold these values. gfs2
recomputes the cookie stored in de_cookie for every readdir call. The
time it takes to do so is small, and if gfs2 expected this value to be
saved on disk, the new code wouldn't work correctly on filesystems
created with an earlier version of gfs2.
One issue with adding de_cookie to the union in the gfs2_dirent
structure is that it caused the union to align itself to a 4 byte
boundary, instead of its previous 2 byte boundary. This changed the
offset of de_rahead. To solve that, I pulled de_rahead out of the union,
since it does not need to be there.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Currently, when gfs2 splits a directory leaf block, the dirents that
need to be copied to the new leaf block are packed into the start of it.
This is good for space efficiency. However, if gfs2 were to copy those
dirents into the exact same offset in the new leaf block as they had in
the old block, it would be able to generate a readdir cookie based on
the dirent location, that would be guaranteed to be unique up well past
where the current code is statistically almost guaranteed to have
collisions. So, gfs2 now keeps the dirent's offset in the block the
same when it copies it to the new leaf block.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
At some point in the past, we used to have a timeout when GFS2 was
unmounting, trying to clear out its glocks. If the timeout expires,
it would dump the remaining glocks to the kernel messages so that
developers can debug the problem. That timeout was eliminated,
probably by accident. This patch reintroduces it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function update_statfs called gfs2_statfs_change_out
to update the master statfs buffer without the sd_statfs_spin held.
In theory, another process could call gfs2_statfs_sync, which takes
the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's
a theoretical timing window in which one process could write the
master statfs buffer, then another comes along and re-reads it, wiping
out the changes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch makes no functional changes. Its goal is to reduce the
size of the gfs2 inode in memory by rearranging structures and
changing the size of some variables within the structure.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
@lend of filemap_write_and_wait_range is supposed to be a "offset
in bytes where the range ends (inclusive)". Subtract 1 to avoid
writing an extra page.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Before this patch, multi-block reservation structures were allocated
from a special slab. This patch folds the structure into the gfs2_inode
structure. The disadvantage is that the gfs2_inode needs more memory,
even when a file is opened read-only. The advantages are: (a) we don't
need the special slab and the extra time it takes to allocate and
deallocate from it. (b) we no longer need to worry that the structure
exists for things like quota management. (c) This also allows us to
remove the calls to get_write_access and put_write_access since we
know the structure will exist.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
If read_node_page return LOCKED_PAGE, in its caller it's better a) skip
unneeded 'Update' flag and mapping info verfication; b) check nid value
stored in footer structure of node page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The Kconfig currently controlling compilation of this code is:
config QUOTA
bool "Quota support"
...meaning that it currently is not being built as a module by anyone.
Lets remove the couple traces of modularity so that when reading the
driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering gets bumped to one level earlier when we
use the more appropriate fs_initcall here. However we've made similar
changes before without any fallout and none is expected here either.
We don't delete module.h because the code in turn tries to load other
modules as appropriate and so it still needs that header.
Cc: Jan Kara <jack@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The Kconfig currently controlling compilation of this code is:
config QUOTA_NETLINK_INTERFACE
bool "Report quota messages through netlink interface"
...meaning that it currently is not being built as a module by anyone.
Lets remove the couple traces of modularity so that when reading the
driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering gets bumped to one level earlier when we
use the more appropriate fs_initcall here. However we've made similar
changes before without any fallout and none is expected here either.
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Change the list operation to only return whether or not an attribute
should be listed. Copying the attribute names into the buffer is moved
to the callers.
Since the result only depends on the dentry and not on the attribute
name, we do not pass the attribute name to list operations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MIPS: fix DMA contiguous allocation
sh64: fix __NR_fgetxattr
ocfs2: fix SGID not inherited issue
mm/oom_kill.c: avoid attempting to kill init sharing same memory
drivers/base/memory.c: prohibit offlining of memory blocks with missing sections
tmpfs: fix shmem_evict_inode() warnings on i_blocks
mm/hugetlb.c: fix resv map memory leak for placeholder entries
mm: hugetlb: call huge_pte_alloc() only if ptep is null
kernel: remove stop_machine() Kconfig dependency
mm: kmemleak: mark kmemleak_init prototype as __init
mm: fix kerneldoc on mem_cgroup_replace_page
osd fs: __r4w_get_page rely on PageUptodate for uptodate
MAINTAINERS: make Vladimir co-maintainer of the memory controller
mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo
memcg: fix memory.high target
mm: hugetlb: fix hugepage memory leak caused by wrong reserve count
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series. This contains:
- A bunch of fixes for lightnvm, should be the last round for this
series. From Matias and Wenwei.
- A writeback detach inode fix from Ilya, also marked for stable.
- A block (though it says SCSI) fix for an OOPS in SCSI runtime power
management.
- Module init error path fixes for null_blk from Minfei"
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: Fix error path in module initialization
lightnvm: do not compile in debugging by default
lightnvm: prevent gennvm module unload on use
lightnvm: fix media mgr registration
lightnvm: replace req queue with nvmdev for lld
lightnvm: comments on constants
lightnvm: check mm before use
lightnvm: refactor spin_unlock in gennvm_get_blk
lightnvm: put blks when luns configure failed
lightnvm: use flags in rrpc_get_blk
block: detach bdev inode from its wb in __blkdev_put()
SCSI: Fix NULL pointer dereference in runtime PM