When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.
TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In the pstore compression part, we use zlib/lzo/lz4/lz4hc/842
compression algorithm API to implement pstore compression backends. But
there are many repeat codes in these implementations. This patch uses
crypto compress API to simplify these codes.
1) rewrite allocate_buf_for_compression, free_buf_for_compression,
pstore_compress, pstore_decompress functions using crypto compress API.
2) drop compress, decompress, allocate, free functions in pstore_zbackend,
and add zbufsize function to get each different compress buffer size.
3) use late_initcall to call ramoops_init later, to make sure the crypto
subsystem has already initialized.
4) use 'unsigned int' type instead of 'size_t' in pstore_compress,
pstore_decompress functions' length arguments.
5) rename 'zlib' to 'deflate' to follow the crypto API's name convention.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked error messages on allocation failures and Kconfig help]
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull overlayfs fixes from Miklos Szeredi:
"This fixes a corner case for NFS exporting (introduced in this cycle)
as well as fixing miscellaneous bugs"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: update Kconfig texts
ovl: redirect_dir=nofollow should not follow redirect for opaque lower
ovl: fix ptr_ret.cocci warnings
ovl: check ERR_PTR() return value from ovl_lookup_real()
ovl: check lower ancestry on encode of lower dir file handle
ovl: hash non-dir by lower inode for fsnotify
Pull xfs fixes from Darrick Wong:
- Fix some iomap locking problems
- Don't allocate cow blocks when we're zeroing file data
* tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't block on the ilock for RWF_NOWAIT
xfs: don't start out with the exclusive ilock for direct I/O
xfs: don't allocate COW blocks for zeroing holes or unwritten extents
Remove Variable Length Array from fs/reiserfs/reiserfs.h. EMPTY_DIR_SIZE
is used as an array size and as it is using strlen() it need not be
evaluated at compile time. Change it's definition to use sizeof() to
force evaluation of array length at compile time.
Signed-off-by: Kyle Spiers <kyle@spiers.me>
Signed-off-by: Jan Kara <jack@suse.cz>
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.
In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.
Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: dc4fd9ab01 ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Before this patch, GFS2 was setting the PageChecked flag for ordered
write pages. This is unnecessary. The ext3 file system only does it
for jdata, and it's only used in jdata circumstances. It only muddies
the already murky waters of writing pages in the aops.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function gfs2_remove_from_ail is only ever used from log.c, so there
is no reason to declare it extern. This patch removes the extern and
declares it static.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Mark the source inode dirty during a rename instead of just updating the
underlying buffer head. Otherwise, fsync may find the inode clean and
will then skip flushing the journal. A subsequent power failure will
cause the rename to be lost. This happens in command sequences like:
xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo
mv foo bar
xfs_io -c 'fsync' bar
# power failure
Fixes xfstests generic/322, generic/376.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly. The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient. This needs
fixing in two places:
In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit. This fixes callers that rely on
ap->allowed to be set even when quotas are off.
In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Instead of using a stack VLA for the parity workspace, preallocate a
memory region. The preallocation is done to keep from needing to perform
allocations during crash dump writing, etc. This also fixes a missed
release of librs on free.
Signed-off-by: Kees Cook <keescook@chromium.org>
To allow for easier build test coverage and run-time testing, this allows
multiple compression algorithms to be built into pstore. Still only one
is supported to operate at a time (which can be selected at build time
or at boot time, similar to how LSMs are selected).
Signed-off-by: Kees Cook <keescook@chromium.org>
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for
block_map"' introduced another bug in gfs2_iomap_begin that can cause
gfs2_block_map to set bh->b_size of an actual buffer to 0. This can
lead to arbitrary incorrect behavior including crashes or disk
corruption. Revert the incorrect part of that commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add some hints about overlayfs kernel config options.
Enabling NFS export by default is especially recommended against, as it
incurs a performance penalty even if the filesystem is not actually
exported.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of casting, make sure we don't end up with giant values and just
perform regular assignments with unsigned int instead of re-cast size_t.
Signed-off-by: Kees Cook <keescook@chromium.org>
Currently, pstore has supported three compression algorithms: zlib,
lzo and lz4. This patch added two more compression algorithms: lz4hc
and 842.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked Kconfig help text slightly]
Signed-off-by: Kees Cook <keescook@chromium.org>
All of the conflicts were cases of overlapping changes.
In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull btrfs fixes from David Sterba:
- when NR_CPUS is large, a SRCU structure can significantly inflate
size of the main filesystem structure that would not be possible to
allocate by kmalloc, so the kvalloc fallback is used
- improved error handling
- fix endiannes when printing some filesystem attributes via sysfs,
this is could happen when a filesystem is moved between different
endianity hosts
- send fixes: the NO_HOLE mode should not send a write operation for a
file hole
- fix log replay for for special files followed by file hardlinks
- fix log replay failure after unlink and link combination
- fix max chunk size calculation for DUP allocation
* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix log replay failure after unlink and link combination
Btrfs: fix log replay failure after linking special file and fsync
Btrfs: send, fix issuing write op when processing hole in no data mode
btrfs: use proper endianness accessors for super_copy
btrfs: alloc_chunk: fix DUP stripe size handling
btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
btrfs: handle failure of add_pending_csums
btrfs: use kvzalloc to allocate btrfs_fs_info
Pull ceph fixes from Ilya Dryomov:
"A cap handling fix from Zhi that ensures that metadata writeback isn't
delayed and three error path memory leak fixups from Chengguang"
* tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client:
ceph: fix potential memory leak in init_caches()
ceph: fix dentry leak when failing to init debugfs
libceph, ceph: avoid memory leak when specifying same option several times
ceph: flush dirty caps of unlinked inode ASAP
Pull block fixes from Jens Axboe:
"A collection of fixes for this series. This is a little larger than
usual at this time, but that's mainly because I was out on vacation
last week. Nothing in here is major in any way, it's just two weeks of
fixes. This contains:
- NVMe pull from Keith, with a set of fixes from the usual suspects.
- mq-deadline zone unlock fix from Damien, fixing an issue with the
SMR zone locking added for 4.16.
- two bcache fixes sent in by Michael, with changes from Coly and
Tang.
- comment typo fix from Eric for blktrace.
- return-value error handling fix for nbd, from Gustavo.
- fix a direct-io case where we don't defer to a completion handler,
making us sleep from IRQ device completion. From Jan.
- a small series from Jan fixing up holes around handling of bdev
references.
- small set of regression fixes from Jiufei, mostly fixing problems
around the gendisk pointer -> partition index change.
- regression fix from Ming, fixing a boundary issue with the discard
page cache invalidation.
- two-patch series from Ming, fixing both a core blk-mq-sched and
kyber issue around token freeing on a requeue condition"
* tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits)
block: fix a typo
block: display the correct diskname for bio
block: fix the count of PGPGOUT for WRITE_SAME
mq-deadline: Make sure to always unlock zones
nvmet: fix PSDT field check in command format
nvme-multipath: fix sysfs dangerously created links
nbd: fix return value in error handling path
bcache: fix kcrashes with fio in RAID5 backend dev
bcache: correct flash only vols (check all uuids)
blktrace_api.h: fix comment for struct blk_user_trace_setup
blockdev: Avoid two active bdev inodes for one device
genhd: Fix BUG in blkdev_open()
genhd: Fix use after free in __blkdev_get()
genhd: Add helper put_disk_and_module()
genhd: Rename get_disk() to get_disk_and_module()
genhd: Fix leaked module reference for NVME devices
direct-io: Fix sleep in atomic due to sync AIO
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
block: kyber: fix domain token leak during requeue
blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch
...
When specifiying iocharset multiple times in a mount or once/multiple in
a remount, current option parsing may cause inaccurate refcount of nls
module. Also, in the failure cleanup of option parsing, the condition
of calling unload_nls is not sufficient.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Change return code to -ENOMEM from -EINVAL when failing memory
allocation in fill_super(), meanwhile delete redundant initial
assignment of variable err.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If logical volume integrity descriptor contains non-closed integrity
type when mounting the volume, there are high chances that the volume is
not consistent (device was detached before the filesystem was
unmounted). Don't touch integrity type of such volume so that fsck can
recognize it and check such filesystem.
Reported-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed,
so that we don't block io_submit callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no reason to take the ilock exclusively at the start of
xfs_file_iomap_begin for direct I/O, given that it will be demoted
just before calling xfs_iomap_write_direct anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The iomap zeroing interface is smart enough to skip zeroing holes or
unwritten extents. Don't subvert this logic for reflink files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is lack of cache destroy operation for ceph_file_cachep
when failing from fscache register.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/foo
$ ln /mnt/testdir/foo /mnt/testdir/bar
$ sync
$ unlink /mnt/testdir/bar
$ touch /mnt/testdir/bar
$ xfs_io -c "fsync" /mnt/testdir/bar
<power failure>
$ mount /dev/sdb /mnt
mount: mount(2) failed: /mnt: No such file or directory
When replaying the log, for that example, we also see the following in
dmesg/syslog:
[71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
[71813.674204] ------------[ cut here ]------------
[71813.675694] BTRFS: Transaction aborted (error -2)
[71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
[71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1
[71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
[71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
[71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
[71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
[71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
[71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
[71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
[71813.679669] Call Trace:
[71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs]
[71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs]
[71813.679669] add_inode_ref+0x71e/0x851 [btrfs]
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs]
[71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs]
[71813.679669] ? rcu_read_unlock+0x3a/0x57
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs]
[71813.679669] walk_log_tree+0xad/0x188 [btrfs]
[71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
[71813.679669] ? replay_one_extent+0x544/0x544 [btrfs]
[71813.679669] open_ctree+0x1cf6/0x2209 [btrfs]
[71813.679669] btrfs_mount_root+0x368/0x482 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] btrfs_mount+0x13e/0x772 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] do_mount+0x6e5/0x973
[71813.679669] ? memdup_user+0x3e/0x5c
[71813.679669] SyS_mount+0x72/0x98
[71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b
[71813.679669] RIP: 0033:0x7f7cedf150ba
[71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
[71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
[71813.679669] ---[ end trace 83bd473fc5b4663b ]---
[71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
[71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
[71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
[71814.128078] BTRFS error (device dm-0): open_ctree failed
This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.
Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ mkfifo /mnt/testdir/foo
# Make sure everything done so far is durably persisted.
$ sync
# Create some unrelated file and fsync it, this is just to create a log
# tree. The file must be in the same directory as our special file.
$ touch /mnt/testdir/f1
$ xfs_io -c "fsync" /mnt/testdir/f1
# Rename our special file and then create a hard link with its old name.
$ mv /mnt/testdir/foo /mnt/testdir/bar
$ ln /mnt/testdir/bar /mnt/testdir/foo
# Create some other unrelated file and fsync it, this is just to persist
# the log tree which was modified by the previous rename and link
# operations. Alternatively we could have modified file f1 and fsync it.
$ touch /mnt/f2
$ xfs_io -c "fsync" /mnt/f2
<power failure>
$ mount /dev/sdc /mnt
mount: mount /dev/sdc on /mnt failed: File exists
This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.
Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.
A new generic test case for fstests was also submitted.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.
Trivial reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdc /mnt/sdc
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
$ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
$ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
$ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
| btrfs receive -vv /mnt/sdd
Before this change the output of the second receive command is:
receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
utimes
write foobar, offset 8192, len 8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
After this change it is:
receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
utimes
update_extent foobar: offset=8192, len=8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info::super_copy is a byte copy of the on-disk structure and all
members must use the accessor macros/functions to obtain the right
value. This was missing in update_super_roots and in sysfs readers.
Moving between opposite endianness hosts will report bogus numbers in
sysfs, and mount may fail as the root will not be restored correctly. If
the filesystem is always used on a same endian host, this will not be a
problem.
Fix this by using the btrfs_set_super...() functions to set
fs_info::super_copy values, and for the sysfs, use the cached
fs_info::nodesize/sectorsize values.
CC: stable@vger.kernel.org
Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.
The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.
Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.
The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.
Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.
Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.
Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.
The real problem was already introduced with the rest of the code in
73c5de0051.
The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.
There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.
As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull xfs fixes from Darrick Wong:
- fix some compiler warnings
- fix block reservations for transactions created during log recovery
- fix resource leaks when respecifying mount options
* tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix potential memory leak in mount option parsing
xfs: reserve blocks for refcount / rmap log item recovery
xfs: use memset to initialize xfs_scrub_agfl_info
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:
vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations just create and destroy /proc entries
and net_generic()->cb_ident_idr IDR. So, we are able to mark
them async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if notification event is lost due to event allocation failing
we ENOMEM, we just silently continue (except for fanotify permission
events where we deny the access). This is undesirable as userspace has
no way of knowing whether the notifications it got are complete or not.
Treat lost events due to ENOMEM the same way as lost events due to queue
overflow so that userspace knows something bad happened and it likely
needs to rescan the filesystem.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fanotify queues of unlimited length do not expect events can be lost.
Since these queues are used for system auditing and other security
related tasks, loosing events can even have security implications.
Currently, since the allocation is small (32-bytes), it cannot fail
however when we start accounting events in memcgs, allocation can start
failing. So avoid loosing events due to failure to allocate memory by
making event allocation use __GFP_NOFAIL.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently when UDF filesystem is recorded without uid / gid (ids are set
to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
and gid= mount options. In such case filesystem could not be modified in
any way as VFS refuses to modify files with invalid ids (even by root).
This is confusing to users and not very useful default since such media
mode is generally used for removable media. Use overflow[ug]id instead
so that at least root can modify the filesystem.
Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Current code relies on the fact that invalid uid/gid as defined by UDF
2.60 3.3.3.1 and 3.3.3.2 coincides with invalid uid/gid as used by the
user namespaces implementation. Since this is only lucky coincidence,
clean this up to avoid future surprises in case user namespaces
implementation changes. Also this is more robust in presence of valid
(from UDF point of view) uids / gids which do not map into current user
namespace.
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently newly created files belong to current user despite
uid=<number> / gid=<number> mount options. This is confusing to users
(as owner of the file will change after remount / eviction from cache)
and also inconsistent with e.g. FAT with the same mount option. So apply
uid=<number> and gid=<number> also to newly created inodes and similarly
as FAT disallow to change owner of the file in this case.
Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently uid=ignore and gid=ignore make no sense without uid=<number>
and gid=<number> respectively as they result in all files having invalid
uid / gid which then doesn't allow even root to modify files and thus
causes confusion. And since commit ca76d2d803 "UDF: fix UID and GID
mount option ignorance" (from over 10 years ago) uid=<number> overrides
all uids on disk as uid=ignore does. So just silently ignore uid=ignore
mount option.
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>