Commit Graph

56429 Commits

Author SHA1 Message Date
Carlos Maiolino
e157ebdcb3 Cleanup old XFS_BTREE_* traces
Remove unused legacy btree traces from IRIX era.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Eric Sandeen
4603fa744c xfs: remove unused m_dmevmask from xfs_mount struct
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used.  Remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Dave Chinner
cb0a8d2302 xfs: fall back to vmalloc when allocation log vector buffers
When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.

TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Geliang Tang
cb3bee0369 pstore: Use crypto compress API
In the pstore compression part, we use zlib/lzo/lz4/lz4hc/842
compression algorithm API to implement pstore compression backends. But
there are many repeat codes in these implementations. This patch uses
crypto compress API to simplify these codes.

1) rewrite allocate_buf_for_compression, free_buf_for_compression,
pstore_compress, pstore_decompress functions using crypto compress API.
2) drop compress, decompress, allocate, free functions in pstore_zbackend,
and add zbufsize function to get each different compress buffer size.
3) use late_initcall to call ramoops_init later, to make sure the crypto
subsystem has already initialized.
4) use 'unsigned int' type instead of 'size_t' in pstore_compress,
pstore_decompress functions' length arguments.
5) rename 'zlib' to 'deflate' to follow the crypto API's name convention.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked error messages on allocation failures and Kconfig help]
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-09 14:16:29 -08:00
Linus Torvalds
719ea86151 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a corner case for NFS exporting (introduced in this cycle)
  as well as fixing miscellaneous bugs"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update Kconfig texts
  ovl: redirect_dir=nofollow should not follow redirect for opaque lower
  ovl: fix ptr_ret.cocci warnings
  ovl: check ERR_PTR() return value from ovl_lookup_real()
  ovl: check lower ancestry on encode of lower dir file handle
  ovl: hash non-dir by lower inode for fsnotify
2018-03-09 09:46:14 -08:00
Linus Torvalds
2d9b1d69c3 Merge tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:

 - Fix some iomap locking problems

 - Don't allocate cow blocks when we're zeroing file data

* tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't block on the ilock for RWF_NOWAIT
  xfs: don't start out with the exclusive ilock for direct I/O
  xfs: don't allocate COW blocks for zeroing holes or unwritten extents
2018-03-09 09:37:29 -08:00
Kyle Spiers
a9cee1764b reiserfs: Remove VLA from fs/reiserfs/reiserfs.h
Remove Variable Length Array from fs/reiserfs/reiserfs.h. EMPTY_DIR_SIZE
is used as an array size and as it is using strlen() it need not be
evaluated at compile time. Change it's definition to use sizeof() to
force evaluation of array length at compile time.

Signed-off-by: Kyle Spiers <kyle@spiers.me>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-09 13:04:20 +01:00
Trond Myklebust
c4f24df942 NFS: Fix unstable write completion
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.

In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: dc4fd9ab01 ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:32 -05:00
Trond Myklebust
9c6376ebdd pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
Ensure that we hold a reference to the layout header when processing
the pNFS return-on-close so that the refcount value does not inadvertently
go to zero.

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.10+
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
2018-03-08 12:56:31 -05:00
Trond Myklebust
d9ee65539d NFS: Fix an incorrect type in struct nfs_direct_req
The start offset needs to be of type loff_t.

Fixed: 5fadeb47dc ("nfs: count DIO good bytes correctly with mirroring")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:31 -05:00
Andreas Gruenbacher
d39d18e0ef gfs2: Improve gfs2_block_map comment
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Bob Peterson
b9e03f1861 GFS2: Only set PageChecked for jdata pages
Before this patch, GFS2 was setting the PageChecked flag for ordered
write pages. This is unnecessary. The ext3 file system only does it
for jdata, and it's only used in jdata circumstances. It only muddies
the already murky waters of writing pages in the aops.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Bob Peterson
9bc980cdb9 GFS2: Make function gfs2_remove_from_ail static
Function gfs2_remove_from_ail is only ever used from log.c, so there
is no reason to declare it extern. This patch removes the extern and
declares it static.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Andreas Gruenbacher
83998ccd9b gfs2: Dirty source inode during rename
Mark the source inode dirty during a rename instead of just updating the
underlying buffer head.  Otherwise, fsync may find the inode clean and
will then skip flushing the journal.  A subsequent power failure will
cause the rename to be lost.  This happens in command sequences like:

  xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo
  mv foo bar
  xfs_io -c 'fsync' bar
  # power failure

Fixes xfstests generic/322, generic/376.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Andreas Gruenbacher
174d1232eb gfs2: Fix fallocate chunk size
The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly.  The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient.  This needs
fixing in two places:

In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit.  This fixes callers that rely on
ap->allowed to be set even when quotas are off.

In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Kees Cook
f2531f1976 pstore/ram: Do not use stack VLA for parity workspace
Instead of using a stack VLA for the parity workspace, preallocate a
memory region. The preallocation is done to keep from needing to perform
allocations during crash dump writing, etc. This also fixes a missed
release of librs on free.

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-07 12:47:06 -08:00
Kees Cook
fe1d475888 pstore: Select compression at runtime
To allow for easier build test coverage and run-time testing, this allows
multiple compression algorithms to be built into pstore. Still only one
is supported to operate at a time (which can be selected at build time
or at boot time, similar to how LSMs are selected).

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-07 12:43:35 -08:00
Andreas Gruenbacher
3b5da96e45 gfs2: Fixes to "Implement iomap for block_map" (2)
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for
block_map"' introduced another bug in gfs2_iomap_begin that can cause
gfs2_block_map to set bh->b_size of an actual buffer to 0.  This can
lead to arbitrary incorrect behavior including crashes or disk
corruption.  Revert the incorrect part of that commit.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-07 11:40:38 -07:00
Miklos Szeredi
36cd95dfa1 ovl: update Kconfig texts
Add some hints about overlayfs kernel config options.

Enabling NFS export by default is especially recommended against, as it
incurs a performance penalty even if the filesystem is not actually
exported.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-03-07 11:47:15 +01:00
Kees Cook
555974068e pstore: Avoid size casts for 842 compression
Instead of casting, make sure we don't end up with giant values and just
perform regular assignments with unsigned int instead of re-cast size_t.

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-06 15:15:24 -08:00
Geliang Tang
239b716199 pstore: Add lz4hc and 842 compression support
Currently, pstore has supported three compression algorithms: zlib,
lzo and lz4. This patch added two more compression algorithms: lz4hc
and 842.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked Kconfig help text slightly]
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-06 15:06:11 -08:00
David S. Miller
0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Linus Torvalds
af8c081627 Merge tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - when NR_CPUS is large, a SRCU structure can significantly inflate
   size of the main filesystem structure that would not be possible to
   allocate by kmalloc, so the kvalloc fallback is used

 - improved error handling

 - fix endiannes when printing some filesystem attributes via sysfs,
   this is could happen when a filesystem is moved between different
   endianity hosts

 - send fixes: the NO_HOLE mode should not send a write operation for a
   file hole

 - fix log replay for for special files followed by file hardlinks

 - fix log replay failure after unlink and link combination

 - fix max chunk size calculation for DUP allocation

* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log replay failure after unlink and link combination
  Btrfs: fix log replay failure after linking special file and fsync
  Btrfs: send, fix issuing write op when processing hole in no data mode
  btrfs: use proper endianness accessors for super_copy
  btrfs: alloc_chunk: fix DUP stripe size handling
  btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
  btrfs: handle failure of add_pending_csums
  btrfs: use kvzalloc to allocate btrfs_fs_info
2018-03-04 11:04:27 -08:00
Linus Torvalds
2833419a62 Merge tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "A cap handling fix from Zhi that ensures that metadata writeback isn't
  delayed and three error path memory leak fixups from Chengguang"

* tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client:
  ceph: fix potential memory leak in init_caches()
  ceph: fix dentry leak when failing to init debugfs
  libceph, ceph: avoid memory leak when specifying same option several times
  ceph: flush dirty caps of unlinked inode ASAP
2018-03-02 10:05:10 -08:00
Linus Torvalds
fb6d47a592 Merge tag 'for-linus-20180302' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes for this series. This is a little larger than
  usual at this time, but that's mainly because I was out on vacation
  last week. Nothing in here is major in any way, it's just two weeks of
  fixes. This contains:

   - NVMe pull from Keith, with a set of fixes from the usual suspects.

   - mq-deadline zone unlock fix from Damien, fixing an issue with the
     SMR zone locking added for 4.16.

   - two bcache fixes sent in by Michael, with changes from Coly and
     Tang.

   - comment typo fix from Eric for blktrace.

   - return-value error handling fix for nbd, from Gustavo.

   - fix a direct-io case where we don't defer to a completion handler,
     making us sleep from IRQ device completion. From Jan.

   - a small series from Jan fixing up holes around handling of bdev
     references.

   - small set of regression fixes from Jiufei, mostly fixing problems
     around the gendisk pointer -> partition index change.

   - regression fix from Ming, fixing a boundary issue with the discard
     page cache invalidation.

   - two-patch series from Ming, fixing both a core blk-mq-sched and
     kyber issue around token freeing on a requeue condition"

* tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits)
  block: fix a typo
  block: display the correct diskname for bio
  block: fix the count of PGPGOUT for WRITE_SAME
  mq-deadline: Make sure to always unlock zones
  nvmet: fix PSDT field check in command format
  nvme-multipath: fix sysfs dangerously created links
  nbd: fix return value in error handling path
  bcache: fix kcrashes with fio in RAID5 backend dev
  bcache: correct flash only vols (check all uuids)
  blktrace_api.h: fix comment for struct blk_user_trace_setup
  blockdev: Avoid two active bdev inodes for one device
  genhd: Fix BUG in blkdev_open()
  genhd: Fix use after free in __blkdev_get()
  genhd: Add helper put_disk_and_module()
  genhd: Rename get_disk() to get_disk_and_module()
  genhd: Fix leaked module reference for NVME devices
  direct-io: Fix sleep in atomic due to sync AIO
  nvme-pci: Fix nvme queue cleanup if IRQ setup fails
  block: kyber: fix domain token leak during requeue
  blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch
  ...
2018-03-02 09:35:36 -08:00
Chengguang Xu
785dffe1da udf: fix potential refcnt problem of nls module
When specifiying iocharset multiple times in a mount or once/multiple in
a remount, current option parsing may cause inaccurate refcount of nls
module.  Also, in the failure cleanup of option parsing, the condition
of calling unload_nls is not sufficient.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-02 14:23:12 +01:00
Chengguang Xu
d090990510 ext2: change return code to -ENOMEM when failing memory allocation
Change return code to -ENOMEM from -EINVAL when failing memory
allocation in fill_super(), meanwhile delete redundant initial
assignment of variable err.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-02 14:23:11 +01:00
Jan Kara
b72e632c6c udf: Do not mark possibly inconsistent filesystems as closed
If logical volume integrity descriptor contains non-closed integrity
type when mounting the volume, there are high chances that the volume is
not consistent (device was detached before the filesystem was
unmounted). Don't touch integrity type of such volume so that fsck can
recognize it and check such filesystem.

Reported-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-02 14:22:57 +01:00
Christoph Hellwig
ff3d8b9c4c xfs: don't block on the ilock for RWF_NOWAIT
Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed,
so that we don't block io_submit callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:12:45 -08:00
Christoph Hellwig
af5b5afe9a xfs: don't start out with the exclusive ilock for direct I/O
There is no reason to take the ilock exclusively at the start of
xfs_file_iomap_begin for direct I/O, given that it will be demoted
just before calling xfs_iomap_write_direct anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:12:12 -08:00
Christoph Hellwig
172ed391f6 xfs: don't allocate COW blocks for zeroing holes or unwritten extents
The iomap zeroing interface is smart enough to skip zeroing holes or
unwritten extents.  Don't subvert this logic for reflink files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:10:31 -08:00
Chengguang Xu
1c78924957 ceph: fix potential memory leak in init_caches()
There is lack of cache destroy operation for ceph_file_cachep
when failing from fscache register.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-03-01 16:39:47 +01:00
Filipe Manana
1f250e929a Btrfs: fix log replay failure after unlink and link combination
If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir
  $ touch /mnt/testdir/foo
  $ ln /mnt/testdir/foo /mnt/testdir/bar

  $ sync

  $ unlink /mnt/testdir/bar
  $ touch /mnt/testdir/bar
  $ xfs_io -c "fsync" /mnt/testdir/bar

  <power failure>

  $ mount /dev/sdb /mnt
  mount: mount(2) failed: /mnt: No such file or directory

When replaying the log, for that example, we also see the following in
dmesg/syslog:

  [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
  [71813.674204] ------------[ cut here ]------------
  [71813.675694] BTRFS: Transaction aborted (error -2)
  [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
  [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
  [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
  [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
  [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
  [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
  [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
  [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
  [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
  [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
  [71813.679669] Call Trace:
  [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
  [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
  [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
  [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
  [71813.679669]  ? rcu_read_unlock+0x3a/0x57
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
  [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
  [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
  [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
  [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
  [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  do_mount+0x6e5/0x973
  [71813.679669]  ? memdup_user+0x3e/0x5c
  [71813.679669]  SyS_mount+0x72/0x98
  [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
  [71813.679669] RIP: 0033:0x7f7cedf150ba
  [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
  [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
  [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
  [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
  [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
  [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
  [71814.128078] BTRFS error (device dm-0): open_ctree failed

This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.

Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:18:40 +01:00
Filipe Manana
9a6509c4da Btrfs: fix log replay failure after linking special file and fsync
If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/testdir
  $ mkfifo /mnt/testdir/foo
  # Make sure everything done so far is durably persisted.
  $ sync

  # Create some unrelated file and fsync it, this is just to create a log
  # tree. The file must be in the same directory as our special file.
  $ touch /mnt/testdir/f1
  $ xfs_io -c "fsync" /mnt/testdir/f1

  # Rename our special file and then create a hard link with its old name.
  $ mv /mnt/testdir/foo /mnt/testdir/bar
  $ ln /mnt/testdir/bar /mnt/testdir/foo

  # Create some other unrelated file and fsync it, this is just to persist
  # the log tree which was modified by the previous rename and link
  # operations. Alternatively we could have modified file f1 and fsync it.
  $ touch /mnt/f2
  $ xfs_io -c "fsync" /mnt/f2

  <power failure>

  $ mount /dev/sdc /mnt
  mount: mount /dev/sdc on /mnt failed: File exists

This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.

Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.

A new generic test case for fstests was also submitted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:18:34 +01:00
Filipe Manana
d4dfc0f4d3 Btrfs: send, fix issuing write op when processing hole in no data mode
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.

Trivial reproducer:

  $ mkfs.btrfs -f -O no-holes /dev/sdc
  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdc /mnt/sdc
  $ mount /dev/sdd /mnt/sdd

  $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
  $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1

  $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
  $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2

  $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
  $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
       | btrfs receive -vv /mnt/sdd

Before this change the output of the second receive command is:

  receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
  utimes
  write foobar, offset 8192, len 8192
  utimes foobar
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...

After this change it is:

  receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
  utimes
  update_extent foobar: offset=8192, len=8192
  utimes foobar
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:18:07 +01:00
Anand Jain
3c181c12c4 btrfs: use proper endianness accessors for super_copy
The fs_info::super_copy is a byte copy of the on-disk structure and all
members must use the accessor macros/functions to obtain the right
value.  This was missing in update_super_roots and in sysfs readers.

Moving between opposite endianness hosts will report bogus numbers in
sysfs, and mount may fail as the root will not be restored correctly. If
the filesystem is always used on a same endian host, this will not be a
problem.

Fix this by using the btrfs_set_super...() functions to set
fs_info::super_copy values, and for the sysfs, use the cached
fs_info::nodesize/sectorsize values.

CC: stable@vger.kernel.org
Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:17:27 +01:00
Hans van Kranenburg
92e222df7b btrfs: alloc_chunk: fix DUP stripe size handling
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.

The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.

Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.

The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.

Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.

Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.

Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.

The real problem was already introduced with the rest of the code in
73c5de0051.

The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:47 +01:00
Nikolay Borisov
765f3cebff btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:12 +01:00
Nikolay Borisov
ac01f26a27 btrfs: handle failure of add_pending_csums
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:00 +01:00
Jeff Mahoney
a8fd1f7174 btrfs: use kvzalloc to allocate btrfs_fs_info
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS.  On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.

There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.

As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:15:36 +01:00
Linus Torvalds
c02be2334e Merge tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:

 - fix some compiler warnings

 - fix block reservations for transactions created during log recovery

 - fix resource leaks when respecifying mount options

* tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix potential memory leak in mount option parsing
  xfs: reserve blocks for refcount / rmap log item recovery
  xfs: use memset to initialize xfs_scrub_agfl_info
2018-02-28 11:40:51 -08:00
Kirill Tkhai
02df428ca2 net: Convert simple pernet_operations
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:

vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Kirill Tkhai
7300bd94e6 net: Convert nfs_net_ops
These pernet_operations just create and destroy /proc entries
and net_generic()->cb_ident_idr IDR. So, we are able to mark
them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Jan Kara
7b1f641776 fsnotify: Let userspace know about lost events due to ENOMEM
Currently if notification event is lost due to event allocation failing
we ENOMEM, we just silently continue (except for fanotify permission
events where we deny the access). This is undesirable as userspace has
no way of knowing whether the notifications it got are complete or not.
Treat lost events due to ENOMEM the same way as lost events due to queue
overflow so that userspace knows something bad happened and it likely
needs to rescan the filesystem.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
1f5eaa9001 fanotify: Avoid lost events due to ENOMEM for unlimited queues
Fanotify queues of unlimited length do not expect events can be lost.
Since these queues are used for system auditing and other security
related tasks, loosing events can even have security implications.
Currently, since the allocation is small (32-bytes), it cannot fail
however when we start accounting events in memcgs, allocation can start
failing. So avoid loosing events due to failure to allocate memory by
making event allocation use __GFP_NOFAIL.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
f0c4a81711 udf: Remove never implemented mount options
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
116e5258e4 udf: Provide saner default for invalid uid / gid
Currently when UDF filesystem is recorded without uid / gid (ids are set
to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
and gid= mount options. In such case filesystem could not be modified in
any way as VFS refuses to modify files with invalid ids (even by root).
This is confusing to users and not very useful default since such media
mode is generally used for removable media. Use overflow[ug]id instead
so that at least root can modify the filesystem.

Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
0c9850f4d4 udf: Clean up handling of invalid uid/gid
Current code relies on the fact that invalid uid/gid as defined by UDF
2.60 3.3.3.1 and 3.3.3.2 coincides with invalid uid/gid as used by the
user namespaces implementation. Since this is only lucky coincidence,
clean this up to avoid future surprises in case user namespaces
implementation changes. Also this is more robust in presence of valid
(from UDF point of view) uids / gids which do not map into current user
namespace.

Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
ecd10aa428 udf: Apply uid/gid mount options also to new inodes & chown
Currently newly created files belong to current user despite
uid=<number> / gid=<number> mount options. This is confusing to users
(as owner of the file will change after remount / eviction from cache)
and also inconsistent with e.g. FAT with the same mount option. So apply
uid=<number> and gid=<number> also to newly created inodes and similarly
as FAT disallow to change owner of the file in this case.

Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
70260e4475 udf: Ignore [ug]id=ignore mount options
Currently uid=ignore and gid=ignore make no sense without uid=<number>
and gid=<number> respectively as they result in all files having invalid
uid / gid which then doesn't allow even root to modify files and thus
causes confusion. And since commit ca76d2d803 "UDF: fix UID and GID
mount option ignorance" (from over 10 years ago) uid=<number> overrides
all uids on disk as uid=ignore does. So just silently ignore uid=ignore
mount option.

Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00