Commit Graph

47524 Commits

Author SHA1 Message Date
Linus Torvalds
e5322c5406 Merge branch 'for-linus2' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Round 2 of this.  I cut back to the bare necessities, the patch is
  still larger than it usually would be at this time, due to the number
  of NVMe fixes in there.  This pull request contains:

   - The 4 core fixes from Ming, that fix both problems with exceeding
     the virtual boundary limit in case of merging, and the gap checking
     for cloned bio's.

   - NVMe fixes from Keith and Christoph:

        - Regression on larger user commands, causing problems with
          reading log pages (for instance). This touches both NVMe,
          and the block core since that is now generally utilized also
          for these types of commands.

        - Hot removal fixes.

        - User exploitable issue with passthrough IO commands, if !length
          is given, causing us to fault on writing to the zero
          page.

        - Fix for a hang under error conditions

   - And finally, the current series regression for umount with cgroup
     writeback, where the final flush would happen async and hence open
     up window after umount where the device wasn't consistent.  fsck
     right after umount would show this.  From Tejun"

* 'for-linus2' of git://git.kernel.dk/linux-block:
  block: support large requests in blk_rq_map_user_iov
  block: fix blk_rq_get_max_sectors for driver private requests
  nvme: fix max_segments integer truncation
  nvme: set queue limits for the admin queue
  writeback: flush inode cgroup wb switches instead of pinning super_block
  NVMe: Fix 0-length integrity payload
  NVMe: Don't allow unsupported flags
  NVMe: Move error handling to failed reset handler
  NVMe: Simplify device reset failure
  NVMe: Fix namespace removal deadlock
  NVMe: Use IDA for namespace disk naming
  NVMe: Don't unmap controller registers on reset
  block: merge: get the 1st and last bvec via helpers
  block: get the 1st and last bvec via helpers
  block: check virt boundary in bio_will_gap()
  block: bio: introduce helpers to get the 1st and last bvec
2016-03-04 18:17:17 -08:00
Linus Torvalds
c51797d25d Merge tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd
Pull jffs2 fixes from David Woodhouse:
 "This contains two important JFFS2 fixes marked for stable:

   - a lock ordering problem between the page lock and the internal
     f->sem mutex, which was causing occasional deadlocks in garbage
     collection

   - a scan failure causing moved directories to sometimes end up
     appearing to have hard links.

  There are also a couple of trivial MAINTAINERS file updates"

* tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd:
  MAINTAINERS: add maintainer entry for FREESCALE GPMI NAND driver
  Fix directory hardlinks from deleted directories
  jffs2: Fix page lock / f->sem deadlock
  Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"
  MAINTAINERS: update Han's email
2016-03-04 17:36:46 -08:00
Linus Torvalds
2cdcb2b5b5 Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "Filipe nailed down a problem where tree log replay would do some work
  that orphan code wasn't expecting to be done yet, leading to BUG_ON"

* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix loading of orphan roots leading to BUG_ON
2016-03-04 17:31:32 -08:00
Yan, Zheng
5ea5c5e0a7 ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04 21:00:37 +01:00
Christoph Hellwig
c43c83a294 direct-io: only use block polling if explicitly requested
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Christoph Hellwig
97be7ebe53 vfs: add the RWF_HIPRI flag for preadv2/pwritev2
This adds a flag that tells the file system that this is a high priority
request for which it's worth to poll the hardware.  The flag is purely
advisory and can be ignored if not supported.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Milosz Tanski
f17d8b3545 vfs: vfs: Define new syscalls preadv2,pwritev2
New syscalls that take an flag argument.   No flags are added yet in this
patch.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Christoph Hellwig
793b80ef14 vfs: pass a flags argument to vfs_readv/vfs_writev
This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations.  For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Filipe Manana
909c3a22da Btrfs: fix loading of orphan roots leading to BUG_ON
When looking for orphan roots during mount we can end up hitting a
BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
replayed and qgroups are enabled. This is because after a log tree is
replayed, a transaction commit is made, which triggers qgroup extent
accounting which in turn does backref walking which ends up reading and
inserting all roots in the radix tree fs_info->fs_root_radix, including
orphan roots (deleted snapshots). So after the log tree is replayed, when
finding orphan roots we hit the BUG_ON with the following trace:

[118209.182438] ------------[ cut here ]------------
[118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
[118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G        W       4.5.0-rc5-btrfs-next-24+ #1
[118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
[118209.186318] RIP: 0010:[<ffffffffa04237d7>]  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP: 0018:ffff8800af34faa8  EFLAGS: 00010246
[118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
[118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
[118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
[118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
[118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
[118209.186318] FS:  00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[118209.186318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
[118209.186318] Stack:
[118209.186318]  fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
[118209.186318]  fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
[118209.186318]  0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
[118209.186318] Call Trace:
[118209.186318]  [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
[118209.186318]  [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
[118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318]  [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
[118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318]  [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
[118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318]  [<ffffffff81195637>] do_mount+0x8a6/0x9e8
[118209.186318]  [<ffffffff8119598d>] SyS_mount+0x77/0x9f
[118209.186318]  [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
[118209.186318] RIP  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318]  RSP <ffff8800af34faa8>
[118209.230735] ---[ end trace 83938f987d85d477 ]---

So fix this by not treating the error -EEXIST, returned when attempting
to insert a root already inserted by the backref walking code, as an error.

The following test case for xfstests reproduces the bug:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_target flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  _run_btrfs_util_prog quota enable $SCRATCH_MNT

  # Create 2 directories with one file in one of them.
  # We use these just to trigger a transaction commit later, moving the file from
  # directory a to directory b and doing an fsync against directory a.
  mkdir $SCRATCH_MNT/a
  mkdir $SCRATCH_MNT/b
  touch $SCRATCH_MNT/a/f
  sync

  # Create our test file with 2 4K extents.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

  # Create a snapshot and delete it. This doesn't really delete the snapshot
  # immediately, just makes it inaccessible and invisible to user space, the
  # snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
  # which is woke up at the next transaction commit.
  # A root orphan item is inserted into the tree of tree roots, so that if a
  # power failure happens before the dedicated kernel thread does the snapshot
  # deletion, the next time the filesystem is mounted it resumes the snapshot
  # deletion.
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap

  # Now overwrite half of the extents we wrote before. Because we made a snapshpot
  # before, which isn't really deleted yet (since no transaction commit happened
  # after we did the snapshot delete request), the non overwritten extents get
  # referenced twice, once by the default subvolume and once by the snapshot.
  $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

  # Now move file f from directory a to directory b and fsync directory a.
  # The fsync on the directory a triggers a transaction commit (because a file
  # was moved from it to another directory) and the file fsync leaves a log tree
  # with file extent items to replay.
  mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  # Now simulate a power failure and mount the filesystem to replay the log tree.
  # After the log tree was replayed, we used to hit a BUG_ON() when processing
  # the root orphan item for the deleted snapshot. This is because when processing
  # an orphan root the code expected to be the first code inserting the root into
  # the fs_info->fs_root_radix radix tree, while in reallity it was the second
  # caller attempting to do it - the first caller was the transaction commit that
  # took place after replaying the log tree, when updating the qgroup counters.
  _flakey_drop_and_remount

  echo "File digest before after failure:"
  # Must match what he got before the power failure.
  md5sum $SCRATCH_MNT/foobar | _filter_scratch

  _unmount_flakey
  status=0
  exit

Fixes: 2d9e977610 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
Cc: stable@vger.kernel.org  # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-03 15:28:59 -08:00
Shaohua Li
3684aa7099 block-dev: enable writeback cgroup support
block_dev's .writepages/.writepage already handles
wbc_init_bio/wbc_account_io. We only set the SB_I_CGROUPWB bit to
suppport writeback cgroup support.

Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:50:53 -07:00
Tejun Heo
a1a0e23e49 writeback: flush inode cgroup wb switches instead of pinning super_block
If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac16 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac16 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac16 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Mike Marshall
9d9e7ba9ee Orangefs: improve gossip statements
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-03-03 13:46:48 -05:00
Konstantin Khlebnikov
b81de061fa ovl: copy new uid/gid into overlayfs runtime inode
Overlayfs must update uid/gid after chown, otherwise functions
like inode_owner_or_capable() will check user against stale uid.
Catched by xfstests generic/087, it chowns file and calls utimes.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:46 +01:00
Konstantin Khlebnikov
45d1173896 ovl: ignore lower entries when checking purity of non-directory entries
After rename file dentry still holds reference to lower dentry from
previous location. This doesn't matter for data access because data comes
from upper dentry. But this stale lower dentry taints dentry at new
location and turns it into non-pure upper. Such file leaves visible
whiteout entry after remove in directory which shouldn't have whiteouts at
all.

Overlayfs already tracks pureness of file location in oe->opaque.  This
patch just uses that for detecting actual path type.

Comment from Vivek Goyal's patch:

Here are the details of the problem. Do following.

$ mkdir upper lower work merged upper/dir/
$ touch lower/test
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=
work merged
$ mv merged/test merged/dir/
$ rm merged/dir/test
$ ls -l merged/dir/
/usr/bin/ls: cannot access merged/dir/test: No such file or directory
total 0
c????????? ? ? ? ?            ? test

Basic problem seems to be that once a file has been unlinked, a whiteout
has been left behind which was not needed and hence it becomes visible.

Whiteout is visible because parent dir is of not type MERGE, hence
od->is_real is set during ovl_dir_open(). And that means ovl_iterate()
passes on iterate handling directly to underlying fs. Underlying fs does
not know/filter whiteouts so it becomes visible to user.

Why did we leave a whiteout to begin with when we should not have.
ovl_do_remove() checks for OVL_TYPE_PURE_UPPER() and does not leave
whiteout if file is pure upper. In this case file is not found to be pure
upper hence whiteout is left.

So why file was not PURE_UPPER in this case? I think because dentry is
still carrying some leftover state which was valid before rename. For
example, od->numlower was set to 1 as it was a lower file. After rename,
this state is not valid anymore as there is no such file in lower.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Viktor Stanchev <me@viktorstanchev.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109611
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:45 +01:00
Rui Wang
ce9113bbcb ovl: fix getcwd() failure after unsuccessful rmdir
ovl_remove_upper() should do d_drop() only after it successfully
removes the dir, otherwise a subsequent getcwd() system call will
fail, breaking userspace programs.

This is to fix: https://bugzilla.kernel.org/show_bug.cgi?id=110491

Signed-off-by: Rui Wang <rui.y.wang@intel.com>
Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:45 +01:00
Konstantin Khlebnikov
b5891cfab0 ovl: fix working on distributed fs as lower layer
This adds missing .d_select_inode into alternative dentry_operations.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Fixes: 7c03b5d45b ("ovl: allow distributed fs as lower layer")
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Tested-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # 4.2+
2016-03-03 17:17:45 +01:00
Nikolay Borisov
ab73ef4639 quota: Fix possible GPF due to uninitialised pointers
When dqget() in __dquot_initialize() fails e.g. due to IO error,
__dquot_initialize() will pass an array of uninitialized pointers to
dqput_all() and thus can lead to deference of random data. Fix the
problem by properly initializing the array.

CC: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-03-03 11:01:58 +01:00
J. Bruce Fields
0f1738a10b nfsd4: resfh unused in nfsd4_secinfo
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-02 15:26:36 -08:00
Yang Shi
59692b7c71 f2fs: mutex can't be used by down_write_nest_lock()
f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check
a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the
comment in include/linux/rwsem.h. And, other than f2fs, it is just called in
mm/mmap.c with two rwsem.

So, it looks it is used wrongly by f2fs. And, it causes the below compile
warning on -rt kernel too.

In file included from fs/f2fs/xattr.c:25:0:
fs/f2fs/f2fs.h: In function 'f2fs_lock_all':
fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types]
  f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex);
                                  ^
fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write'
 #define f2fs_down_write(x, y) down_write_nest_lock(x, y)
                                                       ^
In file included from include/linux/rwsem.h:22:0,
                 from fs/f2fs/xattr.c:21:
include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore *' but argument is of type 'struct mutex *'
 static inline void down_write_nest_lock(struct rw_semaphore *sem,

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-03-02 10:22:14 -08:00
Liu Xue
8c2b1435b9 f2fs: recovery missing dot dentries in root directory
If f2fs was corrupted with missing dot dentries in root dirctory,
it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag
in directory inode when fsck.f2fs detects missing dot dentries.

Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Yong Sheng <shengyong1@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-03-02 09:25:33 -08:00
Linus Torvalds
12f1d7e493 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Various small CIFS/SMB3 fixes for stable:

  Fixes address oops that can occur when accessing Macs with SMB3, and
  another problem found to Samba when read responses queued (e.g. with
  gluster under Samba)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix duplicate line introduced by clone_file_range patch
  Fix cifs_uniqueid_to_ino_t() function for s390x
  CIFS: Fix SMB2+ interim response processing for read requests
  cifs: fix out-of-bounds access in lease parsing
2016-03-02 09:15:21 -08:00
Linus Torvalds
39680f50ae userfaultfd: don't block on the last VM updates at exit time
The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.

That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).

However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.

To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).

This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier).  But in the meantime this is a fairly minimally intrusive fix.

Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-02 09:03:18 -08:00
Greg Kroah-Hartman
523462df28 Merge 4.5-rc6 into char-misc-next
We want the fixes in here, and others are sending us pull requests based
on this kernel tree.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-01 16:38:16 -08:00
Linus Torvalds
f691b77b1f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull d_inode/d_flags race fix from Al Viro.

I love this fix.  Not only does it fix the race in the dentry type
handling, it entirely gets rid of the nasty and subtle memory ordering
rules for d_type and d_inode, and replaces them with the basic dentry
locking rules (sequence numbers under RCU, d_lock elsewhere).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  use ->d_seq to get coherency between ->d_inode and ->d_flags
2016-03-01 15:30:45 -08:00
Christoph Hellwig
a7e5d03ba8 xfs: remove xfs_trans_get_block_res
Just use the t_blk_res field directly instead of obsfucating the reference
by a macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:58:21 +11:00
Eric Sandeen
12c3f05c7b xfs: fix up inode32/64 (re)mount handling
inode32/inode64 allocator behavior with respect to mount, remount
and growfs is a little tricky.

The inode32 mount option should only enable the inode32 allocator
heuristics if the filesystem is large enough for 64-bit inodes to
exist.  Today, it has this behavior on the initial mount, but a
remount with inode32 unconditionally changes the allocation
heuristics, even for a small fs.

Also, an inode32 mounted small filesystem should transition to the
inode32 allocator if the filesystem is subsequently grown to a
sufficient size.  Today that does not happen.

This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
single new function, and moves the "is the maximum inode number big
enough to matter" test into that function, so it doesn't rely on the
caller to get it right - which remount did not do, previously.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:58:09 +11:00
Colin Ian King
5d518bd6ce xfs: fix format specifier , should be %llx and not %llu
busyp->bno is printed with a %llu format specifier when the
intention is to print a hexadecimal value. Trivial fix to
use %llx instead.  Found with smatch static analysis:

fs/xfs/xfs_discard.c:229 xfs_discard_extents() warn: '0x'
  prefix is confusing together with '%llu' specifier

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:57:04 +11:00
Eric Sandeen
a08ee40a79 xfs: sanitize remount options
Perform basic sanitization of remount options by
passing the option string and a dummy mount structure
through xfs_parseargs and returning the result.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:56:31 +11:00
Eric Sandeen
2e74af0e11 xfs: convert mount option parsing to tokens
This should be a no-op change, just switch to token parsing
like every other respectable filesystem does.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:55:38 +11:00
Mateusz Guzik
2e83b79b2d xfs: fix two memory leaks in xfs_attr_list.c error paths
This plugs 2 trivial leaks in xfs_attr_shortform_list and
xfs_attr3_leaf_list_int.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02 09:51:09 +11:00
Chuck Lever
4500632f60 nfsd: Lower NFSv4.1 callback message size limit
The maximum size of a backchannel message on RPC-over-RDMA depends
on the connection's inline threshold. Today that threshold is
typically 1024 bytes, making the maximum message size 996 bytes.

The Linux server's CREATE_SESSION operation checks that the size
of callback Calls can be as large as 1044 bytes, to accommodate
RPCSEC_GSS. Thus CREATE_SESSION fails if a client advertises the
true message size maximum of 996 bytes.

But the server's backchannel currently does not support RPCSEC_GSS.
The actual maximum size it needs is much smaller. It is safe to
reduce the limit to enable NFSv4.1 on RDMA backchannel operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:35 -08:00
Chuck Lever
4ce85c8cf8 nfsd: Update NFS server comments related to RDMA support
The server does indeed now support NFSv4.1 on RDMA transports. It
does not support shifting an RDMA-capable TCP transport (such as
iWARP) to RDMA mode.

Reported-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:32 -08:00
Kinglong Mee
8edf4b0288 nfsd: Fix a memory leak when meeting unsupported state_protect_how4
Remember free allocated client when meeting unsupported state protect how.

Fixes: 50c7b948ad ("nfsd: minor consolidation of mach_cred handling code")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:06:31 -08:00
J. Bruce Fields
4aed9c46af nfsd4: fix bad bounds checking
A number of spots in the xdr decoding follow a pattern like

	n = be32_to_cpup(p++);
	READ_BUF(n + 4);

where n is a u32.  The only bounds checking is done in READ_BUF itself,
but since it's checking (n + 4), it won't catch cases where n is very
large, (u32)(-4) or higher.  I'm not sure exactly what the consequences
are, but we've seen crashes soon after.

Instead, just break these up into two READ_BUF()s.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01 13:02:57 -08:00
Filipe Manana
5e33a2bd7c Btrfs: do not collect ordered extents when logging that inode exists
When logging that an inode exists, for example as part of a directory
fsync operation, we were collecting any ordered extents for the inode but
we ended up doing nothing with them except tagging them as processed, by
setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
collecting and processing them. This created a time window where a second
fsync against the inode, using the fast path, ended up not logging the
checksums for the new extents but it logged the extents since they were
part of the list of modified extents. This happened because the ordered
extents were not collected and checksums were not yet added to the csum
tree - the ordered extents have not gone through btrfs_finish_ordered_io()
yet (which is where we add them to the csum tree by calling
inode.c:add_pending_csums()).

So fix this by not collecting an inode's ordered extents if we are logging
it with the LOG_INODE_EXISTS mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:47 -08:00
Filipe Manana
affc0ff902 Btrfs: fix race when checking if we can skip fsync'ing an inode
If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
returns false, it's possible that we had an ordered extent in progress
(btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
last_trans field was not greater than the id of the last committed
transaction, but shortly after, before we checked if there were any
ongoing ordered extents, the ordered extent had just completed and
removed itself from the inode's ordered tree, in which case we end up not
logging the inode, losing some data if a power failure or crash happens
after the fsync handler returns and before the transaction is committed.

Fix this by checking first if there are any ongoing ordered extents
before comparing the inode's last_trans with the id of the last committed
transaction - when it completes, an ordered extent always updates the
inode's last_trans before it removes itself from the inode's ordered
tree (at btrfs_finish_ordered_io()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:44 -08:00
Filipe Manana
daac7ba61a Btrfs: fix listxattrs not listing all xattrs packed in the same item
In the listxattrs handler, we were not listing all the xattrs that are
packed in the same btree item, which happens when multiple xattrs have
a name that when crc32c hashed produce the same checksum value.

Fix this by processing them all.

The following test case for xfstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/attr

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_attrs

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a few xattrs. The first 3 xattrs have a name
  # that when given as input to a crc32c function result in the same checksum.
  # This made btrfs list only one of the xattrs through listxattrs system call
  # (because it packs xattrs with the same name checksum into the same btree
  # item).
  touch $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
  $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile

  # Now call getfattr with --dump, which calls the listxattrs system call.
  # It should list all the xattrs we have set before.
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:41 -08:00
Filipe Manana
ade770294d Btrfs: fix deadlock between direct IO reads and buffered writes
While running a test with a mix of buffered IO and direct IO against
the same files I hit a deadlock reported by the following trace:

[11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
[11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
[11642.149771]     0 15282      2 0x00000000
[11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
[11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
[11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
[11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
[11642.161403] Call Trace:
[11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
[11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
[11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
[11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
[11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.226295] 2 locks held by kworker/u32:3/15282:
[11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
[11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
[11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
[11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
[11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
[11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
[11642.243715] Call Trace:
[11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
[11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
[11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
[11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
[11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
[11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
[11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.282604] 3 locks held by kworker/u32:8/15289:
[11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
[11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
[11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
[11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
[11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
[11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
[11642.298433] Call Trace:
[11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
[11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
[11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
[11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
[11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
[11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
[11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
[11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
[11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.310403] 3 locks held by fdm-stress/26848:
[11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
[11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
[11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
[11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
[11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
[11642.325096] Call Trace:
[11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
[11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
[11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
[11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
[11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
[11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
[11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
[11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
[11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
[11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
[11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
[11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
[11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
[11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
[11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
[11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
[11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
[11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
[11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
[11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.350222] 4 locks held by fdm-stress/26849:
[11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
[11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
[11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
[11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
[11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
[11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
[11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
[11642.373430] Call Trace:
[11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
[11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
[11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
[11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
[11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
[11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
[11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
[11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
[11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
[11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
[11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
[11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
[11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
[11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
[11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
[11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
[11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.416572] 1 lock held by fdm-stress/26850:
[11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
[11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
[11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
[11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
[11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
[11642.426591] Call Trace:
[11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
[11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
[11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
[11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
[11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
[11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
[11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
[11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
[11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
[11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.441533] 3 locks held by fdm-stress/26851:
[11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
[11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
[11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]

This happened because under specific timings the path for direct IO reads
can deadlock with concurrent buffered writes. The diagram below shows how
this happens for an example file that has the following layout:

     [  extent A  ]  [  extent B  ]  [ ....
     0K              4K              8K

     CPU 1                                               CPU 2                             CPU 3

DIO read against range
 [0K, 8K[ starts

btrfs_direct_IO()
  --> calls btrfs_get_blocks_direct()
      which finds the extent map for the
      extent A and leaves the range
      [0K, 4K[ locked in the inode's
      io tree

                                                   buffered write against
                                                   range [4K, 8K[ starts

                                                   __btrfs_buffered_write()
                                                     --> dirties page at 4K

                                                                                     a user space
                                                                                     task calls sync
                                                                                     for e.g or
                                                                                     writepages() is
                                                                                     invoked by mm

                                                                                     writepages()
                                                                                       run_delalloc_range()
                                                                                         cow_file_range()
                                                                                           --> ordered extent X
                                                                                               for the buffered
                                                                                               write is created
                                                                                               and
                                                                                               writeback starts

  --> calls btrfs_get_blocks_direct()
      again, without submitting first
      a bio for reading extent A, and
      finds the extent map for extent B

  --> calls lock_extent_direct()

      --> locks range [4K, 8K[
      --> finds ordered extent X
          covering range [4K, 8K[
      --> unlocks range [4K, 8K[

                                                  buffered write against
                                                  range [0K, 8K[ starts

                                                  __btrfs_buffered_write()
                                                    prepare_pages()
                                                      --> locks pages with
                                                          offsets 0 and 4K
                                                    lock_and_cleanup_extent_if_need()
                                                      --> blocks attempting to
                                                          lock range [0K, 8K[ in
                                                          the inode's io tree,
                                                          because the range [0, 4K[
                                                          is already locked by the
                                                          direct IO task at CPU 1

      --> calls
          btrfs_start_ordered_extent(oe X)

          btrfs_start_ordered_extent(oe X)

            --> At this point writeback for ordered
                extent X has not finished yet

            filemap_fdatawrite_range()
              btrfs_writepages()
                extent_writepages()
                  extent_write_cache_pages()
                    --> finds page with offset 0
                        with the writeback tag
                        (and not dirty)
                    --> tries to lock it
                         --> deadlock, task at CPU 2
                             has the page locked and
                             is blocked on the io range
                             [0, 4K[ that was locked
                             earlier by this task

So fix this by falling back to a buffered read in the direct IO read path
when an ordered extent for a buffered write is found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:37 -08:00
Filipe Manana
f4dfe68710 Btrfs: fix extent_same allowing destination offset beyond i_size
When using the same file as the source and destination for a dedup
(extent_same ioctl) operation we were allowing it to dedup to a
destination offset beyond the file's size, which doesn't make sense and
it's not allowed for the case where the source and destination files are
not the same file. This made de deduplication operation successful only
when the source range corresponded to a hole, a prealloc extent or an
extent with all bytes having a value of 0x00. This was also leaving a
file hole (between i_size and destination offset) without the
corresponding file extent items, which can be reproduced with the
following steps for example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt/sdi

  $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
  wrote 404990/404990 bytes at offset 304457
  395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)

  $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
  Deduping 2 total files
  (28672, 24576): /mnt/sdi/foobar
  (929792, 24576): /mnt/sdi/foobar
  1 files asked to be deduped
  i: 0, status: 0, bytes_deduped: 24576
  24576 total bytes deduped in this operation

  $ umount /mnt/sdi
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 257 errors 100, file extent discount
  Found file extent holes:
          start: 712704, len: 217088
  found 540673 bytes used err is 1
  total csum bytes: 400
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 123675
  file data blocks allocated: 671744
    referenced 671744
  btrfs-progs v4.2.3

So fix this by not allowing the destination to go beyond the file's size,
just as we do for the same where the source and destination files are not
the same.

A test for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:33 -08:00
Filipe Manana
2be63d5ce9 Btrfs: fix file loss on log replay after renaming a file and fsync
We have two cases where we end up deleting a file at log replay time
when we should not. For this to happen the file must have been renamed
and a directory inode must have been fsynced/logged.

Two examples that exercise these two cases are listed below.

  Case 1)

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir -p /mnt/a/b
  $ mkdir /mnt/c
  $ touch /mnt/a/b/foo
  $ sync
  $ mv /mnt/a/b/foo /mnt/c/
  # Create file bar just to make sure the fsync on directory a/ does
  # something and it's not a no-op.
  $ touch /mnt/a/bar
  $ xfs_io -c "fsync" /mnt/a
  < power fail / crash >

  The next time the filesystem is mounted, the log replay procedure
  deletes file foo.

  Case 2)

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/foo
  $ ln /mnt/a/foo /mnt/b/foo_link
  $ touch /mnt/b/bar
  $ sync
  $ unlink /mnt/b/foo_link
  $ mv /mnt/b/bar /mnt/c/
  $ xfs_io -c "fsync" /mnt/a/foo
  < power fail / crash >

  The next time the filesystem is mounted, the log replay procedure
  deletes file bar.

The reason why the files are deleted is because when we log inodes
other then the fsync target inode, we ignore their last_unlink_trans
value and leave the log without enough information to later replay the
rename operations. So we need to look at the last_unlink_trans values
and fallback to a transaction commit if they are greater than the
id of the last committed transaction.

So fix this by looking at the last_unlink_trans values and fallback to
transaction commits when needed. Also, when logging other inodes (for
case 1 we logged descendants of the fsync target inode while for case 2
we logged ascendants) we need to care about concurrent tasks updating
the last_unlink_trans of inodes we are logging (which was already an
existing problem in check_parent_dirs_for_sync()). Since we can not
acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
deadlocks with other concurrent operations that acquire the i_mutex of
2 inodes (other fsyncs or renames for example), we need to serialize on
the log_mutex of the inode we are logging. A task setting a new value for
an inode's last_unlink_trans must acquire the inode's log_mutex and it
must do this update before doing the actual unlink operation (which is
already the case except when deleting a snapshot). Conversely the task
logging the inode must first log the inode and then check the inode's
last_unlink_trans value while holding its log_mutex, as if its value is
not greater then the id of the last committed transaction it means it
logged a safe state of the inode's items, while if its value is not
smaller then the id of the last committed transaction it means the inode
state it has logged might not be safe (the concurrent task might have
just updated last_unlink_trans but hasn't done yet the unlink operation)
and therefore a transaction commit must be done.

Test cases for xfstests follow in separate patches.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:29 -08:00
Filipe Manana
1ec9a1ae1e Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
If we delete a snapshot, fsync its parent directory and crash/power fail
before the next transaction commit, on the next mount when we attempt to
replay the log tree of the root containing the parent directory we will
fail and prevent the filesystem from mounting, which is solvable by wiping
out the log trees with the btrfs-zero-log tool but very inconvenient as
we will lose any data and metadata fsynced before the parent directory
was fsynced.

For example:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/testdir
  $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
  $ btrfs subvolume delete /mnt/testdir/snap
  $ xfs_io -c "fsync" /mnt/testdir
  < crash / power failure and reboot >
  $ mount /dev/sdc /mnt
  mount: mount(2) failed: No such file or directory

And in dmesg/syslog we get the following message and trace:

[192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
[192066.363010] ------------[ cut here ]------------
[192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
[192066.367250] BTRFS: Transaction aborted (error -2)
[192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G        W       4.4.0-rc6-btrfs-next-20+ #1
[192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[192066.380889]  0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
[192066.382561]  ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
[192066.384191]  ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
[192066.385827] Call Trace:
[192066.386373]  [<ffffffff81257570>] dump_stack+0x4e/0x79
[192066.387387]  [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
[192066.388429]  [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.389236]  [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
[192066.389884]  [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.390621]  [<ffffffff81184b55>] ? iput+0xb0/0x266
[192066.391200]  [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[192066.391930]  [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
[192066.392715]  [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
[192066.393510]  [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
[192066.394241]  [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
[192066.394958]  [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
[192066.395628]  [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
[192066.396790]  [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
[192066.397891]  [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
[192066.398897]  [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
[192066.399823]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.400739]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.401700]  [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.402482]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.403930]  [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
[192066.404831]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.405726]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.406621]  [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.407401]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.408247]  [<ffffffff8118ae36>] do_mount+0x893/0x9d2
[192066.409047]  [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
[192066.409842]  [<ffffffff8118b187>] SyS_mount+0x75/0xa1
[192066.410621]  [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
[192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
[192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
[192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
[192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
[192066.444613] BTRFS: open_ctree failed

This happens because when we are replaying the log and processing the
directory entry pointing to the snapshot in the subvolume tree, we treat
its btrfs_dir_item item as having a location with a key type matching
BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
object id refers to a root number and not to an inode in the root
containing the parent directory.

So fix this by triggering a transaction commit if an fsync against the
parent directory is requested after deleting a snapshot. This is the
simplest approach for a rare use case. Some alternative that avoids the
transaction commit would require more code to explicitly delete the
snapshot at log replay time (factoring out common code from ioctl.c:
btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
log tree of the snapshot's root from the log root of the root of tree
roots, amongst other steps.

A test case for xfstests that triggers the issue follows.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_target flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create a snapshot at the root of our filesystem (mount point path), delete it,
  # fsync the mount point path, crash and mount to replay the log. This should
  # succeed and after the filesystem is mounted the snapshot should not be visible
  # anymore.
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
  _flakey_drop_and_remount
  [ -e $SCRATCH_MNT/snap1 ] && \
      echo "Snapshot snap1 still exists after log replay"

  # Similar scenario as above, but this time the snapshot is created inside a
  # directory and not directly under the root (mount point path).
  mkdir $SCRATCH_MNT/testdir
  _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
  _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
  _flakey_drop_and_remount
  [ -e $SCRATCH_MNT/testdir/snap2 ] && \
      echo "Snapshot snap2 still exists after log replay"

  _unmount_flakey

  echo "Silence is golden"
  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01 08:23:25 -08:00
Chris Mason
c05c5ee5ea Merge tag 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6
Btrfs patchsets for 4.6
2016-03-01 08:13:56 -08:00
Steve French
9589995e46 CIFS: Fix duplicate line introduced by clone_file_range patch
Commit 04b38d6012 ("vfs: pull btrfs clone API to vfs layer")
added a duplicated line (in cifsfs.c) which causes a sparse compile
warning.

Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-03-01 09:38:00 -06:00
Dave Chinner
6448543735 xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
If the block size of a filesystem is not at least PAGE_SIZEd, then
at this point in time DAX cannot be used due to the fact we can't
guarantee extents are page sized or aligned without further work.
Hence disallow setting the DAX flag on an inode if the block size is
too small. Also, be defensive and check the block size when reading
an inode in off disk.

In future, we want to allow DAX to work on any filesystem, so this
is temporary while we sort of the correct conbination of extent size
hints and allocation alignment configurations needed to guarantee
page sized and aligned extent allocation for DAX enabled files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Dave Chinner
3a6a854a82 xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
When we set or clear the XFS_DIFLAG2_DAX flag, we should also
set/clear the S_DAX flag in the VFS inode. To do this, we need to
ensure that we first flush and remove any cached entries in the
radix tree to ensure the correct data access method is used when we
next try to read or write data. We ahve to be especially careful
here to lock out page faults so they don't race with the flush and
invalidation before we change the access mode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Dave Chinner
db10c697b4 xfs: S_DAX is only for regular files
Only regular files can use DAX for data operations, so we should
restrict setting it on the VFS inode to regular files. Setting it on
metadata inodes may cause the VFS to do the wrong thing for such
inodes, so avoid potential problems by restricting the scope of the
flag to what we know is supported.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
Dave Chinner
e889752905 xfs: XFS_DIFLAG_DAX is only for regular files or directories
Only file data can use DAX, so we should onyl be able to set this
flag on regular files. However, the flag also serves as an "inherit"
flag at file create time when set on directories, so limit the
FS_IOC_FSSETXATTR ioctl to only set this flag on regular files and
directories.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-01 09:41:33 +11:00
David Woodhouse
5817b9dc9c jffs2: Improve post-mount CRC scan efficiency
We need to finish doing the CRC checks before we can allow writes to
happen, and we currently process the inodes in order. This means a call
to jffs2_get_ino_cache() for each possible inode# up to c->highest_ino.

There may be a lot of lookups which fail, if the inode# space is used
sparsely. And the inode# space is *often* used sparsely, if a file
system contains a lot of stuff that was put there in the original
image, followed by lots of creation and deletion of new files.

Instead of processing them numerically with a lookup each time, just
walk the hash buckets instead.

[fix locking typo reported by Dan Carpenter]
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2016-02-29 22:29:10 +00:00
Al Viro
a528aca7f3 use ->d_seq to get coherency between ->d_inode and ->d_flags
Games with ordering and barriers are way too brittle.  Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.

Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-29 12:16:43 -05:00
Yadan Fan
1ee9f4bd1a Fix cifs_uniqueid_to_ino_t() function for s390x
This issue is caused by commit 02323db17e ("cifs: fix
cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG
is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t()
function will cast 64-bit fileid to 32-bit by using (ino_t)fileid,
because ino_t (typdefed __kernel_ino_t) is int type.

It's defined in arch/s390/include/uapi/asm/posix_types.h

    #ifndef __s390x__

    typedef unsigned long   __kernel_ino_t;
    ...
    #else /* __s390x__ */

    typedef unsigned int    __kernel_ino_t;

So the #ifdef condition is wrong for s390x, we can just still use
one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t)
and sizeof(u64) to choose the correct execution accordingly.

Signed-off-by: Yadan Fan <ydfan@suse.com>
CC: stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29 00:46:55 -06:00