Commit Graph

56429 Commits

Author SHA1 Message Date
Su Yue
488d7c4566 btrfs: Check name_len before reading btrfs_get_name
In btrfs_get_name, there's btrfs_search_slot and reads name from
inode_ref/root_ref.

Call btrfs_is_name_len_valid in btrfs_get_name.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
59b0a7f2c7 btrfs: Check name_len before read in iterate_dir_item
Since iterate_dir_item checks name_len in its own way,
so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict
name_len check.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switched ENAMETOOLONG to EIO ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
3c1d418448 btrfs: Check name_len in btrfs_check_ref_name_override
In btrfs_log_inode, btrfs_search_forward gets the buffer and then
btrfs_check_ref_name_override will read name from ref/extref for the
first time.

Call btrfs_is_name_len_valid before reading name.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
8ee8c2d62d btrfs: Verify dir_item in replay_xattr_deletes
replay_xattr_deletes calls btrfs_search_slot to get buffer and reads
name.

Call verify_dir_item to check name_len in replay_xattr_deletes to avoid
reading out of boundary.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
26a836cec2 btrfs: Check name_len on add_inode_ref call path
replay_one_buffer first reads buffers and dispatches items accroding to
the item type.
In this patch, add_inode_ref handles inode_ref and inode_extref.
Then add_inode_ref calls ref_get_fields and extref_get_fields to read
ref/extref name for the first time.
So checking name_len before reading those two is fine.

add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir.
The call graph includes btrfs_match_dir_item_name to read dir_item name
in the parent dir.
Checking first dir_item is not enough. Change it to verify every
dir_item while doing matches.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
e79a33270d btrfs: Check name_len with boundary in verify dir_item
Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.

Example:
	1. Corrupt one dir_item name_len to be 255.
        2. Run 'ls -lar /mnt/test/ > /dev/null'
dmesg:
[   48.451449] BTRFS info (device vdb1): disk space caching is enabled
[   48.451453] BTRFS info (device vdb1): has skinny extents
[   48.489420] general protection fault: 0000 [#1] SMP
[   48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
[   48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
[   48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
[   48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
[   48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
[   48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
[   48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
[   48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
[   48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
[   48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
[   48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
[   48.490008] FS:  00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[   48.490008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
[   48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.490008] Call Trace:
[   48.490008]  btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
[   48.490008]  iterate_dir+0x181/0x1b0
[   48.490008]  SyS_getdents+0xa7/0x150
[   48.490008]  ? fillonedir+0x150/0x150
[   48.490008]  entry_SYSCALL_64_fastpath+0x18/0xad
[   48.490008] RIP: 0033:0x7fef5032546b
[   48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[   48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
[   48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
[   48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
[   48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
[   48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
[   48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
[   48.499455] ---[ end trace 321920d8e8339505 ]---

Fix it by adding a parameter @slot and check name_len with item boundary
by calling btrfs_is_name_len_valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
rev
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
Su Yue
19c6dcbfa7 btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary
Introduce function btrfs_is_name_len_valid.

The function compares parameter @name_len with item boundary then
returns true if name_len is valid.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:16:04 +02:00
David Sterba
66b4993e95 btrfs: move dev stats accounting out of wait_dev_flush
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:39 +02:00
David Sterba
2980d5745f btrfs: account as waiting for IO, while waiting fot the flush bio completion
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:39 +02:00
David Sterba
e0ae999414 btrfs: preallocate device flush bio
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.

Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).

The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.

Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 19:03:38 +02:00
Filipe Manana
fdb1388994 Btrfs: incremental send, fix invalid path for unlink commands
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).

Consider the following example scenario where this issue happens.

Parent snapshot:

 .                                                      (ino 256)
 |
 |--- dir1/                                             (ino 257)
       |--- dir2/                                       (ino 258)
       |     |--- file1                                 (ino 259)
       |     |--- file3                                 (ino 261)
       |
       |--- dir3/                                       (ino 262)
             |--- file22                                (ino 260)
             |--- dir4/                                 (ino 263)

Send snapshot:

 .                                                      (ino 256)
 |
 |--- dir1/                                             (ino 257)
       |--- dir2/                                       (ino 258)
       |--- dir3                                        (ino 260)
       |--- file3/                                      (ino 262)
             |--- dir4/                                 (ino 263)
                   |--- file11                          (ino 269)
                   |--- file33                          (ino 261)

When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:

 receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
 utimes
 utimes dir1
 utimes dir1/dir2
 link dir1/dir3/dir4/file11 -> dir1/dir2/file1
 unlink dir1/dir2/file1
 utimes dir1/dir2
 truncate dir1/dir3/dir4/file11 size=0
 utimes dir1/dir3/dir4/file11
 rename dir1/dir3 -> o262-7-0
 link dir1/dir3 -> o262-7-0/file22
 unlink dir1/dir3/file22
 ERROR: unlink dir1/dir3/file22 failed. Not a directory

The following steps happen during the computation of the incremental send
stream the lead to this issue:

1) Before we start processing the new and deleted references for inode
   260, we compute the full path of the deleted reference
   ("dir1/dir3/file22") and cache it in the list of deleted references
   for our inode.

2) We then start processing the new references for inode 260, for which
   there is only one new, located at "dir1/dir3". When processing this
   new reference, we check that inode 262, which was not yet processed,
   collides with the new reference and because of that we orphanize
   inode 262 so its new full path becomes "o262-7-0".

3) After the orphanization of inode 262, we create the new reference for
   inode 260 by issuing a link command with a target path of "dir1/dir3"
   and a source path of "o262-7-0/file22".

4) We then start processing the deleted references for inode 260, for
   which there is only one with the base name of "file22", and issue
   an unlink operation containing the target path computed at step 1,
   which is wrong because that path no longer exists and should be
   replaced with "o262-7-0/file22".

So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 16:53:10 +02:00
Filipe Manana
72c3668fed Btrfs: send, fix invalid path after renaming and linking file
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:

Parent snapshot

 .                  (ino 256)
 |
 |--- f1            (ino 257)
 |--- f2            (ino 258)
 |--- f3            (ino 259)

Send snapshot

 .                  (ino 256)
 |
 |--- f2            (ino 257)
 |--- f3            (ino 258)
 |--- f4            (ino 259)
 |--- f5            (ino 258)

The following steps happen when computing the incremental send stream:

1) When processing inode 257, inode 258 is orphanized (renamed to
   "o258-7-0"), because its current reference has the same name as the
   new reference for inode 257;

2) When processing inode 258, we iterate over all its new references,
   which have the names "f3" and "f5". The first iteration sees name
   "f5" and renames the inode from its orphan name ("o258-7-0") to
   "f5", while the second iteration sees the name "f3" and, incorrectly,
   issues a link operation with a target name matching the orphan name,
   which no longer exists. The first iteration had reset the current
   valid path of the inode to "f5", but in the second iteration we lost
   it because we found another inode, with a higher number of 259, which
   has a reference named "f3" as well, so we orphanized inode 259 and
   recomputed the current valid path of inode 258 to its old orphan
   name because inode 259 could be an ancestor of inode 258 and therefore
   the current valid path could contain the pre-orphanization name of
   inode 259. However in this case inode 259 is not an ancestor of inode
   258 so the current valid path should not be recomputed.
   This makes the receiver fail with the following error:

   ERROR: link f3 -> o258-7-0 failed: No such file or directory

So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.

A test case for fstests will follow soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 16:53:03 +02:00
Filipe Manana
609805d809 Btrfs: fix invalid extent maps due to hole punching
While punching a hole in a range that is not aligned with the sector size
(currently the same as the page size) we can end up leaving an extent map
in memory with a length that is smaller then the sector size or with a
start offset that is not aligned to the sector size. Both cases are not
expected and can lead to problems. This issue is easily detected
after the patch from commit a7e3b975a0 ("Btrfs: fix reported number of
inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
following for example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
  $ xfs_io -c "fpunch 60K 90K" /mnt/foo
  $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
  $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
  $ umount /mnt

After the unmount operation we can see several warnings emmitted due to
underflows related to space reservation counters:

[ 2837.443299] ------------[ cut here ]------------
[ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.462379] Call Trace:
[ 2837.462379]  dump_stack+0x68/0x92
[ 2837.462379]  __warn+0xc2/0xdd
[ 2837.462379]  warn_slowpath_null+0x1d/0x1f
[ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.462379]  destroy_inode+0x3d/0x55
[ 2837.462379]  evict+0x177/0x17e
[ 2837.462379]  dispose_list+0x50/0x71
[ 2837.462379]  evict_inodes+0x132/0x141
[ 2837.462379]  generic_shutdown_super+0x3f/0xeb
[ 2837.462379]  kill_anon_super+0x12/0x1c
[ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.462379]  deactivate_locked_super+0x30/0x68
[ 2837.462379]  deactivate_super+0x36/0x39
[ 2837.462379]  cleanup_mnt+0x58/0x76
[ 2837.462379]  __cleanup_mnt+0x12/0x14
[ 2837.462379]  task_work_run+0x77/0x9b
[ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
[ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
[ 2837.596256] ------------[ cut here ]------------
[ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.663359] Call Trace:
[ 2837.663359]  dump_stack+0x68/0x92
[ 2837.663359]  __warn+0xc2/0xdd
[ 2837.663359]  warn_slowpath_null+0x1d/0x1f
[ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.663359]  ? evict_inodes+0x132/0x141
[ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.663359]  generic_shutdown_super+0x6a/0xeb
[ 2837.663359]  kill_anon_super+0x12/0x1c
[ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.663359]  deactivate_locked_super+0x30/0x68
[ 2837.663359]  deactivate_super+0x36/0x39
[ 2837.663359]  cleanup_mnt+0x58/0x76
[ 2837.663359]  __cleanup_mnt+0x12/0x14
[ 2837.663359]  task_work_run+0x77/0x9b
[ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
[ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
[ 2837.745595] ------------[ cut here ]------------
[ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.758526] Call Trace:
[ 2837.758925]  dump_stack+0x68/0x92
[ 2837.759383]  __warn+0xc2/0xdd
[ 2837.759383]  warn_slowpath_null+0x1d/0x1f
[ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.759383]  ? evict_inodes+0x132/0x141
[ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.759383]  generic_shutdown_super+0x6a/0xeb
[ 2837.759383]  kill_anon_super+0x12/0x1c
[ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.759383]  deactivate_locked_super+0x30/0x68
[ 2837.759383]  deactivate_super+0x36/0x39
[ 2837.759383]  cleanup_mnt+0x58/0x76
[ 2837.759383]  __cleanup_mnt+0x12/0x14
[ 2837.759383]  task_work_run+0x77/0x9b
[ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
[ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
[ 2837.778235] ------------[ cut here ]------------
[ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
[ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.800118] Call Trace:
[ 2837.800515]  dump_stack+0x68/0x92
[ 2837.801015]  __warn+0xc2/0xdd
[ 2837.801471]  warn_slowpath_null+0x1d/0x1f
[ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.801698]  ? evict_inodes+0x132/0x141
[ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.801698]  generic_shutdown_super+0x6a/0xeb
[ 2837.801698]  kill_anon_super+0x12/0x1c
[ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.801698]  deactivate_locked_super+0x30/0x68
[ 2837.801698]  deactivate_super+0x36/0x39
[ 2837.801698]  cleanup_mnt+0x58/0x76
[ 2837.801698]  __cleanup_mnt+0x12/0x14
[ 2837.801698]  task_work_run+0x77/0x9b
[ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
[ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
[ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
[ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
[ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0

What happens in the above example is the following:

1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
   is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
   This results in the creation of an extent map with a length of 2Kb
   starting at file offset 148Kb, through find_first_non_hole() ->
   btrfs_get_extent().

2) The second write (first write after the hole punch operation), sets
   the range [50Kb, 152Kb[ to delalloc.

3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
   map covering the range [148Kb, 150Kb[ and ends up calling
   set_extent_bit() for the same range, which results in splitting an
   existing extent state record, covering the range [148Kb, 152Kb[ into
   two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
   [150Kb, 152Kb[.

4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
   btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
   range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
   callback being invoked against the two 2Kb extent state records that
   cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
   the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
   with a length argument of 2048 bytes. That function rounds up the length
   to a sector size aligned length, so it ends up considering a length of
   4096 bytes, and then calls calc_csum_metadata_size() which results in
   decrementing the inode's csum_bytes counter by 4096 bytes, so after
   it stays a value of 0 bytes. Then the same happens when
   btrfs_clear_bit_hook() is called against the second extent state that
   has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
   rounded up to 4096 and calc_csum_metadata_size() ends up being called
   to decrement 4096 bytes from the inode's csum_bytes counter, which
   at that time has a value of 0, leading to an underflow, which is
   exactly what triggers the first warning, at btrfs_destroy_inode().
   All the other warnings relate to several space accounting counters
   that underflow as well due to similar reasons.

A similar case but where the hole punching operation creates an extent map
with a start offset not aligned to the sector size is the following:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
  $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
  $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
  $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
  $ umount /mnt

During the unmount operation we get similar traces for the same reasons as
in the first example.

So fix the hole punching operation to make sure it never creates extent
maps with a length that is not aligned to the sector size nor with a start
offset that is not aligned to the sector size, as this breaks all
assumptions and it's a land mine.

Fixes: d77815461f ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
Cc: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 16:52:45 +02:00
Jeff Mahoney
cddf3b2cb3 btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items
On an uncontended system, we can end up hitting soft lockups while
doing replace_path.  At the core, and frequently called is
btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
there.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21 15:48:01 +02:00
Dan Carpenter
0e9350de2e btrfs: use new block error code
This function is supposed to return blk_status_t error codes now but
there was a stray -ENOMEM left behind.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 07:47:34 -06:00
Christophe Jaillet
517a6e43c4 CIFS: Fix some return values in case of error in 'crypt_message'
'rc' is known to be 0 at this point. So if 'init_sg' or 'kzalloc' fails, we
should return -ENOMEM instead.

Also remove a useless 'rc' in a debug message as it is meaningless here.

Fixes: 026e93dc0a ("CIFS: Encrypt SMB3 requests before sending")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-06-21 00:09:28 -05:00
Bart Van Assche
ca18d6f769 block: Make most scsi_req_init() calls implicit
Instead of explicitly calling scsi_req_init() after blk_get_request(),
call that function from inside blk_get_request(). Add an
.initialize_rq_fn() callback function to the block drivers that need
it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
because it is too small to keep it as a separate function. Keep the
scsi_req_init() call in ide_prep_sense() because it follows a
blk_rq_init() call.

References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Colin Ian King
e125f5284f cifs: remove redundant return in cifs_creation_time_get
There is a redundant return in function cifs_creation_time_get
that appears to be old vestigial code than can be removed. So
remove it.

Detected by CoverityScan, CID#1361924 ("Structurally dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-06-20 19:14:40 -05:00
Pavel Shilovsky
dcd87838c0 CIFS: Improve readdir verbosity
Downgrade the loglevel for SMB2 to prevent filling the log
with messages if e.g. readdir was interrupted. Also make SMB2
and SMB1 codepaths do the same logging during readdir.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-06-20 19:13:47 -05:00
Colin Ian King
ecf3411a12 CIFS: check if pages is null rather than bv for a failed allocation
pages is being allocated however a null check on bv is being used
to see if the allocation failed. Fix this by checking if pages is
null.

Detected by CoverityScan, CID#1432974 ("Logically dead code")

Fixes: ccf7f4088a ("CIFS: Add asynchronous context to support kernel AIO")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-06-20 19:11:35 -05:00
Dan Carpenter
8a7b0d8e8d CIFS: Set ->should_dirty in cifs_user_readv()
The current code causes a static checker warning because ITER_IOVEC is
zero so the condition is never true.

Fixes: 6685c5e2d1 ("CIFS: Add asynchronous read support through kernel AIO")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-06-20 17:57:27 -05:00
Nikolay Borisov
104b4e5139 percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable.  Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add.  In terms of context-safety,
they're equivalent.  The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity.  Cosmetic
    indentation updates.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
2017-06-20 15:42:32 -04:00
Darrick J. Wong
61d819e7bc xfs: don't allow bmap on rt files
bmap returns a dumb LBA address but not the block device that goes with
that LBA.  Swapfiles don't care about this and will blindly assume that
the data volume is the correct blockdev, which is totally bogus for
files on the rt subvolume.  This results in the swap code doing IOs to
arbitrary locations on the data device(!) if the passed in mapping is a
realtime file, so just turn off bmap for rt files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-20 10:45:22 -07:00
Darrick J. Wong
5da8f2f890 xfs: allow reading of already-locked remote symbolic link
Expose the readlink variant that doesn't take the inode lock so that
the scrubber can inspect symlink contents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20 10:45:22 -07:00
Darrick J. Wong
ad017f6537 xfs: pass along transaction context when reading xattr block buffers
Teach the extended attribute reading functions to pass along a
transaction context if one was supplied.  The extended attribute scrub
code will use transactions to lock buffers and avoid deadlocking with
itself in the case of loops; since it will already have the inode
locked, also create xattr get/list helpers that don't take locks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20 10:45:22 -07:00
Darrick J. Wong
acb9553cab xfs: pass along transaction context when reading directory block buffers
Teach the directory reading functions to pass along a transaction context
if one was supplied.  The directory scrub code will use transactions to
lock buffers and avoid deadlocking with itself in the case of loops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20 10:45:22 -07:00
Darrick J. Wong
8e8877e6ed xfs: return the hash value of a leaf1 directory block
Modify the existing dir leafn lasthash function to enable us to
calculate the highest hash value of a leaf1 block.  This will be used by
the directory scrubbing code to check the sanity of hashes in leaf1
directory blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20 10:45:21 -07:00
Darrick J. Wong
e7f5d5ca36 xfs: refactor the ifork block counting function
Refactor the inode fork block counting function to count extents for us
at the same time.  This will be used by the bmbt scrubber function.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20 10:45:21 -07:00
Darrick J. Wong
d29cb3e45e xfs: make _bmap_count_blocks consistent wrt delalloc extent behavior
There is an inconsistency in the way that _bmap_count_blocks deals with
delalloc reservations -- if the specified fork is in extents format,
*count is set to the total number of blocks referenced by the in-core
fork, including delalloc extents.  However, if the fork is in btree
format, *count is set to the number of blocks referenced by the on-disk
fork, which does /not/ include delalloc extents.

For the lone existing caller of _bmap_count_blocks this hasn't been an
issue because the function is only used to count xattr fork blocks
(where there aren't any delalloc reservations).  However, when scrub
comes along it will use this same function to check di_nblocks against
both on-disk extent maps, so we need this behavior to be consistent.

Therefore, fix _bmap_count_leaves not to include delalloc extents and
remove unnecessary parameters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20 10:45:21 -07:00
Bob Peterson
722f6f62a5 GFS2: Eliminate vestigial sd_log_flush_wrapped
Superblock variable sd_log_flush_wrapped is set, but never referenced,
so this patch eliminates it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-06-20 09:52:57 -05:00
Goldwyn Rodrigues
edf064e7c6 btrfs: nowait aio support
Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
29a5d29ec1 xfs: nowait aio support
If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Return -EAGAIN if we don't have extent list in memory.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
728fbc0e10 ext4: nowait aio support
Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
03a07c92a9 block: return on congested block device
A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
a38d124370 fs: Introduce IOMAP_NOWAIT
IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
b745fafaf7 fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT
RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
of returning -EAGAIN if the AIO call will block. This must be set by
supporting filesystems in the ->open() call.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
9830f4be15 fs: Use RWF_* flags for AIO operations
aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Goldwyn Rodrigues
fdd2f5b7de fs: Separate out kiocb flags setup based on RWF_* flags
Also added RWF_SUPPORTED to encompass all flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Nikolay Borisov
7dfb8be11b btrfs: Round down values which are written for total_bytes_size
We got an internal report about a file system not wanting to mount
following 99e3ecfcb9 ("Btrfs: add more validation checks for
superblock").

BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
fs_devices total_rw_bytes 1000203820544

Subtracting the numbers we get a difference of less than a 4kb. Upon
closer inspection it became apparent that mkfs actually rounds down the
size of the device to a multiple of sector size. However, the same
cannot be said for various functions which modify the total size and are
called from btrfs_balance as well as when adding a new device. So this
patch ensures that values being saved into on-disk data structures are
always rounded down to a multiple of sectorsize.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:48 +02:00
Nikolay Borisov
eca152edf5 btrfs: Manually implement device_total_bytes getter/setter
The device->total_bytes member needs to always be rounded down to sectorsize
so that it corresponds to the value of super->total_bytes. However, there are
multiple places where the setter is fed a value which is not rounded which
can cause a fs to be unmountable due to the check introduced in
99e3ecfcb9 ("Btrfs: add more validation checks for superblock"). This patch
implements the getter/setter manually so that in a later patch I can add
necessary code to catch offenders.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:48 +02:00
David Sterba
0d0c71b317 btrfs: obsolete and remove mount option alloc_start
The mount option alloc_start was used in the past for debugging and
stressing the chunk allocator. Not meant to be used by users, so we're
not breaking anybody's setup.

There was some added complexity handling changes of the value and when
it was not same as default. Such code has likely been untested and I
think it's better to remove it.

This patch kills all use of alloc_start, and by doing that also fixes
a bug when alloc_size is set, potentially called from statfs:

in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
protection is temporarily dropped so btrfs_account_dev_extents_size can
be called and then RCU is locked again! Doing that inside
list_for_each_entry_rcu is just asking for trouble, but unlikely to be
observed in practice.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:48 +02:00
David Sterba
fac03c8dae btrfs: move fs_info::fs_frozen to the flags
We can keep the state among the other fs_info flags, there's no reason
why fs_frozen would need to be separate.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:42 +02:00
David Sterba
79b4f4c605 btrfs: cleanup duplicate return value in insert_inline_extent
The pattern when err is used for function exit and ret is used for
return values of callees is not used here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20 14:22:12 +02:00
Ard Biesheuvel
737326aa51 fs/proc: kcore: use kcore_list type to check for vmalloc/module address
Instead of passing each start address into is_vmalloc_or_module_addr()
to decide whether it falls into either the VMALLOC or the MODULES region,
we can simply check the type field of the current kcore_list entry, since
it will be set to KCORE_VMALLOC based on exactly the same conditions.

As a bonus, when reading the KCORE_TEXT region on architectures that have
one, this will avoid using vread() on the region if it happens to intersect
with a KCORE_VMALLOC region. This is due the fact that the KCORE_TEXT
region is the first one to be added to the kcore region list.

Reported-by: Tan Xiaojun <tanxiaojun@huawei.com>
Tested-by: Tan Xiaojun <tanxiaojun@huawei.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Laura Abbott <labbott@redhat.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-06-20 12:42:57 +01:00
Ingo Molnar
2055da9738 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=> ::head
	struct wait_queue_entry::task_list	=> ::entry

For example, this code:

	rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
	list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &x->head, entry) {
	list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar
5dd43ce2f6 sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.

Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.

So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:

  include/linux/wait_bit.h  for types and APIs
  kernel/sched/wait_bit.c   for the implementation

Update all header dependencies.

This reduces the size of wait.h rather significantly, by about 30%.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:09 +02:00
Ingo Molnar
2141713616 sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.

Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:28 +02:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Jason A. Donenfeld
51b0817b0d cifs: use get_random_u32 for 32-bit lock random
Using get_random_u32 here is faster, more fitting of the use case, and
just as cryptographically secure. It also has the benefit of providing
better randomness at early boot, which is sometimes when this is used.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-19 22:06:28 -04:00
Darrick J. Wong
ea7cdd7b7b xfs: separate function to check if inode shares extents
Separate the "clear reflink flag" function into one function that checks
if the flag is needed, and a second function that checks and clears the
flag.  The inode scrub code will want to check the necessity of the flag
without clearing it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19 14:11:35 -07:00