Commit Graph

56429 Commits

Author SHA1 Message Date
Jan Kara
7b78fd02fb udf: Fix handling of Partition Descriptors
Current handling of Partition Descriptors in Volume Descriptor Sequence
is buggy in several ways. Firstly, it does not take descriptor sequence
numbers into account at all, thus any volume making serious use of them
would be unmountable. Secondly, it does not handle Volume Descriptor
Pointers or Volume Descriptor Sequence without Terminating Descriptor.

Fix these problems by properly remembering all Partition Descriptors in
the Volume Descriptor Sequence and their sequence numbers. This is made
more complicated by the fact that we don't know number of partitions in
advance and sequence numbers have to be tracked on per-partition basis.

Reported-by: Pali Rohár <pali.rohar@gmail.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara
18cf4781c9 udf: Unify common handling of descriptors
When scanning Volume Descriptor Sequence, several descriptors have
exactly the same handling. Unify it.

Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:26 +01:00
Chengguang Xu
5b4c845ea4 xfs: fix potential memory leak in mount option parsing
When specifying string type mount option (e.g., logdev)
several times in a mount, current option parsing may
cause memory leak. Hence, call kfree for previous one
in this case.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-02-26 10:02:13 -08:00
Jan Kara
560e7cb2f3 blockdev: Avoid two active bdev inodes for one device
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:

CPU0					CPU1
blkdev_open()
  bdev = bd_acquire()
					del_gendisk()
					  bdev_unhash_inode(bdev);
					remove device
					create new device with the same number
  __blkdev_get()
    disk = get_gendisk()
      - gets reference to gendisk of the new device

Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.

Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.

Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara
897366537f genhd: Fix use after free in __blkdev_get()
When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:

CPU0				CPU1			CPU2
							del_gendisk(disk);
							  bdev_unhash_inode(inode);
blkdev_open()			blkdev_open()
  bdev = bd_acquire(inode);
    - creates and returns new inode
				  bdev = bd_acquire(inode);
				    - returns the same inode
  __blkdev_get(devt)		  __blkdev_get(devt)
    disk = get_gendisk(devt);
      - got structure of device going away
							<finish device removal>
							<new device gets
							 created under the same
							 device number>
				  disk = get_gendisk(devt);
				    - got new device structure
				  if (!bdev->bd_openers) {
				    does the first open
				  }
    if (!bdev->bd_openers)
      - false
    } else {
      put_disk_and_module(disk)
        - remember this was old device - this was last ref and disk is
          now freed
    }
    disk_unblock_events(disk); -> oops

Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.

Reported-by: Hou Tao <houtao1@huawei.com>
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara
9df6c29912 genhd: Add helper put_disk_and_module()
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:48:42 -07:00
Jan Kara
d9c10e5b88 direct-io: Fix sleep in atomic due to sync AIO
Commit e864f39569 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
way for direct IO to become synchronous and thus trigger fsync from the
IO completion handler. Then commit 9830f4be15 "fs: Use RWF_* flags for
AIO operations" allowed these flags to be set for AIO as well. However
that commit forgot to update the condition checking whether the IO
completion handling should be defered to a workqueue and thus AIO DIO
with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
sleep in atomic.

Fix the problem by checking directly iocb flags (the same way as it is
done in dio_complete()) instead of checking all conditions that could
lead to IO being synchronous.

CC: Christoph Hellwig <hch@lst.de>
CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
CC: stable@vger.kernel.org
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 9830f4be15
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26 09:05:35 -07:00
Vivek Goyal
d1fe96c0e4 ovl: redirect_dir=nofollow should not follow redirect for opaque lower
redirect_dir=nofollow should not follow a redirect. But in a specific
configuration it can still follow it.  For example try this.

$ mkdir -p lower0 lower1/foo upper work merged
$ touch lower1/foo/lower-file.txt
$ setfattr -n "trusted.overlay.opaque" -v "y" lower1/foo
$ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=on none merged
$ cd merged
$ mv foo foo-renamed
$ umount merged

# mount again. This time with redirect_dir=nofollow
$ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=nofollow none merged
$ ls merged/foo-renamed/
# This lists lower-file.txt, while it should not have.

Basically, we are doing redirect check after we check for d.stop. And
if this is not last lower, and we find an opaque lower, d.stop will be
set.

ovl_lookup_single()
        if (!d->last && ovl_is_opaquedir(this)) {
                d->stop = d->opaque = true;
                goto out;
        }

To fix this, first check redirect is allowed. And after that check if
d.stop has been set or not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: 438c84c2f0 ("ovl: don't follow redirects if redirect_dir=off")
Cc: <stable@vger.kernel.org> #v4.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-02-26 16:55:51 +01:00
Chengguang Xu
18106734b5 ceph: fix dentry leak when failing to init debugfs
When failing from ceph_fs_debugfs_init() in ceph_real_mount(),
there is lack of dput of root_dentry and it causes slab errors,
so change the calling order of ceph_fs_debugfs_init() and
open_root_dentry() and do some cleanups to avoid this issue.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-02-26 16:20:07 +01:00
Chengguang Xu
937441f3a3 libceph, ceph: avoid memory leak when specifying same option several times
When parsing string option, in order to avoid memory leak we need to
carefully free it first in case of specifying same option several times.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-02-26 16:19:30 +01:00
Zhi Zhang
6ef0bc6dde ceph: flush dirty caps of unlinked inode ASAP
Client should release unlinked inode from its cache ASAP. But client
can't release inode with dirty caps.

Link: http://tracker.ceph.com/issues/22886
Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-02-26 16:19:16 +01:00
Fengguang Wu
b5095f24e7 ovl: fix ptr_ret.cocci warnings
fs/overlayfs/export.c:459:10-16: WARNING: PTR_ERR_OR_ZERO can be used

 Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-02-26 12:45:20 +01:00
Linus Torvalds
c89be52426 Merge tag 'nfs-for-4.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:

 - fix a broken cast in nfs4_callback_recallany()

 - fix an Oops during NFSv4 migration events

 - make struct nlmclnt_fl_close_lock_ops static

* tag 'nfs-for-4.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: make struct nlmclnt_fl_close_lock_ops static
  nfs: system crashes after NFS4ERR_MOVED recovery
  NFSv4: Fix broken cast in nfs4_callback_recallany()
2018-02-25 13:43:18 -08:00
Will Deacon
8cc07c808c fs: dcache: Use READ_ONCE when accessing i_dir_seq
i_dir_seq is subject to concurrent modification by a cmpxchg or
store-release operation, so ensure that the relaxed access in
d_alloc_parallel uses READ_ONCE.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-25 12:51:10 -05:00
Will Deacon
015555fd4d fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
If d_alloc_parallel runs concurrently with __d_add, it is possible for
d_alloc_parallel to continuously retry whilst i_dir_seq has been
incremented to an odd value by __d_add:

CPU0:
__d_add
	n = start_dir_add(dir);
		cmpxchg(&dir->i_dir_seq, n, n + 1) == n

CPU1:
d_alloc_parallel
retry:
	seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
	hlist_bl_lock(b);
		bit_spin_lock(0, (unsigned long *)b); // Always succeeds

CPU0:
	__d_lookup_done(dentry)
		hlist_bl_lock
			bit_spin_lock(0, (unsigned long *)b); // Never succeeds

CPU1:
	if (unlikely(parent->d_inode->i_dir_seq != seq)) {
		hlist_bl_unlock(b);
		goto retry;
	}

Since the simple bit_spin_lock used to implement hlist_bl_lock does not
provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
exit its retry loop because the sequence number always has the bottom
bit set.

This patch resolves the livelock by not taking hlist_bl_lock in
d_alloc_parallel if the sequence counter is odd, since any subsequent
masked comparison with i_dir_seq will fail anyway.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Naresh Madhusudana <naresh.madhusudana@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-25 12:51:09 -05:00
David S. Miller
f74290fdb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
Al Viro
3b82140963 lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Cc: stable@vger.kernel.org # 3.12+, counting backports
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-23 20:47:17 -05:00
Linus Torvalds
bae6cfe8a3 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo fix from Eric Biederman:
 "This fixes a build error that only shows up on blackfin"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  fs/signalfd: fix build error for BUS_MCEERR_AR
2018-02-22 17:04:06 -08:00
Darrick J. Wong
b31c2bdcd8 xfs: reserve blocks for refcount / rmap log item recovery
During log recovery, the per-AG reservations aren't yet set up, so log
recovery has to reserve enough blocks to handle all possible btree
splits.

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-02-22 14:41:25 -08:00
Eric Sandeen
86516eff3b xfs: use memset to initialize xfs_scrub_agfl_info
Apparently different gcc versions have competing and
incompatible notions of how to initialize at declaration,
so just give up and fall back to the time-tested memset().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-02-22 14:41:25 -08:00
Randy Dunlap
9026e820cb fs/signalfd: fix build error for BUS_MCEERR_AR
Fix build error in fs/signalfd.c by using same method that is used in
kernel/signal.c: separate blocks for different signal si_code values.

./fs/signalfd.c: error: 'BUS_MCEERR_AR' undeclared (first use in this function)

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-02-22 15:00:07 -06:00
Al Viro
304ec482f5 get rid of pointless includes of fs_struct.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-22 14:28:50 -05:00
Luck, Tony
bef3efbeb8 efivarfs: Limit the rate for non-root to read files
Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).

On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.

Linus suggested per-user rate limit to solve this.

So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.

In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-22 10:21:02 -08:00
Colin Ian King
1b72040645 NFS: make struct nlmclnt_fl_close_lock_ops static
The structure nlmclnt_fl_close_lock_ops s local to the source and does
not need to be in global scope, so make it static.

Cleans up sparse warning:
fs/nfs/nfs3proc.c:876:33: warning: symbol 'nlmclnt_fl_close_lock_ops' was not
declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-22 12:23:01 -05:00
Bill.Baker@oracle.com
ad86f605c5 nfs: system crashes after NFS4ERR_MOVED recovery
nfs4_update_server unconditionally releases the nfs_client for the
source server. If migration fails, this can cause the source server's
nfs_client struct to be left with a low reference count, resulting in
use-after-free.  Also, adjust reference count handling for ELOOP.

NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6
WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]()
	nfs_put_client+0xfa/0x110 [nfs]
	nfs4_run_state_manager+0x30/0x40 [nfsv4]
	kthread+0xd8/0xf0

BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
	nfs4_xdr_enc_write+0x6b/0x160 [nfsv4]
	rpcauth_wrap_req+0xac/0xf0 [sunrpc]
	call_transmit+0x18c/0x2c0 [sunrpc]
	__rpc_execute+0xa6/0x490 [sunrpc]
	rpc_async_schedule+0x15/0x20 [sunrpc]
	process_one_work+0x160/0x470
	worker_thread+0x112/0x540
	? rescuer_thread+0x3f0/0x3f0
	kthread+0xd8/0xf0

This bug was introduced by 32e62b7c ("NFS: Add nfs4_update_server"),
but the fix applies cleanly to 52442f9b ("NFS4: Avoid migration loops")

Reported-by: Helen Chao <helen.chao@oracle.com>
Fixes: 52442f9b11 ("NFS4: Avoid migration loops")
Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-22 12:17:42 -05:00
Trond Myklebust
6d243a2356 NFSv4: Fix broken cast in nfs4_callback_recallany()
Passing a pointer to a unsigned integer to test_bit() is broken.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-21 16:35:50 -05:00
David S. Miller
f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Theodore Ts'o
044e6e3d74 ext4: don't update checksum of new initialized bitmaps
When reading the inode or block allocation bitmap, if the bitmap needs
to be initialized, do not update the checksum in the block group
descriptor.  That's because we're not set up to journal those changes.
Instead, just set the verified bit on the bitmap block, so that it's
not necessary to validate the checksum.

When a block or inode allocation actually happens, at that point the
checksum will be calculated, and update of the bg descriptor block
will be properly journalled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-02-19 14:16:47 -05:00
Theodore Ts'o
85e0c4e89c jbd2: if the journal is aborted then don't allow update of the log tail
This updates the jbd2 superblock unnecessarily, and on an abort we
shouldn't truncate the log.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-02-19 12:22:53 -05:00
Theodore Ts'o
fb7c02445c ext4: pass -ESHUTDOWN code to jbd2 layer
Previously the jbd2 layer assumed that a file system check would be
required after a journal abort.  In the case of the deliberate file
system shutdown, this should not be necessary.  Allow the jbd2 layer
to distinguish between these two cases by using the ESHUTDOWN errno.

Also add proper locking to __journal_abort_soft().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-02-18 23:45:18 -05:00
Theodore Ts'o
a6d9946bb9 ext4: eliminate sleep from shutdown ioctl
The msleep() when processing EXT4_GOING_FLAGS_NOLOGFLUSH was a hack to
avoid some races (that are now fixed), but in fact it introduced its
own race.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-02-18 23:16:28 -05:00
Theodore Ts'o
576d18ed60 ext4: shutdown should not prevent get_write_access
The ext4 forced shutdown flag needs to prevent new handles from being
started, but it needs to allow existing handles to complete.  So the
forced shutdown flag should not force ext4_journal_get_write_access to
fail.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-02-18 22:07:36 -05:00
Theodore Ts'o
ed65b00f8d jbd2: clarify bad journal block checksum message
There were two error messages emitted by jbd2, one for a bad checksum
for a jbd2 descriptor block, and one for a bad checksum for a jbd2
data block.  Change the data block checksum error so that the two can
be disambiguated.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-02-18 21:33:13 -05:00
Theodore Ts'o
ccf0f32acd ext4: add tracepoints for shutdown and file system errors
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-02-18 20:53:23 -05:00
Linus Torvalds
da370f1d63 Merge tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We have a few assorted fixes, some of them show up during fstests so I
  gave them more testing"

* tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
  Btrfs: fix null pointer dereference when replacing missing device
  btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
  btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
  Btrfs: fix unexpected -EEXIST when creating new inode
  Btrfs: fix use-after-free on root->orphan_block_rsv
  Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
  Btrfs: fix extent state leak from tree log
  Btrfs: fix crash due to not cleaning up tree log block's dirty bits
  Btrfs: fix deadlock in run_delalloc_nocow
2018-02-16 09:26:18 -08:00
Amir Goldstein
7168179fcf ovl: check ERR_PTR() return value from ovl_lookup_real()
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 0617015403 ("ovl: lookup indexed ancestor of lower dir")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-02-16 15:53:20 +01:00
Amir Goldstein
2ca3c148a0 ovl: check lower ancestry on encode of lower dir file handle
This change relaxes copy up on encode of merge dir with lower layer > 1
and handles the case of encoding a merge dir with lower layer 1, where an
ancestor is a non-indexed merge dir. In that case, decode of the lower
file handle will not have been possible if the non-indexed ancestor is
redirected before or after encode.

Before encoding a non-upper directory file handle from real layer N, we
need to check if it will be possible to reconnect an overlay dentry from
the real lower decoded dentry. This is done by following the overlay
ancestry up to a "layer N connected" ancestor and verifying that all
parents along the way are "layer N connectable". If an ancestor that is
NOT "layer N connectable" is found, we need to copy up an ancestor, which
is "layer N connectable", thus making that ancestor "layer N connected".
For example:

 layer 1: /a
 layer 2: /a/b/c

The overlay dentry /a is NOT "layer 2 connectable", because if dir /a is
copied up and renamed, upper dir /a will be indexed by lower dir /a from
layer 1. The dir /a from layer 2 will never be indexed, so the algorithm
in ovl_lookup_real_ancestor() (*) will not be able to lookup a connected
overlay dentry from the connected lower dentry /a/b/c.

To avoid this problem on decode time, we need to copy up an ancestor of
/a/b/c, which is "layer 2 connectable", on encode time. That ancestor is
/a/b. After copy up (and index) of /a/b, it will become "layer 2 connected"
and when the time comes to decode the file handle from lower dentry /a/b/c,
ovl_lookup_real_ancestor() will find the indexed ancestor /a/b and decoding
a connected overlay dentry will be accomplished.

(*) the algorithm in ovl_lookup_real_ancestor() can be improved to lookup
an entry /a in the lower layers above layer N and find the indexed dir /a
from layer 1. If that improvement is made, then the check for "layer N
connected" will need to verify there are no redirects in lower layers above
layer N. In the example above, /a will be "layer 2 connectable". However,
if layer 2 dir /a is a target of a layer 1 redirect, then /a will NOT be
"layer 2 connectable":

 layer 1: /A (redirect = /a)
 layer 2: /a/b/c

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-02-16 15:53:20 +01:00
Amir Goldstein
764baba801 ovl: hash non-dir by lower inode for fsnotify
Commit 31747eda41 ("ovl: hash directory inodes for fsnotify")
fixed an issue of inotify watch on directory that stops getting
events after dropping dentry caches.

A similar issue exists for non-dir non-upper files, for example:

$ mkdir -p lower upper work merged
$ touch lower/foo
$ mount -t overlay -o
lowerdir=lower,workdir=work,upperdir=upper none merged
$ inotifywait merged/foo &
$ echo 2 > /proc/sys/vm/drop_caches
$ cat merged/foo

inotifywait doesn't get the OPEN event, because ovl_lookup() called
from 'cat' allocates a new overlay inode and does not reuse the
watched inode.

Fix this by hashing non-dir overlay inodes by lower real inode in
the following cases that were not hashed before this change:
 - A non-upper overlay mount
 - A lower non-hardlink when index=off

A helper ovl_hash_bylower() was added to put all the logic and
documentation about which real inode an overlay inode is hashed by
into one place.

The issue dates back to initial version of overlayfs, but this
patch depends on ovl_inode code that was introduced in kernel v4.13.

Cc: <stable@vger.kernel.org> #v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-02-16 15:53:20 +01:00
Jan Kara
4b8d425215 udf: Convert descriptor index definitions to enum
Convert index definitions from defines to enum. It is a shorter
description and easier to modify. Also remove VDS_POS_VOL_DESC_PTR since
it is unused.

Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-16 11:15:16 +01:00
Jan Kara
67621675e9 udf: Allow volume descriptor sequence to be terminated by unrecorded block
According to ECMA-167 3/8.4.2 a volume descriptor sequence can be
terminated also by an unrecorded block within the extent of volume
descriptor sequence. Currently we errored out in such case making such
volumes unmountable. Handle that case by treating any invalid block as a
block terminating the sequence.

Reported-by: Pali Rohár <pali.rohar@gmail.com>
Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-16 11:15:09 +01:00
Jan Kara
7b568cba4f udf: Simplify handling of Volume Descriptor Pointers
According to ECMA-167 3/8.4.2 Volume Descriptor Pointer is terminating
current extent of Volume Descriptor Sequence. Also according to ECMA-167
3/8.4.3 Volume Descriptor Sequence Number is not significant for Volume
Descriptor Pointers. Simplify the handling of Volume Descriptor Pointers
to take this into account.

Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-16 11:15:04 +01:00
Jan Kara
91c9c9ec54 udf: Fix off-by-one in volume descriptor sequence length
We pass one block beyond end of volume descriptor sequence into
process_sequence() as 'lastblock' instead of the last block of the
sequence. When the sequence is not terminated with TD descriptor, this
could lead to false errors due to invalid blocks in volume descriptor
sequence and thus unmountable volumes.

Acked-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-16 11:14:41 +01:00
Kirill Tkhai
24dce0800b net: Export open_related_ns()
This function will be used to obtain net of tun device.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-15 15:34:42 -05:00
Linus Torvalds
e525de3ab0 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes all across the map:

   - /proc/kcore vsyscall related fixes
   - LTO fix
   - build warning fix
   - CPU hotplug fix
   - Kconfig NR_CPUS cleanups
   - cpu_has() cleanups/robustification
   - .gitignore fix
   - memory-failure unmapping fix
   - UV platform fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm, mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages
  x86/error_inject: Make just_return_func() globally visible
  x86/platform/UV: Fix GAM Range Table entries less than 1GB
  x86/build: Add arch/x86/tools/insn_decoder_test to .gitignore
  x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU
  x86/mm/kcore: Add vsyscall page to /proc/kcore conditionally
  vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
  x86/Kconfig: Further simplify the NR_CPUS config
  x86/Kconfig: Simplify NR_CPUS config
  x86/MCE: Fix build warning introduced by "x86: do not use print_symbol()"
  x86/cpufeature: Update _static_cpu_has() to use all named variables
  x86/cpufeature: Reindent _static_cpu_has()
2018-02-14 17:31:51 -08:00
Linus Torvalds
6556677a80 Merge tag 'gfs2-4.16.rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Bob Peterson:
 "Fix regressions in the gfs2 iomap for block_map implementation we
  recently discovered in commit 3974320ca6"

* tag 'gfs2-4.16.rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fixes to "Implement iomap for block_map"
2018-02-14 10:14:59 -08:00
Kirill Tkhai
e1603b6eff inotify: Extend ioctl to allow to request id of new watch descriptor
Watch descriptor is id of the watch created by inotify_add_watch().
It is allocated in inotify_add_to_idr(), and takes the numbers
starting from 1. Every new inotify watch obtains next available
number (usually, old + 1), as served by idr_alloc_cyclic().

CRIU (Checkpoint/Restore In Userspace) project supports inotify
files, and restores watched descriptors with the same numbers,
they had before dump. Since there was no kernel support, we
had to use cycle to add a watch with specific descriptor id:

	while (1) {
		int wd;

		wd = inotify_add_watch(inotify_fd, path, mask);
		if (wd < 0) {
			break;
		} else if (wd == desired_wd_id) {
			ret = 0;
			break;
		}

		inotify_rm_watch(inotify_fd, wd);
	}

(You may find the actual code at the below link:
 https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)

The cycle is suboptiomal and very expensive, but since there is no better
kernel support, it was the only way to restore that. Happily, we had met
mostly descriptors with small id, and this approach had worked somehow.

But recent time containers with inotify with big watch descriptors
begun to come, and this way stopped to work at all. When descriptor id
is something about 0x34d71d6, the restoring process spins in busy loop
for a long time, and the restore hungs and delay of migration from node
to node could easily be watched.

This patch aims to solve this problem. It introduces new ioctl
INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
watch descriptor from userspace. It simply calls idr_set_cursor() primitive
to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
will return this id, if it is not occupied. This is the way which is
used to restore some other resources from userspace. For example,
/proc/sys/kernel/ns_last_pid works the same for task pids.

The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
may exclude it.

v2: Use INT_MAX instead of custom definition of max id,
as IDR subsystem guarantees id is between 0 and INT_MAX.

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-14 11:16:28 +01:00
Andreas Gruenbacher
49edd5bf42 gfs2: Fixes to "Implement iomap for block_map"
It turns out that commit 3974320ca6 "Implement iomap for block_map"
introduced a few bugs that trigger occasional failures with xfstest
generic/476:

In gfs2_iomap_begin, we jump to do_alloc when we determine that we are
beyond the end of the allocated metadata (height > ip->i_height).
There, we can end up calling hole_size with a metapath that doesn't
match the current metadata tree, which doesn't make sense.  After
untangling the code at do_alloc, fix this by checking if the block we
are looking for is within the range of allocated metadata.

In addition, add a BUG() in case gfs2_iomap_begin is accidentally called
for reading stuffed files: this is handled separately.  Make sure we
don't truncate iomap->length for reads beyond the end of the file; in
that case, the entire range counts as a hole.

Finally, revert to taking a bitmap write lock when doing allocations.
It's unclear why that change didn't lead to any failures during testing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-02-13 13:38:10 -07:00
Kirill Tkhai
f039e184bc net: Convert proc_net_ns_ops
This patch starts to convert pernet_subsys, registered
before initcalls.

proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
{un,}register pernet net->proc_net and ->proc_net_stat.

Constructors and destructors of another pernet_operations
are not interested in foreign net's proc_net and proc_net_stat.
Proc filesystem privitives are synchronized on proc_subdir_lock.

So, proc_net_ns_ops methods are able to be executed
in parallel with methods of any other pernet operations.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:05 -05:00
Jia Zhang
595dd46ebf vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
Commit:

  df04abfd18 ("fs/proc/kcore.c: Add bounce buffer for ktext data")

... introduced a bounce buffer to work around CONFIG_HARDENED_USERCOPY=y.
However, accessing the vsyscall user page will cause an SMAP fault.

Replace memcpy() with copy_from_user() to fix this bug works, but adding
a common way to handle this sort of user page may be useful for future.

Currently, only vsyscall page requires KCORE_USER.

Signed-off-by: Jia Zhang <zhang.jia@linux.alibaba.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1518446694-21124-2-git-send-email-zhang.jia@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13 09:15:58 +01:00
Denys Vlasenko
9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00