Commit Graph

48450 Commits

Author SHA1 Message Date
Jaegeuk Kim
bbf156f7af f2fs: fix lost xattrs of directories
This patch enhances the xattr consistency of dirs from suddern power-cuts.

Possible scenario would be:
1. dir->setxattr used by per-file encryption
2. file->setxattr goes into inline_xattr
3. file->fsync

In that case, we should do checkpoint for #1.
Otherwise we'd lose dir's key information for the file given #2.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:39 -07:00
Chao Yu
275b66b09e f2fs: support async discard
Like most filesystems, f2fs will issue discard command synchronously, so
when user trigger fstrim through ioctl, multiple discard commands will be
issued serially with sync mode, which makes poor performance.

In this patch we try to support async discard, so that all discard
commands can be issued and be waited for endio in batch to improve
performance.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:38 -07:00
Shuoran Liu
167451efb5 f2fs: set encryption name flag in add inline entry path
This patch sets encryption name flag in the add inline entry path
if filename is encrypted.

Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:37 -07:00
Chao Yu
e06f86e61d f2fs crypto: avoid unneeded memory allocation in ->readdir
When decrypting dirents in ->readdir, fscrypt_fname_disk_to_usr won't
change content of original encrypted dirent, we don't need to allocate
additional buffer for storing mirror of it, so get rid of it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:36 -07:00
Chao Yu
9421d57051 f2fs: fix to do security initialization of encrypted inode with original filename
When creating new inode, security_inode_init_security will be called for
initializing security info related to the inode, and filename is passed to
security module, it helps security module such as SElinux to know which
rule or label could be applied for the inode with specified name.

Previously, if new inode is created as an encrypted one, f2fs will transfer
encrypted filename to security module which may fail the check of security
policy belong to the inode. So in order to this issue, alter to transfer
original unencrypted filename instead.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:35 -07:00
Chao Yu
7ea984b060 f2fs: do in batch synchronously readahead during GC
In order to enhance performance, we try to readahead node page during
GC, but before loading node page we should get block address of node page
which is stored in NAT table, so synchronously read of single NAT page
block our readahead flow.

f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0xa1e, oldaddr = 0xa1e, newaddr = 0xa1e, rw = READ_SYNC(MP), type = META
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x35e9, oldaddr = 0x72d7a, newaddr = 0x72d7a, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0xc1f, oldaddr = 0xc1f, newaddr = 0xc1f, rw = READ_SYNC(MP), type = META
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x389d, oldaddr = 0x72d7d, newaddr = 0x72d7d, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x3a82, oldaddr = 0x72d7f, newaddr = 0x72d7f, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x3bfa, oldaddr = 0x72d86, newaddr = 0x72d86, rw = READAHEAD ^H, type = NODE

This patch adds one phase that do readahead NAT pages in batch before
readahead node page for more effeciently.

f2fs_submit_page_bio: dev = (251,0), ino = 2, page_index = 0x1952, oldaddr = 0x1952, newaddr = 0x1952, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc34, oldaddr = 0xc34, newaddr = 0xc34, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa33, oldaddr = 0xa33, newaddr = 0xa33, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc30, oldaddr = 0xc30, newaddr = 0xc30, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc32, oldaddr = 0xc32, newaddr = 0xc32, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc26, oldaddr = 0xc26, newaddr = 0xc26, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa2b, oldaddr = 0xa2b, newaddr = 0xa2b, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc23, oldaddr = 0xc23, newaddr = 0xc23, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc24, oldaddr = 0xc24, newaddr = 0xc24, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xa10, oldaddr = 0xa10, newaddr = 0xa10, rw = READ_SYNC(MP), type = META
f2fs_submit_page_mbio: dev = (251,0), ino = 2, page_index = 0xc2c, oldaddr = 0xc2c, newaddr = 0xc2c, rw = READ_SYNC(MP), type = META
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5db7, oldaddr = 0x6be00, newaddr = 0x6be00, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5db9, oldaddr = 0x6be17, newaddr = 0x6be17, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dbc, oldaddr = 0x6be1a, newaddr = 0x6be1a, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc3, oldaddr = 0x6be20, newaddr = 0x6be20, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc7, oldaddr = 0x6be24, newaddr = 0x6be24, rw = READAHEAD ^H, type = NODE
f2fs_submit_page_bio: dev = (251,0), ino = 1, page_index = 0x5dc9, oldaddr = 0x6be25, newaddr = 0x6be25, rw = READAHEAD ^H, type = NODE

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:34 -07:00
Chao Yu
74fa5f3d43 f2fs: schedule in between two continous batch discards
In batch discard approach of fstrim will grab/release gc_mutex lock
repeatly, it makes contention of the lock becoming more intensive.

So after one batch discards were issued in checkpoint and the lock
was released, it's better to do schedule() to increase opportunity
of grabbing gc_mutex lock for other competitors.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-07 17:27:33 -07:00
Chris Mason
b7f3c7d345 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.8 2016-09-07 12:55:36 -07:00
David Howells
5a42976d4f rxrpc: Add tracepoint for working out where aborts happen
Add a tracepoint for working out where local aborts happen.  Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.

rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 16:34:40 +01:00
Christophe JAILLET
240c5185c5 jfs: Simplify code
Calling 'list_splice' followed by 'INIT_LIST_HEAD' is equivalent to
'list_splice_init'.

This has been spotted with the following coccinelle script:
/////
@@
expression y,z;
@@

-   list_splice(y,z);
-   INIT_LIST_HEAD(y);
+   list_splice_init(y,z);

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2016-09-06 12:17:24 -05:00
Jan Kara
f27792f5b7 udf: Remove useless check in udf_adinicb_write_begin()
As Al properly points out, len is guaranteed to be smaller than
PAGE_SIZE when we reach udf_adinicb_write_begin() as otherwise we would
have converted the file to the normal format.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-06 18:04:40 +02:00
Wang Xiaoguang
ce129655c9 btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
In btrfs_async_reclaim_metadata_space(), we use ticket's address to
determine whether asynchronous metadata reclaim work is making progress.

	ticket = list_first_entry(&space_info->tickets,
				  struct reserve_ticket, list);
	if (last_ticket == ticket) {
		flush_state++;
	} else {
		last_ticket = ticket;
		flush_state = FLUSH_DELAYED_ITEMS_NR;
		if (commit_cycles)
			commit_cycles--;
	}

But indeed it's wrong, we should not rely on local variable's address to
do this check, because addresses may be same. In my test environment, I
dd one 168MB file in a 256MB fs, found that for this file, every time
wait_reserve_ticket() called, local variable ticket's address is same,

For above codes, assume a previous ticket's address is addrA, last_ticket
is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
wake up it, then another ticket is added, but with the same address addrA,
now last_ticket will be same to current ticket, then current ticket's flush
work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
which may result in some enospc issues(I have seen this in my test machine).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-06 16:31:43 +02:00
Chris Mason
cbd60aa7cd Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
We use a btrfs_log_ctx structure to pass information into the
tree log commit, and get error values out.  It gets added to a per
log-transaction list which we walk when things go bad.

Commit d1433debe added an optimization to skip waiting for the log
commit, but didn't take root_log_ctx out of the list.  This
patch makes sure we remove things before exiting.

Signed-off-by: Chris Mason <clm@fb.com>
Fixes: d1433debe7
cc: stable@vger.kernel.org # 3.15+
2016-09-06 05:57:25 -07:00
Dmitry Monakhov
e22834f024 ext4: improve ext4lazyinit scalability
ext4lazyinit is a global thread. This thread performs itable
initalization under li_list_mtx mutex.

It basically does the following:
ext4_lazyinit_thread
  ->mutex_lock(&eli->li_list_mtx);
  ->ext4_run_li_request(elr)
    ->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(&ext4_li_info->li_list_mtx);
   ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
block on sb_start_write() so other mount/umount will be stuck forever.

This patch changes logic like follows:
- grab ->s_umount read sem before processing new li_request.
  After that it is safe to drop li_list_mtx because all callers of
  li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:38:36 -04:00
Jan Kara
6ae4c5a698 ext4: cleanup ext4_sync_parent()
A condition !hlist_empty(&inode->i_dentry) is always true for open file.
Just remove it. Also ext4_sync_parent() could use some explanation why
races with rmdir() are not an issue - add a comment explaining that.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:21:43 -04:00
Kaho Ng
0b7b77791c ext4: remove old feature helpers
Use the ext4_{has,set,clear}_feature_* helpers to replace the old
feature helpers.

Signed-off-by: Kaho Ng <ngkaho1234@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-09-05 23:11:58 -04:00
Jan Kara
49da939272 ext4: enable quota enforcement based on mount options
When quota information is stored in quota files, we enable only quota
accounting on mount and enforcement is enabled only in response to
Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
possibility to enable quota enforcement on mount by specifying
corresponding quota mount option (usrquota, grpquota, prjquota).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:08:16 -04:00
Daeho Jeong
93e3b4e663 ext4: reinforce check of i_dtime when clearing high fields of uid and gid
Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
of deleted and evicted inode to fix up interoperability with old
kernels. However, it checks only i_dtime of an inode to determine
whether the inode was deleted and evicted, and this is very risky,
because i_dtime can be used for the pointer maintaining orphan inode
list, too. We need to further check whether the i_dtime is being
used for the orphan inode list even if the i_dtime is not NULL.

We found that high 16-bit fields of uid/gid of inode are unintentionally
and permanently cleared when the inode truncation is just triggered,
but not finished, and the inode metadata, whose high uid/gid bits are
cleared, is written on disk, and the sudden power-off follows that
in order.

Cc: stable@vger.kernel.org
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 22:56:10 -04:00
Wang Xiaoguang
ed7a694839 btrfs: do not decrease bytes_may_use when replaying extents
When replaying extents, there is no need to update bytes_may_use
in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a
WARN_ON about bytes_may_use.

Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-05 17:40:41 +02:00
Nicolas Iooss
0f5aa88a7b ceph: do not modify fi->frag in need_reset_readdir()
Commit f3c4ebe65e ("ceph: using hash value to compose dentry offset")
modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
comparison operator with an assignment one.

This looks like a typo which is reported by clang when building the
kernel with some warning flags:

    fs/ceph/dir.c:600:22: error: using the result of an assignment as a
    condition without parentheses [-Werror,-Wparentheses]
            } else if (fi->frag |= fpos_frag(new_pos)) {
                       ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
    fs/ceph/dir.c:600:22: note: place parentheses around the assignment
    to silence this warning
            } else if (fi->frag |= fpos_frag(new_pos)) {
                                ^
                       (                             )
    fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
    assignment into an inequality comparison
            } else if (fi->frag |= fpos_frag(new_pos)) {
                                ^~
                                !=

Fixes: f3c4ebe65e ("ceph: using hash value to compose dentry offset")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-09-05 14:30:35 +02:00
Miklos Szeredi
e1ff3dd1ae ovl: fix workdir creation
Workdir creation fails in latest kernel.

Fix by allowing EOPNOTSUPP as a valid return value from
vfs_removexattr(XATTR_NAME_POSIX_ACL_*).  Upper filesystem may not support
ACL and still be perfectly able to support overlayfs.

Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: c11b9fdd6a ("ovl: remove posix_acl_default from workdir")
Cc: <stable@vger.kernel.org>
2016-09-05 13:55:20 +02:00
Greg Kroah-Hartman
2f5bb02ff2 Merge 4.8-rc5 into driver-core-next
We want the sysfs and kernfs in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-05 08:09:04 +02:00
Bhaktipriya Shridhar
434e612003 fs/afs/flock: Remove deprecated create_singlethread_workqueue
The workqueue "afs_lock_manager" queues work item &vnode->lock_work,
per vnode. Since there can be multiple vnodes and since their work items
can be executed concurrently, alloc_workqueue has been used to replace
the deprecated create_singlethread_workqueue instance.

The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure because the workqueue is being used on a memory reclaim
path.

Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
Bhaktipriya Shridhar
4c136dae62 fs/afs/callback: Remove deprecated create_singlethread_workqueue
The workqueue "afs_callback_update_worker" queues multiple work items
viz  &vnode->cb_broken_work, &server->cb_break_work which require strict
execution ordering. Hence, an ordered dedicated workqueue has been used.

Since the workqueue is being used on a memory reclaim path, WQ_MEM_RECLAIM
has been set to ensure forward progress under memory pressure.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
Bhaktipriya Shridhar
69ad052aec fs/afs/rxrpc: Remove deprecated create_singlethread_workqueue
The workqueue "afs_async_calls" queues work item
&call->async_work per afs_call. Since there could be multiple calls and since
these calls can be run concurrently, alloc_workqueue has been used to replace
the deprecated create_singlethread_workqueue instance.

The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
memory pressure because the workqueue is being used on a memory reclaim
path.

Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
Bhaktipriya Shridhar
9ce4d7d385 fs/afs/vlocation: Remove deprecated create_singlethread_workqueue
The workqueue "afs_vlocation_update_worker" queues a single work item
&afs_vlocation_update and hence it doesn't require execution ordering.
Hence, alloc_workqueue has been used to replace the deprecated
create_singlethread_workqueue instance.

Since the workqueue is being used on a memory reclaim path, WQ_MEM_RECLAIM
flag has been set to ensure forward progress under memory pressure.

Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
Trond Myklebust
334a8f3711 pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
If there are outstanding LAYOUTGET rpc calls, then we want to ensure that
we keep the layout stateid around so we that don't inadvertently pick up
an old/misordered sequence id.
The race is as follows:

Client				Server
======				======
LAYOUTGET(seqid)
LAYOUTGET(seqid)
				return LAYOUTGET(seqid+1)
				return LAYOUTGET(seqid+2)
process LAYOUTGET(seqid+2)
	forget layout
process LAYOUTGET(seqid+1)

If it forgets the layout stateid before processing seqid+1, then
the client will not check the layout->plh_barrier, and so will set
the stateid with seqid+1.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-04 12:59:00 -04:00
Linus Torvalds
4b30b6d126 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I'm still prepping a set of fixes for btrfs fsync, just nailing down a
  hard to trigger memory corruption.  For now, these are tested and ready."

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
  Btrfs: fix endless loop in balancing block groups
  Btrfs: kill invalid ASSERT() in process_all_refs()
2016-09-03 12:40:45 -07:00
Linus Torvalds
41488202f1 Merge tag 'driver-core-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
 "Here are three small fixes for 4.8-rc5.

  One for sysfs, one for kernfs, and one documentation fix, all for
  reported issues.  All of these have been in linux-next for a while"

* tag 'driver-core-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  sysfs: correctly handle read offset on PREALLOC attrs
  documentation: drivers/core/of: fix name of of_node symlink
  kernfs: don't depend on d_find_any_alias() when generating notifications
2016-09-03 11:36:55 -07:00
Linus Torvalds
3e423945ea devpts: return NULL pts 'priv' entry for non-devpts nodes
In commit 8ead9dd547 ("devpts: more pty driver interface cleanups") I
made devpts_get_priv() just return the dentry->fs_data directly.  And
because I thought it wouldn't happen, I added a warning if you ever saw
a pts node that wasn't on devpts.

And no, that warning never triggered under any actual real use, but you
can trigger it by creating nonsensical pts nodes by hand.

So just revert the warning, and make devpts_get_priv() return NULL for
that case like it used to.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org # 4.6+
Cc: Eric W Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-03 11:02:50 -07:00
Trond Myklebust
52ec7be2e2 pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
If the server fails to set lrp->res.lrs_present in the LAYOUTRETURN reply,
then that means it believes the client holds no more layout state for that
file, and that the layout stateid is now invalid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-03 12:10:38 -04:00
Trond Myklebust
2a59a04116 pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
If the layout was marked as invalid, we want to ensure to initialise
the layout header fields correctly.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-03 12:10:37 -04:00
Trond Myklebust
bf0291dd22 pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
According to RFC5661, the client is responsible for serialising
LAYOUTGET and LAYOUTRETURN to avoid ambiguity. Consider the case
where we send both in parallel.

Client					Server
======					======
LAYOUTGET(seqid=X)
LAYOUTRETURN(seqid=X)
					LAYOUTGET return seqid=X+1
					LAYOUTRETURN return seqid=X+2
Process LAYOUTRETURN
          Forget layout stateid
Process LAYOUTGET
          Set seqid=X+1

The client processes the layoutget/layoutreturn in the wrong order,
and since the result of the layoutreturn was to clear the only
existing layout segment, the client forgets the layout stateid.

When the LAYOUTGET comes in, it is treated as having a completely
new stateid, and so the client sets the wrong sequence id...

Fix is to check if there are outstanding LAYOUTGET requests
before we send the LAYOUTRETURN (note that LAYOUGET will already
wait if it sees an outstanding LAYOUTRETURN).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-03 12:10:37 -04:00
Trond Myklebust
c49edecd51 NFS: Fix error reporting in nfs_file_write()
When doing O_DSYNC writes, the actual write errors are reported through
generic_write_sync(), so we must test the result.

Reported-by: J. R. Okajima <hooanon05g@gmail.com>
Fixes: 18290650b1 ("NFS: Move buffered I/O locking into nfs_file_write()")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-03 12:10:36 -04:00
Linus Torvalds
f28929ba36 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Most of this is regression fixes for posix acl behavior introduced in
  4.8-rc1 (these were caught by the pjd-fstest suite).  The are also
  miscellaneous fixes marked as stable material and cleanups.

  Other than overlayfs code, it touches <linux/fs.h> to add a constant
  with which to disable posix acl caching.  No changes needed to the
  actual caching code, it automatically does the right thing, although
  later we may want to optimize this case.

  I'm now testing overlayfs with the following test suites to catch
  regressions:

   - unionmount-testsuite
   - xfstests
   - pjd-fstest"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update doc
  ovl: listxattr: use strnlen()
  ovl: Switch to generic_getxattr
  ovl: copyattr after setting POSIX ACL
  ovl: Switch to generic_removexattr
  ovl: Get rid of ovl_xattr_noacl_handlers array
  ovl: Fix OVL_XATTR_PREFIX
  ovl: fix spelling mistake: "directries" -> "directories"
  ovl: don't cache acl on overlay layer
  ovl: use cached acl on underlying layer
  ovl: proper cleanup of workdir
  ovl: remove posix_acl_default from workdir
  ovl: handle umask and posix_acl_default correctly on creation
  ovl: don't copy up opaqueness
2016-09-02 09:32:15 -07:00
David Howells
d001648ec7 rxrpc: Don't expose skbs to in-kernel users [ver #2]
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
     rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
     will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
     because we don't have to go through all the packets to get to the one
     cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
     buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
     contexts - usually that of the userspace process that issued a syscall
     - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

     int rxrpc_kernel_recv_data(
		struct socket *sock, struct rxrpc_call *call,
		void *buffer, size_t bufsize, size_t *_offset,
		bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

	rxrpc_kernel_is_data_last()
    	rxrpc_kernel_get_abort_code()
    	rxrpc_kernel_get_error_number()
    	rxrpc_kernel_free_skb()
    	rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:43:27 -07:00
Linus Torvalds
511a8cdb65 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Two small patches to fix some bugs with the audit-by-executable
  functionality we introduced back in v4.3 (both patches are marked
  for the stable folks)"

* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix exe_file access in audit_exe_compare
  mm: introduce get_task_exe_file
2016-09-01 15:55:56 -07:00
Linus Torvalds
7d1ce606a3 Merge tag 'xfs-iomap-for-linus-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs and iomap fixes from Dave Chinner:
 "Most of these changes are small regression fixes that address problems
  introduced in the 4.8-rc1 window.  The two fixes that aren't (IO
  completion fix and superblock inprogress check) are fixes for problems
  introduced some time ago and need to be pushed back to stable kernels.

  Changes in this update:
   - iomap FIEMAP_EXTENT_MERGED usage fix
   - additional mount-time feature restrictions
   - rmap btree query fixes
   - freeze/unmount io completion workqueue fix
   - memory corruption fix for deferred operations handling"

* tag 'xfs-iomap-for-linus-4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: track log done items directly in the deferred pending work item
  iomap: don't set FIEMAP_EXTENT_MERGED for extent based filesystems
  xfs: prevent dropping ioend completions during buftarg wait
  xfs: fix superblock inprogress check
  xfs: simple btree query range should look right if LE lookup fails
  xfs: fix some key handling problems in _btree_simple_query_range
  xfs: don't log the entire end of the AGF
  xfs: disallow mounting of realtime + rmap filesystems
  xfs: don't perform lookups on zero-height btrees
2016-09-01 15:33:16 -07:00
Wang Xiaoguang
e0af24849e btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket()
If can_overcommit() in btrfs_calc_reclaim_metadata_size() returns true,
btrfs_async_reclaim_metadata_space() will not reclaim metadata space, just
return directly and also forget to wake up process which are waiting for
their tickets, so these processes will wait endlessly.

Fstests case generic/172 with mount option "-o compress=lzo" have revealed
this bug in my test machine. Here if we have tickets to handle, we must
handle them first.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-01 17:23:24 +02:00
Liu Bo
a9b1fc851d Btrfs: fix endless loop in balancing block groups
Qgroup function may overwrite the saved error 'err' with 0
in case quota is not enabled, and this ends up with a
endless loop in balance because we keep going back to balance
the same block group.

It really should use 'ret' instead.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-01 17:16:47 +02:00
Josef Bacik
3dc09ec895 Btrfs: kill invalid ASSERT() in process_all_refs()
Suppose you have the following tree in snap1 on a file system mounted with -o
inode_cache so that inode numbers are recycled

└── [    258]  a
    └── [    257]  b

and then you remove b, rename a to c, and then re-create b in c so you have the
following tree

└── [    258]  c
    └── [    257]  b

and then you try to do an incremental send you will hit

ASSERT(pending_move == 0);

in process_all_refs().  This is because we assume that any recycling of inodes
will not have a pending change in our path, which isn't the case.  This is the
case for the DELETE side, since we want to remove the old file using the old
path, but on the create side we could have a pending move and need to do the
normal pending rename dance.  So remove this ASSERT() and put a comment about
why we ignore pending_move.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-01 17:16:47 +02:00
Miklos Szeredi
7cb35119d0 ovl: listxattr: use strnlen()
Be defensive about what underlying fs provides us in the returned xattr
list buffer.  If it's not properly null terminated, bail out with a warning
insead of BUG.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-09-01 11:12:00 +02:00
Andreas Gruenbacher
0eb45fc3bb ovl: Switch to generic_getxattr
Now that overlayfs has xattr handlers for iop->{set,remove}xattr, use
those same handlers for iop->getxattr as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:12:00 +02:00
Miklos Szeredi
ce31513a91 ovl: copyattr after setting POSIX ACL
Setting POSIX acl may also modify the file mode, so need to copy that up to
the overlay inode.

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: d837a49bd5 ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:12:00 +02:00
Andreas Gruenbacher
0e585ccc13 ovl: Switch to generic_removexattr
Commit d837a49bd5 ("ovl: fix POSIX ACL setting") switches from
iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
ovl_removexattr to generic_removexattr as well.  As far as permission
checking goes, the same rules should apply in either case.

While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
this is not an iop->setxattr implementation and remove the unused inode
argument.

Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
order of handlers in ovl_xattr_handlers.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd5 ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:12:00 +02:00
Andreas Gruenbacher
0c97be22f9 ovl: Get rid of ovl_xattr_noacl_handlers array
Use an ordinary #ifdef to conditionally include the POSIX ACL handlers
in ovl_xattr_handlers, like the other filesystems do.  Flag the code
that is now only used conditionally with __maybe_unused.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Andreas Gruenbacher
fe2b759523 ovl: Fix OVL_XATTR_PREFIX
Make sure ovl_own_xattr_handler only matches attribute names starting
with "overlay.", not "overlayXXX".

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd5 ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Colin Ian King
fd36570a88 ovl: fix spelling mistake: "directries" -> "directories"
Trivial fix to spelling mistake in pr_err message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Miklos Szeredi
2a3a2a3f35 ovl: don't cache acl on overlay layer
Some operations (setxattr/chmod) can make the cached acl stale.  We either
need to clear overlay's acl cache for the affected inode or prevent acl
caching on the overlay altogether.  Preventing caching has the following
advantages:

 - no double caching, less memory used

 - overlay cache doesn't go stale when fs clears it's own cache

Possible disadvantage is performance loss.  If that becomes a problem
get_acl() can be optimized for overlayfs.

This patch disables caching by pre setting i_*acl to a value that

  - has bit 0 set, so is_uncached_acl() will return true

  - is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it

The constant -3 was chosen for this purpose.

Fixes: 39a25b2b37 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Miklos Szeredi
5201dc449e ovl: use cached acl on underlying layer
Instead of calling ->get_acl() directly, use get_acl() to get the cached
value.

We will have the acl cached on the underlying inode anyway, because we do
permission checking on the both the overlay and the underlying fs.

So, since we already have double caching, this improves performance without
any cost.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00