Commit Graph

49262 Commits

Author SHA1 Message Date
Trond Myklebust
d8fdb47fae NFS: Don't let readdirplus revalidate an inode that was marked as stale
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
2d148c7e84 NFSv4.1/pnfs: Mark the layout stateid invalid when all segments are removed
According to RFC5661, section 12.5.3. the layout stateid is no longer
valid once the client no longer holds any layout segments. Ensure that
we mark it invalid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
cbebaf897e NFS: Fix a double page unlock
Since commit 0bcbf039f6, nfs_readpage_release() has been used to
unlock the page in the read code.

Fixes: 0bcbf039f6 ("nfs: handle request add failure properly")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Weston Andros Adamson
5e3a98883e pnfs_nfs: fix _cancel_empty_pagelist
pnfs_generic_commit_cancel_empty_pagelist calls nfs_commitdata_release,
but that is wrong: nfs_commitdata_release puts the open context, something
that isn't valid until nfs_init_commit is called, which is never the case
when pnfs_generic_commit_cancel_empty_pagelist is called.

This was introduced in "nfs: avoid race that crashes nfs_init_commit".

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Oleg Drokin
cea7f829d3 nfs4: Fix potential use after free of state in nfs4_do_reclaim.
Commit e8d975e73e ("fixing infinite OPEN loop in 4.0 stateid recovery")
introduced access to state after it was just potentially freed by
nfs4_put_open_state leading to a random data corruption somewhere.

BUG: unable to handle kernel paging request at ffff88004941ee40
IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm
CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ #112
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000
RIP: 0010:[<ffffffff813baf01>]  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
RSP: 0018:ffff88003ff4bd68  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8
RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88
RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98
R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00
FS:  0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0
Stack:
 ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68
 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48
 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40
Call Trace:
 [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff810af0d1>] kthread+0x101/0x120
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff818843af>] ret_from_fork+0x1f/0x40
 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250
Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6
RIP  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
 RSP <ffff88003ff4bd68>
CR2: ffff88004941ee40

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
d2a7de0b34 NFS: Fix up O_DIRECT results
if we read or wrote something, we must report it

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
dd1beb3d16 NFS/pnfs: handle bad delegation stateids in nfs4_layoutget_handle_exception
We must call nfs4_handle_exception() on BAD_STATEID errors. The only
exception is if the stateid argument turns out to be a layout stateid
that is declared invalid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
e5241e4388 NFSv4.1/pnfs: Add sparse lock annotations for pnfs_find_alloc_layout
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
67a3b72146 NFSv4.1/pnfs: Layout stateids start out as being invalid
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Trond Myklebust
bc23676caf NFSv4.1/pnfs: Ensure we handle delegation errors in nfs4_proc_layoutget()
nfs4_handle_exception() relies on the caller setting the 'inode' field
in the struct nfs4_exception argument when the error applies to a
delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-06-24 12:01:00 -04:00
Andy Lutomirski
380cf5ba6b fs: Treat foreign mounts as nosuid
If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.  Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID.  The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents.  If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-06-24 10:40:41 -05:00
Zheng Lv
34df117414 fat: fix error message for bogus number of directory entries
"bogus directory-entries per block" was reported for what was instead
bogus number of directory entries. The message also mismatched the
argument passed to printk(), which was sbi->dir_entries.

Fix this by replacing the message with "bogus number of directory
entries". printk() argument was kept unchanged.

Signed-off-by: Zheng Lv <lv.zheng.2015@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-06-24 14:28:01 +02:00
Linus Torvalds
63c04ee7d3 Merge tag 'upstream-4.7-rc5' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS fixes from Richard Weinberger:
 "This contains fixes for two critical bugs in UBI and UBIFS:

   - fix the possibility of losing data upon a power cut when UBI tries
     to recover from a write error

   - fix page migration on UBIFS.  It turned out that the default page
     migration function is not suitable for UBIFS"

* tag 'upstream-4.7-rc5' of git://git.infradead.org/linux-ubifs:
  UBIFS: Implement ->migratepage()
  mm: Export migrate_page_move_mapping and migrate_page_copy
  ubi: Make recover_peb power cut aware
  gpio: make library immune to error pointers
  gpio: make sure gpiod_to_irq() returns negative on NULL desc
  gpio: 104-idi-48: Fix missing spin_lock_init for ack_lock
2016-06-23 22:48:48 -07:00
Luis de Bethencourt
a6b6befbb2 cifs: check hash calculating succeeded
calc_lanman_hash() could return -ENOMEM or other errors, we should check
that everything went fine before using the calculated key.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-06-23 23:45:17 -05:00
Jerome Marchand
b8da344b74 cifs: dynamic allocation of ntlmssp blob
In sess_auth_rawntlmssp_authenticate(), the ntlmssp blob is allocated
statically and its size is an "empirical" 5*sizeof(struct
_AUTHENTICATE_MESSAGE) (320B on x86_64). I don't know where this value
comes from or if it was ever appropriate, but it is currently
insufficient: the user and domain name in UTF16 could take 1kB by
themselves. Because of that, build_ntlmssp_auth_blob() might corrupt
memory (out-of-bounds write). The size of ntlmssp_blob in
SMB2_sess_setup() is too small too (sizeof(struct _NEGOTIATE_MESSAGE)
+ 500).

This patch allocates the blob dynamically in
build_ntlmssp_auth_blob().

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2016-06-23 23:45:07 -05:00
Jerome Marchand
202d772ba0 cifs: use CIFS_MAX_DOMAINNAME_LEN when converting the domain name
Currently in build_ntlmssp_auth_blob(), when converting the domain
name to UTF16, CIFS_MAX_USERNAME_LEN limit is used. It should be
CIFS_MAX_DOMAINNAME_LEN. This patch fixes this.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-06-23 23:44:56 -05:00
Jeff Layton
3d22462ae9 cifs: stuff the fl_owner into "pid" field in the lock request
Right now, we send the tgid cross the wire. What we really want to send
though is a hashed fl_owner_t since samba treats this field as a generic
lockowner.

It turns out that because we enforce and release locks locally before
they are ever sent to the server, this patch makes no difference in
behavior. Still, setting OFD locks on the server using the process
pid seems wrong, so I think this patch still makes sense.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Pavel Shilovsky <pshilovsky@samba.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2016-06-23 23:44:44 -05:00
Eric W. Biederman
cc50a07a24 userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
Now that SB_I_NODEV controls the nodev behavior devpts can just clear
this flag during mount.  Simplifying the code and making it easier
to audit how the code works.  While still preserving the invariant
that s_iflags is only modified during mount.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:47:31 -05:00
Eric W. Biederman
67690f937c userns: Remove implicit MNT_NODEV fragility.
Replace the implict setting of MNT_NODEV on mounts that happen with
just user namespace permissions with an implicit setting of SB_I_NODEV
in s_iflags.  The visibility of the implicit MNT_NODEV has caused
problems in the past.

With this change the fragile case where an implicit MNT_NODEV needs to
be preserved in do_remount is removed.  Using SB_I_NODEV is much less
fragile as s_iflags are set during the original mount and never
changed.

In do_new_mount with the implicit setting of MNT_NODEV gone, the only
code that can affect mnt_flags is fs_fully_visible so simplify the if
statement and reduce the indentation of the code to make that clear.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:47:23 -05:00
Eric W. Biederman
a1935c1738 mnt: Simplify mount_too_revealing
Verify all filesystems that we check in mount_too_revealing set
SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags.  That is true for today
and it should remain true in the future.

Remove the now unnecessary checks from mnt_already_visibile that
ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
preserved.  Making the code shorter and easier to read.

Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
and several slightly buggy container applications don't bother to
set those flags continue to work.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:57 -05:00
Eric W. Biederman
a2982cc922 vfs: Generalize filesystem nodev handling.
Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV.  Use this new function in all of the
places where MNT_NODEV was previously tested.

Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong.  With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:57 -05:00
Eric W. Biederman
29a517c232 kernfs: The cgroup filesystem also benefits from SB_I_NOEXEC
The cgroup filesystem is in the same boat as sysfs.  No one ever
permits executables of any kind on the cgroup filesystem, and there is
no reasonable future case to support executables in the future.

Therefore move the setting of SB_I_NOEXEC which makes the code proof
against future mistakes of accidentally creating executables from
sysfs to kernfs itself.  Making the code simpler and covering the
sysfs, cgroup, and cgroup2 filesystems.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:56 -05:00
Eric W. Biederman
a001e74cef mnt: Move the FS_USERNS_MOUNT check into sget_userns
Allowing a filesystem to be mounted by other than root in the initial
user namespace is a filesystem property not a mount namespace property
and as such should be checked in filesystem specific code.  Move the
FS_USERNS_MOUNT test into super.c:sget_userns().

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:55 -05:00
Eric W. Biederman
6e4eab577a fs: Add user namespace member to struct super_block
Start marking filesystems with a user namespace owner, s_user_ns.  In
this change this is only used for permission checks of who may mount a
filesystem.  Ultimately s_user_ns will be used for translating ids and
checking capabilities for filesystems mounted from user namespaces.

The default policy for setting s_user_ns is implemented in sget(),
which arranges for s_user_ns to be set to current_user_ns() and to
ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
user_ns.

The guts of sget are split out into another function sget_userns().
The function sget_userns calls alloc_super with the specified user
namespace or it verifies the existing superblock that was found
has the expected user namespace, and fails with EBUSY when it is not.
This failing prevents users with the wrong privileges mounting a
filesystem.

The reason for the split of sget_userns from sget is that in some
cases such as mount_ns and kernfs_mount_ns a different policy for
permission checking of mounts and setting s_user_ns is necessary, and
the existence of sget_userns() allows those policies to be
implemented.

The helper mount_ns is expected to be used for filesystems such as
proc and mqueuefs which present per namespace information.  The
function mount_ns is modified to call sget_userns instead of sget to
ensure the user namespace owner of the namespace whose information is
presented by the filesystem is used on the superblock.

For sysfs and cgroup the appropriate permission checks are already in
place, and kernfs_mount_ns is modified to call sget_userns so that
the init_user_ns is the only user namespace used.

For the cgroup filesystem cgroup namespace mounts are bind mounts of a
subset of the full cgroup filesystem and as such s_user_ns must be the
same for all of them as there is only a single superblock.

Mounts of sysfs that vary based on the network namespace could in principle
change s_user_ns but it keeps the analysis and implementation of kernfs
simpler if that is not supported, and at present there appear to be no
benefits from supporting a different s_user_ns on any sysfs mount.

Getting the details of setting s_user_ns correct has been
a long process.  Thanks to Pavel Tikhorirorv who spotted a leak
in sget_userns.  Thanks to Seth Forshee who has kept the work alive.

Thanks-to: Seth Forshee <seth.forshee@canonical.com>
Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-06-23 15:41:55 -05:00
Eric W. Biederman
e94591d0d9 proc: Convert proc_mount to use mount_ns.
Move the call of get_pid_ns, the call of proc_parse_options, and
the setting of s_iflags into proc_fill_super so that mount_ns
can be used.

Convert proc_mount to call mount_ns and remove the now unnecessary
code.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:54 -05:00
Eric W. Biederman
d91ee87d8d vfs: Pass data, ns, and ns->userns to mount_ns
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks.  The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts.  Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:53 -05:00
Eric W. Biederman
8654df4e2a mnt: Refactor fs_fully_visible into mount_too_revealing
Replace the call of fs_fully_visible in do_new_mount from before the
new superblock is allocated with a call of mount_too_revealing after
the superblock is allocated.   This winds up being a much better location
for maintainability of the code.

The first change this enables is the replacement of FS_USERNS_VISIBLE
with SB_I_USERNS_VISIBLE.  Moving the flag from struct filesystem_type
to sb_iflags on the superblock.

Unfortunately mount_too_revealing fundamentally needs to touch
mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
times.  If the mnt_flags did not need to be touched the code
could be easily moved into the filesystem specific mount code.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:46 -05:00
Chandan Rajendra
b7f67055d2 Btrfs: Force stripesize to the value of sectorsize
Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:42 -07:00
Wang Xiaoguang
c0d2f6104e btrfs: fix disk_i_size update bug when fallocate() fails
When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs  -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
	i=$((i + 1))
	dst_offset=$((blocksize * i))
	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
		testfile > /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:41 -07:00
Liu Bo
415b35a55b Btrfs: fix error handling in map_private_extent_buffer
map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
   than pagesize,
2. when it detects something insane.

The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.

Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:40 -07:00
Wei Yongjun
04e1b65af2 Btrfs: fix error return code in btrfs_init_test_fs()
Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:39 -07:00
Amitoj Kaur Chawla
5c93f56f77 dlm: Use kmemdup instead of kmalloc and memcpy
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@

-  to = \(kmalloc\|kzalloc\)(size,flag);
+  to = kmemdup(from,size,flag);
   if (to==NULL || ...) S
-  memcpy(to, from, size);

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-06-23 11:55:58 -05:00
Josef Bacik
c6887cd111 Btrfs: don't do nocow check unless we have to
Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
	--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:57:14 -07:00
Chris Mason
0f873eca82 btrfs: fix deadlock in delayed_ref_async_start
"Btrfs: track transid for delayed ref flushing" was deadlocking on
btrfs_attach_transaction because its not safe to call from the async
delayed ref start code.  This commit brings back btrfs_join_transaction
instead and checks for a blocked commit.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:54:18 -07:00
Josef Bacik
31b9655f43 Btrfs: track transid for delayed ref flushing
Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling.  This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore.  So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling.  If the transaction we
get is different then we can just bail out.  This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22 17:54:18 -07:00
Kirill A. Shutemov
4ac1c17b20 UBIFS: Implement ->migratepage()
During page migrations UBIFS might get confused
and the following assert triggers:
[  213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436)
[  213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008
[  213.490000] Hardware name: Allwinner sun4i/sun5i Families
[  213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14)
[  213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0)
[  213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50)
[  213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8)
[  213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290)
[  213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80)
[  213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0)
[  213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4)
[  213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0)
[  213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8)
[  213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274)
[  213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c)
[  213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0)
[  213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8)
[  213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48)
[  213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444)
[  213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614)
[  213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c)
[  213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34)

UBIFS is using PagePrivate() which can have different meanings across
filesystems. Therefore the generic page migration code cannot handle this
case correctly.
We have to implement our own migration function which basically does a
plain copy but also duplicates the page private flag.
UBIFS is not a block device filesystem and cannot use buffer_migrate_page().

Cc: stable@vger.kernel.org
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
[rw: Massaged changelog, build fixes, etc...]
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Christoph Hellwig <hch@lst.de>
2016-06-23 00:29:53 +02:00
Linus Torvalds
f9020d1741 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns fix from Eric Biederman:
 "This contains just a single small patch that fixes a tiny hole in the
  logic of allowing unprivileged mounting of proc and sysfs.

  In practice I don't think anyone is affected because having MNT_RDONLY
  clear in mnt->mnt_flags but MS_RDONLY set in sb->s_flags is very weird
  for a filesystem, and weirder for proc and sysfs.  However if it
  happens let's handle it correctly and then no one has to to worry
  about this crazy case"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  mnt: Account for MS_RDONLY in fs_fully_visible
2016-06-22 14:11:24 -07:00
Trond Myklebust
4f52b6bb8c NFS: Don't call COMMIT in ->releasepage()
While COMMIT has the potential to free up a lot of memory that is being
taken by unstable writes, it isn't guaranteed to free up this particular
page. Also, calling fsync() on the server is expensive and so we want to
do it in a more controlled fashion, rather than have it triggered at
random by the VM.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22 09:59:44 -04:00
Trond Myklebust
93761d9863 NFS: Don't hold the inode lock across fsync()
Commits are no longer required to be serialised.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22 09:59:43 -04:00
Trond Myklebust
811ed92ecc NFS: writepage of a single page should not be synchronous
It is almost always better to wait for more so that we can issue a
bulk commit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22 09:59:43 -04:00
Trond Myklebust
6b56a89833 NFS: Kill NFS_INO_NFS_INO_FLUSHING: it is a performance killer
filemap_datawrite() and friends already deal just fine with livelock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22 09:59:42 -04:00
Trond Myklebust
ca0daa277a NFS: Cache aggressively when file is open for writing
Unless the user is using file locking, we must assume close-to-open
cache consistency when the file is open for writing. Adjust the
caching algorithm so that it does not clear the cache on out-of-order
writes and/or attribute revalidations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-06-22 09:59:42 -04:00
Zhilong Liu
505ee5283c dlm: add log_info config option
This config option can be used to disable the
LOG_INFO recovery messages.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-06-21 09:04:24 -05:00
Zheng Lv
dcc381e833 fat: fix typo s/supeblock/superblock/
Signed-off-by: Zheng Lv <lv.zheng.2015@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-06-21 13:36:31 +02:00
Andreas Gruenbacher
84d86e6d30 ocfs: fix ocfs2_xattr_user_get() argument name
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-06-21 13:22:56 +02:00
Dave Chinner
f477cedc4e Merge branch 'xfs-4.8-misc-fixes-2' into for-next 2016-06-21 11:55:13 +10:00
Darrick J. Wong
19b54ee66c xfs: refactor btree maxlevels computation
Create a common function to calculate the maximum height of a per-AG
btree.  This will eventually be used by the rmapbt and refcountbt
code to calculate appropriate maxlevels values for each.  This is
important because the verifiers and the transaction block
reservations depend on accurate estimates of how many blocks are
needed to satisfy a btree split.

We were mistakenly using the max bnobt height for all the btrees,
which creates a dangerous situation since the larger records and
keys in an rmapbt make it very possible that the rmapbt will be
taller than the bnobt and so we can run out of transaction block
reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 11:53:28 +10:00
Darrick J. Wong
e66a4c678e xfs: convert list of extents to free into a regular list
In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 11:53:28 +10:00
Dave Chinner
4d89e20bf1 xfs: separate freelist fixing into a separate helper
Break up xfs_free_extent() into a helper that fixes the freelist.
This helper will be used subsequently to ensure the freelist during
deferred rmap processing.

[darrick: refactor to put this at the head of the patchset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 11:53:28 +10:00
Darrick J. Wong
59bad075bd xfs: rearrange xfs_bmap_add_free parameters
This is already in xfsprogs' libxfs, so port it to the kernel.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 11:53:28 +10:00