Commit Graph

49262 Commits

Author SHA1 Message Date
Andreas Gruenbacher
28ea06c46f gfs2: Avoid alignment hole in struct lm_lockname
Commit 88ffbf3e03 switches to using rhashtables for glocks, hashing over
the entire struct lm_lockname instead of its individual fields.  On some
architectures, struct lm_lockname contains a hole of uninitialized
memory due to alignment rules, which now leads to incorrect hash values.
Get rid of that hole.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
CC: <stable@vger.kernel.org> #v4.3+
2017-03-15 10:06:07 -04:00
Darrick J. Wong
630a04e79d xfs: verify inline directory data forks
When we're reading or writing the data fork of an inline directory,
check the contents to make sure we're not overflowing buffers or eating
garbage data.  xfs/348 corrupts an inline symlink into an inline
directory, triggering a buffer overflow bug.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
v2: add more checks consistent with _dir2_sf_check and make the verifier
usable from anywhere.
2017-03-15 00:24:25 -07:00
Linus Torvalds
ae50dfd616 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
    from Steffen Klassert.

 2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
    Alexei Starovoitov.

 3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
    Michal Schmidt.

 4) We can get a divide by zero when TCP socket are morphed into
    listening state, fix from Eric Dumazet.

 5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
    skb_complete_tx_timestamp(). From Eric Dumazet.

 6) Use after free in dccp_feat_activate_values(), also from Eric
    Dumazet.

 7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
    Jarod Wilson.

 8) Fix use after free in vrf_xmit(), from David Ahern.

 9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
    Alexey Kodanev.

10) Properly check napi_complete_done() return value in order to decide
    whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
    Lendacky.

11) Fix double free of hwmon device in marvell phy driver, from Andrew
    Lunn.

12) Don't crash on malformed netlink attributes in act_connmark, from
    Etienne Noss.

13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
    from Sabrina Dubroca.

14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
    Florian Westphal.

15) Fix routing redirect races in dccp and tcp, basically the ICMP
    handler can't modify the socket's cached route in it's locked by the
    user at this moment. From Jon Maxwell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
  qed: Enable iSCSI Out-of-Order
  qed: Correct out-of-bound access in OOO history
  qed: Fix interrupt flags on Rx LL2
  qed: Free previous connections when releasing iSCSI
  qed: Fix mapping leak on LL2 rx flow
  qed: Prevent creation of too-big u32-chains
  qed: Align CIDs according to DORQ requirement
  mlxsw: reg: Fix SPVMLR max record count
  mlxsw: reg: Fix SPVM max record count
  net: Resend IGMP memberships upon peer notification.
  dccp: fix memory leak during tear-down of unsuccessful connection request
  tun: fix premature POLLOUT notification on tun devices
  dccp/tcp: fix routing redirect race
  ucc/hdlc: fix two little issue
  vxlan: fix ovs support
  net: use net->count to check whether a netns is alive or not
  bridge: drop netfilter fake rtable unconditionally
  ipv6: avoid write to a possibly cloned skb
  net: wimax/i2400m: fix NULL-deref at probe
  isdn/gigaset: fix NULL-deref at probe
  ...
2017-03-14 21:31:23 -07:00
Kinglong Mee
366a1569bf NFSv4: fix a reference leak caused WARNING messages
Because nfs4_opendata_access() has close the state when access is denied,
so the state isn't leak.
Rather than revert the commit a974deee47, I'd like clean the strange state close.

[ 1615.094218] ------------[ cut here ]------------
[ 1615.094607] WARNING: CPU: 0 PID: 23702 at lib/list_debug.c:31 __list_add_valid+0x8e/0xa0
[ 1615.094913] list_add double add: new=ffff9d7901d9f608, prev=ffff9d7901d9f608, next=ffff9d7901ee8dd0.
[ 1615.095458] Modules linked in: nfsv4(E) nfs(E) nfsd(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock f2fs snd_seq_midi snd_seq_midi_event fscrypto coretemp ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf vmw_balloon snd_ens1371 joydev gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel mptspi e1000 serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs]
[ 1615.097663] CPU: 0 PID: 23702 Comm: fstest Tainted: G        W   E   4.11.0-rc1+ #517
[ 1615.098015] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1615.098807] Call Trace:
[ 1615.099183]  dump_stack+0x63/0x86
[ 1615.099578]  __warn+0xcb/0xf0
[ 1615.099967]  warn_slowpath_fmt+0x5f/0x80
[ 1615.100370]  __list_add_valid+0x8e/0xa0
[ 1615.100760]  nfs4_put_state_owner+0x75/0xc0 [nfsv4]
[ 1615.101136]  __nfs4_close+0x109/0x140 [nfsv4]
[ 1615.101524]  nfs4_close_state+0x15/0x20 [nfsv4]
[ 1615.101949]  nfs4_close_context+0x21/0x30 [nfsv4]
[ 1615.102691]  __put_nfs_open_context+0xb8/0x110 [nfs]
[ 1615.103155]  put_nfs_open_context+0x10/0x20 [nfs]
[ 1615.103586]  nfs4_file_open+0x13b/0x260 [nfsv4]
[ 1615.103978]  do_dentry_open+0x20a/0x2f0
[ 1615.104369]  ? nfs4_copy_file_range+0x30/0x30 [nfsv4]
[ 1615.104739]  vfs_open+0x4c/0x70
[ 1615.105106]  ? may_open+0x5a/0x100
[ 1615.105469]  path_openat+0x623/0x1420
[ 1615.105823]  do_filp_open+0x91/0x100
[ 1615.106174]  ? __alloc_fd+0x3f/0x170
[ 1615.106568]  do_sys_open+0x130/0x220
[ 1615.106920]  ? __put_cred+0x3d/0x50
[ 1615.107256]  SyS_open+0x1e/0x20
[ 1615.107588]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1615.107922] RIP: 0033:0x7fab599069b0
[ 1615.108247] RSP: 002b:00007ffcf0600d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[ 1615.108575] RAX: ffffffffffffffda RBX: 00007fab59bcfae0 RCX: 00007fab599069b0
[ 1615.108896] RDX: 0000000000000200 RSI: 0000000000000200 RDI: 00007ffcf060255e
[ 1615.109211] RBP: 0000000000040010 R08: 0000000000000000 R09: 0000000000000016
[ 1615.109515] R10: 00000000000006a1 R11: 0000000000000246 R12: 0000000000041000
[ 1615.109806] R13: 0000000000040010 R14: 0000000000001000 R15: 0000000000002710
[ 1615.110152] ---[ end trace 96ed63b1306bf2f3 ]---

Fixes: a974deee47 ("NFSv4: Fix memory and state leak in...")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-13 10:55:45 -04:00
Kinglong Mee
6f1f622019 nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
This typo cause a memory leak, and a bad client's group id.
unreferenced object 0xffff96d8073998d0 (size 8):
  comm "kworker/0:3", pid 34224, jiffies 4295361338 (age 761.752s)
  hex dump (first 8 bytes):
    30 00 39 07 d8 96 ff ff                          0.9.....
  backtrace:
    [<ffffffffb883212a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffffb8237bc0>] __kmalloc+0x140/0x220
    [<ffffffffc05c921c>] xdr_stream_decode_string_dup+0x7c/0x110 [sunrpc]
    [<ffffffffc08edcf0>] decode_getfattr_attrs+0x940/0x1630 [nfsv4]
    [<ffffffffc08eea7b>] decode_getfattr_generic.constprop.108+0x9b/0x100 [nfsv4]
    [<ffffffffc08eebaf>] nfs4_xdr_dec_open+0xcf/0x100 [nfsv4]
    [<ffffffffc05bf9c7>] rpcauth_unwrap_resp+0xa7/0xe0 [sunrpc]
    [<ffffffffc05afc70>] call_decode+0x1e0/0x810 [sunrpc]
    [<ffffffffc05bc64d>] __rpc_execute+0x8d/0x420 [sunrpc]
    [<ffffffffc05bc9f2>] rpc_async_schedule+0x12/0x20 [sunrpc]
    [<ffffffffb80bb077>] process_one_work+0x197/0x430
    [<ffffffffb80bb35e>] worker_thread+0x4e/0x4a0
    [<ffffffffb80c1d41>] kthread+0x101/0x140
    [<ffffffffb8839a5c>] ret_from_fork+0x2c/0x40
    [<ffffffffffffffff>] 0xffffffffffffffff

Fixes: 686a816ab6 ("NFSv4: Clean up owner/group attribute decode")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-03-13 10:39:10 -04:00
Tahsin Erdogan
4a3a485b1e writeback: fix memory leak in wb_queue_work()
When WB_registered flag is not set, wb_queue_work() skips queuing the
work, but does not perform the necessary clean up. In particular, if
work->auto_free is true, it should free the memory.

The leak condition can be reprouced by following these steps:

   mount /dev/sdb /mnt/sdb
   /* In qemu console: device_del sdb */
   umount /dev/sdb

Above will result in a wb_queue_work() call on an unregistered wb and
thus leak memory.

Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-13 08:27:34 -06:00
NeilBrown
800a938f0b NFSD: fix nfsd_reset_versions for NFSv4.
If you write "-2 -3 -4" to the "versions" file, it will
notice that no versions are enabled, and nfsd_reset_versions()
is called.
This enables all major versions, not no minor versions.
So we lose the invariant that NFSv4 is only advertised when
at least one minor is enabled.

Fix the code to explicitly enable minor versions for v4,
change it to use nfsd_vers() to test and set, and simplify
the code.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-10 17:04:50 -05:00
NeilBrown
928c6fb3a9 NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
Current code will return 1 if the version is supported,
and -1 if it isn't.
This is confusing and inconsistent with the one place where this
is used.
So change to return 1 if it is supported, and zero if not.
i.e. an error is never returned.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-10 17:04:50 -05:00
NeilBrown
abcb4dacb0 NFSD: further refinement of content of /proc/fs/nfsd/versions
Prior to
  e35659f1b0 ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely.  Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently.  To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
  d3635ff07e ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too.  If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely.  Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2"  the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
 writing "-4" will disable all 4.x minor versions.
 writing "+4" will enable all 4.1 minor version if none are currently enabled.
    rpc.nfsd will list minor versions before major versions, so
      rpc.nfsd -V 4.2 -N 4.1
    will write "-4.1 +4.2 +2 +3 +4"
    so it would be a regression for "+4" to enable always all versions.
 reading "-4" implies that no v4.x are enabled
 reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
 "-4.0" is also present.  All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-10 17:04:50 -05:00
Kinglong Mee
c952cd4e94 nfsd: map the ENOKEY to nfserr_perm for avoiding warning
Now that Ext4 and f2fs filesystems support encrypted directories and
files, attempts to access those files may return ENOKEY, resulting in
the following WARNING.

Map ENOKEY to nfserr_perm instead of nfserr_io.

[ 1295.411759] ------------[ cut here ]------------
[ 1295.411787] WARNING: CPU: 0 PID: 12786 at fs/nfsd/nfsproc.c:796 nfserrno+0x74/0x80 [nfsd]
[ 1295.411806] nfsd: non-standard errno: -126
[ 1295.411816] Modules linked in: nfsd nfs_acl auth_rpcgss nfsv4 nfs lockd fscache tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event coretemp crct10dif_pclmul crc32_generic crc32_pclmul snd_ens1371 gameport ghash_clmulni_intel snd_ac97_codec f2fs intel_rapl_perf ac97_bus snd_seq ppdev snd_pcm snd_rawmidi snd_timer vmw_balloon snd_seq_device snd joydev soundcore parport_pc parport nfit acpi_cpufreq tpm_tis vmw_vmci tpm_tis_core tpm shpchp i2c_piix4 grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs_acl]
[ 1295.412522] CPU: 0 PID: 12786 Comm: nfsd Tainted: G        W       4.11.0-rc1+ #521
[ 1295.412959] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1295.413814] Call Trace:
[ 1295.414252]  dump_stack+0x63/0x86
[ 1295.414666]  __warn+0xcb/0xf0
[ 1295.415087]  warn_slowpath_fmt+0x5f/0x80
[ 1295.415502]  ? put_filp+0x42/0x50
[ 1295.415927]  nfserrno+0x74/0x80 [nfsd]
[ 1295.416339]  nfsd_open+0xd7/0x180 [nfsd]
[ 1295.416746]  nfs4_get_vfs_file+0x367/0x3c0 [nfsd]
[ 1295.417182]  ? security_inode_permission+0x41/0x60
[ 1295.417591]  nfsd4_process_open2+0x9b2/0x1200 [nfsd]
[ 1295.418007]  nfsd4_open+0x481/0x790 [nfsd]
[ 1295.418409]  nfsd4_proc_compound+0x395/0x680 [nfsd]
[ 1295.418812]  nfsd_dispatch+0xb8/0x1f0 [nfsd]
[ 1295.419233]  svc_process_common+0x4d9/0x830 [sunrpc]
[ 1295.419631]  svc_process+0xfe/0x1b0 [sunrpc]
[ 1295.420033]  nfsd+0xe9/0x150 [nfsd]
[ 1295.420420]  kthread+0x101/0x140
[ 1295.420802]  ? nfsd_destroy+0x60/0x60 [nfsd]
[ 1295.421199]  ? kthread_park+0x90/0x90
[ 1295.421598]  ret_from_fork+0x2c/0x40
[ 1295.421996] ---[ end trace 0d5a969cd7852e1f ]---

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-03-10 16:54:55 -05:00
Linus Torvalds
baeedc7158 Merge branch 'prep-for-5level'
Merge 5-level page table prep from Kirill Shutemov:
 "Here's relatively low-risk part of 5-level paging patchset. Merging it
  now will make x86 5-level paging enabling in v4.12 easier.

  The first patch is actually x86-specific: detect 5-level paging
  support. It boils down to single define.

  The rest of patchset converts Linux MMU abstraction from 4- to 5-level
  paging.

  Enabling of new abstraction in most cases requires adding single line
  of code in arch-specific code. The rest is taken care by asm-generic/.

  Changes to mm/ code are mostly mechanical: add support for new page
  table level -- p4d_t -- where we deal with pud_t now.

  v2:
   - fix build on microblaze (Michal);
   - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
   - acks from Michal"

* emailed patches from Kirill A Shutemov <kirill.shutemov@linux.intel.com>:
  mm: introduce __p4d_alloc()
  mm: convert generic code to 5-level paging
  asm-generic: introduce <asm-generic/pgtable-nop4d.h>
  arch, mm: convert all architectures to use 5level-fixup.h
  asm-generic: introduce __ARCH_USE_5LEVEL_HACK
  asm-generic: introduce 5level-fixup.h
  x86/cpufeature: Add 5-level paging detection
2017-03-10 08:59:07 -08:00
Linus Torvalds
8fe3ccaed0 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "26 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
  userfaultfd: remove wrong comment from userfaultfd_ctx_get()
  fat: fix using uninitialized fields of fat_inode/fsinfo_inode
  sh: cayman: IDE support fix
  kasan: fix races in quarantine_remove_cache()
  kasan: resched in quarantine_remove_cache()
  mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
  thp: fix another corner case of munlock() vs. THPs
  rmap: fix NULL-pointer dereference on THP munlocking
  mm/memblock.c: fix memblock_next_valid_pfn()
  userfaultfd: selftest: vm: allow to build in vm/ directory
  userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
  userfaultfd: non-cooperative: fix fork fctx->new memleak
  mm/cgroup: avoid panic when init with low memory
  drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
  mm/vmstats: add thp_split_pud event for clarity
  include/linux/fs.h: fix unsigned enum warning with gcc-4.2
  userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
  userfaultfd: non-cooperative: robustness check
  userfaultfd: non-cooperative: rollback userfaultfd_exit
  x86, mm: unify exit paths in gup_pte_range()
  ...
2017-03-10 08:34:42 -08:00
David Howells
cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Linus Torvalds
9db61d6fd6 Merge tag 'xfs-4.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
 "Here are some bug fixes for -rc2 to clean up the copy on write
  handling and to remove a cause of hangs.

   - Fix various iomap bugs

   - Fix overly aggressive CoW preallocation garbage collection

   - Fixes to CoW endio error handling

   - Fix some incorrect geometry calculations

   - Remove a potential system hang in bulkstat

   - Try to allocate blocks more aggressively to reduce ENOSPC errors"

* tag 'xfs-4.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: try any AG when allocating the first btree block when reflinking
  xfs: use iomap new flag for newly allocated delalloc blocks
  xfs: remove kmem_zalloc_greedy
  xfs: Use xfs_icluster_size_fsb() to calculate inode alignment mask
  xfs: fix and streamline error handling in xfs_end_io
  xfs: only reclaim unwritten COW extents periodically
  iomap: invalidate page caches should be after iomap_dio_complete() in direct write
2017-03-09 18:11:28 -08:00
David Hildenbrand
2378cd6181 userfaultfd: remove wrong comment from userfaultfd_ctx_get()
It's a void function, so there is no return value;

Link: http://lkml.kernel.org/r/20170309150817.7510-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:10 -08:00
OGAWA Hirofumi
c0d0e35128 fat: fix using uninitialized fields of fat_inode/fsinfo_inode
Recently fallocate patch was merged and it uses
MSDOS_I(inode)->mmu_private at fat_evict_inode().  However,
fat_inode/fsinfo_inode that was introduced in past didn't initialize
MSDOS_I(inode) properly.

With those combinations, it became the cause of accessing random entry
in FAT area.

Link: http://lkml.kernel.org/r/87pohrj4i8.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Moreno Bartalucci <moreno.bartalucci@tecnorama.it>
Tested-by: Moreno Bartalucci <moreno.bartalucci@tecnorama.it>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:10 -08:00
Andrea Arcangeli
70ccb92fdd userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd_remove() has to be execute before zapping the pagetables or
UFFDIO_COPY could keep filling pages after zap_page_range returned,
which would result in non zero data after a MADV_DONTNEED.

However userfaultfd_remove() may have to release the mmap_sem.  This was
handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

The fix consists in revalidating the vma in case userfaultfd_remove()
had to release the mmap_sem.

This also optimizes away an unnecessary down_read/up_read in the
MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
userfaultfd_remove() will be defined as "true" at build time.

Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:10 -08:00
Mike Rapoport
7eb76d457f userfaultfd: non-cooperative: fix fork fctx->new memleak
We have a memleak in the ->new ctx if the uffd of the parent is closed
before the fork event is read, nothing frees the new context.

Link: http://lkml.kernel.org/r/20170302173738.18994-2-aarcange@redhat.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:10 -08:00
Andrea Arcangeli
8c9e7bb7a4 userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
Don't stop running dup_fctx() even if userfaultfd_event_wait_completion
fails as it has to run userfaultfd_ctx_put on all ctx to pair against
the userfaultfd_ctx_get that was run on all fctx->orig in
dup_userfaultfd.

Link: http://lkml.kernel.org/r/20170224181957.19736-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Andrea Arcangeli
9a69a829f9 userfaultfd: non-cooperative: robustness check
Similar to the handle_userfault() case, also make sure to never attempt
to send any event past the PF_EXITING point of no return.

This is purely a robustness check.

Link: http://lkml.kernel.org/r/20170224181957.19736-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Andrea Arcangeli
dd0db88d80 userfaultfd: non-cooperative: rollback userfaultfd_exit
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".

Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing.  I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.

I dropped userfaultfd_exit as result.  I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.

Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state.  So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state.  The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.

While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs).  This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.

The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.

This patch (of 3):

I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
      do_exit+0x297/0xd10
      SyS_exit+0x17/0x20
      tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
     RSP <ffff8801f833fe80>
    ---[ end trace 9fecd6dcb442846a ]---

In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected.  So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.

When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager).  So this is an use
after free that was happening for all processes.

One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.

One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.

The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:

	if (mmget_not_zero(ctx->mm)) {
[..]
	} else {
		return -ENOSPC;
	}

It's safer to roll it back and re-introduce it later if at all.

[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
  Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Andrea Arcangeli
6bbc4a4144 userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
__do_fault assumes vmf->page has been initialized and is valid if
VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf).

handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't
return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities).

This VM_FAULT_NOPAGE case is only invoked when signal are pending and it
didn't matter for anonymous memory before.  It only started to matter
since shmem was introduced.  hugetlbfs also takes a different path and
doesn't exercise __do_fault.

Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Kirill A. Shutemov
c2febafc67 mm: convert generic code to 5-level paging
Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 11:48:47 -08:00
Linus Torvalds
04bb94b13c overlayfs: remove now unnecessary header file include
This removes the extra include header file that was added in commit
e58bc92783 "Pull overlayfs updates from Miklos Szeredi" now that it
is no longer needed.

There are probably other such includes that got added during the
scheduler header splitup series, but this is the one that annoyed me
personally and I know about.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-08 10:42:13 -08:00
Christoph Hellwig
2fcc319d24 xfs: try any AG when allocating the first btree block when reflinking
When a reflink operation causes the bmap code to allocate a btree block
we're currently doing single-AG allocations due to having ->firstblock
set and then try any higher AG due a little reflink quirk we've put in
when adding the reflink code.  But given that we do not have a minleft
reservation of any kind in this AG we can still not have any space in
the same or higher AG even if the file system has enough free space.
To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
path instead.

[And yes, we need to redo this properly instead of piling hacks over
 hacks.  I'm working on that, but it's not going to be a small series.
 In the meantime this fixes the customer reported issue]

Also add a warning for failing allocations to make it easier to debug.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-08 10:38:53 -08:00
Brian Foster
f65e6fad29 xfs: use iomap new flag for newly allocated delalloc blocks
Commit fa7f138 ("xfs: clear delalloc and cache on buffered write
failure") fixed one regression in the iomap error handling code and
exposed another. The fundamental problem is that if a buffered write
is a rewrite of preexisting delalloc blocks and the write fails, the
failure handling code can punch out preexisting blocks with valid
file data.

This was reproduced directly by sub-block writes in the LTP
kernel/syscalls/write/write03 test. A first 100 byte write allocates
a single block in a file. A subsequent 100 byte write fails and
punches out the block, including the data successfully written by
the previous write.

To address this problem, update the ->iomap_begin() handler to
distinguish newly allocated delalloc blocks from preexisting
delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
->iomap_end() handler to decide when a failed or short write should
punch out delalloc blocks.

This introduces the subtle requirement that ->iomap_begin() should
never combine newly allocated delalloc blocks with existing blocks
in the resulting iomap descriptor. This can occur when a new
delalloc reservation merges with a neighboring extent that is part
of the current write, for example. Therefore, drop the
post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
just return the record inserted into the fork. This ensures only new
blocks are returned and thus that preexisting delalloc blocks are
always handled as "found" blocks and not punched out on a failed
rewrite.

Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-08 09:58:08 -08:00
Amir Goldstein
b1eaa950f7 ovl: lockdep annotate of nested stacked overlayfs inode lock
An overlayfs instance can be the lower layer of another overlayfs
instance. This setup triggers a lockdep splat of possible recursive
locking of sb->s_type->i_mutex_key in iterate_dir(). Trimmed snip:

 [ INFO: possible recursive locking detected ]
 bash/2468 is trying to acquire lock:
  &sb->s_type->i_mutex_key#14, at: iterate_dir+0x7d/0x15c
 but task is already holding lock:
  &sb->s_type->i_mutex_key#14, at: iterate_dir+0x7d/0x15c

One problem observed with this splat is that ovl_new_inode()
does not call lockdep_annotate_inode_mutex_key() to annotate
the dir inode lock as &sb->s_type->i_mutex_dir_key like other
fs do.

The other problem is that the 2 nested levels of overlayfs inode
lock are annotated using the same key, which is the cause of the
false positive lockdep warning.

Fix this by annotating overlayfs inode lock in ovl_fill_inode()
according to stack level of the super block instance and use
different key for dir vs. non-dir like other fs do.

Here is an edited snip from /proc/lockdep_chains after
iterate_dir() of nested overlayfs:

 [...] &ovl_i_mutex_dir_key[depth]   (stack_depth=2)
 [...] &ovl_i_mutex_dir_key[depth]#2 (stack_depth=1)
 [...] &type->i_mutex_dir_key        (stack_depth=0)

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-03-08 15:05:23 +01:00
Josh Poimboeuf
7c23b33001 livepatch: add /proc/<pid>/patch_state
Expose the per-task patch state value so users can determine which tasks
are holding up completion of a patching operation.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-08 09:38:19 +01:00
Darrick J. Wong
08b005f133 xfs: remove kmem_zalloc_greedy
The sole remaining caller of kmem_zalloc_greedy is bulkstat, which uses
it to grab 1-4 pages for staging of inobt records.  The infinite loop in
the greedy allocation function is causing hangs[1] in generic/269, so
just get rid of the greedy allocator in favor of kmem_zalloc_large.
This makes bulkstat somewhat more likely to ENOMEM if there's really no
pages to spare, but eliminates a source of hangs.

[1] http://lkml.kernel.org/r/20170301044634.rgidgdqqiiwsmfpj%40XZHOUW.usersys.redhat.com

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v2: remove single-page fallback
2017-03-07 20:10:50 -08:00
Chandan Rajendra
d5825712ee xfs: Use xfs_icluster_size_fsb() to calculate inode alignment mask
When block size is larger than inode cluster size, the call to
XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
would have set xfs_sb->sb_inoalignmt to 0. Hence in
xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
-1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
intialized to 0 because for every positive value of xfs_mount->m_dalign,
the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
false.

Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
-1 as the value because blks_per_cluster variable would have the value 1
and hence we would never have a need to use xfs_mount->m_inoalign_mask
to compute the inode chunk's agbno and offset within the chunk.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-07 20:10:50 -08:00
Christoph Hellwig
787eb48550 xfs: fix and streamline error handling in xfs_end_io
There are two different cases of buffered I/O errors:

 - first we can have an already shutdown fs.  In that case we should skip
   any on-disk operations and just clean up the appen transaction if
   present and destroy the ioend
 - a real I/O error.  In that case we should cleanup any lingering COW
   blocks.  This gets skipped in the current code and is fixed by this
   patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-07 20:10:50 -08:00
Christoph Hellwig
3802a34532 xfs: only reclaim unwritten COW extents periodically
We only want to reclaim preallocations from our periodic work item.
Currently this is archived by looking for a dirty inode, but that check
is rather fragile.  Instead add a flag to xfs_reflink_cancel_cow_* so
that the caller can ask for just cancelling unwritten extents in the COW
fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-03-07 16:45:58 -08:00
Linus Torvalds
8a9172356f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "This includes a fix for lockups caused by incorrect nsecs related
  cleanup, and a capabilities check fix for timerfd"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  jiffies: Revert bogus conversion of NSEC_PER_SEC to TICK_NSEC
  timerfd: Only check CAP_WAKE_ALARM when it is needed
2017-03-07 14:45:22 -08:00
Kees Cook
30800d9977 pstore: simplify write_user_compat()
Nothing actually uses write_user_compat() currently, but there is no
reason to reuse the dmesg buffer. Instead, just allocate a new record
buffer, copy in from userspace, and pass it to write() as normal.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:01:03 -08:00
Kees Cook
4c9ec21976 pstore: Remove write_buf() callback
Now that write() and write_buf() are functionally identical, this removes
write_buf(), and renames write_buf_user() to write_user(). Additionally
adds sanity-checks for pstore_info's declared functions and flags at
registration time.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:01:02 -08:00
Kees Cook
fdd0311863 pstore: Replace arguments for write_buf_user() API
Removes argument list in favor of pstore record, though the user buffer
remains passed separately since it must carry the __user annotation.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:01:01 -08:00
Kees Cook
b10b471145 pstore: Replace arguments for write_buf() API
As with the other API updates, this removes the long argument list in favor
of passing a single pstore recaord.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:01:01 -08:00
Kees Cook
a61072aae6 pstore: Replace arguments for erase() API
This removes the argument list for the erase() callback and replaces it
with a pointer to the backend record details to be removed.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:01:00 -08:00
Kees Cook
83f70f0769 pstore: Do not duplicate record metadata
This switches the inode-private data from carrying duplicate metadata to
keeping the record passed in during pstore_mkfile().

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:59 -08:00
Kees Cook
2a2b0acf76 pstore: Allocate records on heap instead of stack
In preparation for handling records off to pstore_mkfile(), allocate the
record instead of reusing stack. This still always frees the record,
though, since pstore_mkfile() isn't yet keeping it.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:58 -08:00
Kees Cook
1dfff7dd67 pstore: Pass record contents instead of copying
pstore_mkfile() shouldn't have to memcpy the record contents. It can use
the existing copy instead. This adjusts the allocation lifetime management
and renames the contents variable from "data" to "buf" to assist moving to
struct pstore_record in the future.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:58 -08:00
Kees Cook
7e8cc8dce1 pstore: Always allocate buffer for decompression
Currently, pstore_mkfile() performs a memcpy() of the record contents,
so it can live anywhere. However, this is needlessly wasteful. In
preparation of pstore_mkfile() keeping the record contents, always
allocate a buffer for the contents.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:57 -08:00
Kees Cook
76cc9580e3 pstore: Replace arguments for write() API
Similar to the pstore_info read() callback, there were too many arguments.
This switches to the new struct pstore_record pointer instead. This adds
"reason" and "part" to the record structure as well.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:56 -08:00
Kees Cook
125cc42baf pstore: Replace arguments for read() API
The argument list for the pstore_read() interface is unwieldy. This changes
passes the new struct pstore_record instead. The erst backend was already
doing something similar internally.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:55 -08:00
Kees Cook
1edd1aa397 pstore: Switch pstore_mkfile to pass record
Instead of the long list of arguments, just pass the new record struct.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:55 -08:00
Kees Cook
634f8f5167 pstore: Move record decompression to function
This moves the record decompression logic out to a separate function
to avoid the deep indentation.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:54 -08:00
Kees Cook
9abdcccc3d pstore: Extract common arguments into structure
The read/mkfile pair pass the same arguments and should be cleared
between calls. Move to a structure and wipe it after every loop.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 14:00:53 -08:00
Kees Cook
0d7cd09a3d pstore: Improve register_pstore() error reporting
Uncommon errors are better to get reported to dmesg so developers can
more easily figure out why pstore is unhappy with a backend attempting
to register.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 08:21:38 -08:00
Kees Cook
1344dd86f3 pstore: Avoid race in module unloading
Technically, it might be possible for struct pstore_info to go out of
scope after the module_put(), so report the backend name first.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-03-07 08:21:38 -08:00
Kees Cook
6330d55347 pstore: Shut down worker when unregistering
When built as a module and running with update_ms >= 0, pstore will Oops
during module unload since the work timer is still running. This makes sure
the worker is stopped before unloading.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
2017-03-07 08:21:38 -08:00