Commit Graph

66232 Commits

Author SHA1 Message Date
Pavel Begunkov
906a8c3fdb io_uring: fix punting req w/o grabbed env
It's not enough to check for REQ_F_WORK_INITIALIZED and punt async
assuming that io_req_work_grab_env() was called, it may not have been.
E.g. io_close_prep() and personality path set the flag without further
async init.

As a quick fix, always pass next work through io_req_task_queue().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:16 -06:00
Pavel Begunkov
8ef77766ba io_uring: fix req->work corruption
req->work and req->task_work are in a union, so io_req_task_queue() screws
everything that was in work. De-union them for now.

[  704.367253] BUG: unable to handle page fault for address:
	ffffffffaf7330d0
[  704.367256] #PF: supervisor write access in kernel mode
[  704.367256] #PF: error_code(0x0003) - permissions violation
[  704.367261] CPU: 6 PID: 1654 Comm: io_wqe_worker-0 Tainted: G
I       5.8.0-rc2-00038-ge28d0bdc4863-dirty #498
[  704.367265] RIP: 0010:_raw_spin_lock+0x1e/0x36
...
[  704.367276]  __alloc_fd+0x35/0x150
[  704.367279]  __get_unused_fd_flags+0x25/0x30
[  704.367280]  io_openat2+0xcb/0x1b0
[  704.367283]  io_issue_sqe+0x36a/0x1320
[  704.367294]  io_wq_submit_work+0x58/0x160
[  704.367295]  io_worker_handle_work+0x2a3/0x430
[  704.367296]  io_wqe_worker+0x2a0/0x350
[  704.367301]  kthread+0x136/0x180
[  704.367304]  ret_from_fork+0x22/0x30

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:10 -06:00
David Howells
719fdd3292 afs: Fix storage of cell names
The cell name stored in the afs_cell struct is a 64-char + NUL buffer -
when it needs to be able to handle up to AFS_MAXCELLNAME (256 chars) + NUL.

Fix this by changing the array to a pointer and allocating the string.

Found using Coverity.

Fixes: 989782dcdc ("afs: Overhaul cell database management")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-27 22:04:24 -07:00
Linus Torvalds
916a3b0fc1 Merge tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Six cifs/smb3 fixes, three of them for stable.

  Fixes xfstests 451, 313 and 316"

* tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: misc: Use array_size() in if-statement controlling expression
  cifs: update ctime and mtime during truncate
  cifs/smb3: Fix data inconsistent when punch hole
  cifs/smb3: Fix data inconsistent when zero file range
  cifs: Fix double add page to memcg when cifs_readpages
  cifs: Fix cached_fid refcnt leak in open_shroot
2020-06-27 15:24:04 -07:00
Linus Torvalds
4e99b32169 Merge tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
 "Stable Fixes:
   - xprtrdma: Fix handling of RDMA_ERROR replies
   - sunrpc: Fix rollback in rpc_gssd_dummy_populate()
   - pNFS/flexfiles: Fix list corruption if the mirror count changes
   - NFSv4: Fix CLOSE not waiting for direct IO completion
   - SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()

  Other Fixes:
   - xprtrdma: Fix a use-after-free with r_xprt->rx_ep
   - Fix other xprtrdma races during disconnect
   - NFS: Fix memory leak of export_path"

* tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
  NFSv4 fix CLOSE not waiting for direct IO compeletion
  pNFS/flexfiles: Fix list corruption if the mirror count changes
  nfs: Fix memory leak of export_path
  sunrpc: fixed rollback in rpc_gssd_dummy_populate()
  xprtrdma: Fix handling of RDMA_ERROR replies
  xprtrdma: Clean up disconnect
  xprtrdma: Clean up synopsis of rpcrdma_flush_disconnect()
  xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()
  xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed
2020-06-27 09:35:47 -07:00
Linus Torvalds
ab0f2473d3 Merge tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "Three small fixes:

   - Close a corner case for polled IO resubmission (Pavel)

   - Toss commands when exiting (Pavel)

   - Fix SQPOLL conditional reschedule on perpetually busy submit
     (Xuan)"

* tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block:
  io_uring: fix current->mm NULL dereference on exit
  io_uring: fix hanging iopoll in case of -EAGAIN
  io_uring: fix io_sq_thread no schedule when busy
2020-06-27 09:02:49 -07:00
Randy Dunlap
1e16c2f917 io_uring: fix function args for !CONFIG_NET
Fix build errors when CONFIG_NET is not set/enabled:

../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’
../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’
../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’
../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’
../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’
../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 19:46:18 -06:00
Jens Axboe
2237d76530 Merge branch 'io_uring-5.8' into for-5.9/io_uring
Merge in changes that went into 5.8-rc3. GIT will silently do the
merge, but we still need a tweak on top of that since
io_complete_rw_common() was modified to take a io_comp_state pointer.
The auto-merge fails on that, and we end up with something that
doesn't compile.

* io_uring-5.8:
  io_uring: fix current->mm NULL dereference on exit
  io_uring: fix hanging iopoll in case of -EAGAIN
  io_uring: fix io_sq_thread no schedule when busy

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 13:44:16 -06:00
Linus Torvalds
7c902e2730 Merge branch 'akpm' (patches from Andrew)
Merge misx fixes from Andrew Morton:
 "31 patches.

  Subsystems affected by this patch series: hotfixes, mm/pagealloc,
  kexec, ocfs2, lib, mm/slab, mm/slab, mm/slub, mm/swap, mm/pagemap,
  mm/vmalloc, mm/memcg, mm/gup, mm/thp, mm/vmscan, x86,
  mm/memory-hotplug, MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
  MAINTAINERS: update info for sparse
  mm/memory_hotplug.c: fix false softlockup during pfn range removal
  mm: remove vmalloc_exec
  arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page
  x86/hyperv: allocate the hypercall page with only read and execute bits
  mm/memory: fix IO cost for anonymous page
  mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages"
  mm: workingset: age nonresident information alongside anonymous pages
  doc: THP CoW fault no longer allocate THP
  docs: mm/gup: minor documentation update
  mm/memcontrol.c: prevent missed memory.low load tears
  mm/memcontrol.c: add missed css_put()
  mm: memcontrol: handle div0 crash race condition in memory.low
  mm/vmalloc.c: fix a warning while make xmldocs
  media: omap3isp: remove cacheflush.h
  make asm-generic/cacheflush.h more standalone
  mm/debug_vm_pgtable: fix build failure with powerpc 8xx
  mm/memory.c: properly pte_offset_map_lock/unlock in vm_insert_pages()
  mm: fix swap cache node allocation mask
  slub: cure list_slab_objects() from double fix
  ...
2020-06-26 12:19:36 -07:00
Pavel Begunkov
f4db7182e0 io-wq: return next work from ->do_work() directly
It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 10:34:27 -06:00
Pavel Begunkov
e883a79d8c io-wq: compact io-wq flags numbers
Renumerate IO_WQ flags, so they take adjacent bits

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 10:34:27 -06:00
Jens Axboe
c40f63790e io_uring: use task_work for links if possible
Currently links are always done in an async fashion, unless we catch them
inline after we successfully complete a request without having to resort
to blocking. This isn't necessarily the most efficient approach, it'd be
more ideal if we could just use the task_work handling for this.

Outside of saving an async jump, we can also do less prep work for these
kinds of requests.

Running dependent links from the task_work handler yields some nice
performance benefits. As an example, examples/link-cp from the liburing
repository uses read+write links to implement a copy operation. Without
this patch, the a cache fold 4G file read from a VM runs in about 3
seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.986s
user	0m0.051s
sys	0m2.843s

and a subsequent cache hot run looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.898s
user	0m0.069s
sys	0m0.797s

With this patch in place, the cold case takes about 2.4 seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.400s
user	0m0.020s
sys	0m2.366s

and the cache hot case looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.676s
user	0m0.010s
sys	0m0.665s

As expected, the (mostly) cache hot case yields the biggest improvement,
running about 25% faster with this change, while the cache cold case
yields about a 20% increase in performance. Outside of the performance
increase, we're using less CPU as well, as we're not using the async
offload threads at all for this anymore.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 10:34:23 -06:00
Olga Kornievskaia
d03727b248 NFSv4 fix CLOSE not waiting for direct IO compeletion
Figuring out the root case for the REMOVE/CLOSE race and
suggesting the solution was done by Neil Brown.

Currently what happens is that direct IO calls hold a reference
on the open context which is decremented as an asynchronous task
in the nfs_direct_complete(). Before reference is decremented,
control is returned to the application which is free to close the
file. When close is being processed, it decrements its reference
on the open_context but since directIO still holds one, it doesn't
sent a close on the wire. It returns control to the application
which is free to do other operations. For instance, it can delete a
file. Direct IO is finally releasing its reference and triggering
an asynchronous close. Which races with the REMOVE. On the server,
REMOVE can be processed before the CLOSE, failing the REMOVE with
EACCES as the file is still opened.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Suggested-by: Neil Brown <neilb@suse.com>
CC: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-26 08:43:14 -04:00
Trond Myklebust
8b04013737 pNFS/flexfiles: Fix list corruption if the mirror count changes
If the mirror count changes in the new layout we pick up inside
ff_layout_pg_init_write(), then we can end up adding the
request to the wrong mirror and corrupting the mirror->pg_list.

Fixes: d600ad1f2b ("NFS41: pop some layoutget errors to application")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-26 08:43:14 -04:00
Tom Rix
4659ed7cc8 nfs: Fix memory leak of export_path
The try_location function is called within a loop by nfs_follow_referral.
try_location calls nfs4_pathname_string to created the export_path.
nfs4_pathname_string allocates the memory. export_path is stored in the
nfs_fs_context/fs_context structure similarly as hostname and source.
But whereas the ctx hostname and source are freed before assignment,
export_path is not.  So if there are multiple loops, the new export_path
will overwrite the old without the old being freed.

So call kfree for export_path.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-26 08:43:14 -04:00
Junxiao Bi
9277f8334f ocfs2: fix value of OCFS2_INVALID_SLOT
In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
implementation, slot number is 32 bits.  Usually this will not cause any
issue, because slot number is converted from u16 to u32, but
OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
disk was obtained, its value was (u16)-1, and it was converted to u32.
Then the following checking in get_local_system_inode will be always
skipped:

 static struct inode **get_local_system_inode(struct ocfs2_super *osb,
                                               int type,
                                               u32 slot)
 {
 	BUG_ON(slot == OCFS2_INVALID_SLOT);
	...
 }

Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:37 -07:00
Junxiao Bi
e5a15e17a7 ocfs2: fix panic on nfs server over ocfs2
The following kernel panic was captured when running nfs server over
ocfs2, at that time ocfs2_test_inode_bit() was checking whether one
inode locating at "blkno" 5 was valid, that is ocfs2 root inode, its
"suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from
//global_inode_alloc, but here it wrongly assumed that it was got from per
slot inode alloctor which would cause array overflow and trigger kernel
panic.

  BUG: unable to handle kernel paging request at 0000000000001088
  IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
  PGD 1e06ba067 PUD 1e9e7d067 PMD 0
  Oops: 0002 [#1] SMP
  CPU: 6 PID: 24873 Comm: nfsd Not tainted 4.1.12-124.36.1.el6uek.x86_64 #2
  Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87 02/02/2018
  RIP: _raw_spin_lock+0x18/0xf0
  RSP: e02b:ffff88005ae97908  EFLAGS: 00010206
  RAX: ffff88005ae98000 RBX: 0000000000001088 RCX: 0000000000000000
  RDX: 0000000000020000 RSI: 0000000000000009 RDI: 0000000000001088
  RBP: ffff88005ae97928 R08: 0000000000000000 R09: ffff880212878e00
  R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000001088
  R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15: 000000000000ffff
  FS:  0000000000000000(0000) GS:ffff880218180000(0000) knlGS:ffff880218180000
  CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000001088 CR3: 00000002033d0000 CR4: 0000000000042660
  Call Trace:
    igrab+0x1e/0x60
    ocfs2_get_system_file_inode+0x63/0x3a0 [ocfs2]
    ocfs2_test_inode_bit+0x328/0xa00 [ocfs2]
    ocfs2_get_parent+0xba/0x3e0 [ocfs2]
    reconnect_path+0xb5/0x300
    exportfs_decode_fh+0xf6/0x2b0
    fh_verify+0x350/0x660 [nfsd]
    nfsd4_putfh+0x4d/0x60 [nfsd]
    nfsd4_proc_compound+0x3d3/0x6f0 [nfsd]
    nfsd_dispatch+0xe0/0x290 [nfsd]
    svc_process_common+0x412/0x6a0 [sunrpc]
    svc_process+0x123/0x210 [sunrpc]
    nfsd+0xff/0x170 [nfsd]
    kthread+0xcb/0xf0
    ret_from_fork+0x61/0x90
  Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00 <f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6
  RIP   _raw_spin_lock+0x18/0xf0
  CR2: 0000000000001088
  ---[ end trace 7264463cd1aac8f9 ]---
  Kernel panic - not syncing: Fatal exception

Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:37 -07:00
Junxiao Bi
7569d3c754 ocfs2: load global_inode_alloc
Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will
make it load during mount.  It can be used to test whether some
global/system inodes are valid.  One use case is that nfsd will test
whether root inode is valid.

Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:37 -07:00
Junxiao Bi
4cd9973f9f ocfs2: avoid inode removal while nfsd is accessing it
Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.

This is a series of patches to fix issues on nfsd over ocfs2.  patch 1
is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a
panic issue.

This patch (of 4):

When nfsd is getting file dentry using handle or parent dentry of some
dentry, one cluster lock is used to avoid inode removed from other node,
but it still could be removed from local node, so use a rw lock to avoid
this.

Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:36 -07:00
Linus Torvalds
52366a107b Merge tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixlet from Jan Kara:
 "A performance improvement to reduce impact of fsnotify for inodes
  where it isn't used"

* tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fs: Do not check if there is a fsnotify watcher on pseudo inodes
2020-06-25 13:02:58 -07:00
Jens Axboe
a1d7c393c4 io_uring: enable READ/WRITE to use deferred completions
A bit more surgery required here, as completions are generally done
through the kiocb->ki_complete() callback, even if they complete inline.
This enables the regular read/write path to use the io_comp_state
logic to batch inline completions.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:23:49 -06:00
Jens Axboe
229a7b6350 io_uring: pass in completion state to appropriate issue side handlers
Provide the completion state to the handlers that we know can complete
inline, so they can utilize this for batching completions.

Cap the max batch count at 32. This should be enough to provide a good
amortization of the cost of the lock+commit dance for completions, while
still being low enough not to cause any real latency issues for SQPOLL
applications.

Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
profile from:

17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
 5.94% [kernel] [k] skb_release_data
 4.31% [kernel] [k] udp_rmem_release
 2.68% [kernel] [k] __check_object_size
 2.24% [kernel] [k] __slab_free
 2.22% [kernel] [k] _raw_spin_lock_bh
 2.21% [kernel] [k] kmem_cache_free
 2.13% [kernel] [k] free_pcppages_bulk
 1.83% [kernel] [k] io_submit_sqes
 1.38% [kernel] [k] page_frag_free
 1.31% [kernel] [k] inet_recvmsg

to

19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
 9.36% [kernel] [k] udp_rmem_release
 8.64% [kernel] [k] udp_recvmsg
 6.21% [kernel] [k] __slab_free
 4.39% [kernel] [k] __check_object_size
 3.64% [kernel] [k] free_pcppages_bulk
 2.41% [kernel] [k] kmem_cache_free
 2.00% [kernel] [k] io_submit_sqes
 1.95% [kernel] [k] page_frag_free
 1.54% [kernel] [k] io_put_req
[...]
 0.07% [kernel] [k] io_commit_cqring
 0.44% [kernel] [k] __io_cqring_fill_event

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:23:46 -06:00
Jens Axboe
f13fad7ba4 io_uring: pass down completion state on the issue side
No functional changes in this patch, just in preparation for having the
completion state be available on the issue side. Later on, this will
allow requests that complete inline to be completed in batches.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:23:44 -06:00
Jens Axboe
013538bd65 io_uring: add 'io_comp_state' to struct io_submit_state
No functional changes in this patch, just in preparation for passing back
pending completions to the caller and completing them in a batched
fashion.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:22:50 -06:00
Jens Axboe
e1e16097e2 io_uring: provide generic io_req_complete() helper
We have lots of callers of:

io_cqring_add_event(req, result);
io_put_req(req);

Provide a helper that does this for us. It helps clean up the code, and
also provides a more convenient location for us to change the completion
handling.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:22:41 -06:00
Pavel Begunkov
d3cac64c49 io_uring: fix NULL-mm for linked reqs
__io_queue_sqe() tries to handle all request of a link,
so it's not enough to grab mm in io_sq_thread_acquire_mm()
based just on the head.

Don't check req->needs_mm and do it always.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2020-06-25 07:22:38 -06:00
Pavel Begunkov
d60b5fbc1c io_uring: fix current->mm NULL dereference on exit
Don't reissue requests from io_iopoll_reap_events(), the task may not
have mm, which ends up with NULL. It's better to kill everything off on
exit anyway.

[  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
...
[  677.734679] Call Trace:
[  677.734695]  ? __send_signal+0x1f2/0x420
[  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  677.734699]  ? send_signal+0xf5/0x140
[  677.734700]  io_iopoll_getevents+0x12f/0x1a0
[  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
[  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
[  677.734704]  io_uring_release+0x20/0x30
[  677.734706]  __fput+0xcd/0x230
[  677.734707]  ____fput+0xe/0x10
[  677.734709]  task_work_run+0x67/0xa0
[  677.734710]  do_exit+0x35d/0xb70
[  677.734712]  do_group_exit+0x43/0xa0
[  677.734713]  get_signal+0x140/0x900
[  677.734715]  do_signal+0x37/0x780
[  677.734717]  ? enqueue_hrtimer+0x41/0xb0
[  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
[  677.734720]  ? ktime_get+0x3e/0xa0
[  677.734721]  ? lapic_next_deadline+0x26/0x30
[  677.734723]  ? tick_program_event+0x4d/0x90
[  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
[  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
[  677.734741]  prepare_exit_to_usermode+0x9/0x40
[  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
[  677.734743]  sysvec_reschedule_ipi+0x92/0x160
[  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:20:43 -06:00
Pavel Begunkov
cd664b0e35 io_uring: fix hanging iopoll in case of -EAGAIN
io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.

Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().

io_kiocb's result and iopoll_completed")

Fixes: bbde017a32 ("io_uring: add memory barrier to synchronize
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:20:43 -06:00
Linus Torvalds
8be3a53e18 Merge tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
 "Fix a regression which uses potential uninitialized high 32-bit value
  unexpectedly recently observed with specific compiler options"

* tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup
2020-06-24 17:39:30 -07:00
Christoph Hellwig
621c1f4294 block: move struct block_device to blk_types.h
Move the struct block_device definition together with most of the
block layer definitions, as it has nothing to do with the rest of fs.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
3f1266f1f8 block: move block-related definitions out of fs.h
Move most of the block related definition out of fs.h into more suitable
headers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
764b23bd9a block: mark bd_finish_claiming static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Gao Xiang
3c59728288 erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup
Hongyu reported "id != index" in z_erofs_onlinepage_fixup() with
specific aarch64 environment easily, which wasn't shown before.

After digging into that, I found that high 32 bits of page->private
was set to 0xaaaaaaaa rather than 0 (due to z_erofs_onlinepage_init
behavior with specific compiler options). Actually we only use low
32 bits to keep the page information since page->private is only 4
bytes on most 32-bit platforms. However z_erofs_onlinepage_fixup()
uses the upper 32 bits by mistake.

Let's fix it now.

Reported-and-tested-by: Hongyu Jin <hongyu.jin@unisoc.com>
Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200618234349.22553-1-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-06-24 09:47:44 +08:00
Gustavo A. R. Silva
bf1028a41e cifs: misc: Use array_size() in if-statement controlling expression
Use array_size() instead of the open-coded version in the controlling
expression of the if statement.

Also, while there, use the preferred form for passing a size of a struct.
The alternative form where struct name is spelled out hurts readability
and introduces an opportunity for a bug when the pointer variable type is
changed but the corresponding sizeof that is passed as argument is not.

This issue was found with the help of Coccinelle and, audited and fixed
manually.

Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23 19:06:27 -05:00
Zhang Xiaoxu
5618303d85 cifs: update ctime and mtime during truncate
As the man description of the truncate, if the size changed,
then the st_ctime and st_mtime fields should be updated. But
in cifs, we doesn't do it.

It lead the xfstests generic/313 failed.

So, add the ATTR_MTIME|ATTR_CTIME flags on attrs when change
the file size

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23 19:06:27 -05:00
Zhang Xiaoxu
acc91c2d8d cifs/smb3: Fix data inconsistent when punch hole
When punch hole success, we also can read old data from file:
  # strace -e trace=pread64,fallocate xfs_io -f -c "pread 20 40" \
           -c "fpunch 20 40" -c"pread 20 40" file
  pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40
  fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 20, 40) = 0
  pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40

CIFS implements the fallocate(FALLOCATE_FL_PUNCH_HOLE) with send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the
remote file to zero, but local page caches not updated, then the
local page caches inconsistent with server.

Also can be found by xfstests generic/316.

So, we need to remove the page caches before send the SMB
ioctl(FSCTL_SET_ZERO_DATA) to server.

Fixes: 31742c5a33 ("enable fallocate punch hole ("fallocate -p") for SMB3")
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org # v3.17
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23 19:06:27 -05:00
Zhang Xiaoxu
6b69040247 cifs/smb3: Fix data inconsistent when zero file range
CIFS implements the fallocate(FALLOC_FL_ZERO_RANGE) with send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the
remote file to zero, but local page cache not update, then the data
inconsistent with server, which leads the xfstest generic/008 failed.

So we need to remove the local page caches before send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. After next read, it will
re-cache it.

Fixes: 30175628bf ("[SMB3] Enable fallocate -z support for SMB3 mounts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: stable@vger.kernel.org # v3.17
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-06-23 19:06:27 -05:00
Xuan Zhuo
b772f07add io_uring: fix io_sq_thread no schedule when busy
When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:

rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]

This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-23 11:54:30 -06:00
Zhang Xiaoxu
95a3d8f3af cifs: Fix double add page to memcg when cifs_readpages
When xfstests generic/451, there is an BUG at mm/memcontrol.c:
  page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea
       index:0xf
  mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451"
  flags: 0x2fffff80000001(locked)
  raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210
  raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000
  page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
  page->mem_cgroup:ffff88817287d000
  ------------[ cut here ]------------
  kernel BUG at mm/memcontrol.c:2659!
  invalid opcode: 0000 [#1] SMP
  CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_
    073836-buildvm-ppc64le-16.ppc.4
  RIP: 0010:commit_charge+0x35/0x50
  Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7
        c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9
  RSP: 0018:ffffc90002023a50 EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0
  RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005
  R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000
  R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0
  FS:  00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   mem_cgroup_charge+0x166/0x4f0
   __add_to_page_cache_locked+0x4a9/0x710
   add_to_page_cache_locked+0x15/0x20
   cifs_readpages+0x217/0x1270
   read_pages+0x29a/0x670
   page_cache_readahead_unbounded+0x24f/0x390
   __do_page_cache_readahead+0x3f/0x60
   ondemand_readahead+0x1f1/0x470
   page_cache_async_readahead+0x14c/0x170
   generic_file_buffered_read+0x5df/0x1100
   generic_file_read_iter+0x10c/0x1d0
   cifs_strict_readv+0x139/0x170
   new_sync_read+0x164/0x250
   __vfs_read+0x39/0x60
   vfs_read+0xb5/0x1e0
   ksys_pread64+0x85/0xf0
   __x64_sys_pread64+0x22/0x30
   do_syscall_64+0x69/0x150
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f5071fcb1af
  Code: Bad RIP value.
  RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
  RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af
  RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003
  RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001
  R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000
  Modules linked in:
  ---[ end trace 725fa14a3e1af65c ]---

Since commit 3fea5a499d ("mm: memcontrol: convert page cache to a new
mem_cgroup_charge() API") not cancel the page charge, the pages maybe
double add to pagecache:
thread1                       | thread2
cifs_readpages
readpages_get_pages
 add_to_page_cache_locked(head,index=n)=0
                              | readpages_get_pages
                              | add_to_page_cache_locked(head,index=n+1)=0
 add_to_page_cache_locked(head, index=n+1)=-EEXIST
 then, will next loop with list head page's
 index=n+1 and the page->mapping not NULL
readpages_get_pages
add_to_page_cache_locked(head, index=n+1)
 commit_charge
  VM_BUG_ON_PAGE

So, we should not do the next loop when any page add to page cache
failed.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-06-23 12:04:52 -05:00
Linus Torvalds
3e08a95294 Merge tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "A number of fixes, located in two areas, one performance fix and one
  fixup for better integration with another patchset.

   - bug fixes in nowait aio:
       - fix snapshot creation hang after nowait-aio was used
       - fix failure to write to prealloc extent past EOF
       - don't block when extent range is locked

   - block group fixes:
       - relocation failure when scrub runs in parallel
       - refcount fix when removing fails
       - fix race between removal and creation
       - space accounting fixes

   - reinstante fast path check for log tree at unlink time, fixes
     performance drop up to 30% in REAIM

   - kzfree/kfree fixup to ease treewide patchset renaming kzfree"

* tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
  btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO
  btrfs: fix RWF_NOWAIT write not failling when we need to cow
  btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
  btrfs: fix hang on snapshot creation after RWF_NOWAIT write
  btrfs: check if a log root exists before locking the log_mutex on unlink
  btrfs: fix bytes_may_use underflow when running balance and scrub in parallel
  btrfs: fix data block group relocation failure due to concurrent scrub
  btrfs: fix race between block group removal and block group creation
  btrfs: fix a block group ref counter leak after failure to remove block group
2020-06-23 09:20:11 -07:00
Dave Chinner
c7f87f3984 xfs: fix use-after-free on CIL context on shutdown
xlog_wait() on the CIL context can reference a freed context if the
waiter doesn't get scheduled before the CIL context is freed. This
can happen when a task is on the hard throttle and the CIL push
aborts due to a shutdown. This was detected by generic/019:

thread 1			thread 2

__xfs_trans_commit
 xfs_log_commit_cil
  <CIL size over hard throttle limit>
  xlog_wait
   schedule
				xlog_cil_push_work
				wake_up_all
				<shutdown aborts commit>
				xlog_cil_committed
				kmem_free

   remove_wait_queue
    spin_lock_irqsave --> UAF

Fix it by moving the wait queue to the CIL rather than keeping it in
in the CIL context that gets freed on push completion. Because the
wait queue is now independent of the CIL context and we might have
multiple contexts in flight at once, only wake the waiters on the
push throttle when the context we are pushing is over the hard
throttle size threshold.

Fixes: 0e7ab7efe7 ("xfs: Throttle commits on delayed background CIL push")
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-06-22 19:22:57 -07:00
Xiyu Yang
77577de641 cifs: Fix cached_fid refcnt leak in open_shroot
open_shroot() invokes kref_get(), which increases the refcount of the
"tcon->crfid" object. When open_shroot() returns not zero, it means the
open operation failed and close_shroot() will not be called to decrement
the refcount of the "tcon->crfid".

The reference counting issue happens in one normal path of
open_shroot(). When the cached root have been opened successfully in a
concurrent process, the function increases the refcount and jump to
"oshr_free" to return. However the current return value "rc" may not
equal to 0, thus the increased refcount will not be balanced outside the
function, causing a refcnt leak.

Fix this issue by setting the value of "rc" to 0 before jumping to
"oshr_free" label.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2020-06-21 22:34:50 -05:00
Pavel Begunkov
f6b6c7d6a9 io_uring: kill NULL checks for submit state
After recent changes, io_submit_sqes() always passes valid submit state,
so kill leftovers checking it for NULL.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:05 -06:00
Pavel Begunkov
b90cd197f9 io_uring: set @poll->file after @poll init
It's a good practice to modify fields of a struct after but not before
it was initialised. Even though io_init_poll_iocb() doesn't touch
poll->file, call it first.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:05 -06:00
Pavel Begunkov
24c7467863 io_uring: remove REQ_F_MUST_PUNT
REQ_F_MUST_PUNT may seem looking good and clear, but it's the same
as not having REQ_F_NOWAIT set. That rather creates more confusion.
Moreover, it doesn't even affect any behaviour (e.g. see the patch
removing it from io_{read,write}).

Kill theg flag and update already outdated comments.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:05 -06:00
Pavel Begunkov
62ef731650 io_uring: remove setting REQ_F_MUST_PUNT in rw
io_{read,write}() {
	...
copy_iov: // prep async
  	if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file))
		flags |= REQ_F_MUST_PUNT;
}

REQ_F_MUST_PUNT there is pointless, because if it happens then
REQ_F_NOWAIT is known to be _not_ set, and the request will go
async path in __io_queue_sqe() anyway. file_can_poll() check
is also repeated in arm_poll*(), so don't need it.

Remove the mentioned assignment REQ_F_MUST_PUNT in preparation
for killing the flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:03 -06:00
Jens Axboe
bcf5a06304 io_uring: support true async buffered reads, if file provides it
If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.

The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:26 -06:00
Jens Axboe
8730f12b79 btrfs: flag files as supporting buffered async reads
btrfs uses generic_file_read_iter(), which already supports this.

Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
f89fb730aa xfs: flag files as supporting buffered async reads
XFS uses generic_file_read_iter(), which already supports this.

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
a304f07448 block: flag block devices as supporting IOCB_WAITQ
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00