Commit Graph

1692 Commits

Author SHA1 Message Date
Yafang Shao
eb1d7a65f0 mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED
Our users reported that there're some random latency spikes when their RT
process is running.  Finally we found that latency spike is caused by
FADV_DONTNEED.  Which may call lru_add_drain_all() to drain LRU cache on
remote CPUs, and then waits the per-cpu work to complete.  The wait time
is uncertain, which may be tens millisecond.

That behavior is unreasonable, because this process is bound to a specific
CPU and the file is only accessed by itself, IOW, there should be no
pagecache pages on a per-cpu pagevec of a remote CPU.  That unreasonable
behavior is partially caused by the wrong comparation of the number of
invalidated pages and the number of the target.  For example,

        if (count < (end_index - start_index + 1))

The count above is how many pages were invalidated in the local CPU, and
(end_index - start_index + 1) is how many pages should be invalidated.
The usage of (end_index - start_index + 1) is incorrect, because they are
virtual addresses, which may not mapped to pages.  Besides that, there may
be holes between start and end.  So we'd better check whether there are
still pages on per-cpu pagevec after drain the local cpu, and then decide
whether or not to call lru_add_drain_all().

After I applied it with a hotfix to our production environment, most of
the lru_add_drain_all() can be avoided.

Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:29 -07:00
Linus Torvalds
6ad4bf6ea1 Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:

 - Add blkcg accounting for io-wq offload (Dennis)

 - A use-after-free fix for io-wq (Hillf)

 - Cancelation fixes and improvements

 - Use proper files_struct references for offload

 - Cleanup of io_uring_get_socket() since that can now go into our own
   header

 - SQPOLL fixes and cleanups, and support for sharing the thread

 - Improvement to how page accounting is done for registered buffers and
   huge pages, accounting the real pinned state

 - Series cleaning up the xarray code (Willy)

 - Various cleanups, refactoring, and improvements (Pavel)

 - Use raw spinlock for io-wq (Sebastian)

 - Add support for ring restrictions (Stefano)

* tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits)
  io_uring: keep a pointer ref_node in file_data
  io_uring: refactor *files_register()'s error paths
  io_uring: clean file_data access in files_register
  io_uring: don't delay io_init_req() error check
  io_uring: clean leftovers after splitting issue
  io_uring: remove timeout.list after hrtimer cancel
  io_uring: use a separate struct for timeout_remove
  io_uring: improve submit_state.ios_left accounting
  io_uring: simplify io_file_get()
  io_uring: kill extra check in fixed io_file_get()
  io_uring: clean up ->files grabbing
  io_uring: don't io_prep_async_work() linked reqs
  io_uring: Convert advanced XArray uses to the normal API
  io_uring: Fix XArray usage in io_uring_add_task_file
  io_uring: Fix use of XArray in __io_uring_files_cancel
  io_uring: fix break condition for __io_uring_register() waiting
  io_uring: no need to call xa_destroy() on empty xarray
  io_uring: batch account ->req_issue and task struct references
  io_uring: kill callback_head argument for io_req_task_work_add()
  io_uring: move req preps out of io_issue_sqe()
  ...
2020-10-13 12:36:21 -07:00
Linus Torvalds
3ad11d7ac8 Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Linus Torvalds
11e3235b43 Merge tag 'for-5.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "Mostly core updates with a few user visible bits and fixes.

  Hilights:

   - fsync performance improvements
      - less contention of log mutex (throughput +4%, latency -14%,
        dbench with 32 clients)
      - skip unnecessary commits for link and rename (throughput +6%,
        latency -30%, rename latency -75%, dbench with 16 clients)
      - make fast fsync wait only for writeback (throughput +10..40%,
        runtime -1..-20%, dbench with 1 to 64 clients on various
        file/block sizes)

   - direct io is now implemented using the iomap infrastructure, that's
     the main part, we still have a workaround that requires an iomap
     API update, coming in 5.10

   - new sysfs exports:
      - information about the exclusive filesystem operation status
        (balance, device add/remove/replace, ...)
      - supported send stream version

  Core:

   - use ticket space reservations for data, fair policy using the same
     infrastructure as metadata

   - preparatory work to switch locking from our custom tree locks to
     standard rwsem, now the locking context is propagated to all
     callers, actual switch is expected to happen in the next dev cycle

   - seed device structures are now using list API

   - extent tracepoints print proper tree id

   - unified range checks for extent buffer helpers

   - send: avoid using temporary buffer for copying data

   - remove unnecessary RCU protection from space infos

   - remove unused readpage callback for metadata, enabling several
     cleanups

   - replace indirect function calls for end io hooks and remove
     extent_io_ops completely

  Fixes:

   - more lockdep warning fixes

   - fix qgroup reservation for delayed inode and an occasional
     reservation leak for preallocated files

   - fix device replace of a seed device

   - fix metadata reservation for fallocate that leads to transaction
     aborts

   - reschedule if necessary when logging directory items or when
     cloning lots of extents

   - tree-checker: fix false alert caused by legacy btrfs root item

   - send: fix rename/link conflicts for orphanized inodes

   - properly initialize device stats for seed devices

   - skip devices without magic signature when mounting

  Other:

   - error handling improvements, BUG_ONs replaced by proper handling,
     fuzz fixes

   - various function parameter cleanups

   - various W=1 cleanups

   - error/info messages improved

  Mishaps:

   - commit 62cf539120 ("btrfs: move btrfs_rm_dev_replace_free_srcdev
     outside of all locks") is a rebase leftover after the patch got
     merged to 5.9-rc8 as a466c85edc ("btrfs: move
     btrfs_rm_dev_replace_free_srcdev outside of all locks"), the
     remaining part is trivial and the patch is in the middle of the
     series so I'm keeping it there instead of rebasing"

* tag 'for-5.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (161 commits)
  btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag
  btrfs: annotate device name rcu_string with __rcu
  btrfs: skip devices without magic signature when mounting
  btrfs: cleanup cow block on error
  btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  fs: remove no longer used dio_end_io()
  btrfs: return error if we're unable to read device stats
  btrfs: init device stats for seed devices
  btrfs: remove struct extent_io_ops
  btrfs: call submit_bio_hook directly for metadata pages
  btrfs: stop calling submit_bio_hook for data inodes
  btrfs: don't opencode is_data_inode in end_bio_extent_readpage
  btrfs: call submit_bio_hook directly in submit_one_bio
  btrfs: remove extent_io_ops::readpage_end_io_hook
  btrfs: replace readpage_end_io_hook with direct calls
  btrfs: send, recompute reference path after orphanization of a directory
  btrfs: send, orphanize first all conflicting inodes when processing references
  btrfs: tree-checker: fix false alert caused by legacy btrfs root item
  btrfs: use unaligned helpers for stack and header set/get helpers
  btrfs: free-space-cache: use unaligned helpers to access data
  ...
2020-10-13 09:01:36 -07:00
Linus Torvalds
85ed13e78d Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat iovec cleanups from Al Viro:
 "Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  security/keys: remove compat_keyctl_instantiate_key_iov
  mm: remove compat_process_vm_{readv,writev}
  fs: remove compat_sys_vmsplice
  fs: remove the compat readv/writev syscalls
  fs: remove various compat readv/writev helpers
  iov_iter: transparently handle compat iovecs in import_iovec
  iov_iter: refactor rw_copy_check_uvector and import_iovec
  iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
  compat.h: fix a spelling error in <linux/compat.h>
2020-10-12 16:35:51 -07:00
Goldwyn Rodrigues
c33fe275b5 fs: remove no longer used dio_end_io()
Since we removed the last user of dio_end_io() when btrfs got converted
to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
the helper function dio_end_io().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Christoph Hellwig
bfdc59701d iov_iter: refactor rw_copy_check_uvector and import_iovec
Split rw_copy_check_uvector into two new helpers with more sensible
calling conventions:

 - iovec_from_user copies a iovec from userspace either into the provided
   stack buffer if it fits, or allocates a new buffer for it.  Returns
   the actually used iovec.  It also verifies that iov_len does fit a
   signed type, and handles compat iovecs if the compat flag is set.
 - __import_iovec consolidates the native and compat versions of
   import_iovec. It calls iovec_from_user, then validates each iovec
   actually points to user addresses, and ensures the total length
   doesn't overflow.

This has two major implications:

 - the access_process_vm case loses the total lenght checking, which
   wasn't required anyway, given that each call receives two iovecs
   for the local and remote side of the operation, and it verifies
   the total length on the local side already.
 - instead of a single loop there now are two loops over the iovecs.
   Given that the iovecs are cache hot this doesn't make a major
   difference

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03 00:01:56 -04:00
Jens Axboe
ce71bfea20 fs: align IOCB_* flags with RWF_* flags
We have a set of flags that are shared between the two and inherired
in kiocb_set_rw_flags(), but we check and set these individually.
Reorder the IOCB flags so that the bottom part of the space is synced
with the RWF flag space, and then we can do them all in one mask and
set operation.

The only exception is RWF_SYNC, which needs to mark IOCB_SYNC and
IOCB_DSYNC. Do that one separately.

This shaves 15 bytes of text from kiocb_set_rw_flags() for me.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
a3ec600540 io_uring: move io_uring_get_socket() into io_uring.h
Now we have a io_uring kernel header, move this definition out of fs.h
and into io_uring.h where it belongs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Christoph Hellwig
1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
402dd2cf46 fs: remove the unused SB_I_MULTIROOT flag
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c
("NFS: Add fs_context support.")

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:38 -06:00
Linus Torvalds
e309428590 Merge tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull writeback fixes from Jan Kara:
 "Fixes for writeback code occasionally skipping writeback of some
  inodes or livelocking sync(2)"

* tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  writeback: Drop I_DIRTY_TIME_EXPIRE
  writeback: Fix sync livelock due to b_dirty_time processing
  writeback: Avoid skipping inode writeback
  writeback: Protect inode->i_io_list with inode->i_lock
2020-08-28 10:57:14 -07:00
Linus Torvalds
2cc3c4b3c2 Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "A few differerent things in here.

  Seems like syzbot got some more io_uring bits wired up, and we got a
  handful of reports and the associated fixes are in here.

  General fixes too, and a lot of them marked for stable.

  Lastly, a bit of fallout from the async buffered reads, where we now
  more easily trigger short reads. Some applications don't really like
  that, so the io_read() code now handles short reads internally, and
  got a cleanup along the way so that it's now easier to read (and
  documented). We're now passing tests that failed before"

* tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
  io_uring: short circuit -EAGAIN for blocking read attempt
  io_uring: sanitize double poll handling
  io_uring: internally retry short reads
  io_uring: retain iov_iter state over io_read/io_write calls
  task_work: only grab task signal lock when needed
  io_uring: enable lookup of links holding inflight files
  io_uring: fail poll arm on queue proc failure
  io_uring: hold 'ctx' reference around task_work queue + execute
  fs: RWF_NOWAIT should imply IOCB_NOIO
  io_uring: defer file table grabbing request cleanup for locked requests
  io_uring: add missing REQ_F_COMP_LOCKED for nested requests
  io_uring: fix recursive completion locking on oveflow flush
  io_uring: use TWA_SIGNAL for task_work uncondtionally
  io_uring: account locked memory before potential error case
  io_uring: set ctx sq/cq entry count earlier
  io_uring: Fix NULL pointer dereference in loop_rw_iter()
  io_uring: add comments on how the async buffered read retry works
  io_uring: io_async_buf_func() need not test page bit
2020-08-16 10:55:12 -07:00
Mike Kravetz
34ae204f18 hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode.  This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed.  When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.

Unfortunately, that else clause is exercised when there is no page table
entry.  This will likely lead to a call to huge_pmd_share.  If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem.  If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.

Simply remove the else clause.  It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.

To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.

Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Jens Axboe
efa8480a83 fs: RWF_NOWAIT should imply IOCB_NOIO
With the change allowing read-ahead for IOCB_NOWAIT, we changed the
RWF_NOWAIT semantics of only doing cached reads. Since we know have
IOCB_NOIO to manage that specific side of it, just make RWF_NOWAIT
imply IOCB_NOIO as well to restore the previous behavior.

Fixes: 2e85abf053 ("mm: allow read-ahead with IOCB_NOWAIT set")
Reported-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-11 08:09:01 -06:00
Linus Torvalds
8c2618a6d0 Merge tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Make sure transactions won't be started recursively in
   gfs2_block_zero_range (bug introduced in 5.4 when switching to
   iomap_zero_range)

 - Fix a glock holder refcount leak introduced in the iopen glock
   locking scheme rework merged in 5.8.

 - A few other small improvements (debugging, stack usage, comment
   fixes).

* tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: When gfs2_dirty_inode gets a glock error, dump the glock
  gfs2: Never call gfs2_block_zero_range with an open transaction
  gfs2: print details on transactions that aren't properly ended
  gfs2: Fix inaccurate comment
  fs: Fix typo in comment
  gfs2: Fix refcount leak in gfs2_glock_poke
  gfs2: Pass glock holder to gfs2_file_direct_{read,write}
  gfs2: Add some flags missing from glock output
2020-08-10 18:22:43 -07:00
Linus Torvalds
b79675e15a Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "No common topic whatsoever in those, sorry"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: define inode flags using bit numbers
  iov_iter: Move unnecessary inclusion of crypto/hash.h
  dlmfs: clean up dlmfs_file_{read,write}() a bit
2020-08-07 21:14:30 -07:00
Linus Torvalds
81e11336d9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few MM hotfixes

 - kthread, tools, scripts, ntfs and ocfs2

 - some of MM

Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  mm: vmscan: consistent update to pgrefill
  mm/vmscan.c: fix typo
  khugepaged: khugepaged_test_exit() check mmget_still_valid()
  khugepaged: retract_page_tables() remember to test exit
  khugepaged: collapse_pte_mapped_thp() protect the pmd lock
  khugepaged: collapse_pte_mapped_thp() flush the right range
  mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
  mm: thp: replace HTTP links with HTTPS ones
  mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
  mm/page_alloc.c: skip setting nodemask when we are in interrupt
  mm/page_alloc: fallbacks at most has 3 elements
  mm/page_alloc: silence a KASAN false positive
  mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
  mm/page_alloc.c: simplify pageblock bitmap access
  mm/page_alloc.c: extract the common part in pfn_to_bitidx()
  mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
  mm/shuffle: remove dynamic reconfiguration
  mm/memory_hotplug: document why shuffle_zone() is relevant
  mm/page_alloc: remove nr_free_pagecache_pages()
  mm: remove vm_total_pages
  ...
2020-08-07 11:39:33 -07:00
Peter Collingbourne
45e55300f1 mm: remove unnecessary wrapper function do_mmap_pgoff()
The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.

The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap().  However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().

Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:27 -07:00
Chris Down
e809d5f0b5 tmpfs: per-superblock i_ino support
Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.

In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.  On
affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even having
as many as 150 duplicated inode numbers with differing file content.

This causes actual, tangible problems in production.  For example, we have
complaints from those working on remote caches that their application is
reporting cache corruptions because it uses (device, inodenum) to
establish the identity of a particular cache object, but because it's not
unique any more, the application refuses to continue and reports cache
corruption.  Even worse, sometimes applications may not even detect the
corruption but may continue anyway, causing phantom and hard to debug
behaviour.

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.  One
might also need to check the generation, but in this case:

1. That's not currently exposed to userspace
   (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
   same inode number on one device.

In order to mitigate this, we take a two-pronged approach:

1. Moving inum generation from being global to per-sb for tmpfs. This
   itself allows some reduction in i_ino churn. This works on both 64-
   and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
   64-bit ino_t only: we allow users to mount tmpfs with a new inode64
   option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

You can see how this compares to previous related patches which didn't
implement this per-superblock:

- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/

This patch (of 2):

get_next_ino has a number of problems:

- It uses and returns a uint, which is susceptible to become overflowed
  if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
  means it's not that difficult to cause inode number wraparounds on a
  single device, which can result in having multiple distinct inodes
  with the same inode number.

This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size per-device,
for example, allowing users to choose whether to use 32- or 64-bit inodes
for each tmpfs mount.  This is implemented in the next commit.

For internal shmem mounts which may be less tolerant to spinlock delays,
we implement a percpu batching scheme which only takes the stat_lock at
each batch boundary.

Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:24 -07:00
Linus Torvalds
e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00
Linus Torvalds
2324d50d05 Merge tag 'docs-5.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
 "It's been a busy cycle for documentation - hopefully the busiest for a
  while to come. Changes include:

   - Some new Chinese translations

   - Progress on the battle against double words words and non-HTTPS
     URLs

   - Some block-mq documentation

   - More RST conversions from Mauro. At this point, that task is
     essentially complete, so we shouldn't see this kind of churn again
     for a while. Unless we decide to switch to asciidoc or
     something...:)

   - Lots of typo fixes, warning fixes, and more"

* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
  scripts/kernel-doc: optionally treat warnings as errors
  docs: ia64: correct typo
  mailmap: add entry for <alobakin@marvell.com>
  doc/zh_CN: add cpu-load Chinese version
  Documentation/admin-guide: tainted-kernels: fix spelling mistake
  MAINTAINERS: adjust kprobes.rst entry to new location
  devices.txt: document rfkill allocation
  PCI: correct flag name
  docs: filesystems: vfs: correct flag name
  docs: filesystems: vfs: correct sync_mode flag names
  docs: path-lookup: markup fixes for emphasis
  docs: path-lookup: more markup fixes
  docs: path-lookup: fix HTML entity mojibake
  CREDITS: Replace HTTP links with HTTPS ones
  docs: process: Add an example for creating a fixes tag
  doc/zh_CN: add Chinese translation prefer section
  doc/zh_CN: add clearing-warn-once Chinese version
  doc/zh_CN: add admin-guide index
  doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
  futex: MAINTAINERS: Re-add selftests directory
  ...
2020-08-04 22:47:54 -07:00
Linus Torvalds
cdc8fcb499 Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
 "Lots of cleanups in here, hardening the code and/or making it easier
  to read and fixing bugs, but a core feature/change too adding support
  for real async buffered reads. With the latter in place, we just need
  buffered write async support and we're done relying on kthreads for
  the fast path. In detail:

   - Cleanup how memory accounting is done on ring setup/free (Bijan)

   - sq array offset calculation fixup (Dmitry)

   - Consistently handle blocking off O_DIRECT submission path (me)

   - Support proper async buffered reads, instead of relying on kthread
     offload for that. This uses the page waitqueue to drive retries
     from task_work, like we handle poll based retry. (me)

   - IO completion optimizations (me)

   - Fix race with accounting and ring fd install (me)

   - Support EPOLLEXCLUSIVE (Jiufei)

   - Get rid of the io_kiocb unionizing, made possible by shrinking
     other bits (Pavel)

   - Completion side cleanups (Pavel)

   - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

   - Request environment grabbing cleanups (Pavel)

   - File and socket read/write cleanups (Pavel)

   - Improve kiocb_set_rw_flags() (Pavel)

   - Tons of fixes and cleanups (Pavel)

   - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
  io_uring: flip if handling after io_setup_async_rw
  fs: optimise kiocb_set_rw_flags()
  io_uring: don't touch 'ctx' after installing file descriptor
  io_uring: get rid of atomic FAA for cq_timeouts
  io_uring: consolidate *_check_overflow accounting
  io_uring: fix stalled deferred requests
  io_uring: fix racy overflow count reporting
  io_uring: deduplicate __io_complete_rw()
  io_uring: de-unionise io_kiocb
  io-wq: update hash bits
  io_uring: fix missing io_queue_linked_timeout()
  io_uring: mark ->work uninitialised after cleanup
  io_uring: deduplicate io_grab_files() calls
  io_uring: don't do opcode prep twice
  io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
  io_uring: batch put_task_struct()
  tasks: add put_task_struct_many()
  io_uring: return locked and pinned page accounting
  io_uring: don't miscount pinned memory
  io_uring: don't open-code recv kbuf managment
  ...
2020-08-03 13:01:22 -07:00
Linus Torvalds
382625d0d4 Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
Linus Torvalds
690b25675f Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
 "This release, we add support for inline encryption via the blk-crypto
  framework which was added in 5.8.

  Now when an ext4 or f2fs filesystem is mounted with '-o inlinecrypt',
  the contents of encrypted files will be encrypted/decrypted via
  blk-crypto, instead of directly using the crypto API. This model
  allows taking advantage of the inline encryption hardware that is
  integrated into the UFS or eMMC host controllers on most mobile SoCs.

  Note that this is just an alternate implementation; the ciphertext
  written to disk stays the same.

  (This pull request does *not* include support for direct I/O on
  encrypted files, which blk-crypto makes possible, since that part is
  still being discussed.)

  Besides the above feature update, there are also a few fixes and
  cleanups, e.g. strengthening some memory barriers that may be too
  weak.

  All these patches have been in linux-next with no reported issues.
  I've also tested them with the fscrypt xfstests, as usual. It's also
  been tested that the inline encryption support works with the support
  for Qualcomm and Mediatek inline encryption hardware that will be in
  the scsi pull request for 5.9. Also, several SoC vendors are already
  using a previous, functionally equivalent version of these patches"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: don't load ->i_crypt_info before it's known to be valid
  fscrypt: document inline encryption support
  fscrypt: use smp_load_acquire() for ->i_crypt_info
  fscrypt: use smp_load_acquire() for ->s_master_keys
  fscrypt: use smp_load_acquire() for fscrypt_prepared_key
  fscrypt: switch fscrypt_do_sha256() to use the SHA-256 library
  fscrypt: restrict IV_INO_LBLK_* to AES-256-XTS
  fscrypt: rename FS_KEY_DERIVATION_NONCE_SIZE
  fscrypt: add comments that describe the HKDF info strings
  ext4: add inline encryption support
  f2fs: add inline encryption support
  fscrypt: add inline encryption support
  fs: introduce SB_INLINECRYPT
2020-08-03 10:09:59 -07:00
Andreas Gruenbacher
c9dff08485 fs: Fix typo in comment
The comment for function filemap_check_wb_err accidentally refers to
it as filemap_check_wb_error.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-08-03 14:11:26 +02:00
Pavel Begunkov
1752f0adea fs: optimise kiocb_set_rw_flags()
Use a local var to collect flags in kiocb_set_rw_flags(). That spares
some memory writes and allows to replace most of the jumps with MOVEcc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-01 11:02:00 -06:00
Christoph Hellwig
fd5ad30c78 fs: expose utimes_common
Rename utimes_common to vfs_utimes and make it available outside of
utimes.c.  This will be used by the initramfs unpacking code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:01 +02:00
Eric Biggers
6414e9b09f fs: define inode flags using bit numbers
Define the VFS inode flags using bit numbers instead of hardcoding
powers of 2, which has become unwieldy now that we're up to 65536.

No change in the actual values.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:39:53 -04:00
Christoph Hellwig
9e96c8c0e9 fs: add a vfs_fchmod helper
Add a helper for struct file based chmode operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:04 +02:00
Christoph Hellwig
c04011fe8c fs: add a vfs_fchown helper
Add a helper for struct file based chown operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:00 +02:00
Linus Torvalds
b1b11d0063 Merge tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc
Pull in-kernel read and write op cleanups from Christoph Hellwig:
 "Cleanup in-kernel read and write operations

  Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
  all users of in-kernel file I/O use them if they don't use iov_iter
  based methods already.

  The new WARN_ONs in combination with syzcaller already found a missing
  input validation in 9p. The fix should be on your way through the
  maintainer ASAP".

[ This is prep-work for the real changes coming 5.9 ]

* tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc:
  fs: remove __vfs_read
  fs: implement kernel_read using __kernel_read
  integrity/ima: switch to using __kernel_read
  fs: add a __kernel_read helper
  fs: remove __vfs_write
  fs: implement kernel_write using __kernel_write
  fs: check FMODE_WRITE in __kernel_write
  fs: unexport __kernel_write
  bpfilter: switch to kernel_write
  autofs: switch to kernel_write
  cachefiles: switch to kernel_write
2020-07-10 09:45:15 -07:00
Satya Tangirala
457e7a135c fs: introduce SB_INLINECRYPT
Introduce SB_INLINECRYPT, which is set by filesystems that wish to use
blk-crypto for file content en/decryption. This flag maps to the
'-o inlinecrypt' mount option which multiple filesystems will implement,
and code in fs/crypto/ needs to be able to check for this mount option
in a filesystem-independent way.

Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-2-satyat@google.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-07-08 10:29:14 -07:00
Christoph Hellwig
775802c057 fs: remove __vfs_read
Fold it into the two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 08:27:57 +02:00
Christoph Hellwig
61a707c543 fs: add a __kernel_read helper
This is the counterpart to __kernel_write, and skip the rw_verify_area
call compared to kernel_read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 08:27:56 +02:00
Andreas Gruenbacher
41da51bce3 fs: Add IOCB_NOIO flag for generic_file_read_iter
Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
shouldn't trigger any filesystem I/O for the actual request or for
readahead.  This allows to do tentative reads out of the page cache as
some filesystems allow, and to take the appropriate locks and retry the
reads only if the requested pages are not cached.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-07-07 23:40:08 +02:00
Mauro Carvalho Chehab
21b9cb3438 fs: fs.h: fix a kernel-doc parameter description
Changeset 3b0311e7ca71 ("vfs: track per-sb writeback errors and report them to syncfs")
added a variant of filemap_sample_wb_err(), but it forgot to
rename the arguments at the kernel-doc markup. Fix it.

Fix those warnings:
	./include/linux/fs.h:2845: warning: Function parameter or member 'file' not described in 'file_sample_sb_err'
	./include/linux/fs.h:2845: warning: Excess function parameter 'mapping' description in 'file_sample_sb_err'

Fixes: 3b0311e7ca71 ("vfs: track per-sb writeback errors and report them to syncfs")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/7b33bbceb29ac80874622a2bc84127bb10103245.1592895969.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-06-26 10:01:05 -06:00
Christoph Hellwig
621c1f4294 block: move struct block_device to blk_types.h
Move the struct block_device definition together with most of the
block layer definitions, as it has nothing to do with the rest of fs.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
3f1266f1f8 block: move block-related definitions out of fs.h
Move most of the block related definition out of fs.h into more suitable
headers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
dd0dca223e block: simplify sb_is_blkdev_sb
Just use IS_ENABLED instead of providing a stub for !CONFIG_BLOCK.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
75362a1792 fs: remove the mount_bdev and kill_block_super stubs
No one calls these functions without CONFIG_BLOCK, so don't bother
stubbing them out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
4e24566a13 fs: remove the HAVE_UNLOCKED_IOCTL and HAVE_COMPAT_IOCTL defines
These are not defined anywhere, and contrary to the comments we really
do not care about out of tree code at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
7dbac5baa8 fs: remove an unused block_device_operations forward declaration
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
764b23bd9a block: mark bd_finish_claiming static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Christoph Hellwig
b818f09e46 tty/sysrq: emergency_thaw_all does not depend on CONFIG_BLOCK
We can also thaw non-block file systems.  Remove the CONFIG_BLOCK in
sysrq.c after making the prototype available unconditionally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:16:02 -06:00
Jens Axboe
c2a25ec0f1 fs: add FMODE_BUF_RASYNC
If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
dd3e6d5039 mm: add support for async page locking
Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Zheng Bin
3373a3461a block: make function 'kill_bdev' static
kill_bdev does not have any external user, so make it static.

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 09:24:35 -06:00
Jan Kara
5fcd57505c writeback: Drop I_DIRTY_TIME_EXPIRE
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-15 09:18:46 +02:00
Jan Kara
5afced3bf2 writeback: Avoid skipping inode writeback
Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.

However there are two flaws in the current logic:

1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).

2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:

writeback_single_inode()		__mark_inode_dirty(inode, I_DIRTY_PAGES)
  inode->i_state |= I_SYNC
  __writeback_single_inode()
					  inode->i_state |= I_DIRTY_PAGES;
					  if (inode->i_state & I_SYNC)
					    bail
  if (!(inode->i_state & I_DIRTY_ALL))
  - not true so nothing done

We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).

Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.

Reported-by: Martijn Coenen <maco@android.com>
Reviewed-by: Martijn Coenen <maco@android.com>
Tested-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-15 09:18:45 +02:00