Commit Graph

66352 Commits

Author SHA1 Message Date
David Sterba
521e102227 btrfs: scrub: simplify tree block checksum calculation
Use a simpler iteration over tree block pages, same what csum_tree_block
does: first page always exists, loop over the rest.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:23 +02:00
David Sterba
d41ebef200 btrfs: scrub: clean up temporary page variables in scrub_checksum_data
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:23 +02:00
David Sterba
771aba0d12 btrfs: scrub: simplify data block checksum calculation
We have sectorsize same as PAGE_SIZE, the checksum can be calculated in
one go.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:23 +02:00
David Sterba
c746054109 btrfs: scrub: clean up temporary page variables in scrub_checksum_super
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:23 +02:00
David Sterba
74710cf1fb btrfs: scrub: remove temporary csum array in scrub_checksum_super
The page contents with the checksum is available during the entire
function so we don't need to make a copy.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:22 +02:00
David Sterba
83cf6d5eae btrfs: scrub: simplify superblock checksum calculation
BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
architectures, so we can calculate the checksum in one go.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:22 +02:00
David Sterba
b04852520e btrfs: scrub: unify naming of page address variables
As the page mapping has been removed, rename the variables to 'kaddr'
that we use everywhere else. The type is changed to 'char *' so pointer
arithmetic works without casts.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:22 +02:00
David Sterba
a8b3a89074 btrfs: scrub: remove kmap/kunmap of pages
All pages that scrub uses in the scrub_block::pagev array are allocated
with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
we only need to know the page address.

In scrub_write_page_to_dev_replace we don't even need to call
flush_dcache_page because of the same reason as above.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:22 +02:00
Qu Wenruo
74ef00185e btrfs: introduce "rescue=" mount option
This patch introduces a new "rescue=" mount option group for all mount
options for data recovery.

Different rescue sub options are seperated by ':'. E.g
"ro,rescue=nologreplay:usebackuproot".

The original plan was to use ';', but ';' needs to be escaped/quoted,
or it will be interpreted by bash, similar to '|'.

And obviously, user can specify rescue options one by one like:
"ro,rescue=nologreplay,rescue=usebackuproot".

The following mount options are converted to "rescue=", old mount
options are deprecated but still available for compatibility purpose:

- usebackuproot
  Now it's "rescue=usebackuproot"

- nologreplay
  Now it's "rescue=nologreplay"

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:22 +02:00
Filipe Manana
a89ef455dd btrfs: use btrfs_alloc_data_chunk_ondemand() when allocating space for relocation
We currently use btrfs_check_data_free_space() when allocating space for
relocating data extents, but that is not necessary because that function
combines btrfs_alloc_data_chunk_ondemand(), which does the actual space
reservation, and btrfs_qgroup_reserve_data().

We can use btrfs_alloc_data_chunk_ondemand() directly because we know we
do not need to reserve qgroup space since we are dealing with a relocation
tree, which can never have qgroups (btrfs_qgroup_reserve_data() does
nothing as is_fstree() returns false for a relocation tree).

Conversely we can use btrfs_free_reserved_data_space_noquota() directly
instead of btrfs_free_reserved_data_space(), since we had no qgroup
reservation when allocating space.

This change is preparatory work for another patch in this series that
makes relocation reserve the exact amount of space it needs to relocate
a data block group. The function btrfs_check_data_free_space() has
the incovenient of requiring a start offset argument and we will want to
be able to allocate space for multiple ranges, which are not consecutive,
at once.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:21 +02:00
Filipe Manana
46d4dac888 btrfs: remove the start argument from btrfs_free_reserved_data_space_noquota()
The start argument for btrfs_free_reserved_data_space_noquota() is only
used to make sure the amount of bytes we decrement from the bytes_may_use
counter of the data space_info object is aligned to the filesystem's
sector size. It serves no other purpose.

All its current callers always pass a length argument that is already
aligned to the sector size, so we can make the start argument go away.
In fact its presence makes it impossible to use it in a context where we
just want to free a number of bytes for a range for which either we do
not know its start offset or for freeing multiple ranges at once (which
are not contiguous).

This change is preparatory work for a patch (third patch in this series)
that makes relocation of data block groups that are not full reserve less
data space.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:21 +02:00
Liao Pingfang
ab48300921 btrfs: check-integrity: remove unnecessary failure messages during memory allocation
As there is a dump_stack() done on memory allocation failures, these
messages might as well be deleted instead.

Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:21 +02:00
Anand Jain
b5790d5180 btrfs: use helper btrfs_get_block_group
Use the helper function where it is open coded to increment the
block_group reference count As btrfs_get_block_group() is a one-liner we
could have open-coded it, but its partner function
btrfs_put_block_group() isn't one-liner which does the free part in it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:21 +02:00
Anand Jain
69b0e093c7 btrfs: let btrfs_return_cluster_to_free_space() return void
__btrfs_return_cluster_to_free_space() returns only 0. And all its
parent functions don't need the return value either so make this a void
function.

Further, as none of the callers of btrfs_return_cluster_to_free_space()
is actually using the return from this function, make this function also
return void.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:21 +02:00
Filipe Manana
f22f457a1a btrfs: remove no longer necessary chunk mutex locking cases
Initially when the 'removed' flag was added to a block group to avoid
races between block group removal and fitrim, by commit 04216820fe
("Btrfs: fix race between fs trimming and block group remove/allocation"),
we had to lock the chunks mutex because we could be moving the block
group from its current list, the pending chunks list, into the pinned
chunks list, or we could just be adding it to the pinned chunks if it was
not in the pending chunks list. Both lists were protected by the chunk
mutex.

However we no longer have those lists since commit 1c11b63eff
("btrfs: replace pending/pinned chunks lists with io tree"), and locking
the chunk mutex is no longer necessary because of that. The same happens
at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block
group's extent map could be part of the pinned chunks list and the call
to remove_extent_mapping() could be deleting it from that list, which
used to be protected by that mutex.

So just remove those lock and unlock calls as they are not needed anymore.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:21 +02:00
Johannes Thumshirn
e3ba67a108 btrfs: factor out reading of bg from find_frist_block_group
When find_first_block_group() finds a block group item in the extent-tree,
it does a lookup of the object in the extent mapping tree and does further
checks on the item.

Factor out this step from find_first_block_group() so we can further
simplify the code.

While we're at it, we can also just return early in
find_first_block_group(), if the tree slot isn't found.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:20 +02:00
Johannes Thumshirn
89d7da9bc5 btrfs: get mapping tree directly from fsinfo in find_first_block_group
We already have an fs_info in our function parameters, there's no need
to do the maths again and get fs_info from the extent_root just to get
the mapping_tree.

Instead directly grab the mapping_tree from fs_info.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:20 +02:00
Nikolay Borisov
96f9b0f2fa btrfs: simplify checks when adding excluded ranges
Adresses held in 'logical' array are always guaranteed to fall within
the boundaries of the block group. That is, 'start' can never be
smaller than cache->start. This invariant follows from the way the
address are calculated in btrfs_rmap_block:

    stripe_nr = physical - map->stripes[i].physical;
    stripe_nr = div64_u64(stripe_nr, map->stripe_len);
    bytenr = chunk_start + stripe_nr * io_stripe_size;

I.e it's always some IO stripe within the given chunk.

Exploit this invariant to simplify the body of the loop by removing the
unnecessary 'if' since its 'else' part is the one always executed.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:20 +02:00
Nikolay Borisov
9e22b92598 btrfs: read stripe len directly in btrfs_rmap_block
extent_map::orig_block_len contains the size of a physical stripe when
it's used to describe block groups (calculated in read_one_chunk via
calc_stripe_length or calculated in decide_stripe_size and then assigned
to extent_map::orig_block_len in create_chunk). Exploit this fact to get
the size directly rather than opencoding the calculations. No functional
changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:20 +02:00
Nikolay Borisov
6a3c7f5c87 btrfs: don't balance btree inode pages from buffered write path
The call to btrfs_btree_balance_dirty has been there since the early
days of BTRFS, when the btree was directly modified from the write path,
hence dirtied btree inode pages. With the implementation of b888db2bd7
("Btrfs: Add delayed allocation to the extent based page tree code")
13 years ago the btree is no longer modified from the write path, hence
there is no point in calling this function. Just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:20 +02:00
Greg Kroah-Hartman
eea2c51f81 Merge 5.8-rc7 into driver-core-next
We want the driver core fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-27 12:39:54 +02:00
Randy Dunlap
dcec10a5d1 udf: osta_udf.h: delete a duplicated word
Drop the repeated word "struct" in a comment.

Link: https://lore.kernel.org/r/20200720001455.31882-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 10:58:55 +02:00
Randy Dunlap
269f00a950 reiserfs: reiserfs.h: delete a duplicated word
Drop the repeated word "than" in a comment.

Link: https://lore.kernel.org/r/20200720001431.29718-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: reiserfs-devel@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 10:58:32 +02:00
Randy Dunlap
17a0445e7b ext2: ext2.h: fix duplicated word + typos
Change the repeated word "the" in "it the the" to "it is the".
Fix typo "recentl" to "recently".
Fix verb "give" to "gives".

Link: https://lore.kernel.org/r/20200720001327.23603-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jan Kara <jack@suse.com>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 10:58:06 +02:00
Chao Yu
b2f57a8e6b f2fs: compress: delay temp page allocation
Currently, we allocate temp pages which is used to pad hole in
cluster during read IO submission, it may take long time before
releasing them in f2fs_decompress_pages(), since they are only
used as temp output buffer in decompression context, so let's
just do the allocation in that context to reduce time of memory
pool resource occupation.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-26 08:20:16 -07:00
Chao Yu
944dd22ea4 f2fs: compress: fix to update isize when overwriting compressed file
We missed to update isize of compressed file in write_end() with
below case:

cluster size is 16KB

- write 14KB data from offset 0
- overwrite 16KB data from offset 0

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-26 08:19:06 -07:00
Jack Qiu
a87aff1d49 f2fs: space related cleanup
Just for code style, no logic change
1. delete useless space
2. change spaces into tab

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-07-26 08:15:40 -07:00
Yonghong Song
f9c7927295 bpf: Refactor to provide aux info to bpf_iter_init_seq_priv_t
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
David S. Miller
a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Linus Torvalds
17baa44286 Merge tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull EFI fixes from Ingo Molnar:
 "Various EFI fixes:

   - Fix the layering violation in the use of the EFI runtime services
     availability mask in users of the 'efivars' abstraction

   - Revert build fix for GCC v4.8 which is no longer supported

   - Clean up some x86 EFI stub details, some of which are borderline
     bugs that copy around garbage into padding fields - let's fix these
     out of caution.

   - Fix build issues while working on RISC-V support

   - Avoid --whole-archive when linking the stub on arm64"

* tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: Revert "efi/x86: Fix build with gcc 4"
  efi/efivars: Expose RT service availability via efivars abstraction
  efi/libstub: Move the function prototypes to header file
  efi/libstub: Fix gcc error around __umoddi3 for 32 bit builds
  efi/libstub/arm64: link stub lib.a conditionally
  efi/x86: Only copy upto the end of setup_header
  efi/x86: Remove unused variables
2020-07-25 13:18:42 -07:00
Linus Torvalds
7cb3a5c5f6 Merge tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6 into master
Pull cifs fix from Steve French:
 "A fix for a recently discovered regression in rename to older servers
  caused by a recent patch"

* tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6:
  Revert "cifs: Fix the target file was deleted when rename failed."
2020-07-25 12:53:46 -07:00
Pavel Begunkov
b089ed390b io-wq: update hash bits
Linked requests are hashed, remove a comment stating otherwise. Also
move hash bits to emphasise that we don't carry it through loop
iteration and set it every time.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 09:47:44 -06:00
Pavel Begunkov
f063c5477e io_uring: fix missing io_queue_linked_timeout()
Whoever called io_prep_linked_timeout() should also do
io_queue_linked_timeout(). __io_queue_sqe() doesn't follow that for the
punting path leaving linked timeouts prepared but never queued.

Fixes: 6df1db6b54 ("io_uring: fix mis-refcounting linked timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 09:47:44 -06:00
Pavel Begunkov
b65e0dd6a2 io_uring: mark ->work uninitialised after cleanup
Remove REQ_F_WORK_INITIALIZED after io_req_clean_work(). That's a cold
path but is safer for those using io_req_clean_work() out of
*dismantle_req()/*io_free(). And for the same reason zero work.fs

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 09:47:44 -06:00
Linus Torvalds
5876aa073f Merge tag 'nfsd-5.8-2' of git://linux-nfs.org/~bfields/linux into master
Pull nfsd fix from Bruce Fields:
 "Just one fix for a NULL dereference if someone happens to read
  /proc/fs/nfsd/client/../state at the wrong moment"

* tag 'nfsd-5.8-2' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix NULL dereference in nfsd/clients display code
2020-07-24 16:27:54 -07:00
Randy Dunlap
94a4beaa6b nfsd: netns.h: delete a duplicated word
Drop the repeated word "the" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-24 17:25:13 -04:00
Linus Torvalds
68845a55c3 Merge branch 'akpm' into master (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Subsystems affected by this patch series: mm/pagemap, mm/shmem,
  mm/hotfixes, mm/memcg, mm/hugetlb, mailmap, squashfs, scripts,
  io-mapping, MAINTAINERS, and gdb"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  scripts/gdb: fix lx-symbols 'gdb.error' while loading modules
  MAINTAINERS: add KCOV section
  io-mapping: indicate mapping failure
  scripts/decode_stacktrace: strip basepath from all paths
  squashfs: fix length field overlap check in metadata reading
  mailmap: add entry for Mike Rapoport
  khugepaged: fix null-pointer dereference due to race
  mm/hugetlb: avoid hardcoding while checking if cma is enabled
  mm: memcg/slab: fix memory leak at non-root kmem_cache destroy
  mm/memcg: fix refcount error while moving and swapping
  mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages()
  mm: initialize return of vm_insert_pages
  vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way
  mm/mmap.c: close race between munmap() and expand_upwards()/downwards()
2020-07-24 14:24:35 -07:00
Linus Torvalds
0669704270 Merge tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into master
Pull btrfs fixes from David Sterba:
 "A few resouce leak fixes from recent patches, all are stable material.

  The problems have been observed during testing or have a reproducer"

* tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix mount failure caused by race with umount
  btrfs: fix page leaks after failure to lock page for delalloc
  btrfs: qgroup: fix data leak caused by race between writeback and truncate
  btrfs: fix double free on ulist after backref resolution failure
2020-07-24 14:11:43 -07:00
Linus Torvalds
6a343656d3 Merge tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs into master
Pull zonefs fixes from Damien Le Moal:
 "Two fixes, the first one to remove compilation warnings and the second
  to avoid potentially inefficient allocation of BIOs for direct writes
  into sequential zones"

* tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: count pages after truncating the iterator
  zonefs: Fix compilation warning
2020-07-24 14:09:19 -07:00
Linus Torvalds
1f68f31b51 Merge tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block into master
Pull io_uring fixes from Jens Axboe:

 - Fix discrepancy in how sqe->flags are treated for a few requests,
   this makes it consistent (Daniele)

 - Ensure that poll driven retry works with double waitqueue poll users

 - Fix a missing io_req_init_async() (Pavel)

* tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block:
  io_uring: missed req_init_async() for IOSQE_ASYNC
  io_uring: always allow drain/link/hardlink/async sqe flags
  io_uring: ensure double poll additions work with both request types
2020-07-24 14:02:41 -07:00
Phillip Lougher
2910c59fd0 squashfs: fix length field overlap check in metadata reading
This is a regression introduced by the "migrate from ll_rw_block usage
to BIO" patch.

Squashfs packs structures on byte boundaries, and due to that the length
field (of the metadata block) may not be fully in the current block.
The new code rewrote and introduced a faulty check for that edge case.

Fixes: 93e72b3c61 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Bernd Amend <bernd.amend@gmail.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Daniel Rosenberg <drosen@google.com>
Link: http://lkml.kernel.org/r/20200717195536.16069-1-phillip@squashfs.org.uk
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-24 12:42:41 -07:00
Pavel Begunkov
f56040b819 io_uring: deduplicate io_grab_files() calls
Move io_req_init_async() into io_grab_files(), it's safer this way. Note
that io_queue_async_work() does *init_async(), so it's valid to move out
of __io_queue_sqe() punt path. Also, add a helper around io_grab_files().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Pavel Begunkov
ae34817bd9 io_uring: don't do opcode prep twice
Calling into opcode prep handlers may be dangerous, as they re-read
SQE but might not re-initialise requests completely. If io_req_defer()
passed fast checks and is done with preparations, punt it async.

As all other cases are covered with nulling @sqe, this guarantees that
io_[opcode]_prep() are visited only once per request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Xiaoguang Wang
23b3628e45 io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
In io_sq_thread(), if there are task works to handle, current codes
will skip schedule() and go on polling sq again, but forget to clear
IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
set and clear IORING_SQ_NEED_WAKEUP flag,

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Pavel Begunkov
5af1d13e8f io_uring: batch put_task_struct()
As every iopoll request have a task ref, it becomes expensive to put
them one by one, instead we can put several at once integrating that
into io_req_free_batch().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Pavel Begunkov
cbcf72148d io_uring: return locked and pinned page accounting
Locked and pinned memory accounting in io_{,un}account_mem() depends on
having ->sqo_mm, which is NULL after a recent change for non SQPOLL'ed
io_ring. That disables the accounting.

Return ->sqo_mm initialisation back, and do __io_sq_thread_acquire_mm()
based on IORING_SETUP_SQPOLL flag.

Fixes: 8eb06d7e8d ("io_uring: fix missing ->mm on exit")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
5dbcad51f7 io_uring: don't miscount pinned memory
io_sqe_buffer_unregister() uses cxt->sqo_mm for memory accounting, but
io_ring_ctx_free() drops ->sqo_mm before leaving pinned_vm
over-accounted. Postpone mm cleanup for when it's not needed anymore.

Fixes: 309758254e ("io_uring: report pinned memory usage")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
7fbb1b541f io_uring: don't open-code recv kbuf managment
Don't implement fast path of kbuf freeing and management inlined into
io_recv{,msg}(), that's error prone and duplicates handling. Replace it
with a helper io_put_recv_kbuf(), which mimics io_put_rw_kbuf() in the
io_read/write().

This also keeps cflags calculation in one place, removing duplication
between rw and recv/send.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
8ff069bf2e io_uring: extract io_put_kbuf() helper
Extract a common helper for cleaning up a selected buffer, this will be
used shortly. By the way, correct cflags types to unsigned and, as kbufs
are anyway tracked by a flag, remove useless zeroing req->rw.addr.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
bc02ef3325 io_uring: move BUFFER_SELECT check into *recv[msg]
Move REQ_F_BUFFER_SELECT flag check out of io_recv_buffer_select(), and
do that in its call sites That saves us from double error checking and
possibly an extra function call.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00