Commit Graph

64610 Commits

Author SHA1 Message Date
Linus Torvalds
87ad46e601 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
 "A brown paper bag slipped through my proc changes, and syzcaller
  caught it when the code ended up in your tree.

  I have opted to fix it the simplest cleanest way I know how, so there
  is no reasonable chance for the bug to repeat"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Use a dedicated lock in struct pid
2020-04-10 12:59:56 -07:00
Steve French
4e8aea30f7 smb3: enable swap on SMB3 mounts
Add experimental support for allowing a swap file to be on an SMB3
mount.  There are use cases where swapping over a secure network
filesystem is preferable. In some cases there are no local
block devices large enough, and network block devices can be
hard to setup and secure.  And in some cases there are no
local block devices at all (e.g. with the recent addition of
remote boot over SMB3 mounts).

There are various enhancements that can be added later e.g.:
- doing a mandatory byte range lock over the swapfile (until
the Linux VFS is modified to notify the file system that an open
is for a swapfile, when the file can be opened "DENY_ALL" to prevent
others from opening it).
- pinning more buffers in the underlying transport to minimize memory
allocations in the TCP stack under the fs
- documenting how to create ACLs (on the server) to secure the
swapfile (or adding additional tools to cifs-utils to make it easier)

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-04-10 13:32:32 -05:00
Linus Torvalds
172edde960 Merge tag 'io_uring-5.7-2020-04-09' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "Here's a set of fixes that either weren't quite ready for the first,
  or came about from some intensive testing on memcached with 350K+
  sockets.

  Summary:

   - Fixes for races or deadlocks around poll handling

   - Don't double account fixed files against RLIMIT_NOFILE

   - IORING_OP_OPENAT LFS fix

   - Poll retry handling (Bijan)

   - Missing finish_wait() for SQPOLL (Hillf)

   - Cleanup/split of io_kiocb alloc vs ctx references (Pavel)

   - Fixed file unregistration and init fixes (Xiaoguang)

   - Various little fixes (Xiaoguang, Pavel, Colin)"

* tag 'io_uring-5.7-2020-04-09' of git://git.kernel.dk/linux-block:
  io_uring: punt final io_ring_ctx wait-and-free to workqueue
  io_uring: fix fs cleanup on cqe overflow
  io_uring: don't read user-shared sqe flags twice
  io_uring: remove req init from io_get_req()
  io_uring: alloc req only after getting sqe
  io_uring: simplify io_get_sqring
  io_uring: do not always copy iovec in io_req_map_rw()
  io_uring: ensure openat sets O_LARGEFILE if needed
  io_uring: initialize fixed_file_data lock
  io_uring: remove redundant variable pointer nxt and io_wq_assign_next call
  io_uring: fix ctx refcounting in io_submit_sqes()
  io_uring: process requests completed with -EAGAIN on poll list
  io_uring: remove bogus RLIMIT_NOFILE check in file registration
  io_uring: use io-wq manager as backup task if task is exiting
  io_uring: grab task reference for poll requests
  io_uring: retry poll if we got woken with non-matching mask
  io_uring: add missing finish_wait() in io_sq_thread()
  io_uring: refactor file register/unregister/update handling
2020-04-10 10:02:21 -07:00
Linus Torvalds
8c3c07439e Merge tag 'xfs-5.7-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
 "As promised last week, this batch changes how xfs interacts with
  memory reclaim; how the log batches and throttles log items; how hard
  writes near ENOSPC will try to squeeze more space out of the
  filesystem; and hopefully fix the last of the umount hangs after a
  catastrophic failure.

  Summary:

   - Validate the realtime geometry in the superblock when mounting

   - Refactor a bunch of tricky flag handling in the log code

   - Flush the CIL more judiciously so that we don't wait until there
     are millions of log items consuming a lot of memory.

   - Throttle transaction commits to prevent the xfs frontend from
     flooding the CIL with too many log items.

   - Account metadata buffers correctly for memory reclaim.

   - Mark slabs properly for memory reclaim. These should help reclaim
     run more effectively when XFS is using a lot of memory.

   - Don't write a garbage log record at unmount time if we're trying to
     trigger summary counter recalculation at next mount.

   - Don't block the AIL on locked dquot/inode buffers; instead trigger
     its backoff mechanism to give the lock holder a chance to finish
     up.

   - Ratelimit writeback flushing when buffered writes encounter ENOSPC.

   - Other minor cleanups.

   - Make reflink a synchronous operation when the fs is mounted with
     wsync or sync, which means that now we force the log to disk to
     record the changes"

* tag 'xfs-5.7-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
  xfs: reflink should force the log out if mounted with wsync
  xfs: factor out a new xfs_log_force_inode helper
  xfs: fix inode number overflow in ifree cluster helper
  xfs: remove redundant variable assignment in xfs_symlink()
  xfs: ratelimit inode flush on buffered write ENOSPC
  xfs: return locked status of inode buffer on xfsaild push
  xfs: trylock underlying buffer on dquot flush
  xfs: remove unnecessary ternary from xfs_create
  xfs: don't write a corrupt unmount record to force summary counter recalc
  xfs: factor inode lookup from xfs_ifree_cluster
  xfs: tail updates only need to occur when LSN changes
  xfs: factor common AIL item deletion code
  xfs: correctly acount for reclaimable slabs
  xfs: Improve metadata buffer reclaim accountability
  xfs: don't allow log IO to be throttled
  xfs: Throttle commits on delayed background CIL push
  xfs: Lower CIL flush limit for large logs
  xfs: remove some stale comments from the log code
  xfs: refactor unmount record writing
  xfs: merge xlog_commit_record with xlog_write_done
  ...
2020-04-10 09:54:26 -07:00
Jens Axboe
85faa7b834 io_uring: punt final io_ring_ctx wait-and-free to workqueue
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the
task_works list isn't ordered (in fact it's LIFO ordered). We could
either fix this with a separate task_works list for io_uring work, or
just punt the wait-and-free to async context. This ensures that
task_work that comes in while we're shutting down is processed
correctly. If we don't go async, we could have work past the fput()
work for the ring that depends on work that won't be executed until
after we're done with the wait-and-free. But as this operation is
blocking, it'll never get a chance to run.

This was reproduced with hundreds of thousands of sockets running
memcached, haven't been able to reproduce this synthetically.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-09 18:45:27 -06:00
Christoph Hellwig
ecf84096a5 ubifs: remove broken lazytime support
When "ubifs: introduce UBIFS_ATIME_SUPPORT to ubifs" introduced atime
support to ubifs, it also added lazytime support.  As far as I can tell
the lazytime support is terminally broken, as it causes
mark_inode_dirty_sync to be called from __writeback_single_inode, which
will then trigger the locking assert in ubifs_dirty_inode.  Just remove
the broken lazytime support for now, it can be added back later,
especially as some infrastructure changes should make that easier soon.

Fixes: 8c1c5f2638 ("ubifs: introduce UBIFS_ATIME_SUPPORT to ubifs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-04-09 22:03:56 +02:00
Daniel Xu
772b314066 xattr: fix uninitialized out-param
`removed_sized` isn't correctly initialized (as the doc comment
suggests) on memory allocation failures. Fix by moving initialization up
a bit.

Fixes: 0c47383ba3 ("kernfs: Add option to enable user xattrs")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-04-09 15:33:09 -04:00
Steve French
1dc94b7381 smb3: change noisy error message to FYI
The noisy posix error message in readdir was supposed
to be an FYI (not enabled by default)
  CIFS VFS: XXX dev 66306, reparse 0, mode 755

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-04-09 13:28:24 -05:00
Linus Torvalds
e4da01d833 Merge tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
 "The bulk of this is the series to make CONFIG_COMPAT user-selectable,
  it's been around for a long time but was blocked behind the
  syscall-in-C series.

  Plus there's also a few fixes and other minor things.

  Summary:

   - A fix for a crash in machine check handling on pseries (ie. guests)

   - A small series to make it possible to disable CONFIG_COMPAT, and
     turn it off by default for ppc64le where it's not used.

   - A few other miscellaneous fixes and small improvements.

  Thanks to: Alexey Kardashevskiy, Anju T Sudhakar, Arnd Bergmann,
  Christophe Leroy, Dan Carpenter, Ganesh Goudar, Geert Uytterhoeven,
  Geoff Levand, Mahesh Salgaonkar, Markus Elfring, Michal Suchanek,
  Nicholas Piggin, Stephen Boyd, Wen Xiong"

* tag 'powerpc-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  selftests/powerpc: Always build the tm-poison test 64-bit
  powerpc: Improve ppc_save_regs()
  Revert "powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled"
  powerpc/time: Replace <linux/clk-provider.h> by <linux/of_clk.h>
  powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory
  powerpc/perf: split callchain.c by bitness
  powerpc/64: Make COMPAT user-selectable disabled on littleendian by default.
  powerpc/64: make buildable without CONFIG_COMPAT
  powerpc/perf: consolidate valid_user_sp -> invalid_user_sp
  powerpc/perf: consolidate read_user_stack_32
  powerpc: move common register copy functions from signal_32.c to signal.c
  powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
  powerpc/ps3: Set CONFIG_UEVENT_HELPER=y in ps3_defconfig
  powerpc/ps3: Remove an unneeded NULL check
  powerpc/ps3: Remove duplicate error message
  powerpc/powernv: Re-enable imc trace-mode in kernel
  powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
  powerpc/pseries: Fix MCE handling on pseries
  selftests/eeh: Skip ahci adapters
  powerpc/64s: Fix doorbell wakeup msgclr optimisation
2020-04-09 11:01:42 -07:00
Eric W. Biederman
63f818f46a proc: Use a dedicated lock in struct pid
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
>  (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&pid->wait_pidfd);
>                                local_irq_disable();
>                                lock(tasklist_lock);
>                                lock(&pid->wait_pidfd);
>   <Interrupt>
>     lock(tasklist_lock);
>
>  *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.

It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid->lock with irqs disabled.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-09 12:15:35 -05:00
Pavel Begunkov
c398ecb3d6 io_uring: fix fs cleanup on cqe overflow
If completion queue overflow occurs, __io_cqring_fill_event() will
update req->cflags, which is in a union with req->work and happens to
be aliased to req->work.fs. Following io_free_req() ->
io_req_work_drop_env() may get a bunch of different problems (miscount
fs->users, segfault, etc) on cleaning @fs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-09 09:38:18 -06:00
Linus Torvalds
fcc95f0640 Merge tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
 "The main items are:

   - support for asynchronous create and unlink (Jeff Layton).

     Creates and unlinks are satisfied locally, without waiting for a
     reply from the MDS, provided the client has been granted
     appropriate caps (new in v15.y.z ("Octopus") release). This can be
     a big help for metadata heavy workloads such as tar and rsync.
     Opt-in with the new nowsync mount option.

   - multiple blk-mq queues for rbd (Hannes Reinecke and myself).

     When the driver was converted to blk-mq, we settled on a single
     blk-mq queue because of a global lock in libceph and some other
     technical debt. These have since been addressed, so allocate a
     queue per CPU to enhance parallelism.

   - don't hold onto caps that aren't actually needed (Zheng Yan).

     This has been our long-standing behavior, but it causes issues with
     some active/standby applications (synchronous I/O, stalls if the
     standby goes down, etc).

   - .snap directory timestamps consistent with ceph-fuse (Luis
     Henriques)"

* tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
  ceph: fix snapshot directory timestamps
  ceph: wait for async creating inode before requesting new max size
  ceph: don't skip updating wanted caps when cap is stale
  ceph: request new max size only when there is auth cap
  ceph: cleanup return error of try_get_cap_refs()
  ceph: return ceph_mdsc_do_request() errors from __get_parent()
  ceph: check all mds' caps after page writeback
  ceph: update i_requested_max_size only when sending cap msg to auth mds
  ceph: simplify calling of ceph_get_fmode()
  ceph: remove delay check logic from ceph_check_caps()
  ceph: consider inode's last read/write when calculating wanted caps
  ceph: always renew caps if mds_wanted is insufficient
  ceph: update dentry lease for async create
  ceph: attempt to do async create when possible
  ceph: cache layout in parent dir on first sync create
  ceph: add new MDS req field to hold delegated inode number
  ceph: decode interval_sets for delegated inos
  ceph: make ceph_fill_inode non-static
  ceph: perform asynchronous unlink if we have sufficient caps
  ceph: don't take refs to want mask unless we have all bits
  ...
2020-04-08 21:44:05 -07:00
Linus Torvalds
c6b80eb89b Merge tag 'ovl-update-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:

 - Fix failure to copy-up files from certain NFSv4 mounts

 - Sort out inconsistencies between st_ino and i_ino (used in /proc/locks)

 - Allow consistent (POSIX-y) inode numbering in more cases

 - Allow virtiofs to be used as upper layer

 - Miscellaneous cleanups and fixes

* tag 'ovl-update-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: document xino expected behavior
  ovl: enable xino automatically in more cases
  ovl: avoid possible inode number collisions with xino=on
  ovl: use a private non-persistent ino pool
  ovl: fix WARN_ON nlink drop to zero
  ovl: fix a typo in comment
  ovl: replace zero-length array with flexible-array member
  ovl: ovl_obtain_alias(): don't call d_instantiate_anon() for old
  ovl: strict upper fs requirements for remote upper fs
  ovl: check if upper fs supports RENAME_WHITEOUT
  ovl: allow remote upper
  ovl: decide if revalidate needed on a per-dentry basis
  ovl: separate detection of remote upper layer from stacked overlay
  ovl: restructure dentry revalidation
  ovl: ignore failure to copy up unknown xattrs
  ovl: document permission model
  ovl: simplify i_ino initialization
  ovl: factor out helper ovl_get_root()
  ovl: fix out of date comment and unreachable code
  ovl: fix value of i_ino for lower hardlink corner case
2020-04-08 21:40:30 -07:00
Linus Torvalds
9744b923d5 Merge tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
 "Fix a problem in readahead where we can crash if we can't allocate a
  full bio due to GFP_NORETRY"

* tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Handle memory allocation failure in readahead
2020-04-08 21:37:18 -07:00
Linus Torvalds
9b06860d7c Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
 "There were multiple touches outside of drivers/nvdimm/ this round to
  add cross arch compatibility to the devm_memremap_pages() interface,
  enhance numa information for persistent memory ranges, and add a
  zero_page_range() dax operation.

  This cycle I switched from the patchwork api to Konstantin's b4 script
  for collecting tags (from x86, PowerPC, filesystem, and device-mapper
  folks), and everything looks to have gone ok there. This has all
  appeared in -next with no reported issues.

  Summary:

   - Add support for region alignment configuration and enforcement to
     fix compatibility across architectures and PowerPC page size
     configurations.

   - Introduce 'zero_page_range' as a dax operation. This facilitates
     filesystem-dax operation without a block-device.

   - Introduce phys_to_target_node() to facilitate drivers that want to
     know resulting numa node if a given reserved address range was
     onlined.

   - Advertise a persistence-domain for of_pmem and papr_scm. The
     persistence domain indicates where cpu-store cycles need to reach
     in the platform-memory subsystem before the platform will consider
     them power-fail protected.

   - Promote numa_map_to_online_node() to a cross-kernel generic
     facility.

   - Save x86 numa information to allow for node-id lookups for reserved
     memory ranges, deploy that capability for the e820-pmem driver.

   - Pick up some miscellaneous minor fixes, that missed v5.6-final,
     including a some smatch reports in the ioctl path and some unit
     test compilation fixups.

   - Fixup some flexible-array declarations"

* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
  dax: Move mandatory ->zero_page_range() check in alloc_dax()
  dax,iomap: Add helper dax_iomap_zero() to zero a range
  dax: Use new dax zero page method for zeroing a page
  dm,dax: Add dax zero_page_range operation
  s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
  dax, pmem: Add a dax operation zero_page_range
  pmem: Add functions for reading/writing page to/from pmem
  libnvdimm: Update persistence domain value for of_pmem and papr_scm device
  tools/test/nvdimm: Fix out of tree build
  libnvdimm/region: Fix build error
  libnvdimm/region: Replace zero-length array with flexible-array member
  libnvdimm/label: Replace zero-length array with flexible-array member
  ACPI: NFIT: Replace zero-length array with flexible-array member
  libnvdimm/region: Introduce an 'align' attribute
  libnvdimm/region: Introduce NDD_LABELING
  libnvdimm/namespace: Enforce memremap_compat_align()
  libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
  libnvdimm: Out of bounds read in __nd_ioctl()
  acpi/nfit: improve bounds checking for 'func'
  mm/memremap_pages: Introduce memremap_compat_align()
  ...
2020-04-08 21:03:40 -07:00
Filipe Manana
d611add48b btrfs: fix reclaim counter leak of space_info objects
Whenever we add a ticket to a space_info object we increment the object's
reclaim_size counter witht the ticket's bytes, and we decrement it with
the corresponding amount only when we are able to grant the requested
space to the ticket. When we are not able to grant the space to a ticket,
or when the ticket is removed due to a signal (e.g. an application has
received sigterm from the terminal) we never decrement the counter with
the corresponding bytes from the ticket. This leak can result in the
space reclaim code to later do much more work than necessary. So fix it
by decrementing the counter when those two cases happen as well.

Fixes: db161806dc ("btrfs: account ticket size at add/delete time")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08 19:11:19 +02:00
Filipe Manana
7af597433d btrfs: make full fsyncs always operate on the entire file again
This is a revert of commit 0a8068a3dd ("btrfs: make ranged full
fsyncs more efficient"), with updated comment in btrfs_sync_file.

Commit 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
made full fsyncs operate on the given range only as it assumed it was safe
when using the NO_HOLES feature, since the hole detection was simplified
some time ago and no longer was a source for races with ordered extent
completion of adjacent file ranges.

However it's still not safe to have a full fsync only operate on the given
range, because extent maps for new extents might not be present in memory
due to inode eviction or extent cloning. Consider the following example:

1) We are currently at transaction N;

2) We write to the file range [0, 1MiB);

3) Writeback finishes for the whole range and ordered extents complete,
   while we are still at transaction N;

4) The inode is evicted;

5) We open the file for writing, causing the inode to be loaded to
   memory again, which sets the 'full sync' bit on its flags. At this
   point the inode's list of modified extent maps is empty (figuring
   out which extents were created in the current transaction and were
   not yet logged by an fsync is expensive, that's why we set the
   'full sync' bit when loading an inode);

6) We write to the file range [512KiB, 768KiB);

7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
   This correctly flushes this range and logs its extent into the log
   tree. When the writeback started an extent map for range [512KiB, 768KiB)
   was added to the inode's list of modified extents, and when the fsync()
   finishes logging it removes that extent map from the list of modified
   extent maps. This fsync also clears the 'full sync' bit;

8) We do a regular fsync() (full ranged). This fsync() ends up doing
   nothing because the inode's list of modified extents is empty and
   no other changes happened since the previous ranged fsync(), so
   it just returns success (0) and we end up never logging extents for
   the file ranges [0, 512KiB) and [768KiB, 1MiB).

Another scenario where this can happen is if we replace steps 2 to 4 with
cloning from another file into our test file, as that sets the 'full sync'
bit in our inode's flags and does not populate its list of modified extent
maps.

This was causing test case generic/457 to fail sporadically when using the
NO_HOLES feature, as it exercised this later case where the inode has the
'full sync' bit set and has no extent maps in memory to represent the new
extents due to extent cloning.

Fix this by reverting commit 0a8068a3dd ("btrfs: make ranged full fsyncs
more efficient") since there is no easy way to work around it.

Fixes: 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08 19:10:52 +02:00
Filipe Manana
4fdb688c70 btrfs: fix lost i_size update after cloning inline extent
When not using the NO_HOLES feature we were not marking the destination's
file range as written after cloning an inline extent into it. This can
lead to a data loss if the current destination file size is smaller than
the source file's size.

Example:

  $ mkfs.btrfs -f -O ^no-holes /dev/sdc
  $ mount /mnt/sdc /mnt

  $ echo "hello world" > /mnt/foo
  $ cp --reflink=always /mnt/foo /mnt/bar
  $ rm -f /mnt/foo
  $ umount /mnt

  $ mount /mnt/sdc /mnt
  $ cat /mnt/bar
  $
  $ stat -c %s /mnt/bar
  0

  # -> the file is empty, since we deleted foo, the data lost is forever

Fix that by calling btrfs_inode_set_file_extent_range() after cloning an
inline extent.

A test case for fstests will follow soon.

Link: https://lore.kernel.org/linux-btrfs/20200404193846.GA432065@latitude/
Reported-by: Johannes Hirte <johannes.hirte@datenkhaos.de>
Fixes: 9ddc959e80 ("btrfs: use the file extent tree infrastructure")
Tested-by: Johannes Hirte <johannes.hirte@datenkhaos.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08 19:10:34 +02:00
Josef Bacik
4d4225fc22 btrfs: check commit root generation in should_ignore_root
Previously we would set the reloc root's last snapshot to transid - 1.
However there was a problem with doing this, and we changed it to
setting the last snapshot to the generation of the commit node of the fs
root.

This however broke should_ignore_root().  The assumption is that if we
are in a generation newer than when the reloc root was created, then we
would find the reloc root through normal backref lookups, and thus can
ignore any fs roots we find with an old enough reloc root.

Now that the last snapshot could be considerably further in the past
than before, we'd end up incorrectly ignoring an fs root.  Thus we'd
find no nodes for the bytenr we were searching for, and we'd fail to
relocate anything.  We'd loop through the relocate code again and see
that there were still used space in that block group, attempt to
relocate those bytenr's again, fail in the same way, and just loop like
this forever.  This is tricky in that we have to not modify the fs root
at all during this time, so we need to have a block group that has data
in this fs root that is not shared by any other root, which is why this
has been difficult to reproduce.

Fixes: 054570a1dc ("Btrfs: fix relocation incorrectly dropping data references")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-08 19:10:31 +02:00
Pavel Begunkov
9c280f9087 io_uring: don't read user-shared sqe flags twice
Don't re-read userspace-shared sqe->flags, it can be exploited.
sqe->flags are copied into req->flags in io_submit_sqe(), check them
there instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:26:51 -06:00
Pavel Begunkov
0553b8bda8 io_uring: remove req init from io_get_req()
io_get_req() do two different things: io_kiocb allocation and
initialisation. Move init part out of it and rename into
io_alloc_req(). It's simpler this way and also have better data
locality.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:26:28 -06:00
Pavel Begunkov
b1e50e549b io_uring: alloc req only after getting sqe
As io_get_sqe() split into 2 stage get/consume, get an sqe before
allocating io_kiocb, so no free_req*() for a failure case is needed,
and inline back __io_req_do_free(), which has only 1 user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:26:16 -06:00
Pavel Begunkov
709b302fad io_uring: simplify io_get_sqring
Make io_get_sqring() care only about sqes themselves, not initialising
the io_kiocb. Also, split it into get + consume, that will be helpful in
the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:25:18 -06:00
Xiaoguang Wang
45097daea2 io_uring: do not always copy iovec in io_req_map_rw()
In io_read_prep() or io_write_prep(), io_req_map_rw() takes
struct io_async_rw's fast_iov as argument to call io_import_iovec(),
and if io_import_iovec() uses struct io_async_rw's fast_iov as
valid iovec array, later indeed io_req_map_rw() does not need
to do the memcpy operation, because they are same pointers.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:23:42 -06:00
Jens Axboe
08a1d26eb8 io_uring: ensure openat sets O_LARGEFILE if needed
OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the
OPENAT opcode. Dmitry reports that his test case that compares openat()
and IORING_OP_OPENAT sees failures on large files:

*** sync openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write succeeded

*** sync openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write succeeded

*** io_uring openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write failed: File too large

*** io_uring openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write failed: File too large

Ensure we set O_LARGEFILE, if force_o_largefile() is true.

Cc: stable@vger.kernel.org # v5.6
Fixes: 15b71abe7b ("io_uring: add support for IORING_OP_OPENAT")
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:20:54 -06:00
Mike Marshall
0e393a9a8f orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush
Christoph Hellwig noticed that we were doing some unnecessary
work in orangefs_flush:

  orangefs_flush just writes out data on every close(2) call.  There is
  no need to change anything about the dirty state, especially as
  orangefs doesn't treat I_DIRTY_TIMES special in any way.  The code
  seems to come from partially open coding vfs_fsync.

He sent in a patch with the above commit message and also a
patch that was a reversion of another Orangefs patch I had
sent upstream a while ago. I had to fix his reversion patch
so that it would compile which caused his "don't mess with
I_DIRTY_TIMES" patch to fail to apply. So here I have just
remade his patch and applied it after the fixed reversion patch.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-04-08 09:39:11 -04:00
Mike Marshall
ec95f1dedc orangefs: get rid of knob code...
Christoph Hellwig sent in a reversion of "orangefs: remember count
when reading." because:

  ->read_iter calls can race with each other and one or
  more ->flush calls. Remove the the scheme to store the read
  count in the file private data as is is completely racy and
  can cause use after free or double free conditions

Christoph's reversion caused Orangefs not to work or to compile. I
added a patch that fixed that, but intel's kbuild test robot pointed
out that sending Christoph's patch followed by my patch upstream, it
would break bisection because of the failure to compile. So I have
combined the reversion plus my patch... here's the commit message
that was in my patch:

  Logically, optimal Orangefs "pages" are 4 megabytes. Reading
  large Orangefs files 4096 bytes at a time is like trying to
  kick a dead whale down the beach. Before Christoph's "Revert
  orangefs: remember count when reading." I tried to give users
  a knob whereby they could, for example, use "count" in
  read(2) or bs with dd(1) to get whatever they considered an
  appropriate amount of bytes at a time from Orangefs and fill
  as many page cache pages as they could at once.

  Without the racy code that Christoph reverted Orangefs won't
  even compile, much less work. So this replaces the logic that
  used the private file data that Christoph reverted with
  a static number of bytes to read from Orangefs.

  I ran tests like the following to determine what a
  reasonable static number of bytes might be:

  dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304
  dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152
  dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576
                            .
                            .
                            .
  dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128

  Reads seem faster using the static number, so my "knob code"
  wasn't just racy, it wasn't even a good idea...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Reported-by: kbuild test robot <lkp@intel.com>
2020-04-08 09:38:51 -04:00
Linus Torvalds
63bef48fd6 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a lot more of MM, quite a bit more yet to come: (memcg, pagemap,
   vmalloc, pagealloc, migration, thp, ksm, madvise, virtio,
   userfaultfd, memory-hotplug, shmem, rmap, zswap, zsmalloc, cleanups)

 - various other subsystems (procfs, misc, MAINTAINERS, bitops, lib,
   checkpatch, epoll, binfmt, kallsyms, reiserfs, kmod, gcov, kconfig,
   ubsan, fault-injection, ipc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (158 commits)
  ipc/shm.c: make compat_ksys_shmctl() static
  ipc/mqueue.c: fix a brace coding style issue
  lib/Kconfig.debug: fix a typo "capabilitiy" -> "capability"
  ubsan: include bug type in report header
  kasan: unset panic_on_warn before calling panic()
  ubsan: check panic_on_warn
  drivers/misc/lkdtm/bugs.c: add arithmetic overflow and array bounds checks
  ubsan: split "bounds" checker from other options
  ubsan: add trap instrumentation option
  init/Kconfig: clean up ANON_INODES and old IO schedulers options
  kernel/gcov/fs.c: replace zero-length array with flexible-array member
  gcov: gcc_3_4: replace zero-length array with flexible-array member
  gcov: gcc_4_7: replace zero-length array with flexible-array member
  kernel/kmod.c: fix a typo "assuems" -> "assumes"
  reiserfs: clean up several indentation issues
  kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
  samples/hw_breakpoint: drop use of kallsyms_lookup_name()
  samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes
  fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path
  fs/binfmt_elf.c: allocate less for static executable
  ...
2020-04-07 14:11:54 -07:00
Linus Torvalds
04de788e61 Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - Fix a page leak in nfs_destroy_unlinked_subrequests()

   - Fix use-after-free issues in nfs_pageio_add_request()

   - Fix new mount code constant_table array definitions

   - finish_automount() requires us to hold 2 refs to the mount record

  Features:
   - Improve the accuracy of telldir/seekdir by using 64-bit cookies
     when possible.

   - Allow one RDMA active connection and several zombie connections to
     prevent blocking if the remote server is unresponsive.

   - Limit the size of the NFS access cache by default

   - Reduce the number of references to credentials that are taken by
     NFS

   - pNFS files and flexfiles drivers now support per-layout segment
     COMMIT lists.

   - Enable partial-file layout segments in the pNFS/flexfiles driver.

   - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type

   - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
     the DS using the layouterror mechanism.

  Bugfixes and cleanups:
   - SUNRPC: Fix krb5p regressions

   - Don't specify NFS version in "UDP not supported" error

   - nfsroot: set tcp as the default transport protocol

   - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()

   - alloc_nfs_open_context() must use the file cred when available

   - Fix locking when dereferencing the delegation cred

   - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails

   - Various clean ups of the NFS O_DIRECT commit code

   - Clean up RDMA connect/disconnect

   - Replace zero-length arrays with C99-style flexible arrays"

* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
  NFS: Clean up process of marking inode stale.
  SUNRPC: Don't start a timer on an already queued rpc task
  NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
  NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
  NFS: Beware when dereferencing the delegation cred
  NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
  NFS: finish_automount() requires us to hold 2 refs to the mount record
  NFS: Fix a few constant_table array definitions
  NFS: Try to join page groups before an O_DIRECT retransmission
  NFS: Refactor nfs_lock_and_join_requests()
  NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
  NFS: Clean up nfs_lock_and_join_requests()
  NFS: Remove the redundant function nfs_pgio_has_mirroring()
  NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
  NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
  NFS: Fix use-after-free issues in nfs_pageio_add_request()
  NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
  NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
  NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
  pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
  ...
2020-04-07 13:51:39 -07:00
Linus Torvalds
f40f31cadc Merge tag 'f2fs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've mainly focused on fixing bugs and addressing
  issues in recently introduced compression support.

  Enhancement:
   - add zstd support, and set LZ4 by default
   - add ioctl() to show # of compressed blocks
   - show mount time in debugfs
   - replace rwsem with spinlock
   - avoid lock contention in DIO reads

  Some major bug fixes wrt compression:
   - compressed block count
   - memory access and leak
   - remove obsolete fields
   - flag controls

  Other bug fixes and clean ups:
   - fix overflow when handling .flags in inode_info
   - fix SPO issue during resize FS flow
   - fix compression with fsverity enabled
   - potential deadlock when writing compressed pages
   - show missing mount options"

* tag 'f2fs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (66 commits)
  f2fs: keep inline_data when compression conversion
  f2fs: fix to disable compression on directory
  f2fs: add missing CONFIG_F2FS_FS_COMPRESSION
  f2fs: switch discard_policy.timeout to bool type
  f2fs: fix to verify tpage before releasing in f2fs_free_dic()
  f2fs: show compression in statx
  f2fs: clean up dic->tpages assignment
  f2fs: compress: support zstd compress algorithm
  f2fs: compress: add .{init,destroy}_decompress_ctx callback
  f2fs: compress: fix to call missing destroy_compress_ctx()
  f2fs: change default compression algorithm
  f2fs: clean up {cic,dic}.ref handling
  f2fs: fix to use f2fs_readpage_limit() in f2fs_read_multi_pages()
  f2fs: xattr.h: Make stub helpers inline
  f2fs: fix to avoid double unlock
  f2fs: fix potential .flags overflow on 32bit architecture
  f2fs: fix NULL pointer dereference in f2fs_verity_work()
  f2fs: fix to clear PG_error if fsverity failed
  f2fs: don't call fscrypt_get_encryption_info() explicitly in f2fs_tmpfile()
  f2fs: don't trigger data flush in foreground operation
  ...
2020-04-07 13:48:26 -07:00
Linus Torvalds
763dede1b2 Merge tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:

 - Fix for memory leaks around UBIFS orphan handling

 - Fix for memory leaks around UBI fastmap

 - Remove zero-length array from ubi-media.h

 - Fix for TNC lookup in UBIFS orphan code

* tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubi: ubi-media.h: Replace zero-length array with flexible-array member
  ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len
  ubi: fastmap: Only produce the initial anchor PEB when fastmap is used
  ubi: fastmap: Free unused fastmap anchor peb during detach
  ubifs: ubifs_add_orphan: Fix a memory leak bug
  ubifs: ubifs_jnl_write_inode: Fix a memory leak bug
  ubifs: Fix ubifs_tnc_lookup() usage in do_kill_orphans()
2020-04-07 12:40:56 -07:00
Linus Torvalds
762a9f2f01 Merge tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:

 - New mode for time travel, external via virtio

 - Fixes for ubd to make sure no requests can get lost

 - Fixes for vector networking

 - Allow CONFIG_STATIC_LINK only when possible

 - Minor cleanups and fixes

* tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: Remove some unnecessary NULL checks in vector_user.c
  um: vector: Avoid NULL ptr deference if transport is unset
  um: Make CONFIG_STATIC_LINK actually static
  um: Implement cpu_relax() as ndelay(1) for time-travel
  um: Implement ndelay/udelay in time-travel mode
  um: Implement time-travel=ext
  um: virtio: Implement VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS
  um: time-travel: Rewrite as an event scheduler
  um: Move timer-internal.h to non-shared
  hostfs: Use kasprintf() instead of fixed buffer formatting
  um: falloc.h needs to be directly included for older libc
  um: ubd: Retry buffer read on any kind of error
  um: ubd: Prevent buffer overrun on command completion
  um: Fix overlapping ELF segments when statically linked
  um: Delete never executed timer
  um: Don't overwrite ethtool driver version
  um: Fix len of file in create_pid_file
  um: Don't use console_drivers directly
  um: Cleanup CONFIG_IOSCHED_CFQ
2020-04-07 12:36:09 -07:00
Steve French
2bcb4fd6ba smb3: smbdirect support can be configured by default
smbdirect support (SMB3 over RDMA) should be enabled by
default in many configurations.

It is not experimental and is stable enough and has enough
performance benefits to recommend that it be configured by
default.  Change the  "If unsure N" to "If unsure Y" in
the description of the configuration parameter.

Acked-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-04-07 13:39:00 -05:00
Colin Ian King
5404e7e0ac reiserfs: clean up several indentation issues
There are several places where code is indented incorrectly. Fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200325135018.113431-1-colin.king@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Alexey Dobriyan
aa0d1564b1 fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path
Static executables don't need to free NULL pointer.

It doesn't matter really because static executable is not common scenario
but do it anyway out of pedantry.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219185330.GA4933@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Alexey Dobriyan
0693ffebcf fs/binfmt_elf.c: allocate less for static executable
PT_INTERP ELF header can be spared if executable is static.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219185012.GB4871@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Alexey Dobriyan
c69bcc932e fs/binfmt_elf.c: delete "loc" variable
"loc" variable became just a wrapper for PT_INTERP ELF header after main
ELF header was moved to "bprm->buf".  Delete it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200219184847.GA4871@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Jason Baron
efcdd350d1 fs/epoll: make nesting accounting safe for -rt kernel
Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set
ep_poll_safewake() can take several non-raw spinlocks after disabling
interrupts.  Since a spinlock can block in the -rt kernel, we can't take a
spinlock after disabling interrupts.  So let's re-work how we determine
the nesting level such that it plays nicely with the -rt kernel.

Let's introduce a 'nests' field in struct eventpoll that records the
current nesting level during ep_poll_callback().  Then, if we nest again
we can find the previous struct eventpoll that we were called from and
increase our count by 1.  The 'nests' field is protected by
ep->poll_wait.lock.

I've also moved the visited field to reduce the size of struct eventpoll
from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which
is typical for a production config.

Reported-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Matthew Wilcox (Oracle)
fad955009c proc: inline m_next_vma into m_next
It's clearer to just put this inline.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-5-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
b829a0f0f2 seq_file: remove m->version
The process maps file was the only user of version (introduced back in
2005).  Now that it uses ppos instead, we can remove it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
4781f2c3ab proc: use ppos instead of m->version
The ppos is a private cursor, just like m->version.  Use the canonical
cursor, not a special one.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-3-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
c2e88d22e8 proc: remove m_cache_vma
Instead of setting m->version in the show method, set it in m_next(),
where it should be.  Also remove the fallback code for failing to find a
vma, or version being zero.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-2-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Matthew Wilcox (Oracle)
d07ded611e proc: inline vma_stop into m_stop
Instead of calling vma_stop() from m_start() and m_next(), do its work
in m_stop().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200317193201.9924-1-adobriyan@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Alexey Dobriyan
5c5ab9714c proc: speed up /proc/*/statm
top(1) reads all /proc/*/statm files but kernel threads will always have
zeros.  Print those zeroes directly without going through
seq_put_decimal_ull().

Speed up reading /proc/2/statm (which is kthreadd) is like 3%.

My system has more kernel threads than normal processes after booting KDE.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200307154435.GA2788@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Alexey Dobriyan
d919b33daf proc: faster open/read/close with "permanent" files
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...

Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed.  Files like /proc/cpuinfo which never
disappear simply do not need such protection.

Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.

Enable "permanent" flag for

	/proc/cpuinfo
	/proc/kmsg
	/proc/modules
	/proc/slabinfo
	/proc/stat
	/proc/sysvipc/*
	/proc/swaps

More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.

This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".

	N	R	t, s (before)	t, s (after)
	-----------------------------------------------------
	64	4096	1.582458	1.530502	-3.2%
	256	4096	6.371926	6.125168	-3.9%
	1024	4096	25.64888	24.47528	-4.6%

Benchmark source:

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;

int xxx = 0;

int glue(int n)
{
	cpu_set_t m;
	CPU_ZERO(&m);
	CPU_SET(n, &m);
	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}

void f(int n)
{
	glue(n % NR_CPUS);

	while (*(volatile int *)&xxx == 0) {
	}

	for (int i = 0; i < R; i++) {
		int fd = open(filename, O_RDONLY);
		char buf[4096];
		ssize_t rv = read(fd, buf, sizeof(buf));
		asm volatile ("" :: "g" (rv));
		close(fd);
	}
}

int main(int argc, char *argv[])
{
	if (argc < 4) {
		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
		return 1;
	}

	N = atoi(argv[1]);
	filename = argv[2];
	R = atoi(argv[3]);

	for (int i = 0; i < NR_CPUS; i++) {
		if (glue(i) == 0)
			break;
	}

	std::vector<std::thread> T;
	T.reserve(N);
	for (int i = 0; i < N; i++) {
		T.emplace_back(f, i);
	}

	auto t0 = std::chrono::system_clock::now();
	{
		*(volatile int *)&xxx = 1;
		for (auto& t: T) {
			t.join();
		}
	}
	auto t1 = std::chrono::system_clock::now();
	std::chrono::duration<double> dt = t1 - t0;
	std::cout << dt.count() << '
';

	return 0;
}

P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.

[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Jules Irenge
904f394e2e fs/proc/inode.c: annotate close_pdeo() for sparse
Fix sparse locking imbalance warning:

	warning: context imbalance in close_pdeo() - unexpected unlock

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200227201538.GA30462@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Peter Xu
14819305e0 userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally
Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed.  Then when the user
registers regions with shmem/hugetlbfs we won't expose the new ioctl to
them.  Even with complete anonymous memory range, we'll only expose the
new WP ioctl bit if the register mode has MODE_WP.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:40 -07:00
Peter Xu
23080e2783 userfaultfd: wp: don't wake up when doing write protect
It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region.  Only wake up when resolving a write
protected page fault.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Andrea Arcangeli
63b2d4174c userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
Introduce the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this flag can
co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
userspace program can not only resolve missing page faults, and at the
same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
write protection tracking.  Note that we will need to register the memory
region with UFFDIO_REGISTER_MODE_WP before that.

[peterx@redhat.com: write up the commit message]
[peterx@redhat.com: remove useless block, write commit message, check against
 VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Andrea Arcangeli
72981e0e7b userfaultfd: wp: add UFFDIO_COPY_MODE_WP
This allows UFFDIO_COPY to map pages write-protected.

[peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
 around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
 commit messages]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00