xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
As a prelude to implementing asynchronous fileserver operations in the afs
filesystem, rename struct afs_fs_cursor to afs_operation.
This struct is going to form the core of the operation management and is
going to acquire more members in later.
Signed-off-by: David Howells <dhowells@redhat.com>
Set a flag in the call struct to indicate an unmarshalling error rather
than return and handle an error from the decoding of file statuses. This
flag is checked on a successful return from the delivery function.
Signed-off-by: David Howells <dhowells@redhat.com>
afs_vol_interest objects represent the volume IDs currently being accessed
from a fileserver. These hold lists of afs_cb_interest objects that
repesent the superblocks using that volume ID on that server.
When a callback notification from the server telling of a modification by
another client arrives, the volume ID specified in the notification is
looked up in the server's afs_vol_interest list. Through the
afs_cb_interest list, the relevant superblocks can be iterated over and the
specific inode looked up and marked in each one.
Make the following efficiency improvements:
(1) Hold rcu_read_lock() over the entire processing rather than locking it
each time.
(2) Do all the callbacks for each vid together rather than individually.
Each volume then only needs to be looked up once.
(3) afs_vol_interest objects are now stored in an rb_tree rather than a
flat list to reduce the lookup step count.
(4) afs_vol_interest lookup is now done with RCU, but because it's in an
rb_tree which may rotate under us, a seqlock is used so that if it
changes during the walk, we repeat the walk with a lock held.
With this and the preceding patch which adds RCU-based lookups in the inode
cache, target volumes/vnodes can be taken without the need to take any
locks, except on the target itself.
Signed-off-by: David Howells <dhowells@redhat.com>
Show more information in /proc/net/afs/servers to make it easier to see
what's going on with the server probing.
Signed-off-by: David Howells <dhowells@redhat.com>
When an AFS client accesses a file, it receives a limited-duration callback
promise that the server will notify it if another client changes a file.
This callback duration can be a few hours in length.
If a client mounts a volume and then an application prevents it from being
unmounted, say by chdir'ing into it, but then does nothing for some time,
the rxrpc_peer record will expire and rxrpc-level keepalive will cease.
If there is NAT or a firewall between the client and the server, the route
back for the server may close after a comparatively short duration, meaning
that attempts by the server to notify the client may then bounce.
The client, however, may (so far as it knows) still have a valid unexpired
promise and will then rely on its cached data and will not see changes made
on the server by a third party until it incidentally rechecks the status or
the promise needs renewal.
To deal with this, the client needs to regularly probe the server. This
has two effects: firstly, it keeps a route open back for the server, and
secondly, it causes the server to disgorge any notifications that got
queued up because they couldn't be sent.
Fix this by adding a mechanism to emit regular probes.
Two levels of probing are made available: Under normal circumstances the
'slow' queue will be used for a fileserver - this just probes the preferred
address once every 5 mins or so; however, if server fails to respond to any
probes, the server will shift to the 'fast' queue from which all its
interfaces will be probed every 30s. When it finally responds, the record
will switch back to the slow queue.
Further notes:
(1) Probing is now no longer driven from the fileserver rotation
algorithm.
(2) Probes are dispatched to all interfaces on a fileserver when that an
afs_server object is set up to record it.
(3) The afs_server object is removed from the probe queues when we start
to probe it. afs_is_probing_server() returns true if it's not listed
- ie. it's undergoing probing.
(4) The afs_server object is added back on to the probe queue when the
final outstanding probe completes, but the probed_at time is set when
we're about to launch a probe so that it's not dependent on the probe
duration.
(5) The timer and the work item added for this must be handed a count on
net->servers_outstanding, which they hand on or release. This makes
sure that network namespace cleanup waits for them.
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Split the usage count on the afs_server struct to have an active count that
registers who's actually using it separately from the reference count on
the object.
This allows a future patch to dispatch polling probes without advancing the
"unuse" time into the future each time we emit a probe, which would
otherwise prevent unused server records from expiring.
Included in this:
(1) The latter part of afs_destroy_server() in which the RCU destruction
of afs_server objects is invoked and the outstanding server count is
decremented is split out into __afs_put_server().
(2) afs_put_server() now calls __afs_put_server() rather then setting the
management timer.
(3) The calls begun by afs_fs_give_up_all_callbacks() and
afs_fs_get_capabilities() can now take a ref on the server record, so
afs_destroy_server() can just drop its ref and needn't wait for the
completion of these calls. They'll put the ref when they're done.
(4) Because of (3), afs_fs_probe_done() no longer needs to wake up
afs_destroy_server() with server->probe_outstanding.
(5) afs_gc_servers can be simplified. It only needs to check if
server->active is 0 rather than playing games with the refcount.
(6) afs_manage_servers() can propose a server for gc if usage == 0 rather
than if ref == 1. The gc is effected by (5).
Signed-off-by: David Howells <dhowells@redhat.com>
The U-version VLDB volume record retrieved by the VL.GetEntryByNameU rpc op
carries a change counter (the serverUnique field) for each fileserver
listed in the record as backing that volume. This is incremented whenever
the registration details for a fileserver change (such as its address
list). Note that the same value will be seen in all UVLDB records that
refer to that fileserver.
This should be checked before calling the VL server to re-query the address
list for a fileserver. If it's the same, there's no point doing the query.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
When a lookup is done in an AFS directory, the filesystem will speculate
and fetch up to 49 other statuses for files in the same directory and fetch
those as well, turning them into inodes or updating inodes that already
exist.
However, occasionally, a callback break might go missing due to NAT timing
out, but the afs filesystem doesn't then realise that the directory is not
up to date.
Alleviate this by using one of the status slots to check the directory in
which the lookup is being done.
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Suggested-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.
The main thing this requires is some RCU annotation on the list
manipulation operations. Inodes are already freed by RCU in most cases.
Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.
There are at least three instances where this can be of use:
(1) Testing whether the inode number iunique() is going to return is
currently unique (the iunique_lock is still held).
(2) Ext4 date stamp updating.
(3) AFS callback breaking.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
pstore/blk is similar to pstore/ram, but uses a block device as the
storage rather than persistent ram.
The pstore/blk backend solves two common use-cases that used to preclude
using pstore/ram:
- not all devices have a battery that could be used to persist
regular RAM across power failures.
- most embedded intelligent equipment have no persistent ram, which
increases costs, instead preferring cheaper solutions, like block
devices.
pstore/blk provides separate configurations for the end user and for the
block drivers. User configuration determines how pstore/blk operates, such
as record sizes, max kmsg dump reasons, etc. These can be set by Kconfig
and/or module parameters, but module parameter have priority over Kconfig.
Driver configuration covers all the details about the target block device,
such as total size of the device and how to perform read/write operations.
These are provided by block drivers, calling pstore_register_blkdev(),
including an optional panic_write callback used to bypass regular IO
APIs in an effort to avoid potentially destabilized kernel code during
a panic.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-3-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.
The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.
For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.
Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.
Co-developed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Add a new member to struct pstore_info for passing information about
kmesg dump maximum reason. This allows a finer control of what kmesg
dumps are sent to pstore storage backends.
Those backends that do not explicitly set this field (keeping it equal to
0), get the default behavior: store only Oopses and Panics, or everything
if the printk.always_kmsg_dump boot param is set.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-5-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
In order to more cleanly pass around backend names, make the "name" member
const. This means the module param needs to be dynamic (technically, it
was before, so this actually cleans up a minor memory leak if a backend
was specified and then gets unloaded.)
Link: https://lore.kernel.org/lkml/20200510202436.63222-3-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
The CON_ENABLED flag gets cleared during unregister_console(), so make
sure we already reset the console flags before calling register_console(),
otherwise unloading and reloading a pstore backend will not restart
console logging.
Signed-off-by: Kees Cook <keescook@chromium.org>
The pstore.update_ms value was being disabled during pstore_unregister(),
which would cause any prior value to go unnoticed on the next
pstore_register(). Instead, just let del_timer() stop the timer, which
was always sufficient. This additionally refactors the timer reset code
and allows the timer to be enabled if the module parameter is changed
away from the default.
Link: https://lore.kernel.org/lkml/20200506152114.50375-10-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Under heavy fsstress, we may triggle panic while issuing discard,
because __check_sit_bitmap() detects that discard command may earse
valid data blocks, the root cause is as below race stack described,
since we removed lock when flushing quota data, quota data writeback
may race with write_checkpoint(), so that it causes inconsistency in
between cached discard entry and segment bitmap.
- f2fs_write_checkpoint
- block_operations
- set_sbi_flag(sbi, SBI_QUOTA_SKIP_FLUSH)
- f2fs_flush_sit_entries
- add_discard_addrs
- __set_bit_le(i, (void *)de->discard_map);
- f2fs_write_data_pages
- f2fs_write_single_data_page
: inode is quota one, cp_rwsem won't be locked
- f2fs_do_write_data_page
- f2fs_allocate_data_block
- f2fs_wait_discard_bio
: discard entry has not been added yet.
- update_sit_entry
- f2fs_clear_prefree_segments
- f2fs_issue_discard
: add discard entry
In order to fix this, this patch uses node_write to serialize
f2fs_allocate_data_block and checkpoint.
Fixes: 435cbab95e ("f2fs: fix quota_sync failure due to f2fs_lock_op")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Overflowed requests in io_uring_cancel_files() should be shed only of
inflight and overflowed refs. All other left references are owned by
someone else.
If refcount_sub_and_test() fails, it will go further and put put extra
ref, don't do that. Also, don't need to do io_wq_cancel_work()
for overflowed reqs, they will be let go shortly anyway.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather
sqe->off + number of prior inflight requests. Wait exactly for
sqe->off non-timeout completions
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Separate flushing offset timeouts io_commit_cqring() by moving it into a
helper. Just a preparation, makes following patches clearer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Because of the separation of FS_XFLAG_DAX from S_DAX and the delayed
setting of S_DAX, data invalidation no longer needs to happen when
FS_XFLAG_DAX is changed.
Change xfs_ioctl_setattr_dax_invalidate() to be
xfs_ioctl_dax_check_set_cache() and alter the code to reflect the new
functionality.
Furthermore, we no longer need the locking so we remove the join_flags
logic.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
nearly identical. The only difference is that *_to_linux() is called after
inode setup and disallows changing the DAX flag.
Combining them can be done with a flag which indicates if this is the initial
setup to allow the DAX flag to be properly set only at init time.
So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
directly.
While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
use xfs_ip2xflags() to ensure future diflags are included correctly.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.
Change the use of xfs_inode_supports_dax() to reflect only if the inode
and underlying storage support dax.
Add a new function xfs_inode_should_enable_dax() which reflects if the
inode should be enabled for DAX.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
An earlier call of xfs_reinit_inode() from xfs_iget_cache_hit() already
handles initialization of i_rwsem.
Doing so again is unneeded.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the computation of creds from prepare_binfmt into begin_new_exec
so that the creds need only be computed once. This is just code
reorganization no semantic changes of any kind are made.
Moving the computation is safe. I have looked through the kernel and
verified none of the binfmts look at bprm->cred directly, and that
there are no helpers that look at bprm->cred indirectly. Which means
that it is not a problem to compute the bprm->cred later in the
execution flow as it is not used until it becomes current->cred.
A new function bprm_creds_from_file is added to contain the work that
needs to be done. bprm_creds_from_file first computes which file
bprm->executable or most likely bprm->file that the bprm->creds
will be computed from.
The funciton bprm_fill_uid is updated to receive the file instead of
accessing bprm->file. The now unnecessary work needed to reset the
bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
A small comment to document that bprm_fill_uid now only deals with the
work to handle suid and sgid files. The default case is already
heandled by prepare_exec_creds.
The function security_bprm_repopulate_creds is renamed
security_bprm_creds_from_file and now is explicitly passed the file
from which to compute the creds. The documentation of the
bprm_creds_from_file security hook is updated to explain when the hook
is called and what it needs to do. The file is passed from
cap_bprm_creds_from_file into get_file_caps so that the caps are
computed for the appropriate file. The now unnecessary work in
cap_bprm_creds_from_file to reset the ambient capabilites has been
removed. A small comment to document that the work of
cap_bprm_creds_from_file is to read capabilities from the files
secureity attribute and derive capabilities from the fact the
user had uid 0 has been added.
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There is a small bug in the code that recomputes parts of bprm->cred
for every bprm->file. The code never recomputes the part of
clear_dangerous_personality_flags it is responsible for.
Which means that in practice if someone creates a sgid script
the interpreter will not be able to use any of:
READ_IMPLIES_EXEC
ADDR_NO_RANDOMIZE
ADDR_COMPAT_LAYOUT
MMAP_PAGE_ZERO.
This accentially clearing of personality flags probably does
not matter in practice because no one has complained
but it does make the code more difficult to understand.
Further remaining bug compatible prevents the recomputation from being
removed and replaced by simply computing bprm->cred once from the
final bprm->file.
Making this change removes the last behavior difference between
computing bprm->creds from the final file and recomputing
bprm->cred several times. Which allows this behavior change
to be justified for it's own reasons, and for any but hunts
looking into why the behavior changed to wind up here instead
of in the code that will follow that computes bprm->cred
from the final bprm->file.
This small logic bug appears to have existed since the code
started clearing dangerous personality bits.
History Tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 1bb0fa189c6a ("[PATCH] NX: clean up legacy binary support")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull ceph fixes from Ilya Dryomov:
"Cache tiering and cap handling fixups, both marked for stable"
* tag 'ceph-for-5.7-rc8' of git://github.com/ceph/ceph-client:
ceph: flush release queue when handling caps for unknown inode
libceph: ignore pool overlay and cache logic on redirects