The pstorefs internal list lock doesn't need to be a spinlock and will
create problems when trying to access the list in the subsequent patch
that will walk the pstorefs records during pstore_unregister(). Change
this to a mutex to avoid may_sleep() warnings when unregistering devices.
Link: https://lore.kernel.org/lkml/20200506152114.50375-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Currently pstore can only have a single backend attached at a time, and it
tracks the active backend via "psinfo", under a lock. The locking for this
does not need to be a spinlock, and in order to avoid may_sleep() issues
during future changes to pstore_unregister(), switch to a mutex instead.
Link: https://lore.kernel.org/lkml/20200506152114.50375-4-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
There is no reason to be doing a module get/put in pstore_register(),
since the module calling pstore_register() cannot be unloaded since it
hasn't finished its initialization. Remove it so there is no confusion
about how registration ordering works.
Link: https://lore.kernel.org/lkml/20200506152114.50375-2-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
We are not guaranteed that the credential will remain pinned.
Fixes: 6129650720 ("NFSv4: Avoid referencing the cred unnecessarily during NFSv4 I/O")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
(1) The reorganisation of bmap() use accidentally caused the return value
of cachefiles_read_or_alloc_pages() to get corrupted.
(2) The NFS superblock index key accidentally got changed to include a
number of kernel pointers - meaning that the key isn't matchable after
a reboot.
(3) A redundant check in nfs_fscache_get_super_cookie().
(4) The NFS change_attr sometimes set in the auxiliary data for the
caching of an file and sometimes not, which causes the cache to get
discarded when it shouldn't.
(5) There's a race between cachefiles_read_waiter() and
cachefiles_read_copier() that causes an occasional assertion failure.
We add the new state to the nfsi->open_states list, making it
potentially visible to other threads, before we've finished initializing
it.
That wasn't a problem when all the readers were also taking the i_lock
(as we do here), but since we switched to RCU, there's now a possibility
that a reader could see the partially initialized state.
Symptoms observed were a crash when another thread called
nfs4_get_valid_delegation() on a NULL inode, resulting in an oops like:
BUG: unable to handle page fault for address: ffffffffffffffb0 ...
RIP: 0010:nfs4_get_valid_delegation+0x6/0x30 [nfsv4] ...
Call Trace:
nfs4_open_prepare+0x80/0x1c0 [nfsv4]
__rpc_execute+0x75/0x390 [sunrpc]
? finish_task_switch+0x75/0x260
rpc_async_schedule+0x29/0x40 [sunrpc]
process_one_work+0x1ad/0x370
worker_thread+0x30/0x390
? create_worker+0x1a0/0x1a0
kthread+0x10c/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x22/0x30
Fixes: 9ae075fdd1 "NFSv4: Convert open state lookup to use RCU"
Reviewed-by: Seiichi Ikarashi <s.ikarashi@fujitsu.com>
Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
sufficiently sampled. This can cause problems with some fileservers with
calls to the cache manager in the afs filesystem being dropped from the
fileserver because a packet goes missing and the retransmission timeout is
greater than the call expiry timeout.
Fix this by:
(1) Copying the RTT/RTO calculation code from Linux's TCP implementation
and altering it to fit rxrpc.
(2) Altering the various users of the RTT to make use of the new SRTT
value.
(3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
value instead (which is needed in jiffies), along with a backoff.
Notes:
(1) rxrpc provides RTT samples by matching the serial numbers on outgoing
DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
against the reference serial number in incoming REQUESTED ACK and
PING-RESPONSE ACK packets.
(2) Each packet that is transmitted on an rxrpc connection gets a new
per-connection serial number, even for retransmissions, so an ACK can
be cross-referenced to a specific trigger packet. This allows RTT
information to be drawn from retransmitted DATA packets also.
(3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
on an rxrpc_call because many RPC calls won't live long enough to
generate more than one sample.
(4) The calculated SRTT value is in units of 8ths of a microsecond rather
than nanoseconds.
The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.
Fixes: 17926a7932 ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
Signed-off-by: David Howells <dhowells@redhat.com>
Remove duplicate semicolon at the end of line in io_file_from_index()
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
- a small series fixing a use-after-free of bdi name (Christoph,Yufen)
- NVMe fix for a regression with the smaller CQ update (Alexey)
- NVMe fix for a hang at namespace scanning error recovery (Sagi)
- fix race with blk-iocost iocg->abs_vdebt updates (Tejun)
* tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block:
nvme: fix possible hang when ns scanning fails during error recovery
nvme-pci: fix "slimmer CQ head update"
bdi: add a ->dev_name field to struct backing_dev_info
bdi: use bdi_dev_name() to get device name
bdi: move bdi_dev_name out of line
vboxsf: don't use the source name in the bdi name
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.
The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.
Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
The name is only printed for a not registered bdi in writeback. Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
the blk-iocost changes, but we also need the base of the bdi
use-after-free as well as we build on top of it.
* block-5.7:
nvme: fix possible hang when ns scanning fails during error recovery
nvme-pci: fix "slimmer CQ head update"
bdi: add a ->dev_name field to struct backing_dev_info
bdi: use bdi_dev_name() to get device name
bdi: move bdi_dev_name out of line
vboxsf: don't use the source name in the bdi name
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
block: remove the bd_openers checks in blk_drop_partitions
nvme: prevent double free in nvme_alloc_ns() error handling
null_blk: Cleanup zoned device initialization
null_blk: Fix zoned command handling
block: remove unused header
blk-iocost: Fix error on iocost_ioc_vrate_adj
bdev: Reduce time holding bd_mutex in sync in blkdev_close()
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring fixes from Jens Axboe:
- Fix finish_wait() balancing in file cancelation (Xiaoguang)
- Ensure early cleanup of resources in ring map failure (Xiaoguang)
- Ensure IORING_OP_SLICE does the right file mode checks (Pavel)
- Remove file opening from openat/openat2/statx, it's not needed and
messes with O_PATH
* tag 'io_uring-5.7-2020-05-08' of git://git.kernel.dk/linux-block:
io_uring: don't use 'fd' for openat/openat2/statx
splice: move f_mode checks to do_{splice,tee}()
io_uring: handle -EFAULT properly in io_uring_setup()
io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
do_splice() doesn't expect len to be 0. Just always return 0 in this
case as splice(2) does.
Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.
The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
The attempt protecting us from closing the ring itself wasn't really
complete, and we actually don't need it. The referencing of requests
themselve, and the references they hold on the ring, ensures that the
life time of the ring is sane. With the check removed, we can also
remove the need to have the close operation fget() the file.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 402cb8dda9 ("fscache: Attach the index key and aux data to
the cookie") added the aux_data and aux_data_len to parameters to
fscache_acquire_cookie(), and updated the callers in the NFS client.
In the process it modified the aux_data to include the change_attr,
but missed adding change_attr to a couple places where aux_data was
used. Specifically, when opening a file and the change_attr is not
added, the following attempt to lookup an object will fail inside
cachefiles_check_object_xattr() = -116 due to
nfs_fscache_inode_check_aux() failing memcmp on auxdata and returning
FSCACHE_CHECKAUX_OBSOLETE.
Fix this by adding nfs_fscache_update_auxdata() to set the auxdata
from all relevant fields in the inode, including the change_attr.
Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Commit f2aedb713c ("NFS: Add fs_context support.") reworked
NFS mount code paths for fs_context support which included
super_block initialization. In the process there was an extra
return left in the code and so we never call
nfs_fscache_get_super_cookie even if 'fsc' is given on as mount
option. In addition, there is an extra check inside
nfs_fscache_get_super_cookie for the NFS_OPTION_FSCACHE which
is unnecessary since the only caller nfs_get_cache_cookie
checks this flag.
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Commit 402cb8dda9 ("fscache: Attach the index key and aux data to
the cookie") added the index_key and index_key_len parameters to
fscache_acquire_cookie(), and updated the callers in the NFS client.
One of the callers was inside nfs_fscache_get_super_cookie()
and was changed to use the full struct nfs_fscache_key as the
index_key. However, a couple members of this structure contain
pointers and thus will change each time the same NFS share is
remounted. Since index_key is used for fscache_cookie->key_hash
and this subsequently is used to compare cookies, the effectiveness
of fscache with NFS is reduced to the point at which a umount
occurs. Any subsequent remount of the same share will cause a
unique NFS super_block index_key and key_hash to be generated for
the same data, rendering any prior fscache data unable to be
found. A simple reproducer demonstrates the problem.
1. Mount share with 'fsc', create a file, drop page cache
systemctl start cachefilesd
mount -o vers=3,fsc 127.0.0.1:/export /mnt
dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1
echo 3 > /proc/sys/vm/drop_caches
2. Read file into page cache and fscache, then unmount
dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1
umount /mnt
3. Remount and re-read which should come from fscache
mount -o vers=3,fsc 127.0.0.1:/export /mnt
echo 3 > /proc/sys/vm/drop_caches
dd if=/mnt/file1.bin of=/dev/null bs=4096 count=1
4. Check for READ ops in mountstats - there should be none
grep READ: /proc/self/mountstats
Looking at the history and the removed function, nfs_super_get_key(),
we should only use nfs_fscache_key.key plus any uniquifier, for
the fscache index_key.
Fixes: 402cb8dda9 ("fscache: Attach the index key and aux data to the cookie")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
This reverts commit df5db5f9ee.
This patch fixes a regression: patch df5db5f9ee allowed function
run_queue() to bypass its call to do_xmote() if revokes were queued for
the glock. That's wrong because its call to do_xmote() is what is
responsible for calling the go_sync() glops functions to sync both
the ail list and any revokes queued for it. By bypassing the call,
gfs2 could get into a stand-off where the glock could not be demoted
until its revokes are written back, but the revokes would not be
written back because do_xmote() was never called.
It "sort of" works, however, because there are other mechanisms like
the log flush daemon (logd) that can sync the ail items and revokes,
if it deems it necessary. The problem is: without file system pressure,
it might never deem it necessary.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, if the go_sync operation returned an error during
the do_xmote process (such as unable to sync metadata to the journal)
the code did goto out. That kept the glock locked, so it could not be
given away, which correctly avoids file system corruption. However,
it never set the withdraw bit or requeueing the glock work. So it would
hang forever, unable to ever demote the glock.
This patch changes to goto to a new label, skip_inval, so that errors
from go_sync are treated the same way as errors from go_inval:
The delayed withdraw bit is set and the work is requeued. That way,
the logd should eventually figure out there's a problem and withdraw
properly there.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pull ceph fixes from Ilya Dryomov:
"Fixes for an endianness handling bug that prevented mounts on
big-endian arches, a spammy log message and a couple error paths.
Also included a MAINTAINERS update"
* tag 'ceph-for-5.7-rc5' of git://github.com/ceph/ceph-client:
ceph: demote quotarealm lookup warning to a debug message
MAINTAINERS: remove myself as ceph co-maintainer
ceph: fix double unlock in handle_cap_export()
ceph: fix special error code in ceph_try_get_caps()
ceph: fix endianness bug when handling MDS session feature bits
This patch rearranges gfs2_add_revoke so that the extra glock
reference is added earlier on in the function to avoid races in which
the glock is freed before the new reference is taken.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function gfs2_quota_unlock checked if quotas are
turned off, and if so, it branched to label out, which called
gfs2_quota_unhold. With the new system of gfs2_qa_get and put, we
no longer want to call gfs2_quota_unhold or we won't balance our
gets and puts.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_quota_lock checked if it was called
from a privileged user, and if so, it bypassed the quota check:
superuser can operate outside the quotas.
That's the wrong place for the check because the lock/unlock functions
are separate from the lock_check function, and you can do lock and
unlock without actually checking the quotas.
This patch moves the check to gfs2_quota_lock_check.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch removes a check from gfs2_quota_check for whether quotas
are enabled by the superblock. There is a test just prior for the
GIF_QD_LOCKED bit in the inode, and that can only be set by functions
that already check that quotas are enabled in the superblock.
Therefore, the check is redundant.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2_quota_change() would BUG_ON if the
qa_ref counter was not a positive number. This patch changes it to
be a withdraw instead. That way we can debug things more easily.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch fixes a couple of places in which gfs2_qa_get and gfs2_qa_put are
not balanced: we now keep references around whenever a file is open for writing
(see gfs2_open_common and gfs2_release), so we need to put all references we
grab in function gfs2_create_inode. This was broken in the successful case and
on one error path.
This also means that we don't have a reference to put in gfs2_evict_inode.
In addition, gfs2_qa_put was called for the wrong inode in gfs2_link.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pull driver core fixes from Greg KH:
"Here are a number of small driver core fixes for 5.7-rc5 to resolve a
bunch of reported issues with the current tree.
Biggest here are the reverts and patches from John Stultz to resolve a
bunch of deferred probe regressions we have been seeing in 5.7-rc
right now.
Along with those are some other smaller fixes:
- coredump crash fix
- devlink fix for when permissive mode was enabled
- amba and platform device dma_parms fixes
- component error silenced for when deferred probe happens
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
regulator: Revert "Use driver_deferred_probe_timeout for regulator_init_complete_work"
driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires
driver core: Use dev_warn() instead of dev_WARN() for deferred_probe_timeout warnings
driver core: Revert default driver_deferred_probe_timeout value to 0
component: Silence bind error on -EPROBE_DEFER
driver core: Fix handling of fw_devlink=permissive
coredump: fix crash when umh is disabled
amba: Initialize dma_parms for amba devices
driver core: platform: Initialize dma_parms for platform devices
The random buffer write failure errortag patch introduced a local
mount pointer variable for the test macro, but the macro is compiled
out on !DEBUG kernels. This results in an unused variable warning.
Access the mount structure through the buffer pointer and remove the
local mount pointer to address the warning.
Fixes: 7376d74547 ("xfs: random buffer write failure errortag")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the helpers that handle incore buffer cancellation records to
xfs_buf_item_recover.c since they're not directly related to the main
log recovery machinery. No functional changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The only purpose of XFS_LI_RECOVERED is to prevent log recovery from
trying to replay recovered intents more than once. Therefore, we can
move the bit setting up to the ->iop_recover caller.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've made the recovered item tests all the same, we can hoist
the test and the ail locking code to the ->iop_recover caller and call
the recovery function directly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rename XFS_{EFI,BUI,RUI,CUI}_RECOVERED to XFS_LI_RECOVERED so that we
track recovery status in the log item, then get rid of the now unused
flags fields in each of those log item types.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
During recovery, every intent that we recover from the log has to be
added to the AIL. Replace the open-coded addition with a helper.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Replace the open-coded AIL item walking with a proper helper when we're
trying to release an intent item that has been finished. We add a new
->iop_match method to decide if an intent item matches a supplied ID.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've finished converting all types of log intent items to
provide an ->iop_recover function, we can convert the "is this an intent
item?" predicate to look for a non-null iop_recover pointer.
Move the predicate closer to the functions that use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>