No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY,
as we can do that right after retrieving it from userspace. Also move the
KIOCB_KEY definition to aio.c as it is an internal value not used by any
other place in the kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
->get_poll_head returns the waitqueue that the poll operation is going
to sleep on. Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events. But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Use straightline code with failure handling gotos instead of a lot
of nested conditionals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
These two functions now trigger a warning when CONFIG_PROC_FS is disabled:
fs/xfs/xfs_stats.c:128:12: error: 'xqmstat_proc_show' defined but not used [-Werror=unused-function]
static int xqmstat_proc_show(struct seq_file *m, void *v)
^~~~~~~~~~~~~~~~~
fs/xfs/xfs_stats.c:118:12: error: 'xqm_proc_show' defined but not used [-Werror=unused-function]
static int xqm_proc_show(struct seq_file *m, void *v)
^~~~~~~~~~~~~
Previously, they were referenced from an unused 'static const' structure,
which is silently dropped by gcc.
We can address the warning by adding the same #ifdef around them that
hides the reference.
Fixes: 3f3942aca6 ("proc: introduce proc_create_single{,_data}")
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.
commit fd3cfad374 ("udf: Convert udf_disk_stamp_to_time() to use mktime64()")
eliminated the NULL return condition from udf_disk_stamp_to_time().
udf_time_to_disk_time() is always called with a valid dest pointer and
the return value is ignored.
Further, caller can as well check the dest pointer being passed in rather
than return argument.
Make both the functions return void.
This will make the inode timestamp conversion simpler.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: jack@suse.com
----
Changes from v1:
* fixed the pointer error pointed by Jan
Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.
This will lead to type mismatch for memcpys.
Use regular assignments instead.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: trond.myklebust@primarydata.com
Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.
Convert these print formats to use long long types to
avoid warnings and errors on conversion.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: zyan@redhat.com
Cc: ceph-devel@vger.kernel.org
As vfs moves to using struct timespec64 to represent times,
update the argument to timespec_truncate() to use
struct timespec64. Also change the name of the function.
The rest of the implementation logic is the same.
Move this to fs/inode.c instead of kernel/time/time.c as all the
users of this api are filesystems.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <viro@zeniv.linux.org.uk>
ext4_resize_fs() has an off-by-one bug when checking whether growing of
a filesystem will not overflow inode count. As a result it allows a
filesystem with 8192 inodes per group to grow to 64TB which overflows
inode count to 0 and makes filesystem unusable. Fix it.
Cc: stable@vger.kernel.org
Fixes: 3f8a6411fb
Reported-by: Jaco Kroon <jaco@uls.co.za>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Pull rdma fixes from Jason Gunthorpe:
"This is pretty much just the usual array of smallish driver bugs.
- remove bouncing addresses from the MAINTAINERS file
- kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and
hns drivers
- various small LOC behavioral/operational bugs in mlx5, hns, qedr
and i40iw drivers
- two fixes for patches already sent during the merge window
- a long-standing bug related to not decreasing the pinned pages
count in the right MM was found and fixed"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits)
RDMA/hns: Move the location for initializing tmp_len
RDMA/hns: Bugfix for cq record db for kernel
IB/uverbs: Fix uverbs_attr_get_obj
RDMA/qedr: Fix doorbell bar mapping for dpi > 1
IB/umem: Use the correct mm during ib_umem_release
iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()'
RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint
RDMA/i40iw: Avoid reference leaks when processing the AEQ
RDMA/i40iw: Avoid panic when objects are being created and destroyed
RDMA/hns: Fix the bug with NULL pointer
RDMA/hns: Set NULL for __internal_mr
RDMA/hns: Enable inner_pa_vld filed of mpt
RDMA/hns: Set desc_dma_addr for zero when free cmq desc
RDMA/hns: Fix the bug with rq sge
RDMA/hns: Not support qp transition from reset to reset for hip06
RDMA/hns: Add return operation when configured global param fail
RDMA/hns: Update convert function of endian format
RDMA/hns: Load the RoCE dirver automatically
RDMA/hns: Bugfix for rq record db for kernel
RDMA/hns: Add rq inline flags judgement
...
Pull btrfs fix from David Sterba:
"A one-liner that prevents leaking an internal error value 1 out of the
ftruncate syscall.
This has been observed in practice. The steps to reproduce make a
common pattern (open/write/fync/ftruncate) but also need the
application to not check only for negative values and happens only for
compressed inlined files.
The conditions are narrow but as this could break userspace I think
it's better to merge it now and not wait for the merge window"
* tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix error handling in btrfs_truncate()
The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files when inode owner is invalid. Ordinarily the
capable_wrt_inode_uidgid check is sufficient to allow access to files
but when the underlying filesystem has uids or gids that don't map to
the current user namespace it is not enough, so the chown permission
checks need to be extended to allow this case.
Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.
Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.
An ordinary filesystem mountable by a userns root will limit all uids
and gids in s_user_ns or the INVALID_UID and INVALID_GID to flag all
others. So having this added permission limited to just INVALID_UID
and INVALID_GID is sufficient to handle every case on an ordinary filesystem.
Of the virtual filesystems at least proc is known to set s_user_ns to
something other than &init_user_ns, while at the same time presenting
some files owned by GLOBAL_ROOT_UID. Those files the mounter of proc
in a user namespace should not be able to chown to get access to.
Limiting the relaxation in permission to just the minimum of allowing
changing INVALID_UID and INVALID_GID prevents problems with cases like
that.
The original version of this patch was written by: Seth Forshee. I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.
Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
These filesystems already always set SB_I_NODEV so mknod will not be
useful for gaining control of any devices no matter their permissions.
This will allow overlayfs and applications like to fakeroot to use
device nodes to represent things on disk.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Changing the link count of an inode via unlink or link will cause a
write back of that inode. If the uids or gids are invalid (aka not known
to the kernel) writing the inode back may change the uid or gid in the
filesystem. To prevent possible filesystem and to avoid the need for
filesystem maintainers to worry about it don't allow operations on
inodes with an invalid uid or gid.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).
To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:
int main(void) {
char buf[256] = { 0 };
int ret;
int fd;
fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}
if (fsync(fd) == -1) {
perror("fsync");
close(fd);
return EXIT_FAILURE;
}
ret = ftruncate(fd, 128);
if (ret) {
printf("ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);
return EXIT_SUCCESS;
}
Fixes: ddfae63cc8 ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: Jun Wu <quark@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().
Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).
Cc: stable@kernel.org
Fixes: 0460fef2a9 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
struct file *pipe_to_umh;
struct file *pipe_from_umh;
pid_t pid;
};
that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.
The kernel will do:
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
starts, the kernel will create two unix pipes for bidirectional
communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
'struct umh_info' with two unix pipes and the pid of the user process
As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.
Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.
Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.
The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it
in the exit code.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ext4 will always create ext4 extended attributes which do not have a
value (where e_value_size is zero) with e_value_offs set to zero. In
most places e_value_offs will not be used in a substantive way if
e_value_size is zero.
There was one exception to this, which is in ext4_xattr_set_entry(),
where if there is a maliciously crafted file system where there is an
extended attribute with e_value_offs is non-zero and e_value_size is
0, the attempt to remove this xattr will result in a negative value
getting passed to memmove, leading to the following sadness:
[ 41.225365] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 44.538641] BUG: unable to handle kernel paging request at ffff9ec9a3000000
[ 44.538733] IP: __memmove+0x81/0x1a0
[ 44.538755] PGD 1249bd067 P4D 1249bd067 PUD 1249c1067 PMD 80000001230000e1
[ 44.538793] Oops: 0003 [#1] SMP PTI
[ 44.539074] CPU: 0 PID: 1470 Comm: poc Not tainted 4.16.0-rc1+ #1
...
[ 44.539475] Call Trace:
[ 44.539832] ext4_xattr_set_entry+0x9e7/0xf80
...
[ 44.539972] ext4_xattr_block_set+0x212/0xea0
...
[ 44.540041] ext4_xattr_set_handle+0x514/0x610
[ 44.540065] ext4_xattr_set+0x7f/0x120
[ 44.540090] __vfs_removexattr+0x4d/0x60
[ 44.540112] vfs_removexattr+0x75/0xe0
[ 44.540132] removexattr+0x4d/0x80
...
[ 44.540279] path_removexattr+0x91/0xb0
[ 44.540300] SyS_removexattr+0xf/0x20
[ 44.540322] do_syscall_64+0x71/0x120
[ 44.540344] entry_SYSCALL_64_after_hwframe+0x21/0x86
https://bugzilla.kernel.org/show_bug.cgi?id=199347
This addresses CVE-2018-10840.
Reported-by: "Xu, Wen" <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Fixes: dec214d00e ("ext4: xattr inode deduplication")
Implement network namespacing within AFS, but don't yet let mounts occur
outside the init namespace. An additional patch will be required propagate
the network namespace across automounts.
Signed-off-by: David Howells <dhowells@redhat.com>
The afs_net::ws_cell member is sometimes used under RCU conditions from
within an seq-readlock. It isn't, however, marked __rcu and it isn't set
using the proper RCU barrier-imposing functions.
Fix this by annotating it with __rcu and using appropriate barriers to
make sure accesses are correctly ordered.
Without this, the code can produce the following warning:
>> fs/afs/proc.c:151:24: sparse: incompatible types in comparison expression (different address spaces)
Fixes: f044c8847b ("afs: Lay the groundwork for supporting network namespaces")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Sparse doesn't appear able to handle the conditionally-taken locks in
xdr_decode_AFSFetchStatus(), even though the lock and unlock are both
contingent on the same unvarying function argument.
Deal with this by interpolating a wrapper function that takes the lock if
needed and calls xdr_decode_AFSFetchStatus() on two separate branches, one
with the lock held and one without.
This allows Sparse to work out the locking.
Signed-off-by: David Howells <dhowells@redhat.com>
Similar to the ->copy_from_iter() operation, a platform may want to
deploy an architecture or device specific routine for handling reads
from a dax_device like /dev/pmemX. On x86 this routine will point to a
machine check safe version of copy_to_iter(). For now, add the plumbing
to device-mapper and the dax core.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If ext4_find_inline_data_nolock() returns an error it needs to get
reflected up to ext4_iget(). In order to fix this,
ext4_iget_extra_inode() needs to return an error (and not return
void).
This is related to "ext4: do not allow external inodes for inline
data" (which fixes CVE-2018-11412) in that in the errors=continue
case, it would be useful to for userspace to receive an error
indicating that file system is corrupted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
The inline data feature was implemented before we added support for
external inodes for xattrs. It makes no sense to support that
combination, but the problem is that there are a number of extended
attribute checks that are skipped if e_value_inum is non-zero.
Unfortunately, the inline data code is completely e_value_inum
unaware, and attempts to interpret the xattr fields as if it were an
inline xattr --- at which point, Hilarty Ensues.
This addresses CVE-2018-11412.
https://bugzilla.kernel.org/show_bug.cgi?id=199803
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Cc: stable@kernel.org
... and take the "check if file is open, pick ->f_mode" into a helper;
tid_fd_revalidate() can use it.
The next patch will get rid of tid_fd_revalidate() calls in instantiate
callbacks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
First of all, calling pid_revalidate() in the end of <pid>/* lookups
is *not* about closing any kind of races; that used to be true once
upon a time, but these days those comments are actively misleading.
Especially since pid_revalidate() doesn't even do d_drop() on
failure anymore. It doesn't matter, anyway, since once
pid_revalidate() starts returning false, ->d_delete() of those
dentries starts saying "don't keep"; they won't get stuck in
dcache any longer than they are pinned.
These calls cannot be just removed, though - the side effect of
pid_revalidate() (updating i_uid/i_gid/etc.) is what we are calling
it for here.
Let's separate the "update ownership" into a new helper (pid_update_inode())
and use it, both in lookups and in pid_revalidate() itself.
The comments in pid_revalidate() are also out of date - they refer to
the time when pid_revalidate() used to call d_drop() directly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That's one case when unlink() destroys a subtree, thanks to "resource
fork" idiocy. We might forcibly evict that shit on unlink(2), but
for now let's just disallow overmounting; as it is, anything that
plays games with those would leak mounts.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>