I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We reserve space for the inode update when we first reserve space for writing to
a file. However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents. Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve. The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly. In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations. Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.
This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a few races in the metadata reservation stuff. First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes. So use the normal
btrfs_block_rsv_add helper for this case. Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation. So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv. This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size. So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info. This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit. Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The f2fs_map_blocks is very related to the performance, so let's avoid any
latency to read ahead node pages.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The readahead nat pages are more likely to be reclaimed quickly, so it'd better
to gather more free nids in advance.
And, let's keep some free nids as much as possible.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/en.h
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
drivers/net/usb/r8152.c
All three conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This bug can be reproducible with fsfuzzer, although, I couldn't reproduce it
100% of my tries, it is quite easily reproducible.
During the deletion of an inode, ext2_xattr_delete_inode() does not check if the
block pointed by EXT2_I(inode)->i_file_acl is a valid data block, this might
lead to a deadlock, when i_file_acl == 1, and the filesystem block size is 1024.
In that situation, ext2_xattr_delete_inode, will load the superblock's buffer
head (instead of a valid i_file_acl block), and then lock that buffer head,
which, ext2_sync_super will also try to lock, making the filesystem deadlock in
the following stack trace:
root 17180 0.0 0.0 113660 660 pts/0 D+ 07:08 0:00 rmdir
/media/test/dir1
[<ffffffff8125da9f>] __sync_dirty_buffer+0xaf/0x100
[<ffffffff8125db03>] sync_dirty_buffer+0x13/0x20
[<ffffffffa03f0d57>] ext2_sync_super+0xb7/0xc0 [ext2]
[<ffffffffa03f10b9>] ext2_error+0x119/0x130 [ext2]
[<ffffffffa03e9d93>] ext2_free_blocks+0x83/0x350 [ext2]
[<ffffffffa03f3d03>] ext2_xattr_delete_inode+0x173/0x190 [ext2]
[<ffffffffa03ee9e9>] ext2_evict_inode+0xc9/0x130 [ext2]
[<ffffffff8123fd23>] evict+0xb3/0x180
[<ffffffff81240008>] iput+0x1b8/0x240
[<ffffffff8123c4ac>] d_delete+0x11c/0x150
[<ffffffff8122fa7e>] vfs_rmdir+0xfe/0x120
[<ffffffff812340ee>] do_rmdir+0x17e/0x1f0
[<ffffffff81234dd6>] SyS_rmdir+0x16/0x20
[<ffffffff81838cf2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[<ffffffffffffffff>] 0xffffffffffffffff
Fix this by using the same approach ext4 uses to test data blocks validity,
implementing ext2_data_block_valid.
An another possibility when the superblock is very corrupted, is that i_file_acl
is 1, block_count is 1 and first_data_block is 0. For such situations, we might
have i_file_acl pointing to a 'valid' block, but still step over the superblock.
The approach I used was to also test if the superblock is not in the range
described by ext2_data_block_valid() arguments
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory. Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We want to ensure that we write the cached data to the server, but
don't require it be synced to disk. If the server reboots, we will
get a stateid error, which will cause us to retry anyway.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to ensure that any writes to the destination file are serialised
with the copy, meaning that the writeback has to occur under the inode lock.
Also relax the writeback requirement on the source, and rely on the
stateid checking to tell us if the source rebooted. Add the helper
nfs_filemap_write_and_wait_range() to call pnfs_sync_inode() as
is appropriate for pNFS servers that may need a layoutcommit.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When punching holes in a file, we want to ensure the operation is
serialised w.r.t. other writes, meaning that we want to call
nfs_sync_inode() while holding the inode lock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When retrieving stat() information, NFS unfortunately does require us to
sync writes to disk in order to ensure that mtime and ctime are up to
date. However we shouldn't have to ensure that those writes are persisted.
Relaxing that requirement does mean that we may see an mtime/ctime change
if the server reboots and forces us to replay all writes.
The exception to this rule are pNFS clients that are required to send
layoutcommit, however that is dealt with by the call to pnfs_sync_inode()
in _nfs_revalidate_inode().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A file that is open for O_DIRECT is by definition not obeying
close-to-open cache consistency semantics, so let's not cache
the attributes too aggressively either.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We're now waiting immediately after taking the locks, so waiting
in fsync() and write_begin() is either redundant or potentially
subject to livelock (if not holding the lock).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There is only one caller that sets the "write" argument to true,
so just move the call to nfs_zap_mapping() and get rid of the
now redundant argument.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow dio requests to be scheduled in parallel, but ensuring that they
do not conflict with buffered I/O.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
On success, the RPC callbacks will ensure that we make the appropriate calls
to nfs_writeback_update_inode()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should not be interested in looking at the value of the stable field,
since that could take any value.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we need to update the cached attributes, then we'd better make
sure that we also layoutcommit first. Otherwise, the server may have stale
attributes.
Prior to this patch, the revalidation code tried to "fix" this problem by
simply disabling attributes that would be affected by the layoutcommit.
That approach breaks nfs_writeback_check_extend(), leading to a file size
corruption.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
So ensure that we mark the layout for commit once the write is done,
and then ensure that the commit to ds is finished before sending
layoutcommit.
Note that by doing this, we're able to optimise away the commit
for the case of servers that don't need layoutcommit in order to
return updated attributes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should always do a layoutcommit after commit to DS, except if
the layout segment we're using has set FF_FLAGS_NO_LAYOUTCOMMIT.
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Mostly supporting filesystems outside of init_user_ns is
s/&init_usre_ns/dquot->dq_sb->s_user_ns/. An actual need for
supporting quotas on filesystems outside of s_user_ns is quite a ways
away and to be done responsibily needs an audit on what can happen
with hostile quota files. Until that audit is complete don't attempt
to support quota files on filesystems outside of s_user_ns.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Introduce the helper qid_has_mapping and use it to ensure that the
quota system only considers qids that map to the filesystems
s_user_ns.
In practice for quota supporting filesystems today this is the exact
same check as qid_valid. As only 0xffffffff aka (qid_t)-1 does not
map into init_user_ns.
Replace the qid_valid calls with qid_has_mapping as values come in
from userspace. This is harmless today and it prepares the quota
system to work on filesystems with quotas but mounted by unprivileged
users.
Call qid_has_mapping from dqget. This ensures the passed in qid has a
prepresentation on the underlying filesystem. Previously this was
unnecessary as filesystesm never had qids that could not map. With
the introduction of filesystems outside of s_user_ns this will not
remain true.
All of this ensures the quota code never has to deal with qids that
don't map to the underlying filesystem.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
It is expected that filesystems can not represent uids and gids from
outside of their user namespace. Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.
The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.
Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk. In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.
Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.
In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set. Reads of the inodes appear safe and
useful, but any write or modification is disallowed. The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values. After such a chown the inode is normal and may
be treated as such.
Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.
One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.
Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr. As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed. evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache. Use d_in_lookup() to tell one from another, rather
than d_unhashed().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In orangefs_inode_getxattr(), an fsuid is written to dmesg. The kuid is
converted to a userspace uid via from_kuid(current_user_ns(), [...]), but
since dmesg is global, init_user_ns should be used here instead.
In copy_attributes_from_inode(), op_alloc() and fill_default_sys_attrs(),
upcall structures are populated with uids/gids that have been mapped into
the caller's namespace. However, those upcall structures are read by
another process (the userspace filesystem driver), and that process might
be running in another namespace. This effectively lets any user spoof its
uid and gid as seen by the userspace filesystem driver.
To fix the second issue, I just construct the opcall structures with
init_user_ns uids/gids and require the filesystem server to run in the
init namespace. Since orangefs is full of global state anyway (as the error
message in DUMP_DEVICE_ERROR explains, there can only be one userspace
orangefs filesystem driver at once), that shouldn't be a problem.
[
Why does orangefs even exist in the kernel if everything does upcalls into
userspace? What does orangefs do that couldn't be done with the FUSE
interface? If there is no good answer to those questions, I'd prefer to see
orangefs kicked out of the kernel. Can that be done for something that
shipped in a release?
According to commit f7ab093f74 ("Orangefs: kernel client part 1"), they
even already have a FUSE daemon, and the only rational reason (apart from
"but most of our users report preferring to use our kernel module instead")
given for not wanting to use FUSE is one "in-the-works" feature that could
probably be integated into FUSE instead.
]
This patch has been compile-tested.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Mike,
On Fri, Jun 3, 2016 at 9:44 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> We use the return value in this one line you changed, our userspace code gets
> ill when we send it (-ENOMEM +1) as a key length...
ah, my mistake. Here's a fixed version.
Thanks,
Andreas
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Orangefs has a catch-all xattr handler that effectively does what the
trusted handler does already.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>