Use d_fsdata instead, which is the same size. Introduce helpers to hide
the typecasts.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Steve French <sfrench@samba.org>
The problem with writecount is: we want consistent handling of it for
underlying filesystems as well as overlayfs. Making sure i_writecount is
correct on all layers is difficult. Instead this patch makes sure that
when write access is acquired, it's always done on the underlying writable
layer (called the upper layer). We must also make sure to look at the
writecount on this layer when checking for conflicting leases.
Open for write already updates the upper layer's writecount. Leaving only
truncate.
For truncate copy up must happen before get_write_access() so that the
writecount is updated on the upper layer. Problem with this is if
something fails after that, then copy-up was done needlessly. E.g. if
break_lease() was interrupted. Probably not a big deal in practice.
Another interesting case is if there's a denywrite on a lower file that is
then opened for write or truncated. With this patch these will succeed,
which is somewhat counterintuitive. But I think it's still acceptable,
considering that the copy-up does actually create a different file, so the
old, denywrite mapping won't be touched.
On non-overlayfs d_real() is an identity function and d_real_inode() is
equivalent to d_inode() so this patch doesn't change behavior in that case.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
This patch allows flock, posix locks, ofd locks and leases to work
correctly on overlayfs.
Instead of using the underlying inode for storing lock context use the
overlay inode. This allows locks to be persistent across copy-up.
This is done by introducing locks_inode() helper and using it instead of
file_inode() to get the inode in locking code. For non-overlayfs the two
are equivalent, except for an extra pointer dereference in locks_inode().
Since lock operations are in "struct file_operations" we must also make
sure not to call underlying filesystem's lock operations. Introcude a
super block flag MS_NOREMOTELOCK to this effect.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
On overlayfs relatime_need_update() needs inode times to be correct on
overlay inode. But i_mtime and i_ctime are updated by filesystem code on
underlying inode only, so they will be out-of-date on the overlay inode.
This patch copies the times from the underlying inode if needed. This
can't be done if called from RCU lookup (link following) but link m/ctime
are not updated by fs, so this is all right.
This patch doesn't change functionality for anything but overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This fixes a bug where the permission was not properly checked in
overlayfs. The testcase is ltp/utimensat01.
It is also cleaner and safer to do the permission checking in the vfs
helper instead of the caller.
This patch introduces an additional ia_valid flag ATTR_TOUCH (since
touch(1) is the most obvious user of utimes(NULL)) that is passed into
notify_change whenever the conditions for this special permission checking
mode are met.
Reported-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Aihua Zhang <zhangaihua1@huawei.com>
Cc: <stable@vger.kernel.org> # v3.18+
This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set. Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.
I have tested the patch on my machine.
To test the behavior, compile and run this:
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/personality.h>
#include <linux/aio_abi.h>
#include <err.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
int main(void) {
personality(READ_IMPLIES_EXEC);
aio_context_t ctx = 0;
if (syscall(__NR_io_setup, 1, &ctx))
err(1, "io_setup");
char cmd[1000];
sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
(int)getpid());
system(cmd);
return 0;
}
In the output, "rw-s" is good, "rwxs" is bad.
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure. However, the output
length was redundant with the value written to 'oname->len'. It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.
Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead. For consistency also make other callers check for
a nonzero result rather than a negative result.
This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
fscrypt_complete() was used only for data pages, not for all
encryption/decryption. Rename it to page_crypt_complete().
dir_crypt_complete() was used for filename encryption/decryption for
both directory entries and symbolic links. Rename it to
fname_crypt_complete().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch removes some #includes that are clearly not needed, such as a
reference to ecryptfs, which is unrelated to the new filesystem
encryption code.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Kirill A Shutemov reports that the kernel doesn't try to cap dest_count
in any way, and uses the number to allocate kernel memory. This causes
high order allocation warnings in the kernel log if someone passes in a
big enough value. We should clamp the allocation at PAGE_SIZE to avoid
stressing the VM.
The two existing users of the dedupe ioctl never send more than 120
requests, so we can safely clamp dest_range at PAGE_SIZE, because with
4k pages we can handle up to 127 dedupe candidates. Given the max
extent length of 16MB, we can end up doing 2GB of IO which is plenty.
[ Note: the "offsetof()" can't overflow, because 'count' is just a
16-bit integer. That's not obvious in the limited context of the
patch, so I'm noting it here because it made me go look. - Linus ]
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the VFS functions in the dedupe ioctl path return int status, so
the ioctl handler ought to as well.
Found by Coverity, CID 1350952.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Validate fscrypt_context.format and fscrypt_context.flags. If
unrecognized values are set, then the kernel may not know how to
interpret the encrypted file, so it should fail the operation.
- Validate that AES_256_XTS is used for contents and that AES_256_CTS is
used for filenames. It was previously possible for the kernel to
accept these reversed, though it would have taken manual editing of
the block device. This was not intended.
- Fail cleanly rather than BUG()-ing if a file has an unexpected type.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There are some repetitive code in jbd2_journal_init_dev() and
jbd2_journal_init_inode(). So this patch moves the common code into
journal_init_common() helper to simplify the code. And fix the coding
style warnings reported by checkpatch.pl by the way.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_alloc_file_blocks() is called from ext4_zero_range() and
ext4_fallocate() both already testing EXT4_INODE_EXTENTS
We can call ext_depth(inode) unconditionnally.
[ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
called for a indirect-mapped inode in the future. -- tytso ]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pages clear buffers after ext4 delayed block allocation failed,
However, it does not clean its pte_dirty flag.
if the pages unmap ,in cording to the pte_dirty ,
unmap_page_range may try to call __set_page_dirty,
which may lead to the bugon at
mpage_prepare_extent_to_map:head = page_buffers(page);.
This patch just call clear_page_dirty_for_io to clean pte_dirty
at mpage_release_unused_pages for pages mmaped.
Steps to reproduce the bug:
(1) mmap a file in ext4
addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
fd, 0);
memset(addr, 'i', 4096);
(2) return EIO at
ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent
which causes this log message to be print:
ext4_msg(sb, KERN_CRIT,
"Delayed block allocation failed for "
"inode %lu at logical offset %llu with"
" max blocks %u with error %d",
inode->i_ino,
(unsigned long long)map->m_lblk,
(unsigned)map->m_len, -err);
(3)Unmap the addr cause warning at
__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));
(4) wait for a minute,then bugon happen.
Cc: stable@vger.kernel.org
Signed-off-by: wangguang <wangguang03@zte.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The macro f2fs_check_acl is defined but never used since
the initial commit, this patch removes the code that has
been dead for several years.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When src and dst is the same file, and the latter part of source region
overlaps with the former part of destination region, current implement
will overwrite data which hasn't been moved yet and truncate data in
overlapped region.
This patch return -EINVAL when such cases occur and return 0 when
source region and destination region is actually the same part of
the same file.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
commit e1defc4ff0
"block: Do away with the notion of hardsect_size"
removed the notion of "hardware sector size" from
the kernel in favor of logical block size, but
references remain in comments and documentation.
Update the remaining sites mentioning hardsect.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
DAX support for block devices was removed in commits 03cdad
("block: disable block device DAX by default") and 99a01cd
("block: remove BLK_DEV_DAX config option"), but we still kept a call to
dax_do_io and some uneeded i_flags manipulations introduced in commit
bbab37 ("block: Add support for DAX reads/writes to block devices").
Remove those leftovers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As it stands today, the "fail immediately" vs. "retry forever"
values for max_retries and retry_timeout_seconds in the xfs metadata
error configurations are not consistent.
A retry_timeout_seconds of 0 means "retry forever," but a
max_retries of 0 means "fail immediately."
retry_timeout_seconds < 0 is disallowed, while max_retries == -1
means "retry forever."
Make this consistent across the error configs, such that a value of
0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
and a value of -1 always means "retry forever."
This makes retry_timeout a signed long to accommodate the -1, even
though it stores jiffies. Given our limit of a 1 day maximum
timeout, this should be sufficient even at much higher HZ values
than we have available today.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Commit 2a6fba6 "xfs: only return -errno or success from attr ->put_listent"
changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is
insufficient space in the buffer assuming that setting context->count to -1
would be enough, but all of the ->put_listent callers only check seen_enough.
This results in a failed assertion:
XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175
in insufficient buffer size case.
This is only reproducible with at least 2 xattrs and only when the buffer
gets depleted before the last one.
Furthermore if buffersize is such that it is enough to hold the last xattr's
name, but not enough to hold the sum of preceeding xattr names listxattr won't
fail with ERANGE, but will suceed returning last xattr's name without the
first character. The first character end's up overwriting data stored at
(context->alist - 1).
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
"blocks" should be added back to fdblocks at undo time, not taken
away, i.e. the minus sign should not be used.
This is a regression introduced by commit 0d485ada40 ("xfs: use
generic percpu counters for free block counter"). And it's found by
code inspection, I didn't it in real world, so there's no
reproducer.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Previously, f2fs_write_begin sets PageUptodate all the time. But, when user
tries to update the entire page (i.e., len == PAGE_SIZE), we need to consider
that the page is able to be copied partially afterwards. In such the case,
we will lose the remaing region in the page.
This patch fixes this by setting PageUptodate in f2fs_write_end as given copied
result. In the short copy case, it returns zero to let generic_perform_write
retry copying user data again.
As a result, f2fs_write_end() works:
PageUptodate len copied return retry
1. no 4096 4096 4096 false -> return 4096
2. no 4096 1024 0 true -> goto #1 case
3. yes 2048 2048 2048 false -> return 2048
4. yes 2048 1024 1024 false -> return 1024
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__exchange_data_block should take block indexes as parameters
instead of offsets in bytes.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
state accounting
- Fix the CREATE_SESSION slot number
Bugfixes:
- sunrpc: fix a UDP memory accounting regression
- NFS: Fix an error reporting regression in nfs_file_write()
- pNFS: Fix further layout stateid issues
- RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer
exhaustion...")
- RPC/rdma: Fix receive buffer accounting"
* tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Fix the CREATE_SESSION slot number accounting
xprtrdma: Fix receive buffer accounting
xprtrdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
NFS: Fix error reporting in nfs_file_write()
sunrpc: fix UDP memory accounting
When truncating cached inline_data, we don't need to allocate a new page
all the time. Instead, it must check its page cache only.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Ensure that we conform to the algorithm described in RFC5661, section
18.36.4 for when to bump the sequence id. In essence we do it for all
cases except when the RPC call timed out, or in case of the server returning
NFS4ERR_DELAY or NFS4ERR_STALE_CLIENTID.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Pull libnvdimm fixes from Dan Williams:
"nvdimm fixes for v4.8, two of them are tagged for -stable:
- Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
DAX pmd mappings end up with an uncached pgprot, and unusable
performance for the device-dax interface. The device-dax interface
appeared in 4.7 so this is tagged for -stable.
- Fix a couple VM_BUG_ON() checks in the show_smaps() path to
understand DAX pmd entries. This fix is tagged for -stable.
- Fix a mis-merge of the nfit machine-check handler to flip the
polarity of an if() to match the final version of the patch that
Vishal sent for 4.8-rc1. Without this the nfit machine check
handler never detects / inserts new 'badblocks' entries which
applications use to identify lost portions of files.
- For test purposes, fix the nvdimm_clear_poison() path to operate on
legacy / simulated nvdimm memory ranges. Without this fix a test
can set badblocks, but never clear them on these ranges.
- Fix the range checking done by dax_dev_pmd_fault(). This is not
tagged for -stable since this problem is mitigated by specifying
aligned resources at device-dax setup time.
These patches have appeared in a next release over the past week. The
recent rebase you can see in the timestamps was to drop an invalid fix
as identified by the updated device-dax unit tests [1]. The -mm
touches have an ack from Andrew"
[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: allow legacy (e820) pmem region to clear bad blocks
nfit, mce: Fix SPA matching logic in MCE handler
mm: fix cache mode of dax pmd mappings
mm: fix show_smap() for zone_device-pmd ranges
dax: fix mapping size check
Pull fscrypto fixes fromTed Ts'o:
"Fix some brown-paper-bag bugs for fscrypto, including one one which
allows a malicious user to set an encryption policy on an empty
directory which they do not own"
* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fscrypto: require write access to mount to set encryption policy
fscrypto: only allow setting encryption policy on directories
fscrypto: add authorization check for setting encryption policy
Since setting an encryption policy requires writing metadata to the
filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
Otherwise, a user could cause a write to a frozen or readonly
filesystem. This was handled correctly by f2fs but not by ext4. Make
fscrypt_process_policy() handle it rather than relying on the filesystem
to get it right.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
The patch
fs/cifs: make share unaccessible at root level mountable
makes use of prepaths when any component of the underlying path is
inaccessible.
When mounting 2 separate shares having different prepaths but are other
wise similar in other respects, we end up sharing superblocks when we
shouldn't be doing so.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Fix memory leaks introduced by the patch
fs/cifs: make share unaccessible at root level mountable
Also move allocation of cifs_sb->prepath to cifs_setup_cifs_sb().
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>