When copying up a file that has multiple hard links we need to break any
association with the origin file. This makes copy-up be essentially an
atomic replace.
The new file has nothing to do with the old one (except having the same
data and metadata initially), so don't set the overlay.origin attribute.
We can relax this in the future when we are able to index upper object by
origin.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Nothing prevents mischief on upper layer while we are busy copying up the
data.
Move the lookup right before the looked up dentry is actually used.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 01ad3eb8a0 ("ovl: concurrent copy up of regular files")
Cc: <stable@vger.kernel.org> # v4.11
The current code works only for the case where we have exactly one slot,
which is no longer true.
nfs4_free_slot() will automatically declare the callback channel to be
drained when all slots have been returned.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the task calling layoutget is signalled, then it is possible for the
calls to nfs4_sequence_free_slot() and nfs4_layoutget_prepare() to race,
in which case we leak a slot.
The fix is to move the call to nfs4_sequence_free_slot() into the
nfs4_layoutget_release() so that it gets called at task teardown time.
Fixes: 2e80dbe7ac ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add a new dqget flag that grabs the dquot without taking the ilock.
This will be used by the scrubber (which will have already grabbed
the ilock) to perform basic sanity checking of the quota data.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by calling __xfs_set_acl() instead of xfs_set_acl() when
setting up inode in xfs_generic_create(). That prevents SGID bit
clearing and mode is properly set by posix_acl_create() anyway. We also
reorder arguments of __xfs_set_acl() to match the ordering of
xfs_set_acl() to make things consistent.
Fixes: 073931017b
CC: stable@vger.kernel.org
CC: Darrick J. Wong <darrick.wong@oracle.com>
CC: linux-xfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XFS runs an eofblocks reclaim scan before returning an ENOSPC error to
userspace for buffered writes. This facilitates aggressive speculative
preallocation without causing user visible side effects such as
premature ENOSPC.
Run a cowblocks scan in the same situation to reclaim lingering COW fork
preallocation throughout the filesystem.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that error injection tags support dynamic frequency adjustment,
replace the debug mode sysfs knob that controls log record CRC error
injection with an error injection tag.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We now have enhanced error injection that can control the frequency
with which errors happen, so convert drop_writes to use this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Since we moved the injected error frequency controls to the mountpoint,
we can get rid of the last argument to XFS_TEST_ERROR.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Creates a /sys/fs/xfs/$dev/errortag/ directory to control the errortag
values directly. This enables us to control the randomness values,
rather than having to accept the defaults.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Remove the xfs_etest structure in favor of a per-mountpoint structure.
This will give us the flexibility to set as many error injection points
as we want, and later enable us to set up sysfs knobs to set the trigger
frequency as we wish. This comes at a cost of higher memory use, but
unti we hit 1024 injection points (we're at 29) or a lot of mounts this
shouldn't be a huge issue.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Use memdup_user() helper instead of open-coding to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
... and lose HAVE_ARCH_...; if copy_{to,from}_user() on an
architecture sucks badly enough to make it a problem, we have
a worse problem.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Log recovery allocates in-core transaction and member item data
structures on-demand as it processes the on-disk log. Transactions
are allocated on first encounter on-disk and stored in a hash table
structure where they are easily accessible for subsequent lookups.
Transaction items are also allocated on demand and are attached to
the associated transactions.
When a commit record is encountered in the log, the transaction is
committed to the fs and the in-core structures are freed. If a
filesystem crashes or shuts down before all in-core log buffers are
flushed to the log, however, not all transactions may have commit
records in the log. As expected, the modifications in such an
incomplete transaction are not replayed to the fs. The in-core data
structures for the partial transaction are never freed, however,
resulting in a memory leak.
Update xlog_do_recovery_pass() to first correctly initialize the
hash table array so empty lists can be distinguished from populated
lists on function exit. Update xlog_recover_free_trans() to always
remove the transaction from the list prior to freeing the associated
memory. Finally, walk the hash table of transaction lists as the
last step before it goes out of scope and free any transactions that
may remain on the lists. This prevents a memory leak of partial
transactions in the log.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This makes it consistent with ->is_encrypted(), ->empty_dir(), and
fscrypt_dummy_context_enabled().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
fscrypt provides facilities to use different encryption algorithms which
are selectable by userspace when setting the encryption policy. Currently,
only AES-256-XTS for file contents and AES-256-CBC-CTS for file names are
implemented. This is a clear case of kernel offers the mechanism and
userspace selects a policy. Similar to what dm-crypt and ecryptfs have.
This patch adds support for using AES-128-CBC for file contents and
AES-128-CBC-CTS for file name encryption. To mitigate watermarking
attacks, IVs are generated using the ESSIV algorithm. While AES-CBC is
actually slightly less secure than AES-XTS from a security point of view,
there is more widespread hardware support. Using AES-CBC gives us the
acceptable performance while still providing a moderate level of security
for persistent storage.
Especially low-powered embedded devices with crypto accelerators such as
CAAM or CESA often only support AES-CBC. Since using AES-CBC over AES-XTS
is basically thought of a last resort, we use AES-128-CBC over AES-256-CBC
since it has less encryption rounds and yields noticeable better
performance starting from a file size of just a few kB.
Signed-off-by: Daniel Walter <dwalter@sigma-star.at>
[david@sigma-star.at: addressed review comments]
Signed-off-by: David Gstir <david@sigma-star.at>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
fscrypt_free_filename() only needs to do a kfree() of crypto_buf.name,
which works well as an inline function. We can skip setting the various
pointers to NULL, since no user cares about it (the name is always freed
just before it goes out of scope).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, filesystems allow truncate(2) on an encrypted file without
the encryption key. However, it's impossible to correctly handle the
case where the size being truncated to is not a multiple of the
filesystem block size, because that would require decrypting the final
block, zeroing the part beyond i_size, then encrypting the block.
As other modifications to encrypted file contents are prohibited without
the key, just prohibit truncate(2) as well, making it fail with ENOKEY.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Merge misc fixes from Andrew Morton:
"8 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/exec.c: account for argv/envp pointers
ocfs2: fix deadlock caused by recursive locking in xattr
slub: make sysfs file removal asynchronous
lib/cmdline.c: fix get_options() overflow while parsing ranges
fs/dax.c: fix inefficiency in dax_writeback_mapping_range()
autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL
mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings
mm, thp: remove cond_resched from __collapse_huge_page_copy
When limiting the argv/envp strings during exec to 1/4 of the stack limit,
the storage of the pointers to the strings was not included. This means
that an exec with huge numbers of tiny strings could eat 1/4 of the stack
limit in strings and then additional space would be later used by the
pointers to the strings.
For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
single-byte strings would consume less than 2MB of stack, the max (8MB /
4) amount allowed, but the pointers to the strings would consume the
remaining additional stack space (1677721 * 4 == 6710884).
The result (1677721 + 6710884 == 8388605) would exhaust stack space
entirely. Controlling this stack exhaustion could result in
pathological behavior in setuid binaries (CVE-2017-1000365).
[akpm@linux-foundation.org: additional commenting from Kees]
Fixes: b6a2fea393 ("mm: variable length argument support")
Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Another deadlock path caused by recursive locking is reported. This
kind of issue was introduced since commit 743b5f1434 ("ocfs2: take
inode lock in ocfs2_iop_set/get_acl()"). Two deadlock paths have been
fixed by commit b891fa5024 ("ocfs2: fix deadlock issue when taking
inode lock at vfs entry points"). Yes, we intend to fix this kind of
case in incremental way, because it's hard to find out all possible
paths at once.
This one can be reproduced like this. On node1, cp a large file from
home directory to ocfs2 mountpoint. While on node2, run
setfacl/getfacl. Both nodes will hang up there. The backtraces:
On node1:
__ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
ocfs2_write_begin+0x43/0x1a0 [ocfs2]
generic_perform_write+0xa9/0x180
__generic_file_write_iter+0x1aa/0x1d0
ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
__vfs_write+0xc3/0x130
vfs_write+0xb1/0x1a0
SyS_write+0x46/0xa0
On node2:
__ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
ocfs2_set_acl+0x22d/0x260 [ocfs2]
ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
set_posix_acl+0x75/0xb0
posix_acl_xattr_set+0x49/0xa0
__vfs_setxattr+0x69/0x80
__vfs_setxattr_noperm+0x72/0x1a0
vfs_setxattr+0xa7/0xb0
setxattr+0x12d/0x190
path_setxattr+0x9f/0xb0
SyS_setxattr+0x14/0x20
Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
exported by commit 439a36b8ef ("ocfs2/dlmglue: prepare tracking logic
to avoid recursive cluster lock").
Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
Fixes: 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Eric Ren <zren@suse.com>
Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull xfs fixes from Darrick Wong:
"I have one more bugfix for you for 4.12-rc7 to fix a disk corruption
problem:
- don't allow swapon on files on the realtime device, because the
swap code will swap pages out to blocks on the data device, thereby
corrupting the filesystem"
* tag 'xfs-4.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't allow bmap on rt files
The main loop in __discard_prealloc is protected by the reiserfs write lock
which is dropped across schedules like the BKL it replaced. The problem is
that it checks the value, calls a routine that schedules, and then adjusts
the state. As a result, two threads that are calling
reiserfs_prealloc_discard at the same time can race when one calls
reiserfs_free_prealloc_block, the lock is dropped, and the other calls
reiserfs_free_prealloc_block with the same block number. In the right
circumstances, it can cause the prealloc count to go negative.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Most extended attributes will fit in a single block. More importantly,
we drop the reference to the inode while holding the transaction open
so the preallocated blocks aren't released. As a result, the inode
may be evicted before it's removed from the transaction's prealloc list
which can cause memory corruption.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
kstrtoull returns 0 on success, however, in reserved_clusters_store we
will return -EINVAL if kstrtoull returns 0, it makes us fail to update
reserved_clusters value through sysfs.
Fixes: 76d33bca55
Cc: stable@vger.kernel.org # 4.4
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For 1k-block filesystems, the filesystem starts at block 1, not block 0.
This fact is recorded in s_first_data_block, so use that to bump up the
start_fsb before we start querying the filesystem for its space map.
Without this, ext4/026 fails on 1k block ext4 because various functions
(notably ext4_get_group_no_and_offset) don't know what to do with an
fsblock that is "before" the start of the filesystem and return garbage
results (blockgroup 2^32-1, etc.) that confuse fsmap.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously a bad directory block with a bad checksum is skipped; we
should be returning EFSBADCRC (aka EBADMSG).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously, a read error would be ignored and we would eventually return
NULL from ext4_find_entry, which signals "no such file or directory". We
should be returning EIO.
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Currently it's possible to encrypt all files and directories on an ext4
filesystem by deleting everything, including lost+found, then setting an
encryption policy on the root directory. However, this is incompatible
with e2fsck because e2fsck expects to find, create, and/or write to
lost+found and does not have access to any encryption keys. Especially
problematic is that if e2fsck can't find lost+found, it will create it
without regard for whether the root directory is encrypted. This is
wrong for obvious reasons, and it causes a later run of e2fsck to
consider the lost+found directory entry to be corrupted.
Encrypting the root directory may also be of limited use because it is
the "all-or-nothing" use case, for which dm-crypt can be used instead.
(By design, encryption policies are inherited and cannot be overridden;
so the root directory having an encryption policy implies that all files
and directories on the filesystem have that same encryption policy.)
In any case, encrypting the root directory is broken currently and must
not be allowed; so start returning an error if userspace requests it.
For now only do this in ext4, because f2fs and ubifs do not appear to
have the lost+found requirement. We could move it into
fscrypt_ioctl_set_policy() later if desired, though.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Now, when we mount ext4 filesystem with '-o discard' option, we have to
issue all the discard commands for the blocks to be deallocated and
wait for the completion of the commands on the commit complete phase.
Because this procedure might involve a lot of sequential combinations of
issuing discard commands and waiting for that, the delay of this
procedure might be too much long, even to 17.0s in our test,
and it results in long commit delay and fsync() performance degradation.
To reduce this kind of delay, instead of adding callback for each
extent and handling all of them in a sequential manner on commit phase,
we instead add a separate list of extents to free to the superblock and
then process this list at once after transaction commits so that
we can issue all the discard commands in a parallel manner like XFS
filesystem.
Finally, we could enhance the discard command handling performance.
The result was such that 17.0s delay of a single commit in the worst
case has been enhanced to 4.8s.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Kitae Lee <kitae87.lee@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
These days inode reclaim calls evict_inode() only when it has no pages
in the mapping. In that case it is not necessary to wait for transaction
commit in ext4_evict_inode() as there can be no pages waiting to be
committed. So avoid unnecessary transaction waiting in that case.
We still have to keep the check for the case where ext4_evict_inode()
gets called from other paths (e.g. umount) where inode still can have
some page cache pages.
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull cifs fixes from Steve French:
"Various small fixes for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix some return values in case of error in 'crypt_message'
cifs: remove redundant return in cifs_creation_time_get
CIFS: Improve readdir verbosity
CIFS: check if pages is null rather than bv for a failed allocation
CIFS: Set ->should_dirty in cifs_user_readv()
The main purpose of mb cache is to achieve deduplication in
extended attributes. In use cases where opportunity for deduplication
is unlikely, it only adds overhead.
Add a mount option to explicitly turn off mb cache.
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>