This is motivated by orangefs_inode_old_getattr's habit of writing over
live inodes.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Use the result of a local read to determine when to set the eof flag. This
allows us to return the location of the end of the file atomically at the
time of the read.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
[bfields: add some documentation]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
API:
- Fix kzalloc error path crash in ecryptfs added by skcipher
conversion. Note the subject of the commit is screwed up and the
correct subject is actually in the body.
Drivers:
- A number of fixes to the marvell cesa hashing code.
- Remove bogus nested irqsave that clobbers the saved flags in ccp"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: marvell/cesa - forward devm_ioremap_resource() error code
crypto: marvell/cesa - initialize hash states
crypto: marvell/cesa - fix memory leak
crypto: ccp - fix lock acquisition code
eCryptfs: Use skcipher and shash
Merge third patch-bomb from Andrew Morton:
- more ocfs2 changes
- a few hotfixes
- Andy's compat cleanups
- misc fixes to fatfs, ptrace, coredump, cpumask, creds, eventfd,
panic, ipmi, kgdb, profile, kfifo, ubsan, etc.
- many rapidio updates: fixes, new drivers.
- kcov: kernel code coverage feature. Like gcov, but not
"prohibitively expensive".
- extable code consolidation for various archs
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
ia64/extable: use generic search and sort routines
x86/extable: use generic search and sort routines
s390/extable: use generic search and sort routines
alpha/extable: use generic search and sort routines
kernel/...: convert pr_warning to pr_warn
drivers: dma-coherent: use memset_io for DMA_MEMORY_IO mappings
drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MAP
memremap: add MEMREMAP_WC flag
memremap: don't modify flags
kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE
mm/mprotect.c: don't imply PROT_EXEC on non-exec fs
ipc/sem: make semctl setting sempid consistent
ubsan: fix tree-wide -Wmaybe-uninitialized false positives
kfifo: fix sparse complaints
scripts/gdb: account for changes in module data structure
scripts/gdb: add cmdline reader command
scripts/gdb: add version command
kernel: add kcov code coverage
profile: hide unused functions when !CONFIG_PROC_FS
hpwdt: use nmi_panic() when kernel panics in NMI handler
...
Since commit e22553e2a2 ("eventfd: don't take the spinlock in
eventfd_poll", 2015-02-17), eventfd is reading ctx->count outside
ctx->wqh.lock.
However, things aren't as simple as the read barrier in eventfd_poll
would suggest. In fact, the read barrier, besides lacking a comment, is
not paired in any obvious manner with another read barrier, and it is
pointless because it is sitting between a write (deep in poll_wait) and
the read of ctx->count. The read barrier is acting just as a compiler
barrier, for which we can use READ_ONCE instead. This is what the code
change in this patch does.
The documentation change is just as important, however. The question,
posed by Andrea Arcangeli, is then why the thing is safe on
architectures where spin_unlock does not imply a store-load memory
barrier. The answer is that it's safe because writes of ctx->count use
the same lock as poll_wait, and hence an acquire barrier implicit in
poll_wait provides the necessary synchronization between eventfd_poll
and callers of wake_up_locked_poll. This is sort of mentioned in the
commit message with respect to eventfd_ctx_read ("eventfd_read is
similar, it will do a single decrement with the lock held") but it
applies to all other callers too. It's tricky enough that it should be
documented in the code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:
- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)
Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.
To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FAT has long supported its own default file name encoding config
setting, separate from CONFIG_NLS_DEFAULT.
However, if UTF-8 encoded file names are desired FAT character set
should not be set to utf8 since this would make file names case
sensitive even if case insensitive matching is requested. Instead,
"utf8" mount options should be provided to enable UTF-8 file names in
FAT file system.
Unfortunately, there was no possibility to set the default value of this
option so on UTF-8 system "utf8" mount option had to be added manually
to most FAT mounts.
This patch adds config option to set such default value.
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4 treats directory offsets differently for 32-bit and 64-bit callers.
Check the caller type using in_compat_syscall, not is_compat_task. This
changes behavior on SPARC slightly.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When there are errors in the ocfs2 filesystem, they are usually
accompanied by the inode number which caused the error. This inode
number would be the input to fixing the file. One of these options
could be considered:
A file in the sys filesytem which would accept inode numbers. This
could be used to communication back what has to be fixed or is fixed.
You could write:
$# echo "<inode>" > /sys/fs/ocfs2/devname/filecheck/check
or
$# echo "<inode>" > /sys/fs/ocfs2/devname/filecheck/fix
Compare with second version, I re-design filecheck sysfs interfaces,
there are three sysfs files (check, fix and set) under filecheck
directory (see above), sysfs will accept only one argument <inode>.
Second, I adjust some code in ocfs2_filecheck_repair_inode_block()
function according to upstream feedback, we cannot just add VALID_FL
flag back as a inode block fix, then we will not fix this field
corruption currently until having a complete solution. Compare with
first version, I use strncasecmp instead of double strncmp functions.
Second, update the source file contribution vendor.
This patch (of 4):
Export ocfs2_kset object from ocfs2_stackglue kernel module, then online
file check code will create the related sysfiles under ocfs2_kset
object. We're exporting this because it's built in ocfs2_stackglue.ko.
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Add support for multiple NFSv4.1 callbacks in flight
- Initial patchset for RPC multipath support
- Adapt RPC/RDMA to use the new completion queue API
Bugfixes and cleanups:
- nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
- Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync
- Fix RPC/RDMA credit accounting
- Properly handle RDMA_ERROR replies
- xprtrdma: Do not wait if ib_post_send() fails
- xprtrdma: Segment head and tail XDR buffers on page boundaries
- xprtrdma cleanups for dprintk, physical_op_map and unused macros"
* tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (35 commits)
nfs/blocklayout: make sure making a aligned read request
nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed
nfs: remove nfs_inode_dio_wait
nfs: remove nfs4_file_fsync
xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs
xprtrdma: Use an anonymous union in struct rpcrdma_mw
xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs
xprtrdma: Serialize credit accounting again
xprtrdma: Properly handle RDMA_ERROR replies
rpcrdma: Add RPCRDMA_HDRLEN_ERR
xprtrdma: Do not wait if ib_post_send() fails
xprtrdma: Segment head and tail XDR buffers on page boundaries
xprtrdma: Clean up dprintk format string containing a newline
xprtrdma: Clean up physical_op_map()
xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback
pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3
SUNRPC: Allow addition of new transports to a struct rpc_clnt
NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections
SUNRPC: Make NFS swap work with multipath
...
We aren't checking to see if the in-inode extended attribute is
corrupted before we try to expand the inode's extra isize fields.
This can lead to potential crashes caused by the BUG_ON() check in
ext4_xattr_shift_entries().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull overlayfs updates from Miklos Szeredi:
"Various fixes and tweaks"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: cleanup unused var in rename2
ovl: rename is_merge to is_lowest
ovl: fixed coding style warning
ovl: Ensure upper filesystem supports d_type
ovl: Warn on copy up if a process has a R/O fd open to the lower file
ovl: honor flag MS_SILENT at mount
ovl: verify upper dentry before unlink and rename
Pull fuse update from Miklos Szeredi:
"This contains direct I/O fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: return patrial success from fuse_direct_io()
fuse: Add reference counting for fuse_io_priv
fuse: do not use iocb after it may have been freed
You could add any multiple of 2^32/PNFS_SCSI_RANGE_SIZE to nr_iomaps and
still pass this check. You'd probably still fail the following kcalloc,
but best to be paranoid since this is from-the-wire data.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
transaction_kthread() is calling try_to_freeze(), but that's just an
expeinsive no-op given the fact that the thread is not marked freezable.
After removing this, disk-io.c is now independent on freezer API.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
cleaner_kthread() is not marked freezable, and therefore calling
try_to_freeze() in its context is a pointless no-op.
In addition to that, as has been clearly demonstrated by 80ad623edd
("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"), it's perfectly
valid / legal for cleaner_kthread() to stay scheduled out in an arbitrary
place during suspend (in that particular example that was waiting for
reading of extent pages), so there is no need to leave any traces of
freezer in this kthread.
Fixes: 80ad623edd ("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Fixes: 6962491321 ("btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
but it is better than to have a silent metadata corruption on disk.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"We have a good sized cleanup of our internal read ahead code, and the
first series of commits from Chandan to enable PAGE_SIZE > sectorsize
Otherwise, it's a normal series of cleanups and fixes, with many
thanks to Dave Sterba for doing most of the patch wrangling this time"
* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (82 commits)
btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums
btrfs: Fix misspellings in comments.
btrfs: Print Warning only if ENOSPC_DEBUG is enabled
btrfs: scrub: silence an uninitialized variable warning
btrfs: move btrfs_compression_type to compression.h
btrfs: rename btrfs_print_info to btrfs_print_mod_info
Btrfs: Show a warning message if one of objectid reaches its highest value
Documentation: btrfs: remove usage specific information
btrfs: use kbasename in btrfsic_mount
Btrfs: do not collect ordered extents when logging that inode exists
Btrfs: fix race when checking if we can skip fsync'ing an inode
Btrfs: fix listxattrs not listing all xattrs packed in the same item
Btrfs: fix deadlock between direct IO reads and buffered writes
Btrfs: fix extent_same allowing destination offset beyond i_size
Btrfs: fix file loss on log replay after renaming a file and fsync
Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
Btrfs: fix lockdep deadlock warning due to dev_replace
btrfs: drop unused argument in btrfs_ioctl_get_supported_features
btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
btrfs: change max_inline default to 2048
...
Pull UDF and quota updates from Jan Kara:
"This contains a rewrite of UDF handling of filename encoding to fix
remaining overflow issues from Andrew Gabbasov and quota changes to
support new Q_[X]GETNEXTQUOTA quotactl for VFS quota formats"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Fix possible GPF due to uninitialised pointers
ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem
quota: Fix possible races during quota loading
ocfs2: Implement get_next_id()
quota_v2: Implement get_next_id() for V2 quota format
quota: Add support for ->get_nextdqblk() for VFS quota
udf: Merge linux specific translation into CS0 conversion function
udf: Remove struct ustr as non-needed intermediate storage
udf: Use separate buffer for copying split names
udf: Adjust UDF_NAME_LEN to better reflect actual restrictions
udf: Join functions for UTF8 and NLS conversions
udf: Parameterize output length in udf_put_filename
quota: Allow Q_GETQUOTA for frozen filesystem
quota: Fixup comments about return value of Q_[X]GETNEXTQUOTA
Pull xfs updates from Dave Chinner:
"There's quite a lot in this request, and there's some cross-over with
ext4, dax and quota code due to the nature of the changes being made.
As for the rest of the XFS changes, there are lots of little things
all over the place, which add up to a lot of changes in the end.
The major changes are that we've reduced the size of the struct
xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
>10%), the writepage code now only does a single set of mapping tree
lockups so uses less CPU, delayed allocation reservations won't
overrun under random write loads anymore, and we added compile time
verification for on-disk structure sizes so we find out when a commit
or platform/compiler change breaks the on disk structure as early as
possible.
Change summary:
- error propagation for direct IO failures fixes for both XFS and
ext4
- new quota interfaces and XFS implementation for iterating all the
quota IDs in the filesystem
- locking fixes for real-time device extent allocation
- reduction of duplicate information in the xfs and vfs inode, saving
roughly 100 bytes of memory per cached inode.
- buffer flag cleanup
- rework of the writepage code to use the generic write clustering
mechanisms
- several fixes for inode flag based DAX enablement
- rework of remount option parsing
- compile time verification of on-disk format structure sizes
- delayed allocation reservation overrun fixes
- lots of little error handling fixes
- small memory leak fixes
- enable xfsaild freezing again"
* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
xfs: always set rvalp in xfs_dir2_node_trim_free
xfs: ensure committed is initialized in xfs_trans_roll
xfs: borrow indirect blocks from freed extent when available
xfs: refactor delalloc indlen reservation split into helper
xfs: update freeblocks counter after extent deletion
xfs: debug mode forced buffered write failure
xfs: remove impossible condition
xfs: check sizes of XFS on-disk structures at compile time
xfs: ioends require logically contiguous file offsets
xfs: use named array initializers for log item dumping
xfs: fix computation of inode btree maxlevels
xfs: reinitialise per-AG structures if geometry changes during recovery
xfs: remove xfs_trans_get_block_res
xfs: fix up inode32/64 (re)mount handling
xfs: fix format specifier , should be %llx and not %llu
xfs: sanitize remount options
xfs: convert mount option parsing to tokens
xfs: fix two memory leaks in xfs_attr_list.c error paths
xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
...
Pull f2fs updates from Jaegeuk Kim:
"New Features:
- uplift filesystem encryption into fs/crypto/
- give sysfs entries to control memroy consumption
Enhancements:
- aio performance by preallocating blocks in ->write_iter
- use writepages lock for only WB_SYNC_ALL
- avoid redundant inline_data conversion
- enhance forground GC
- use wait_for_stable_page as possible
- speed up SEEK_DATA and fiiemap
Bug Fixes:
- corner case in terms of -ENOSPC for inline_data
- hung task caused by long latency in shrinker
- corruption between atomic write and f2fs_trace_pid
- avoid garbage lengths in dentries
- revoke atomicly written pages if an error occurs
In addition, there are various minor bug fixes and clean-ups"
* tag 'for-f2fs-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (81 commits)
f2fs: submit node page write bios when really required
f2fs: add missing argument to f2fs_setxattr stub
f2fs: fix to avoid unneeded unlock_new_inode
f2fs: clean up opened code with f2fs_update_dentry
f2fs: declare static functions
f2fs: use cryptoapi crc32 functions
f2fs: modify the readahead method in ra_node_page()
f2fs crypto: sync ext4_lookup and ext4_file_open
fs crypto: move per-file encryption from f2fs tree to fs/crypto
f2fs: mutex can't be used by down_write_nest_lock()
f2fs: recovery missing dot dentries in root directory
f2fs: fix to avoid deadlock when merging inline data
f2fs: introduce f2fs_flush_merged_bios for cleanup
f2fs: introduce f2fs_update_data_blkaddr for cleanup
f2fs crypto: fix incorrect positioning for GCing encrypted data page
f2fs: fix incorrect upper bound when iterating inode mapping tree
f2fs: avoid hungtask problem caused by losing wake_up
f2fs: trace old block address for CoWed page
f2fs: try to flush inode after merging inline data
f2fs: show more info about superblock recovery
...
Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.
After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"
* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path
Only treat write goes up to the inode size as aligned request,
because it always write PAGE_CACHE_SIZE, but read a dynamic size.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The 'is_merge' is an historical naming from when only a single lower layer
could exist. With the introduction of multiple lower layers the meaning of
this flag was changed to mean only the "lowest layer" (while all lower
layers were being merged).
So now 'is_merge' is inaccurate and hence renaming to 'is_lowest'
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In some instances xfs has been created with ftype=0 and there if a file
on lower fs is removed, overlay leaves a whiteout in upper fs but that
whiteout does not get filtered out and is visible to overlayfs users.
And reason it does not get filtered out because upper filesystem does
not report file type of whiteout as DT_CHR during iterate_dir().
So it seems to be a requirement that upper filesystem support d_type for
overlayfs to work properly. Do this check during mount and fail if d_type
is not supported.
Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Print a warning when overlayfs copies up a file if the process that
triggered the copy up has a R/O fd open to the lower file being copied up.
This can help catch applications that do things like the following:
fd1 = open("foo", O_RDONLY);
fd2 = open("foo", O_RDWR);
where they expect fd1 and fd2 to refer to the same file - which will no
longer be the case post-copy up.
With this patch, the following commands:
bash 5</mnt/a/foo128
6<>/mnt/a/foo128
assuming /mnt/a/foo128 to be an un-copied up file on an overlay will
produce the following warning in the kernel log:
overlayfs: Copying up foo129, but open R/O on fd 5 which will cease
to be coherent [pid=3818 bash]
This is enabled by setting:
/sys/module/overlay/parameters/check_copy_up
to 1.
The warnings are ratelimited and are also limited to one warning per file -
assuming the copy up completes in each case.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch hides error about missing lowerdir if MS_SILENT is set.
We use mount(NULL, "/", "overlay", MS_SILENT, NULL) for testing support of
overlayfs: syscall returns -ENODEV if it's not supported. Otherwise kernel
automatically loads module and returns -EINVAL because lowerdir is missing.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Unlink and rename in overlayfs checked the upper dentry for staleness by
verifying upper->d_parent against upperdir. However the dentry can go
stale also by being unhashed, for example.
Expand the verification to actually look up the name again (under parent
lock) and check if it matches the upper dentry. This matches what the VFS
does before passing the dentry to filesytem's unlink/rename methods, which
excludes any inconsistency caused by overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit c40a3d38af (Btrfs: Compute and look up csums based on
sectorsized blocks) changes around how we walk the bios while looking up
crcs. There's an inner loop that is jumping to the next bvec based on
sectors and before it derefs the next bvec, it needs to make sure we're
still in the bio.
In this case, the outer loop would have decided to stop moving forward
too, and the bvec deref is never actually used for anything. But
CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio.
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).
There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.
This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).
This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.
So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.
We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.
There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.
Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"
* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
UBIFS does not support POSIX ACLs, so there is no need for including any
POSIX ACL hesders.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The existing logging macros are fairly large and converting the
macros to functions make the object code smaller.
Use %pV and __builtin_return_address(0) as appropriate.
$ size fs/ubifs/built-in.o*
text data bss dec hex filename
575831 309688 161312 1046831 ff92f fs/ubifs/built-in.o.allyesconfig.new
622457 312872 161120 1096449 10bb01 fs/ubifs/built-in.o.allyesconfig.old
223785 640 644 225069 36f2d fs/ubifs/built-in.o.defconfig.new
251873 640 644 253157 3dce5 fs/ubifs/built-in.o.defconfig.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
When cgroup writeback is in use, there can be multiple wb's
(bdi_writeback's) per bdi and an inode may switch among them
dynamically. In a couple places, the wrong wb was used leading to
performing operations on the wrong list under the wrong lock
corrupting the io lists.
* writeback_single_inode() was taking @wb parameter and used it to
remove the inode from io lists if it becomes clean after writeback.
The callers of this function were always passing in the root wb
regardless of the actual wb that the inode was associated with,
which could also change while writeback is in progress.
Fix it by dropping the @wb parameter and using
inode_to_wb_and_lock_list() to determine and lock the associated wb.
* After writeback_sb_inodes() writes out an inode, it re-locks @wb and
inode to remove it from or move it to the right io list. It assumes
that the inode is still associated with @wb; however, the inode may
have switched to another wb while writeback was in progress.
Fix it by using inode_to_wb_and_lock_list() to determine and lock
the associated wb after writeback is complete. As the function
requires the original @wb->list_lock locked for the next iteration,
in the unlikely case where the inode has changed association, switch
the locks.
Kudos to Tahsin for pinpointing these subtle breakages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: d10c809552 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com
Reported-and-diagnosed-by: Tahsin Erdogan <tahsin@google.com>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Jens Axboe <axboe@fb.com>
locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
the target inode, unlocks inode, locks the wb's list_lock and verifies
that the inode is still associated with the wb. To prevent the wb
going away between dropping inode lock and acquiring list_lock, the wb
is pinned while inode lock is held. The wb reference is put right
after acquiring list_lock citing that the wb won't be dereferenced
anymore.
This isn't true. If the inode is still associated with the wb, the
inode has reference and it's safe to return the wb; however, if inode
has been switched, the wb still needs to be unlocked which is a
dereference and can lead to use-after-free if it it races with wb
destruction.
Fix it by putting the reference after releasing list_lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 87e1d789bf ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
Cc: stable@vger.kernel.org # v4.2+
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs updates from Al Viro:
- Preparations of parallel lookups (the remaining main obstacle is the
need to move security_d_instantiate(); once that becomes safe, the
rest will be a matter of rather short series local to fs/*.c
- preadv2/pwritev2 series from Christoph
- assorted fixes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
splice: handle zero nr_pages in splice_to_pipe()
vfs: show_vfsstat: do not ignore errors from show_devname method
dcache.c: new helper: __d_add()
don't bother with __d_instantiate(dentry, NULL)
untangle fsnotify_d_instantiate() a bit
uninline d_add()
replace d_add_unique() with saner primitive
quota: use lookup_one_len_unlocked()
cifs_get_root(): use lookup_one_len_unlocked()
nfs_lookup: don't bother with d_instantiate(dentry, NULL)
kill dentry_unhash()
ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
configfs: move d_rehash() into configfs_create() for regular files
ceph: don't bother with d_rehash() in splice_dentry()
namei: teach lookup_slow() to skip revalidate
namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
lookup_one_len_unlocked(): use lookup_dcache()
namei: simplify invalidation logics in lookup_dcache()
namei: change calling conventions for lookup_{fast,slow} and follow_managed()
...
Merge second patch-bomb from Andrew Morton:
- a couple of hotfixes
- the rest of MM
- a new timer slack control in procfs
- a couple of procfs fixes
- a few misc things
- some printk tweaks
- lib/ updates, notably to radix-tree.
- add my and Nick Piggin's old userspace radix-tree test harness to
tools/testing/radix-tree/. Matthew said it was a godsend during the
radix-tree work he did.
- a few code-size improvements, switching to __always_inline where gcc
screwed up.
- partially implement character sets in sscanf
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
sscanf: implement basic character sets
lib/bug.c: use common WARN helper
param: convert some "on"/"off" users to strtobool
lib: add "on"/"off" support to kstrtobool
lib: update single-char callers of strtobool()
lib: move strtobool() to kstrtobool()
include/linux/unaligned: force inlining of byteswap operations
include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
usb: common: convert to use match_string() helper
ide: hpt366: convert to use match_string() helper
ata: hpt366: convert to use match_string() helper
power: ab8500: convert to use match_string() helper
power: charger_manager: convert to use match_string() helper
drm/edid: convert to use match_string() helper
pinctrl: convert to use match_string() helper
device property: convert to use match_string() helper
lib/string: introduce match_string() helper
radix-tree tests: add test for radix_tree_iter_next
radix-tree tests: add regression3 test
...
Pull core block updates from Jens Axboe:
"Here are the core block changes for this merge window. Not a lot of
exciting stuff going on in this round, most of the changes have been
on the driver side of things. That pull request is coming next. This
pull request contains:
- A set of fixes for chained bio handling from Christoph.
- A tag bounds check for blk-mq from Hannes, ensuring that we don't
do something stupid if a device reports an invalid tag value.
- A set of fixes/updates for the CFQ IO scheduler from Jan Kara.
- A set of blk-mq fixes from Keith, adding support for dynamic
hardware queues, and fixing init of max_dev_sectors for stacking
devices.
- A fix for the dynamic hw context from Ming.
- Enabling of cgroup writeback support on a block device, from
Shaohua"
* 'for-4.6/core' of git://git.kernel.dk/linux-block:
blk-mq: add bounds check on tag-to-rq conversion
block: bio_remaining_done() isn't unlikely
block: cleanup bio_endio
block: factor out chained bio completion
block: don't unecessarily clobber bi_error for chained bios
block-dev: enable writeback cgroup support
blk-mq: Fix NULL pointer updating nr_requests
blk-mq: mark request queue as mq asap
block: Initialize max_dev_sectors to 0
blk-mq: dynamic h/w context count
cfq-iosched: Allow parent cgroup to preempt its child
cfq-iosched: Allow sync noidle workloads to preempt each other
cfq-iosched: Reorder checks in cfq_should_preempt()
cfq-iosched: Don't group_idle if cfqq has big thinktime
Running the following command:
busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null
with any tracing enabled pretty very quickly leads to various NULL
pointer dereferences and VM BUG_ON()s, such as these:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff8119df6c>] generic_pipe_buf_release+0xc/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cbee>] entry_SYSCALL_64_fastpath+0x12/0x6d
page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
kernel BUG at include/linux/mm.h:367!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
RIP: [<ffffffff8119df9c>] generic_pipe_buf_release+0x3c/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cd1e>] tracesys_phase2+0x84/0x89
(busybox's cat uses sendfile(2), unlike the coreutils version)
This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.
All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.
Cc: stable@vger.kernel.org
Signed-off-by: Rabin Vincent <rabin@rab.in>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>