The only place where we feed to __d_rehash() something other than
d_hash(dentry->d_name.hash) is __d_move(), where we give it d_hash
of another dentry. Postpone rehashing until we'd switched the
names and we are rid of that exception, along with the need to
keep _d_rehash() and __d_rehash() separate.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Probe-mounting a volume too small for UBIFS results in kernel log
polution which might irritate users.
Address this by silencing errors which may happen during boot if the
rootfs is e.g. squashfs (and thus rather small) stored on a UBI volume.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
Pull fuse updates from Miklos Szeredi:
"This fixes error propagation from writeback to fsync/close for
writeback cache mode as well as adding a missing capability flag to
the INIT message. The rest are cleanups.
(The commits are recent but all the code actually sat in -next for a
while now. The recommits are due to conflict avoidance and the
addition of Cc: stable@...)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use filemap_check_errors()
mm: export filemap_check_errors() to modules
fuse: fix wrong assignment of ->flags in fuse_send_init()
fuse: fuse_flush must check mapping->flags for errors
fuse: fsync() did not return IO errors
fuse: don't mess with blocking signals
new helper: wait_event_killable_exclusive()
fuse: improve aio directIO write performance for size extending writes
This reverts commit 3c9fe8cdff.
As Miklos points out in commit c1b2cc1a76, the "lookup_hash()" helper
is now unused, and in fact, with the hash salting changes, since the
hash of a dentry name now depends on the directory dentry it is in, the
helper function isn't even really likely to be useful.
So rather than keep it around in case somebody else might end up finding
a use for it, let's just remove the helper and not trick people into
thinking it might be a useful thing.
For example, I had obviously completely missed how the helper didn't
follow the normal dentry hashing patterns, and how the hash salting
patch broke overlayfs. Things would quietly build and look sane, but
not work.
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull overlayfs update from Miklos Szeredi:
"First of all, this fixes a regression in overlayfs introduced by the
dentry hash salting. I've moved the patch fixing this to the front of
the queue, so if (god forbid) something needs to be bisected in
overlayfs this regression won't interfere with that.
The biggest part is preparation for selinux support, done by Vivek
Goyal. Essentially this makes all operations on underlying
filesystems be done with credentials of mounter. This makes
everything nicely consistent.
There are also fixes for a number of known and recently discovered
non-standard behavior (thanks to Eryu Guan for testing and improving
the test suites)"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (23 commits)
ovl: simplify empty checking
qstr: constify instances in overlayfs
ovl: clear nlink on rmdir
ovl: disallow overlayfs as upperdir
ovl: fix warning
ovl: remove duplicated include from super.c
ovl: append MAY_READ when diluting write checks
ovl: dilute permission checks on lower only if not special file
ovl: fix POSIX ACL setting
ovl: share inode for hard link
ovl: store real inode pointer in ->i_private
ovl: permission: return ECHILD instead of ENOENT
ovl: update atime on upper
ovl: fix sgid on directory
ovl: simplify permission checking
ovl: do not require mounter to have MAY_WRITE on lower
ovl: do operations on underlying file system in mounter's context
ovl: modify ovl_permission() to do checks on two inodes
ovl: define ->get_acl() for overlay inodes
ovl: move some common code in a function
...
Pull freevxfs updates from Christoph Hellwig:
"Support for foreign endianess and HP-UP superblocks from
Krzysztof Błaszkowski"
* tag 'freevxfs-for-4.8' of git://git.infradead.org/users/hch/freevxfs:
freevxfs: update Kconfig information
freevxfs: refactor readdir and lookup code
freevxfs: fix lack of inode initialization
freevxfs: fix memory leak in vxfs_read_fshead()
freevxfs: update documentation and cresdits for HP-UX support
freevxfs: implement ->alloc_inode and ->destroy_inode
freevxfs: avoid the need for forward declaring the super operations
freevxfs: move VFS inode allocation into vxfs_blkiget and vxfs_stiget
freevxfs: remove vxfs_put_fake_inode
freevxfs: handle big endian HP-UX file systems
Pull configfs update from Christoph Hellwig:
"A simple error handling fix from Tal Shorer"
* tag 'configfs-for-4.8' of git://git.infradead.org/users/hch/configfs:
configfs: don't set buffer_needs_fill to zero if show() returns error
Pull CIFS/SMB3 fixes from Steve French:
"Various CIFS/SMB3 fixes, most for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix a possible invalid memory access in smb2_query_symlink()
fs/cifs: make share unaccessible at root level mountable
cifs: fix crash due to race in hmac(md5) handling
cifs: unbreak TCP session reuse
cifs: Check for existing directory when opening file with O_CREAT
Add MF-Symlinks support for SMB 2.0
Fix sparse warnings from the use of "struct xattr_handler"
structures that are not exported by making them static. Fixes
the following sparse warnings:
/fs/ubifs/xattr.c:595:28: warning: symbol 'ubifs_user_xattr_handler' was not declared. Should it be static?
/fs/ubifs/xattr.c:601:28: warning: symbol 'ubifs_trusted_xattr_handler' was not declared. Should it be static?
/fs/ubifs/xattr.c:607:28: warning: symbol 'ubifs_security_xattr_handler' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
This change completes commit
90bea5a3f0 ("UBIFS: respect MS_SILENT mount flag")
which already implements support for MS_SILENT except for that one
error message which is still being displayed despite MS_SILENT being
set. Suppress that error message as well in case MS_SILENT is set.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
[rw: massaged commit message]
Signed-off-by: Richard Weinberger <richard@nod.at>
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
Due to implementation of fuse writeback filemap_write_and_wait_range() does
not catch errors. We have to do this directly after fuse_sync_writes()
Signed-off-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
The empty checking logic is duplicated in ovl_check_empty_and_clear() and
ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
different:
ovl_check_empty_and_clear() checked for being upper
ovl_remove_and_whiteout() checked for merge OR lower
Move the intersection of those checks (upper AND merge) into
ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This does not work and does not make sense. So instead of fixing it
(probably not hard) just disallow.
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
lower/. This is done as files on lower will never be written and will be
copied up. But to copy up a file, mounter should have MAY_READ permission
otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
reset.
Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
True (context mounts) but when he tried to actually write to file, it
failed as mounter did not have permission on lower file.
[SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
won't trigger a copy-up.
Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
mask as lower/ will never be written and file will be copied up. But this
is not true for special files. These files are not copied up and are opened
in place. So don't dilute the checks for these types of files.
Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Setting POSIX ACL needs special handling:
1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context"). These permission checks need to be done with current cred as
well.
2) Setting ACL can fail for various reasons. We do not need to copy up in
these cases.
In the mean time switch to using generic_setxattr.
[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.
This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty. If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.
This patch attempts to fix this by sharing inodes for hard links.
Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper. Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").
The inode is only inserted in the hash if it is non-directoy and upper.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.
Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.
Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fix atime update logic in overlayfs.
This patch adds an i_op->update_time() handler to overlayfs inodes. This
forwards atime updates to the upper layer only. No atime updates are done
on lower layers.
Remove implicit atime updates to underlying files and directories with
O_NOATIME. Remove explicit atime update in ovl_readlink().
Clear atime related mnt flags from cloned upper mount. This means atime
updates are controlled purely by overlayfs mount options.
Reported-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely. Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.
Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The fact that we always do permission checking on the overlay inode and
clear MAY_WRITE for checking access to the lower inode allows cruft to be
removed from ovl_permission().
1) "default_permissions" option effectively did generic_permission() on the
overlay inode with i_mode, i_uid and i_gid updated from underlying
filesystem. This is what we do by default now. It did the update using
vfs_getattr() but that's only needed if the underlying filesystem can
change (which is not allowed). We may later introduce a "paranoia_mode"
that verifies that mode/uid/gid are not changed.
2) splitting out the IS_RDONLY() check from inode_permission() also becomes
unnecessary once we remove the MAY_WRITE from the lower inode check.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now we have two levels of checks in ovl_permission(). overlay inode
is checked with the creds of task while underlying inode is checked
with the creds of mounter.
Looks like mounter does not have to have WRITE access to files on lower/.
So remove the MAY_WRITE from access mask for checks on underlying
lower inode.
This means task should still have the MAY_WRITE permission on lower
inode and mounter is not required to have MAY_WRITE.
It also solves the problem of read only NFS mounts being used as lower.
If __inode_permission(lower_inode, MAY_WRITE) is called on read only
NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
read only NFS shold work with overlay without having to specify any
special mount options (default permission).
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.
So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now ovl_permission() calls __inode_permission(realinode), to do
permission checks on real inode and no checks are done on overlay inode.
Modify it to do checks both on overlay inode as well as underlying inode.
Checks on overlay inode will be done with the creds of calling task while
checks on underlying inode will be done with the creds of mounter.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function. No functionality
change.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Previously this was only done for directory inodes. Doing so for all
inodes makes for a nice cleanup in ovl_permission at zero cost.
Inodes are not shared for hard links on the overlay, so this works fine.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The hash salting changes meant that we can no longer reuse the hash in the
overlay dentry to look up the underlying dentry.
Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
mounter's creds (like we do for all other operations later in the series).
Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff ("vfs: add
lookup_hash() helper") is unused and can possibly be removed; its
usefulness negated by the hash salting and the idea that mounter's creds
should be used on operations on underlying filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8387ff2577 ("vfs: make the string hashes salt the hash")
Pull tracing updates from Steven Rostedt:
"This is mostly clean ups and small fixes. Some of the more visible
changes are:
- The function pid code uses the event pid filtering logic
- [ku]probe events have access to current->comm
- trace_printk now has sample code
- PCI devices now trace physical addresses
- stack tracing has less unnessary functions traced"
* tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
printk, tracing: Avoiding unneeded blank lines
tracing: Use __get_str() when manipulating strings
tracing, RAS: Cleanup on __get_str() usage
tracing: Use outer () on __get_str() definition
ftrace: Reduce size of function graph entries
tracing: Have HIST_TRIGGERS select TRACING
tracing: Using for_each_set_bit() to simplify trace_pid_write()
ftrace: Move toplevel init out of ftrace_init_tracefs()
tracing/function_graph: Fix filters for function_graph threshold
tracing: Skip more functions when doing stack tracing of events
tracing: Expose CPU physical addresses (resource values) for PCI devices
tracing: Show the preempt count of when the event was called
tracing: Add trace_printk sample code
tracing: Choose static tp_printk buffer by explicit nesting count
tracing: expose current->comm to [ku]probe events
ftrace: Have set_ftrace_pid use the bitmap like events do
tracing: Move pid_list write processing into its own function
tracing: Move the pid_list seq_file functions to be global
tracing: Move filtered_pid helper functions into trace.c
tracing: Make the pid filtering helper functions global
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
Merge more updates from Andrew Morton:
"The rest of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
mm, compaction: simplify contended compaction handling
mm, compaction: introduce direct compaction priority
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
mm, page_alloc: make THP-specific decisions more generic
mm, page_alloc: restructure direct compaction handling in slowpath
mm, page_alloc: don't retry initial attempt in slowpath
mm, page_alloc: set alloc_flags only once in slowpath
lib/stackdepot.c: use __GFP_NOWARN for stack allocations
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
mm, kasan: account for object redzone in SLUB's nearest_obj()
mm: fix use-after-free if memory allocation failed in vma_adjust()
zsmalloc: Delete an unnecessary check before the function call "iput"
mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
mm: optimize copy_page_to/from_iter_iovec
mm: add cond_resched() to generic_swapfile_activate()
Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
mm: hwpoison: remove incorrect comments
make __section_nr() more efficient
...
oom_score_adj is shared for the thread groups (via struct signal) but this
is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer. In the
worst case some of those processes might hide from the oom killer
altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer
would then pick up those eligible but won't be allowed to kill others
sharing the same mm so the mm wouldn't release the mm and so the memory.
It would be ideal to have the oom_score_adj per mm_struct because that is
the natural entity OOM killer considers. But this will not work because
some programs are doing
vfork()
set_oom_adj()
exec()
We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not in
the middle of vfork. As a result all the processes will share the same
oom_score_adj. The current implementation is rather pessimistic and
checks all the existing processes by default if there is more than 1
holder of the mm but we do not have any reliable way to check for external
users yet.
Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>