The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If the reply to a successful CLOSE call races with an OPEN to the same
file, we can end up scribbling over the stateid that represents the
new open state.
The race looks like:
Client Server
====== ======
CLOSE stateid A on file "foo"
CLOSE stateid A, return stateid C
OPEN file "foo"
OPEN "foo", return stateid B
Receive reply to OPEN
Reset open state for "foo"
Associate stateid B to "foo"
Receive CLOSE for A
Reset open state for "foo"
Replace stateid B with C
The fix is to examine the argument of the CLOSE, and check for a match
with the current stateid "other" field. If the two do not match, then
the above race occurred, and we should just ignore the CLOSE.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Centralize the checks for inodes_per_block and be more strict to make
sure the inodes_per_block_group can't end up being zero.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
Fix a large number of problems with how we handle mount options in the
superblock. For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters. It's unlikely
this will cause a security problem, but it could result in an invalid
parse. Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock. (Fortunately it only happens on file systems with a
1k block size.)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs fixes from Al Viro:
"A couple of regression fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix iov_iter_advance() for ITER_PIPE
xattr: Fix setting security xattrs on sockfs
Pull orangefs fix from Mike Marshall:
"orangefs: add .owner to debugfs file_operations
Without ".owner = THIS_MODULE" it is possible to crash the kernel by
unloading the Orangefs module while someone is reading debugfs files"
* tag 'for-linus-4.9-rc5-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: add .owner to debugfs file_operations
Similar to the simple fast path, but we now need a dio structure to
track multiple-bio completions. It's basically a cut-down version
of the new iomap-based direct I/O code for filesystems, but without
all the logic to call into the filesystem for extent lookup or
allocation, and without the complex I/O completion workqueue handler
for AIO - instead we just use the FUA bit on the bios to ensure
data is flushed to stable storage.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch adds a small and simple fast patch for small direct I/O
requests on block devices that don't use AIO. Between the neat
bio_iov_iter_get_pages helper that avoids allocating a page array
for get_user_pages and the on-stack bio and biovec this avoid memory
allocations and atomic operations entirely in the direct I/O code
(lower levels might still do memory allocations and will usually
have at least some atomic operations, though).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
Mounting proc in user namespace containers fails if the xenbus
filesystem is mounted on /proc/xen because this directory fails
the "permanently empty" test. proc_create_mount_point() exists
specifically to create such mountpoints in proc but is currently
proc-internal. Export this interface to modules, then use it in
xenbus when creating /proc/xen.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
The IOP_XATTR flag is set on sockfs because sockfs supports getting the
"system.sockprotoname" xattr. Since commit 6c6ef9f2, this flag is checked for
setxattr support as well. This is wrong on sockfs because security xattr
support there is supposed to be provided by security_inode_setsecurity. The
smack security module relies on socket labels (xattrs).
Fix this by adding a security xattr handler on sockfs that returns
-EAGAIN, and by checking for -EAGAIN in setxattr.
We cannot simply check for -EOPNOTSUPP in setxattr because there are
filesystems that neither have direct security xattr support nor support
via security_inode_setsecurity. A more proper fix might be to move the
call to security_inode_setsecurity into sockfs, but it's not clear to me
if that is safe: we would end up calling security_inode_post_setxattr after
that as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull fuse fixes from Miklos Szeredi:
"A regression fix and bug fix bound for stable"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix fuse_write_end() if zero bytes were copied
fuse: fix root dentry initialization
Without ".owner = THIS_MODULE" it is possible to crash the kernel
by unloading the Orangefs module while someone is reading debugfs
files.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Some embedded systems have no use for them. This removes about
25KB from the kernel binary size when configured out.
Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.
The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This adds a check for a NULL platform data, which should only be possible
if a driver incorrectly sets up a probe request without also having defined
the platform_data structure. This is based on a patch from Geliang Tang.
Signed-off-by: Kees Cook <keescook@chromium.org>
Maybe I'm missing something, but I don't know why it needs to copy the
input buffer to psinfo->buf and then write. Instead we can write the
input buffer directly. The only implementation that supports console
message (i.e. ramoops) already does it for ftrace messages.
For the upcoming virtio backend driver, it needs to protect psinfo->buf
overwritten from console messages. If it could use ->write_buf method
instead of ->write, the problem will be solved easily.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
When update_ms is set, pstore_get_records() will be called when there's
a new entry. But unlink can be called at the same time and might
contend with the open-read-close loop. Depending on the implementation
of platform driver, it may be safe or not. But I think it'd be better
to protect those race in the first place.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Currently, pstore doesn't have any filters setup for function tracing.
This has the associated overhead and may not be useful for users looking
for tracing specific set of functions.
ftrace's regular function trace filtering is done writing to
tracing/set_ftrace_filter however this is not available if not requested.
In order to be able to use this feature, the support to request global
filtering introduced earlier in the series should be requested before
registering the ftrace ops. Here we do the same.
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Since "przs" (persistent ram zones) is a general name in the code now, so
rename the Oops-dump zones to dprzs from przs.
Based on a patch from Nobuhiro Iwamatsu.
Signed-off-by: Kees Cook <keescook@chromium.org>
When setting ramoops record sizes, sometimes it's not clear which
parameters contributed to the allocation failure. This adds a per-zone
name and expands the failure reports.
Signed-off-by: Kees Cook <keescook@chromium.org>
Up until this patch, each of the per CPU ftrace buffers appear as a
separate ftrace-ramoops-N file. In this patch we merge all the zones into
one and populate a single ftrace-ramoops-0 file.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: clarified variables names, added -ENOMEM handling]
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation for merging the per CPU buffers into one buffer when
we retrieve the pstore ftrace data, we store the timestamp as a
counter in the ftrace pstore record. We store the CPU number as well
if !PSTORE_CPU_IN_IP, in this case we shift the counter and may lose
ordering there but we preserve the same record size. The timestamp counter
is also racy, and not doing any locking or synchronization here results
in the benefit of lower overhead. Since we don't care much here for exact
ordering of function traces across CPUs, we don't synchronize and may lose
some counter updates but I'm ok with that.
Using trace_clock() results in much lower performance so avoid using it
since we don't want accuracy in timestamp and need a rough ordering to
perform merge.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: updated commit message, added comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
If the RAMOOPS_FLAG_FTRACE_PER_CPU flag is passed to ramoops pdata, split
the ftrace space into multiple zones depending on the number of CPUs.
This speeds up the performance of function tracing by about 280% in my
tests as we avoid the locking. The trade off being lesser space available
per CPU. Let the ramoops user decide which option they want based on pdata
flag.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: added max_ftrace_cnt to track size, added DT logic and docs]
Signed-off-by: Kees Cook <keescook@chromium.org>
Currently ramoops_init_przs() is hard wired only for panic dump zone
array. In preparation for the ftrace zone array (one zone per-cpu) and pmsg
zone array, make the function more generic to be able to handle this case.
Heavily based on similar work from Joel Fernandes.
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.
Signed-off-by: Joel Fernandes <joelaf@google.com>
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook <keescook@chromium.org>
If pos is at the beginning of a page and copied is zero then page is not
zeroed but is marked uptodate.
Fix by skipping everything except unlock/put of page if zero bytes were
copied.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 6b12c1b37e ("fuse: Implement write_begin/write_end callbacks")
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Runs of xfstest ext4/022 on nojournal file systems result in failures
because the inodes of some of its test files do not expand as expected.
The cause is a conditional in ext4_mark_inode_dirty() that prevents inode
expansion unless the test file system has a journal. Remove this
unnecessary restriction.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.
current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().
Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.
Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.
Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
Pass the file mode of the proc inode to be created to
proc_pid_make_inode. In proc_pid_make_inode, initialize inode->i_mode
before calling security_task_to_inode. This allows selinux to set
isec->sclass right away without introducing "half-initialized" inode
security structs.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
reply_cache_stats_operations, of type struct file_operations, is never
modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No real change in functionality, but the old interface seems to be
deprecated.
We don't actually care about ordering necessarily, but we do depend on
running at most one work item at a time: nfsd4_process_cb_update()
assumes that no other thread is running it, and that no new callbacks
are starting while it's running.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:
BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
Call Trace:
[<ffffffff81318c89>] dump_stack+0x67/0x9e
[<ffffffff810810b0>] ___might_sleep+0xf0/0x140
[<ffffffff81081153>] __might_sleep+0x53/0xb0
[<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
[<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
[<ffffffff81081153>] ? __might_sleep+0x53/0xb0
[<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320
Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.
This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990 was trying to address),
but a Warning is better than BUG.
Fixes: 4743f83990
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Return errors to the caller instead of declaring the file system
corrupted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
This allows us to properly propagate errors back up to
ext4_truncate()'s callers. This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page. get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.
This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page. For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename. Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.
This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some filesystems, such as UBIFS, maintain a const pointer for struct
inode.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Not all filesystems work on full pages, thus we should allow them to
hand partial pages to fscrypt for en/decryption.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some filesystem might pass pages which do not have page->mapping->host
set to the encrypted inode. We want the caller to explicitly pass the
corresponding inode.
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>