Pull block fixes from Jens Axboe:
- the virtio_blk stack DMA corruption fix from Christoph, fixing and
issue with VMAP stacks.
- O_DIRECT blkbits calculation fix from Chandan.
- discard regression fix from Christoph.
- queue init error handling fixes for nbd and virtio_blk, from Omar and
Jeff.
- two small nvme fixes, from Christoph and Guilherme.
- rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
to more closely follow what we do in other places in the block layer.
This interface is new for this series, so let's get the naming right
before releasing a kernel with this feature. From Damien.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: don't try to discard from __blkdev_issue_zeroout
sd: remove __data_len hack for WRITE SAME
nvme: use blk_rq_payload_bytes
scsi: use blk_rq_payload_bytes
block: add blk_rq_payload_bytes
block: Rename blk_queue_zone_size and bdev_zone_size
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
nvme-rdma: fix nvme_rdma_queue_is_ready
virtio_blk: fix panic in initialization error path
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: avoid DMA to stack for the sense buffer
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.
After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.
This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>
file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":
BUG: spinlock bad magic on CPU#0, ls/486
lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
Call Trace:
[<ffffffff81327533>] dump_stack+0x65/0x92
[<ffffffff810baf75>] spin_dump+0x85/0xe0
[<ffffffff810baff6>] spin_bug+0x26/0x30
[<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
[<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
[<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
[<ffffffff81181cfd>] __fput+0x5d/0x160
[<ffffffff81181e3e>] ____fput+0xe/0x10
[<ffffffff8109410e>] task_work_run+0x7e/0xa0
[<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
[<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
[<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9
Fixes: 3afca265b5 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.
Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked. While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.
Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.
Reported-by: Mingbo Wan <mingbo@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull btrfs fixes from Chris Mason:
"These are all over the place.
The tracepoint part of the pull fixes a crash and adds a little more
information to two tracepoints, while the rest are good old fashioned
fixes"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: make tracepoint format strings more compact
Btrfs: add truncated_len for ordered extent tracepoints
Btrfs: add 'inode' for extent map tracepoint
btrfs: fix crash when tracepoint arguments are freed by wq callbacks
Btrfs: adjust outstanding_extents counter properly when dio write is split
Btrfs: fix lockdep warning about log_mutex
Btrfs: use down_read_nested to make lockdep silent
btrfs: fix locking when we put back a delayed ref that's too new
btrfs: fix error handling when run_delayed_extent_op fails
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
Pull ceph fixes from Ilya Dryomov:
"Two small fixups for the filesystem changes that went into this merge
window"
* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
ceph: fix get_oldest_context()
ceph: fix mds cluster availability check
If the server reboots multiple times, the client should rely on the
server to tell it that it cannot reclaim state as per section 9.6.3.4
in RFC7530 and section 8.4.2.1 in RFC5661.
Currently, the client is being to conservative, and is assuming that
if the server reboots while state recovery is in progress, then it must
ignore state that was not recovered before the reboot.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit bcb6f6d2b9 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.
Fixes: bcb6f6d2b9 ("fuse: use timespec64")
Signed-off-by: David Sheets <dsheets@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.9
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.
Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.
This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().
Fixes: ee314a870e ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.2+
Oops--in 916d2d844a I moved some constants into an array for
convenience, but here I'm accidentally writing to that array.
The effect is that if you ever encounter a filesystem lacking support
for ACLs or security labels, then all queries of supported attributes
will report that attribute as unsupported from then on.
Fixes: 916d2d844a "nfsd: clean up supported attribute handling"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If a file is renamed, but stays in the same directory, we will still receive
2 change_info4 structures describing the change to that directory, but we
only want to apply it once.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We don't want to invalidate the directory attribute and data cache unless we
know that a file was created, or the change attribute differs from the one
in our cache.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull xfs fixes from Darrick Wong:
"As promised last week, here's some stability fixes from Christoph and
Jan Kara:
- fix free space request handling when low on disk space
- remove redundant log failure error messages
- free truncated dirty pages instead of letting them build up
forever"
* tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Timely free truncated dirty pages
xfs: don't print warnings when xfs_log_force fails
xfs: don't rely on ->total in xfs_alloc_space_available
xfs: adjust allocation length in xfs_alloc_space_available
xfs: fix bogus minleft manipulations
xfs: bump up reserved blocks in xfs_alloc_set_aside
For no snapshot case, we should use ci->truncate_{seq,size}.
Fixes: 5f743e4566 ("ceph: record truncate size/seq for snap data writeback")
Signed-off-by: Geng, Jichao <geng.jichao@h3c.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
I have reports of a crash that look like __fput() was called twice for
a NFSv4.0 file. It seems possible that the state manager could try to
reclaim a lock and take a reference on the fl->fl_file at the same time the
file is being released if, during the close(), a signal interrupts the wait
for outstanding IO while removing locks which then skips the removal
of that lock.
Since 83bfff23e9 ("nfs4: have do_vfs_lock take an inode pointer") has
removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(),
taking that reference is no longer necessary.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Collapsed the two patches, they were nonsensically split and broke
bisection.
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In order to test the inode extra isize expansion code, it is useful to
be able to easily create file systems that have inodes with extra
isize values smaller than the current desired value.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 99579ccec4 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:
while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done
will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.
So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.
CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579ccec4
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.
The crash backtrace is below:
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282
RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
The 1st round:
Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently dax_mapping_entry_mkclean() fails to clean and write protect
the pmd_t of a DAX PMD entry during an *sync operation. This can result
in data loss in the following sequence:
1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
pmd_t dirty and writeable
2) fsync, flushing out PMD data and cleaning the radix tree entry. We
currently fail to mark the pmd_t as clean and write protected.
3) more mmap writes to the PMD. These don't cause any page faults since
the pmd_t is dirty and writeable. The radix tree entry remains clean.
4) fsync, which fails to flush the dirty PMD data because the radix tree
entry was clean.
5) crash - dirty data that should have been fsync'd as part of 4) could
still have been in the processor cache, and is lost.
Fix this by marking the pmd_t clean and write protected in
dax_mapping_entry_mkclean(), which is called as part of the fsync
operation 2). This will cause the writes in step 3) above to generate
page faults where we'll re-dirty the PMD radix tree entry, resulting in
flushes in the fsync that happens in step 4).
Fixes: 4b4bb46d00 ("dax: clear dirty entry tags on cache flush")
Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code currently uses sdio->blkbits to compute the number of blocks to
be cleaned. However sdio->blkbits is derived from the logical block size
of the underlying block device (Refer to the definition of
do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
fail when executed on an ext4 filesystem with 64k as the block size and
when using a virtio based disk (having 512 byte as the logical block
size) inside a kvm guest.
This commit fixes the bug by using inode->i_blkbits to compute the
number of blocks to be cleaned.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once,
to avoid issues with block size changes with IO in flight.
Signed-off-by: Jens Axboe <axboe@fb.com>
We were checking block number without checking partition.
sbi->s_partmaps[iloc->partitionReferenceNum] could lead to
bad memory access. See udf_nfs_get_inode() path for instance.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Move all module attributes at the end of one file like other FS.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_update_extent_cache() is only called from inode_bmap()
with 1 for next_epos
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
loc & 0x02 is empty since first git version in 2005 in
udf_add_extendedattr()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Having struct kernel_long_ad laarr[EXTENT_MERGE_SIZE]
in all function arguments could be understood as by-value parameter.
Use kernel_long_ad pointer for functions depending on
inode_getblk()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:
touch $testfile
chown 100:100 $testfile
chmod 2755 $testfile
_runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile
Signed-off-by: Gu Zheng <guzheng1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In all but one case, the last two arguments are NULL and 0 resp.;
almost everyone just wants to switch nameidata to non-RCU mode.
The only exception is lookup_fast(), where we have a child dentry
we want to legitimize as well. Split these two cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add MS_KERNMOUNT to the flags that are passed.
Use sget_userns and force &init_user_ns instead of calling sget so that
even if called from a weird context the internal filesystem will be
considered to be in the intial user namespace.
Luis Ressel reported that the the failure to pass MS_KERNMOUNT into
mount_pseudo broke his in development graphics driver that uses the
generic drm infrastructure. I am not certain the deriver was bug
free in it's usage of that infrastructure but since
mount_pseudo_xattr can never be triggered by userspace it is clearer
and less error prone, and less problematic for the code to be explicit.
Reported-by: Luis Ressel <aranea@aixah.de>
Tested-by: Luis Ressel <aranea@aixah.de>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.
Kristen Johansen <kjlx@templeofstupid.com> reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.
Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.
I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.
d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.
Cc: stable@vger.kernel.org
Fixes: ce07d891a0 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.
Simply removing the warnings thus makes the XFS log reporting significantly
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers. It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].
But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions. Use the maximum of args->total and the
calculated block requirement to make a decision. We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.
[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation. But that's for a separate series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>