Currently faults are protected against truncate by filesystem specific
i_mmap_sem and page lock in case of hole page. Cow faults are protected
DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
code. Remove it.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Currently DAX page fault locking is racy.
CPU0 (write fault) CPU1 (read fault)
__dax_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE)
get_block(inode, block, &bh, 1) -> allocates blocks
if (page) -> no
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE) {
} else {
dax_load_hole();
}
dax_insert_mapping()
And we are in a situation where we fail in dax_radix_entry() with -EIO.
Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.
We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.
Reviewed-by: NeilBrown <neilb@suse.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
We will use lowest available bit in the radix tree exceptional entry for
locking of the entry. Define it. Also clean up definitions of DAX entry
type bits in DAX exceptional entries to use defined constants instead of
hardcoding numbers and cleanup checking of these bits to not rely on how
other bits in the entry are set.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Currently the handling of huge pages for DAX is racy. For example the
following can happen:
CPU0 (THP write fault) CPU1 (normal read fault)
__dax_pmd_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh) && write)
get_block(inode, block, &bh, 1) -> allocates blocks
truncate_pagecache_range(inode, lstart, lend);
dax_load_hole();
This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.
The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So make THP support in DAX code depend on CONFIG_BROKEN
for now.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
returned buffer has BH_Uptodate set. However that doesn't get set for
any mapping buffer so that branch is actually a dead code. The
BH_Uptodate check doesn't make any sense so just remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
TCP stack can now run from process context.
Use read_lock_bh() variant to restore previous assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS for 4.7. Here's the summary of
the changes:
- ATH79: Support for DTB passuing using the UHI boot protocol
- ATH79: Remove support for builtin DTB.
- ATH79: Add zboot debug serial support.
- ATH79: Add initial support for Dragino MS14 (Dragine 2), Onion Omega
and DPT-Module.
- ATH79: Update devicetree clock support for AR9132 and AR9331.
- ATH79: Cleanup the DT code.
- ATH79: Support newer SOCs in ath79_ddr_ctrl_init.
- ATH79: Fix regression in PCI window initialization.
- BCM47xx: Move SPROM driver to drivers/firmware/
- BCM63xx: Enable partition parser in defconfig.
- BMIPS: BMIPS5000 has I cache filing from D cache
- BMIPS: BMIPS: Add cpu-feature-overrides.h
- BMIPS: Add Whirlwind support
- BMIPS: Adjust mips-hpt-frequency for BCM7435
- BMIPS: Remove maxcpus from BCM97435SVMB DTS
- BMIPS: Add missing 7038 L1 register cells to BCM7435
- BMIPS: Various tweaks to initialization code.
- BMIPS: Enable partition parser in defconfig.
- BMIPS: Cache tweaks.
- BMIPS: Add UART, I2C and SATA devices to DT.
- BMIPS: Add BCM6358 and BCM63268support
- BMIPS: Add device tree example for BCM6358.
- BMIPS: Improve Improve BCM6328 and BCM6368 device trees
- Lantiq: Add support for device tree file from boot loader
- Lantiq: Allow build with no built-in DT.
- Loongson 3: Reserve 32MB for RS780E integrated GPU.
- Loongson 3: Fix build error after ld-version.sh modification
- Loongson 3: Move chipset ACPI code from drivers to arch.
- Loongson 3: Speedup irq processing.
- Loongson 3: Add basic Loongson 3A support.
- Loongson 3: Set cache flush handlers to nop.
- Loongson 3: Invalidate special TLBs when needed.
- Loongson 3: Fast TLB refill handler.
- MT7620: Fallback strategy for invalid syscfg0.
- Netlogic: Fix CP0_EBASE redefinition warnings
- Octeon: Initialization fixes
- Octeon: Add DTS files for the D-Link DSR-1000N and EdgeRouter Lite
- Octeon: Enable add Octeon-drivers in cavium_octeon_defconfig
- Octeon: Correctly handle endian-swapped initramfs images.
- Octeon: Support CN73xx, CN75xx and CN78xx.
- Octeon: Remove dead code from cvmx-sysinfo.
- Octeon: Extend number of supported CPUs past 32.
- Octeon: Remove some code limiting NR_IRQS to 255.
- Octeon: Simplify octeon_irq_ciu_gpio_set_type.
- Octeon: Mark some functions __init in smp.c
- Octeon: Octeon: Add Octeon III CN7xxx interface detection
- PIC32: Add serial driver and bindings for it.
- PIC32: Add PIC32 deadman timer driver and bindings.
- PIC32: Add PIC32 clock timer driver and bindings.
- Pistachio: Determine SoC revision during boot
- Sibyte: Fix Kconfig dependencies of SIBYTE_BUS_WATCHER.
- Sibyte: Strip redundant comments from bcm1480_regs.h.
- Panic immediately if panic_on_oops is set.
- module: fix incorrect IS_ERR_VALUE macro usage.
- module: Make consistent use of pr_*
- Remove no longer needed work_on_cpu() call.
- Remove CONFIG_IPV6_PRIVACY from defconfigs.
- Fix registers of non-crashing CPUs in dumps.
- Handle MIPSisms in new vmcore_elf32_check_arch.
- Select CONFIG_HANDLE_DOMAIN_IRQ and make it work.
- Allow RIXI to be used on non-R2 or R6 cores.
- Reserve nosave data for hibernation
- Fix siginfo.h to use strict POSIX types.
- Don't unwind user mode with EVA.
- Fix watchpoint restoration
- Ptrace watchpoints for R6.
- Sync icache when it fills from dcache
- I6400 I-cache fills from dcache.
- Various MSA fixes.
- Cleanup MIPS_CPU_* definitions.
- Signal: Move generic copy_siginfo to signal.h
- Signal: Fix uapi include in exported asm/siginfo.h
- Timer fixes for sake of KVM.
- XPA TLB refill fixes.
- Treat perf counter feature
- Update John Crispin's email address
- Add PIC32 watchdog and bindings.
- Handle R10000 LL/SC bug in set_pte()
- cpufreq: Various fixes for Longson1.
- R6: Fix R2 emulation.
- mathemu: Cosmetic fix to ADDIUPC emulation, plenty of other small fixes
- ELF: ABI and FP fixes.
- Allow for relocatable kernel and use that to support KASLR.
- Fix CPC_BASE_ADDR mask
- Plenty fo smp-cps, CM, R6 and M6250 fixes.
- Make reset_control_ops const.
- Fix kernel command line handling of leading whitespace.
- Cleanups to cache handling.
- Add brcm, bcm6345-l1-intc device tree bindings.
- Use generic clkdev.h header
- Remove CLK_IS_ROOT usage.
- Misc small cleanups.
- CM: Fix compilation error when !MIPS_CM
- oprofile: Fix a preemption issue
- Detect DSP ASE v3 support:1"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (275 commits)
MIPS: pic32mzda: fix getting timer clock rate.
MIPS: ath79: fix regression in PCI window initialization
MIPS: ath79: make ath79_ddr_ctrl_init() compatible for newer SoCs
MIPS: Fix VZ probe gas errors with binutils <2.24
MIPS: perf: Fix I6400 event numbers
MIPS: DEC: Export `ioasic_ssr_lock' to modules
MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC
MIPS: CM: Fix compilation error when !MIPS_CM
MIPS: Fix genvdso error on rebuild
USB: ohci-jz4740: Remove obsolete driver
MIPS: JZ4740: Probe OHCI platform device via DT
MIPS: JZ4740: Qi LB60: Remove support for AVT2 variant
MIPS: pistachio: Determine SoC revision during boot
MIPS: BMIPS: Adjust mips-hpt-frequency for BCM7435
mips: mt7620: fallback to SDRAM when syscfg0 does not have a valid value for the memory type
MIPS: Prevent "restoration" of MSA context in non-MSA kernels
MIPS: cevt-r4k: Dynamically calculate min_delta_ns
MIPS: malta-time: Take seconds into account
MIPS: malta-time: Start GIC count before syncing to RTC
MIPS: Force CPUs to lose FP context during mode switches
...
Pull security subsystem updates from James Morris:
"Highlights:
- A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
of modules and firmware to be loaded from a specific device (this
is from ChromeOS, where the device as a whole is verified
cryptographically via dm-verity).
This is disabled by default but can be configured to be enabled by
default (don't do this if you don't know what you're doing).
- Keys: allow authentication data to be stored in an asymmetric key.
Lots of general fixes and updates.
- SELinux: add restrictions for loading of kernel modules via
finit_module(). Distinguish non-init user namespace capability
checks. Apply execstack check on thread stacks"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
LSM: LoadPin: provide enablement CONFIG
Yama: use atomic allocations when reporting
seccomp: Fix comment typo
ima: add support for creating files using the mknodat syscall
ima: fix ima_inode_post_setattr
vfs: forbid write access when reading a file into memory
fs: fix over-zealous use of "const"
selinux: apply execstack check on thread stacks
selinux: distinguish non-init user namespace capability checks
LSM: LoadPin for kernel file loading restrictions
fs: define a string representation of the kernel_read_file_id enumeration
Yama: consolidate error reporting
string_helpers: add kstrdup_quotable_file
string_helpers: add kstrdup_quotable_cmdline
string_helpers: add kstrdup_quotable
selinux: check ss_initialized before revalidating an inode label
selinux: delay inode label lookup as long as possible
selinux: don't revalidate an inode's label when explicitly setting it
selinux: Change bool variable name to index.
KEYS: Add KEYCTL_DH_COMPUTE command
...
UDF/OSTA terminology is confusing. Partition Numbers (PNs) are arbitrary
16-bit values, one for each physical partition in the volume. Partition
Reference Numbers (PRNs) are indices into the the Partition Map Table
and do not necessarily equal the PN of the mapped partition.
The current metadata code mistakenly uses the PN instead of the PRN when
mapping metadata blocks to physical/sparable blocks. Windows-created
UDF 2.5 discs for some reason use large, arbitrary PNs, resulting in
mount failure and KASAN read warnings in udf_read_inode().
For example, a NetBSD UDF 2.5 partition might look like this:
PRN PN Type
--- -- ----
0 0 Sparable
1 0 Metadata
Since PRN == PN, we are fine.
But Windows could gives us:
PRN PN Type
--- ---- ----
0 8192 Sparable
1 8192 Metadata
So udf_read_inode() will start out by checking the partition length in
sbi->s_partmaps[8192], which is obviously out of bounds.
Fix this by creating a new field (s_phys_partition_ref) in struct
udf_meta_data, referencing whatever physical or sparable map has the
same partition number as the metadata partition.
[JK: Add comment about s_phys_partition_ref, change its name]
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently when udf_get_pblock_meta25() fails to map a block using the
primary metadata file, it will attempt to load the mirror file entry by
calling udf_find_metadata_inode_efe(). That function will return a ERR_PTR
if it fails, but the return value is only checked against NULL. Test the
return value using IS_ERR() and change it to NULL if needed.
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, if a metadata partition map is missing its partition descriptor,
then udf_get_pblock_meta25() will BUG() out the first time it is called.
This is rather drastic for a corrupted filesystem, so just treat this case
as an invalid mapping instead.
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
init_f2fs_fs does:
1) f2fs_build_trace_ios
2) init_inodecache
3) create_node_manager_caches
4) create_segment_manager_caches
5) create_checkpoint_caches
6) create_extent_cache
7) kset_create_and_add
8) kobject_init_and_add
9) register_shrinker
10) register_filesystem
11) f2fs_create_root_stats
12) proc_mkdir
exit_f2fs_fs should do cleanup in the reverse order
to make the code more clear.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull misc vfs cleanups from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coredump: only charge written data against RLIMIT_CORE
coredump: get rid of coredump_params->written
ecryptfs_lookup(): try either only encrypted or plaintext name
ecryptfs: avoid multiple aliases for directories
bpf: reject invalid names right in ->lookup()
__d_alloc(): treat NULL name as QSTR("/", 1)
mtd: switch ubi_open_volume_path() to vfs_stat()
mtd: switch open_mtd_by_chdev() to use of vfs_stat()
Pull iov_iter cleanups from Al Viro.
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fold checks into iterate_and_advance()
rw_verify_area(): saner calling conventions
aio: remove a pointless assignment
The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in
09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
macros
The comments for the above functions described a distinction between
those, that is now redundant, so remove those paragraphs
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
In the truncate or hole-punch path in dax, we clear out sub-page ranges.
If these sub-page ranges are sector aligned and sized, we can do the
zeroing through the driver instead so that error-clearing is handled
automatically.
For sub-sector ranges, we still have to rely on clear_pmem and have the
possibility of tripping over errors.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
This allows XFS to perform zeroing using the iomap infrastructure and
avoid buffer heads.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[vishal: fix conflicts with dax-error-handling]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
1/ If a mapping overlaps a bad sector fail the request.
2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Pull parallel lookup fixups from Al Viro:
"Fix for xfs parallel readdir (turns out the cxfs exposure was not
enough to catch all problems), and a reversion of btrfs back to
->iterate() until the fs/btrfs/delayed-inode.c gets fixed"
* 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xfs: concurrent readdir hangs on data buffer locks
Revert "btrfs: switch to ->iterate_shared()"
There's a three-process deadlock involving shared/exclusive barriers
and inverted lock orders in the directory readdir implementation.
It's a pre-existing problem with lock ordering, exposed by the
VFS parallelisation code.
process 1 process 2 process 3
--------- --------- ---------
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock and read buffer
iunlock(shared)
process entries in buffer
.....
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock buffer
<blocks>
finish ->iterate_shared
file_accessed()
->update_time
start transaction
ilock(excl)
<blocks>
.....
finishes processing buffer
get next buffer
ilock(shared)
<blocks>
And that's the deadlock.
Fix this by dropping the current buffer lock in process 1 before
trying to map the next buffer. This means we keep the lock order of
ilock -> buffer lock intact and hence will allow process 3 to make
progress and drop it's ilock(shared) once it is done.
Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This reverts commit 972b241f84.
Quoth Chris:
didn't take the delayed inode stuff into account
it got an rbtree of items and it pulls things out
so in shared mode, its hugely racey
sorry, lets revert and fix it for real inside of btrfs
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cifs iovec cleanups from Al Viro.
* 'sendmsg.cifs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
cifs: don't bother with kmap on read_pages side
cifs_readv_receive: use cifs_read_from_socket()
cifs: no need to wank with copying and advancing iovec on recvmsg side either
cifs: quit playing games with draining iovecs
cifs: merge the hash calculation helpers
Pull remaining vfs xattr work from Al Viro:
"The rest of work.xattr (non-cifs conversions)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
btrfs: Switch to generic xattr handlers
ubifs: Switch to generic xattr handlers
jfs: Switch to generic xattr handlers
jfs: Clean up xattr name mapping
gfs2: Switch to generic xattr handlers
ceph: kill __ceph_removexattr()
ceph: Switch to generic xattr handlers
ceph: Get rid of d_find_alias in ceph_set_acl
Pull cifs updates from Steve French:
"Various small CIFS and SMB3 fixes (including some for stable)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
remove directory incorrectly tries to set delete on close on non-empty directories
Update cifs.ko version to 2.09
fs/cifs: correctly to anonymous authentication for the NTLM(v2) authentication
fs/cifs: correctly to anonymous authentication for the NTLM(v1) authentication
fs/cifs: correctly to anonymous authentication for the LANMAN authentication
fs/cifs: correctly to anonymous authentication via NTLMSSP
cifs: remove any preceding delimiter from prefix_path
cifs: Use file_dentry()
Rearrange the inode tagging functions so that they are higher up in
xfs_cache.c and so there is no need for forward prototypes to be
defined. This is purely code movement, no other change.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Inode radix tree tagging for reclaim passes a lot of unnecessary
variables around. Over time the xfs-perag has grown a xfs_mount
backpointer, and an internal agno so we don't need to pass other
variables into the tagging functions to supply this information.
Rework the functions to pass the minimal variable set required
and simplify the internal logic and flow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The cluster inode variable uses unconventional naming - iq - which
makes it hard to distinguish it between the inode passed into the
function - ip - and that is a vector for mistakes to be made.
Rename all the cluster inode variables to use a more conventional
prefixes to reduce potential future confusion (cilist, cilist_size,
cip).
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
it can find inodes beyond the current cluster if there is sparse
cache population. gang lookups return results in ascending index
order, so stop trying to cluster inodes once the first inode outside
the cluster mask is detected.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The last thing we do before using call_rcu() on an xfs_inode to be
freed is mark it as invalid. This means there is a window between
when we know for certain that the inode is going to be freed and
when we do actually mark it as "freed".
This is important in the context of RCU lookups - we can look up the
inode, find that it is valid, and then use it as such not realising
that it is in the final stages of being freed.
As such, mark the inode as being invalid the moment we know it is
going to be reclaimed. This can be done while we still hold the
XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
it occurs well before we remove it from the radix tree, and that
the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
synchronisation points for detecting that an inode is about to go
away.
For defensive purposes, this allows us to add a further check to
xfs_iflush_cluster to ensure we skip inodes that are being freed
after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
if the inode number if valid while we have these locks held we know
that it has not progressed through reclaim to the point where it is
clean and is about to be freed.
[bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
had already been zeroed.]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The xfs_inode freed in xfs_inode_free() has multiple allocated
structures attached to it. We free these in xfs_inode_free() before
we mark the inode as invalid, and before we run call_rcu() to queue
the structure for freeing.
Unfortunately, this freeing can race with other accesses that are in
the RCU current grace period that have found the inode in the radix
tree with a valid state. This includes xfs_iflush_cluster(), which
calls xfs_inode_clean(), and that accesses the inode log item on the
xfs_inode.
The log item structure is freed in xfs_inode_free(), so there is the
possibility we can be accessing freed memory in xfs_iflush_cluster()
after validating the xfs_inode structure as being valid for this RCU
context. Hence we can get spuriously incorrect clean state returned
from such checks. This can lead to use thinking the inode is dirty
when it is, in fact, clean, and so incorrectly attaching it to the
buffer for IO and completion processing.
This then leads to use-after-free situations on the xfs_inode itself
if the IO completes after the current RCU grace period expires. The
buffer callbacks will access the xfs_inode and try to do all sorts
of things it shouldn't with freed memory.
IOWs, xfs_iflush_cluster() only works correctly when racing with
inode reclaim if the inode log item is present and correctly stating
the inode is clean. If the inode is being freed, then reclaim has
already made sure the inode is clean, and hence xfs_iflush_cluster
can skip it. However, we are accessing the inode inode under RCU
read lock protection and so also must ensure that all dynamically
allocated memory we reference in this context is not freed until the
RCU grace period expires.
To fix this, move all the potential memory freeing into
xfs_inode_free_callback() so that we are guarantee RCU protected
lookup code will always have the memory structures it needs
available during the RCU grace period that lookup races can occur
in.
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When unmounting XFS, we call:
xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy
This goes over the whole indirection array and calls
xfs_iext_irec_remove for each one of the erps (from the last one to
the first one). As a result, we keep shrinking (reallocating
actually) the indirection array until we shrink out all of its
elements. When we have files with huge numbers of extents, umount
takes 30-80 sec, depending on the amount of files that XFS loaded
and the amount of indirection entries of each file. The unmount
stack looks like:
[<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
[<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
[<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
[<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
[<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
[<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
[<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
[<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
[<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
[<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
[<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
[<ffffffff811ea207>] kill_block_super+0x27/0x70
[<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
[<ffffffff811eaaee>] deactivate_super+0x4e/0x70
[<ffffffff81207593>] cleanup_mnt+0x43/0x90
[<ffffffff81207632>] __cleanup_mnt+0x12/0x20
[<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
[<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
[<ffffffff81717c6f>] int_signal+0x12/0x17
Further, this reallocation prevents us from freeing the extent list
from a RCU callback as allocation can block. Hence if the extent
list is in indirect format, optimise the freeing of the extent list
to only use kmem_free calls by freeing entire extent buffer pages at
a time, rather than extent by extent.
[dchinner: simplified freeing loop based on Christoph's suggestion]
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.
(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
cc: <stable@vger.kernel.org> # 3.10.x-
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.
Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.
cc: <stable@vger.kernel.org> # 3.10.x-
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Joe Lawrence reported a list_add corruption with 4.6-rc1 when
testing some custom md administration code that made it's own
block device nodes for the md array. The simple test loop of:
for i in {0..100}; do
mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
mdadm --detail --export $tmp/tmp_node > /dev/null
rm -f $tmp/tmp_node
done
Would produce this warning in bd_acquire() when mdadm opened the
device node:
list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.
And then produce this from bd_forget from kdevtmpfs evicting a block
dev inode:
list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8
This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
to vfs inode"). The issue is that xfs_inactive() frees the
unlinked inode, and the above commit meant that this freeing zeroed
the mode in the struct inode. The problem is that after evict() has
called ->evict_inode, it expects the i_mode to be intact so that it
can call bd_forget() or cd_forget() to drop the reference to the
block device inode attached to the XFS inode.
In reality, the only thing we do in xfs_fs_evict_inode() that is not
generic is call xfs_inactive(). We can move the xfs_inactive() call
to xfs_fs_destroy_inode() without any problems at all, and this
will leave the VFS inode intact until it is completely done with it.
So, remove xfs_fs_evict_inode(), and do the work it used to do in
->destroy_inode instead.
cc: <stable@vger.kernel.org> # 4.6
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we take "retry forever" literally on metadata IO errors, we can
hang at unmount, once it retries those writes forever. This is the
default behavior, unfortunately.
Add an error configuration option for this behavior and default it
to "fail" so that an unmount will trigger actuall errors, a shutdown
and allow the unmount to succeed. It will be noisy, though, as it
will log the errors and shutdown that occurs.
To fix this, we need to mark the filesystem as being in the process
of unmounting. Do this with a mount flag that is added at the
appropriate time (i.e. before the blocking AIL sync). We also need
to add this flag if mount fails after the initial phase of log
recovery has been run.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
now most of the infrastructure is in place, we can start adding
support for configuring specific errors such as ENODEV, ENOSPC, EIO,
etc. Add these error configurations and configure them all to have
appropriate behaviours. That is, all will be configured to retry
forever by default, except for ENODEV, which is an unrecoverable
error, so it will be configured to not retry on error
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>