commit 0c4bcfdecb1ac0967619ee7ff44871d93c08c909 upstream.
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.
On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.
This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.
Fix by copying pages coming from the user address space to new pipe
buffers.
Bug: 226679409
Reported-by: Jann Horn <jannh@google.com>
Fixes: c3021629a0 ("fuse: support splice() reading from fuse device")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: I57a98e96e36bb97ce3e7b1ebf88917c6c8b0247d
Changes in 5.10.87
nfc: fix segfault in nfc_genl_dump_devices_done
drm/msm/dsi: set default num_data_lanes
KVM: arm64: Save PSTATE early on exit
s390/test_unwind: use raw opcode instead of invalid instruction
Revert "tty: serial: fsl_lpuart: drop earlycon entry for i.MX8QXP"
net/mlx4_en: Update reported link modes for 1/10G
ALSA: hda: Add Intel DG2 PCI ID and HDMI codec vid
ALSA: hda/hdmi: fix HDA codec entry table order for ADL-P
parisc/agp: Annotate parisc agp init functions with __init
i2c: rk3x: Handle a spurious start completion interrupt flag
net: netlink: af_netlink: Prevent empty skb by adding a check on len.
drm/amd/display: Fix for the no Audio bug with Tiled Displays
drm/amd/display: add connector type check for CRC source set
tracing: Fix a kmemleak false positive in tracing_map
KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req
staging: most: dim2: use device release method
bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc
fuse: make sure reclaim doesn't write the inode
hwmon: (dell-smm) Fix warning on /proc/i8k creation error
ethtool: do not perform operations on net devices being unregistered
perf inject: Fix itrace space allowed for new attributes
perf intel-pt: Fix some PGE (packet generation enable/control flow packets) usage
perf intel-pt: Fix sync state when a PSB (synchronization) packet is found
perf intel-pt: Fix intel_pt_fup_event() assumptions about setting state type
perf intel-pt: Fix state setting when receiving overflow (OVF) packet
perf intel-pt: Fix next 'err' value, walking trace
perf intel-pt: Fix missing 'instruction' events with 'q' option
perf intel-pt: Fix error timestamp setting on the decoder error path
memblock: free_unused_memmap: use pageblock units instead of MAX_ORDER
memblock: align freed memory map on pageblock boundaries with SPARSEMEM
memblock: ensure there is no overflow in memblock_overlaps_region()
arm: extend pfn_valid to take into account freed memory map alignment
arm: ioremap: don't abuse pfn_valid() to check if pfn is in RAM
Linux 5.10.87
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I56719d03237a8607e6cc0bc357421d0b4a479084
commit 5c791fe1e2a4f401f819065ea4fc0450849f1818 upstream.
In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.
Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.
The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).
Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.
- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call
- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper
Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.63
ext4: fix race writing to an inline_data file while its xattrs are changing
fscrypt: add fscrypt_symlink_getattr() for computing st_size
ext4: report correct st_size for encrypted symlinks
f2fs: report correct st_size for encrypted symlinks
ubifs: report correct st_size for encrypted symlinks
Revert "ucounts: Increase ucounts reference counter before the security hook"
Revert "cred: add missing return error code when set_cred_ucounts() failed"
Revert "Add a reference to ucounts for each cred"
static_call: Fix unused variable warn w/o MODULE
xtensa: fix kconfig unmet dependency warning for HAVE_FUTEX_CMPXCHG
ARM: OMAP1: ams-delta: remove unused function ams_delta_camera_power
gpu: ipu-v3: Fix i.MX IPU-v3 offset calculations for (semi)planar U/V formats
reset: reset-zynqmp: Fixed the argument data type
qed: Fix the VF msix vectors flow
net: macb: Add a NULL check on desc_ptp
qede: Fix memset corruption
perf/x86/intel/pt: Fix mask of num_address_ranges
ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()
perf/x86/amd/ibs: Work around erratum #1197
perf/x86/amd/power: Assign pmu.module
cryptoloop: add a deprecation warning
ALSA: hda/realtek: Quirk for HP Spectre x360 14 amp setup
ALSA: hda/realtek: Workaround for conflicting SSID on ASUS ROG Strix G17
ALSA: pcm: fix divide error in snd_pcm_lib_ioctl
serial: 8250: 8250_omap: Fix possible array out of bounds access
spi: Switch to signed types for *_native_cs SPI controller fields
new helper: inode_wrong_type()
fuse: fix illegal access to inode with reused nodeid
media: stkwebcam: fix memory leak in stk_camera_probe
Linux 5.10.63
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5d461fa0b4dd5ba2457663bd20da1001936feaca
commit 15db16837a35d8007cb8563358787412213db25e upstream.
Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...)
with ourarg containing nodeid and generation.
If a fuse inode is found in inode cache with the same nodeid but different
generation, the existing fuse inode should be unhashed and marked "bad" and
a new inode with the new generation should be hashed instead.
This can happen, for example, with passhrough fuse filesystem that returns
the real filesystem ino/generation on lookup and where real inode numbers
can get recycled due to real files being unlinked not via the fuse
passthrough filesystem.
With current code, this situation will not be detected and an old fuse
dentry that used to point to an older generation real inode, can be used to
access a completely new inode, which should be accessed only via the new
dentry.
Note that because the FORGET message carries the nodeid w/o generation, the
server should wait to get FORGET counts for the nlookup counts of the old
and reused inodes combined, before it can free the resources associated to
that nodeid.
Stable backport notes:
* This is not a regression. The bug has been in fuse forever, but only
a certain class of low level fuse filesystems can trigger this bug
* Because there is no way to check if this fix is applied in runtime,
libfuse test_examples.py tests this fix with hardcoded check for
kernel version >= 5.14
* After backport to stable kernel(s), the libfuse test can be updated
to also check minimal stable kernel version(s)
* Depends on "fuse: fix bad inode" which is already applied to stable
kernels v5.4.y and v5.10.y
* Required backporting helper inode_wrong_type()
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxi8DymG=JO_sAU+wS8akFdzh+PuXwW3Ebgahd2Nwnh7zA@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The initial FUSE passthrough interface has the issue of introducing an
ioctl which receives as a parameter a data structure containing a
pointer. What happens is that, depending on the architecture, the size
of this struct might change, and especially for 32-bit userspace running
on 64-bit kernel, the size mismatch results into different a single
ioctl the behavior of which depends on the data that is passed (e.g.,
with an enum). This is just a poor ioctl design as mentioned by Arnd
Bergmann [1].
Introduce the new FUSE_PASSTHROUGH_OPEN ioctl which only gets the fd of
the lower file system, which is a fixed-size __u32, dropping the
confusing fuse_passthrough_out data structure.
[1] https://lore.kernel.org/lkml/CAK8P3a2K2FzPvqBYL9W=Yut58SFXyetXwU4Fz50G5O3TsS0pPQ@mail.gmail.com/
Bug: 175195837
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: I486d71cbe20f3c0c87544fa75da4e2704fe57c7c
Changes in 5.10.36
bus: mhi: core: Fix check for syserr at power_up
bus: mhi: core: Clear configuration from channel context during reset
bus: mhi: core: Sanity check values from remote device before use
nitro_enclaves: Fix stale file descriptors on failed usercopy
dyndbg: fix parsing file query without a line-range suffix
s390/disassembler: increase ebpf disasm buffer size
s390/zcrypt: fix zcard and zqueue hot-unplug memleak
vhost-vdpa: fix vm_flags for virtqueue doorbell mapping
tpm: acpi: Check eventlog signature before using it
ACPI: custom_method: fix potential use-after-free issue
ACPI: custom_method: fix a possible memory leak
ftrace: Handle commands when closing set_ftrace_filter file
ARM: 9056/1: decompressor: fix BSS size calculation for LLVM ld.lld
arm64: dts: marvell: armada-37xx: add syscon compatible to NB clk node
arm64: dts: mt8173: fix property typo of 'phys' in dsi node
ecryptfs: fix kernel panic with null dev_name
fs/epoll: restore waking from ep_done_scan()
mtd: spi-nor: core: Fix an issue of releasing resources during read/write
Revert "mtd: spi-nor: macronix: Add support for mx25l51245g"
mtd: spinand: core: add missing MODULE_DEVICE_TABLE()
mtd: rawnand: atmel: Update ecc_stats.corrected counter
mtd: physmap: physmap-bt1-rom: Fix unintentional stack access
erofs: add unsupported inode i_format check
spi: stm32-qspi: fix pm_runtime usage_count counter
spi: spi-ti-qspi: Free DMA resources
scsi: qla2xxx: Fix crash in qla2xxx_mqueuecommand()
scsi: mpt3sas: Block PCI config access from userspace during reset
mmc: uniphier-sd: Fix an error handling path in uniphier_sd_probe()
mmc: uniphier-sd: Fix a resource leak in the remove function
mmc: sdhci: Check for reset prior to DMA address unmap
mmc: sdhci-pci: Fix initialization of some SD cards for Intel BYT-based controllers
mmc: sdhci-tegra: Add required callbacks to set/clear CQE_EN bit
mmc: block: Update ext_csd.cache_ctrl if it was written
mmc: block: Issue a cache flush only when it's enabled
mmc: core: Do a power cycle when the CMD11 fails
mmc: core: Set read only for SD cards with permanent write protect bit
mmc: core: Fix hanging on I/O during system suspend for removable cards
irqchip/gic-v3: Do not enable irqs when handling spurious interrups
cifs: Return correct error code from smb2_get_enc_key
cifs: fix out-of-bound memory access when calling smb3_notify() at mount point
cifs: detect dead connections only when echoes are enabled.
smb2: fix use-after-free in smb2_ioctl_query_info()
btrfs: handle remount to no compress during compression
x86/build: Disable HIGHMEM64G selection for M486SX
btrfs: fix metadata extent leak after failure to create subvolume
intel_th: pci: Add Rocket Lake CPU support
btrfs: fix race between transaction aborts and fsyncs leading to use-after-free
posix-timers: Preserve return value in clock_adjtime32()
fbdev: zero-fill colormap in fbcmap.c
cpuidle: tegra: Fix C7 idling state on Tegra114
bus: ti-sysc: Probe for l4_wkup and l4_cfg interconnect devices first
staging: wimax/i2400m: fix byte-order issue
spi: ath79: always call chipselect function
spi: ath79: remove spi-master setup and cleanup assignment
bus: mhi: core: Destroy SBL devices when moving to mission mode
crypto: api - check for ERR pointers in crypto_destroy_tfm()
crypto: qat - fix unmap invalid dma address
usb: gadget: uvc: add bInterval checking for HS mode
usb: webcam: Invalid size of Processing Unit Descriptor
x86/sev: Do not require Hypervisor CPUID bit for SEV guests
crypto: hisilicon/sec - fixes a printing error
genirq/matrix: Prevent allocation counter corruption
usb: gadget: f_uac2: validate input parameters
usb: gadget: f_uac1: validate input parameters
usb: dwc3: gadget: Ignore EP queue requests during bus reset
usb: xhci: Fix port minor revision
kselftest/arm64: mte: Fix compilation with native compiler
ARM: tegra: acer-a500: Rename avdd to vdda of touchscreen node
PCI: PM: Do not read power state in pci_enable_device_flags()
kselftest/arm64: mte: Fix MTE feature detection
ARM: dts: BCM5301X: fix "reg" formatting in /memory node
ARM: dts: ux500: Fix up TVK R3 sensors
x86/build: Propagate $(CLANG_FLAGS) to $(REALMODE_FLAGS)
x86/boot: Add $(CLANG_FLAGS) to compressed KBUILD_CFLAGS
efi/libstub: Add $(CLANG_FLAGS) to x86 flags
soc/tegra: pmc: Fix completion of power-gate toggling
arm64: dts: imx8mq-librem5-r3: Mark buck3 as always on
tee: optee: do not check memref size on return from Secure World
soundwire: cadence: only prepare attached devices on clock stop
perf/arm_pmu_platform: Use dev_err_probe() for IRQ errors
perf/arm_pmu_platform: Fix error handling
random: initialize ChaCha20 constants with correct endianness
usb: xhci-mtk: support quirk to disable usb2 lpm
fpga: dfl: pci: add DID for D5005 PAC cards
xhci: check port array allocation was successful before dereferencing it
xhci: check control context is valid before dereferencing it.
xhci: fix potential array out of bounds with several interrupters
bus: mhi: core: Clear context for stopped channels from remove()
ARM: dts: at91: change the key code of the gpio key
tools/power/x86/intel-speed-select: Increase string size
platform/x86: ISST: Account for increased timeout in some cases
spi: dln2: Fix reference leak to master
spi: omap-100k: Fix reference leak to master
spi: qup: fix PM reference leak in spi_qup_remove()
usb: gadget: tegra-xudc: Fix possible use-after-free in tegra_xudc_remove()
usb: musb: fix PM reference leak in musb_irq_work()
usb: core: hub: Fix PM reference leak in usb_port_resume()
usb: dwc3: gadget: Check for disabled LPM quirk
tty: n_gsm: check error while registering tty devices
intel_th: Consistency and off-by-one fix
phy: phy-twl4030-usb: Fix possible use-after-free in twl4030_usb_remove()
crypto: sun8i-ss - Fix PM reference leak when pm_runtime_get_sync() fails
crypto: sun8i-ce - Fix PM reference leak in sun8i_ce_probe()
crypto: stm32/hash - Fix PM reference leak on stm32-hash.c
crypto: stm32/cryp - Fix PM reference leak on stm32-cryp.c
crypto: sa2ul - Fix PM reference leak in sa_ul_probe()
crypto: omap-aes - Fix PM reference leak on omap-aes.c
platform/x86: intel_pmc_core: Don't use global pmcdev in quirks
spi: sync up initial chipselect state
btrfs: do proper error handling in create_reloc_root
btrfs: do proper error handling in btrfs_update_reloc_root
btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s
drm: Added orientation quirk for OneGX1 Pro
drm/qxl: do not run release if qxl failed to init
drm/qxl: release shadow on shutdown
drm/ast: Fix invalid usage of AST_MAX_HWC_WIDTH in cursor atomic_check
drm/amd/display: changing sr exit latency
drm/ast: fix memory leak when unload the driver
drm/amd/display: Check for DSC support instead of ASIC revision
drm/amd/display: Don't optimize bandwidth before disabling planes
drm/amdgpu/display: buffer INTERRUPT_LOW_IRQ_CONTEXT interrupt work
drm/amd/display/dc/dce/dce_aux: Remove duplicate line causing 'field overwritten' issue
scsi: lpfc: Fix incorrect dbde assignment when building target abts wqe
scsi: lpfc: Fix pt2pt connection does not recover after LOGO
drm/amdgpu: Fix some unload driver issues
sched/pelt: Fix task util_est update filtering
kvfree_rcu: Use same set of GFP flags as does single-argument
scsi: target: pscsi: Fix warning in pscsi_complete_cmd()
media: ite-cir: check for receive overflow
media: drivers: media: pci: sta2x11: fix Kconfig dependency on GPIOLIB
media: imx: capture: Return -EPIPE from __capture_legacy_try_fmt()
atomisp: don't let it go past pipes array
power: supply: bq27xxx: fix power_avg for newer ICs
extcon: arizona: Fix some issues when HPDET IRQ fires after the jack has been unplugged
extcon: arizona: Fix various races on driver unbind
media: media/saa7164: fix saa7164_encoder_register() memory leak bugs
media: gspca/sq905.c: fix uninitialized variable
power: supply: Use IRQF_ONESHOT
backlight: qcom-wled: Use sink_addr for sync toggle
backlight: qcom-wled: Fix FSC update issue for WLED5
drm/amdgpu: mask the xgmi number of hops reported from psp to kfd
drm/amdkfd: Fix UBSAN shift-out-of-bounds warning
drm/amdgpu : Fix asic reset regression issue introduce by 8f211fe8ac7c4f
drm/amd/pm: fix workload mismatch on vega10
drm/amd/display: Fix UBSAN warning for not a valid value for type '_Bool'
drm/amd/display: DCHUB underflow counter increasing in some scenarios
drm/amd/display: fix dml prefetch validation
scsi: qla2xxx: Always check the return value of qla24xx_get_isp_stats()
drm/vkms: fix misuse of WARN_ON
scsi: qla2xxx: Fix use after free in bsg
mmc: sdhci-esdhc-imx: validate pinctrl before use it
mmc: sdhci-pci: Add PCI IDs for Intel LKF
mmc: sdhci-brcmstb: Remove CQE quirk
ata: ahci: Disable SXS for Hisilicon Kunpeng920
drm/komeda: Fix bit check to import to value of proper type
nvmet: return proper error code from discovery ctrl
selftests/resctrl: Enable gcc checks to detect buffer overflows
selftests/resctrl: Fix compilation issues for global variables
selftests/resctrl: Fix compilation issues for other global variables
selftests/resctrl: Clean up resctrl features check
selftests/resctrl: Fix missing options "-n" and "-p"
selftests/resctrl: Use resctrl/info for feature detection
selftests/resctrl: Fix incorrect parsing of iMC counters
selftests/resctrl: Fix checking for < 0 for unsigned values
power: supply: cpcap-charger: Add usleep to cpcap charger to avoid usb plug bounce
scsi: smartpqi: Use host-wide tag space
scsi: smartpqi: Correct request leakage during reset operations
scsi: smartpqi: Add new PCI IDs
scsi: scsi_dh_alua: Remove check for ASC 24h in alua_rtpg()
media: em28xx: fix memory leak
media: vivid: update EDID
drm/msm/dp: Fix incorrect NULL check kbot warnings in DP driver
clk: socfpga: arria10: Fix memory leak of socfpga_clk on error return
power: supply: generic-adc-battery: fix possible use-after-free in gab_remove()
power: supply: s3c_adc_battery: fix possible use-after-free in s3c_adc_bat_remove()
media: tc358743: fix possible use-after-free in tc358743_remove()
media: adv7604: fix possible use-after-free in adv76xx_remove()
media: i2c: adv7511-v4l2: fix possible use-after-free in adv7511_remove()
media: i2c: tda1997: Fix possible use-after-free in tda1997x_remove()
media: i2c: adv7842: fix possible use-after-free in adv7842_remove()
media: platform: sti: Fix runtime PM imbalance in regs_show
media: sun8i-di: Fix runtime PM imbalance in deinterlace_start_streaming
media: dvb-usb: fix memory leak in dvb_usb_adapter_init
media: gscpa/stv06xx: fix memory leak
sched/fair: Ignore percpu threads for imbalance pulls
drm/msm/mdp5: Configure PP_SYNC_HEIGHT to double the vtotal
drm/msm/mdp5: Do not multiply vclk line count by 100
drm/amdgpu/ttm: Fix memory leak userptr pages
drm/radeon/ttm: Fix memory leak userptr pages
drm/amd/display: Fix debugfs link_settings entry
drm/amd/display: Fix UBSAN: shift-out-of-bounds warning
drm/amdkfd: Fix cat debugfs hang_hws file causes system crash bug
amdgpu: avoid incorrect %hu format string
drm/amd/display: Try YCbCr420 color when YCbCr444 fails
drm/amdgpu: fix NULL pointer dereference
scsi: lpfc: Fix crash when a REG_RPI mailbox fails triggering a LOGO response
scsi: lpfc: Fix error handling for mailboxes completed in MBX_POLL mode
scsi: lpfc: Remove unsupported mbox PORT_CAPABILITIES logic
mfd: intel-m10-bmc: Fix the register access range
mfd: da9063: Support SMBus and I2C mode
mfd: arizona: Fix rumtime PM imbalance on error
scsi: libfc: Fix a format specifier
perf: Rework perf_event_exit_event()
sched,fair: Alternative sched_slice()
block/rnbd-clt: Fix missing a memory free when unloading the module
s390/archrandom: add parameter check for s390_arch_random_generate
sched,psi: Handle potential task count underflow bugs more gracefully
power: supply: cpcap-battery: fix invalid usage of list cursor
ALSA: emu8000: Fix a use after free in snd_emu8000_create_mixer
ALSA: hda/conexant: Re-order CX5066 quirk table entries
ALSA: sb: Fix two use after free in snd_sb_qsound_build
ALSA: usb-audio: Explicitly set up the clock selector
ALSA: usb-audio: Add dB range mapping for Sennheiser Communications Headset PC 8
ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 445 G7
ALSA: hda/realtek: GA503 use same quirks as GA401
ALSA: hda/realtek: fix mic boost on Intel NUC 8
ALSA: hda/realtek - Headset Mic issue on HP platform
ALSA: hda/realtek: fix static noise on ALC285 Lenovo laptops
ALSA: hda/realtek: Add quirk for Intel Clevo PCx0Dx
tools/power/turbostat: Fix turbostat for AMD Zen CPUs
btrfs: fix race when picking most recent mod log operation for an old root
arm64/vdso: Discard .note.gnu.property sections in vDSO
Makefile: Move -Wno-unused-but-set-variable out of GCC only block
fs: fix reporting supported extra file attributes for statx()
virtiofs: fix memory leak in virtio_fs_probe()
kcsan, debugfs: Move debugfs file creation out of early init
ubifs: Only check replay with inode type to judge if inode linked
f2fs: fix error handling in f2fs_end_enable_verity()
f2fs: fix to avoid out-of-bounds memory access
mlxsw: spectrum_mr: Update egress RIF list before route's action
openvswitch: fix stack OOB read while fragmenting IPv4 packets
ACPI: GTDT: Don't corrupt interrupt mappings on watchdow probe failure
NFS: fs_context: validate UDP retrans to prevent shift out-of-bounds
NFS: Don't discard pNFS layout segments that are marked for return
NFSv4: Don't discard segments marked for return in _pnfs_return_layout()
Input: ili210x - add missing negation for touch indication on ili210x
jffs2: Fix kasan slab-out-of-bounds problem
jffs2: Hook up splice_write callback
powerpc/powernv: Enable HAIL (HV AIL) for ISA v3.1 processors
powerpc/eeh: Fix EEH handling for hugepages in ioremap space.
powerpc/kexec_file: Use current CPU info while setting up FDT
powerpc/32: Fix boot failure with CONFIG_STACKPROTECTOR
powerpc: fix EDEADLOCK redefinition error in uapi/asm/errno.h
intel_th: pci: Add Alder Lake-M support
tpm: efi: Use local variable for calculating final log size
tpm: vtpm_proxy: Avoid reading host log when using a virtual device
crypto: arm/curve25519 - Move '.fpu' after '.arch'
crypto: rng - fix crypto_rng_reset() refcounting when !CRYPTO_STATS
md/raid1: properly indicate failure when ending a failed write request
dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences
fuse: fix write deadlock
exfat: fix erroneous discard when clear cluster bit
sfc: farch: fix TX queue lookup in TX flush done handling
sfc: farch: fix TX queue lookup in TX event handling
security: commoncap: fix -Wstringop-overread warning
Fix misc new gcc warnings
jffs2: check the validity of dstlen in jffs2_zlib_compress()
smb3: when mounting with multichannel include it in requested capabilities
smb3: do not attempt multichannel to server which does not support it
Revert 337f13046f ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")
futex: Do not apply time namespace adjustment on FUTEX_LOCK_PI
x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
kbuild: update config_data.gz only when the content of .config is changed
ext4: annotate data race in start_this_handle()
ext4: annotate data race in jbd2_journal_dirty_metadata()
ext4: fix check to prevent false positive report of incorrect used inodes
ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
ext4: fix error code in ext4_commit_super
ext4: fix ext4_error_err save negative errno into superblock
ext4: fix error return code in ext4_fc_perform_commit()
ext4: allow the dax flag to be set and cleared on inline directories
ext4: Fix occasional generic/418 failure
media: dvbdev: Fix memory leak in dvb_media_device_free()
media: dvb-usb: Fix use-after-free access
media: dvb-usb: Fix memory leak at error in dvb_usb_device_init()
media: staging/intel-ipu3: Fix memory leak in imu_fmt
media: staging/intel-ipu3: Fix set_fmt error handling
media: staging/intel-ipu3: Fix race condition during set_fmt
media: v4l2-ctrls: fix reference to freed memory
media: venus: hfi_parser: Don't initialize parser on v1
usb: gadget: dummy_hcd: fix gpf in gadget_setup
usb: gadget: Fix double free of device descriptor pointers
usb: gadget/function/f_fs string table fix for multiple languages
usb: dwc3: gadget: Remove FS bInterval_m1 limitation
usb: dwc3: gadget: Fix START_TRANSFER link state check
usb: dwc3: core: Do core softreset when switch mode
usb: dwc2: Fix session request interrupt handler
tty: fix memory leak in vc_deallocate
rsi: Use resume_noirq for SDIO
tools/power turbostat: Fix offset overflow issue in index converting
tracing: Map all PIDs to command lines
tracing: Restructure trace_clock_global() to never block
dm persistent data: packed struct should have an aligned() attribute too
dm space map common: fix division bug in sm_ll_find_free_block()
dm integrity: fix missing goto in bitmap_flush_interval error handling
dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails
lib/vsprintf.c: remove leftover 'f' and 'F' cases from bstr_printf()
thermal/drivers/cpufreq_cooling: Fix slab OOB issue
thermal/core/fair share: Lock the thermal zone while looping over instances
Linux 5.10.36
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7b8075de5edd8de69205205cddb9a3273d7d0810
commit 4f06dd92b5d0a6f8eec6a34b8d6ef3e1f4ac1e10 upstream.
There are two modes for write(2) and friends in fuse:
a) write through (update page cache, send sync WRITE request to userspace)
b) buffered write (update page cache, async writeout later)
The write through method kept all the page cache pages locked that were
used for the request. Keeping more than one page locked is deadlock prone
and Qian Cai demonstrated this with trinity fuzzing.
The reason for keeping the pages locked is that concurrent mapped reads
shouldn't try to pull possibly stale data into the page cache.
For full page writes, the easy way to fix this is to make the cached page
be the authoritative source by marking the page PG_uptodate immediately.
After this the page can be safely unlocked, since mapped/cached reads will
take the written data from the cache.
Concurrent mapped writes will now cause data in the original WRITE request
to be updated; this however doesn't cause any data inconsistency and this
scenario should be exceedingly rare anyway.
If the WRITE request returns with an error in the above case, currently the
page is not marked uptodate; this means that a concurrent read will always
read consistent data. After this patch the page is uptodate between
writing to the cache and receiving the error: there's window where a cached
read will read the wrong data. While theoretically this could be a
regression, it is unlikely to be one in practice, since this is normal for
buffered writes.
In case of a partial page write to an already uptodate page the locking is
also unnecessary, with the above caveats.
Partial write of a not uptodate page still needs to be handled. One way
would be to read the complete page before doing the write. This is not
possible, since it might break filesystems that don't expect any READ
requests when the file was opened O_WRONLY.
The other solution is to serialize the synchronous write with reads from
the partial pages. The easiest way to do this is to keep the partial pages
locked. The problem is that a write() may involve two such pages (one head
and one tail). This patch fixes it by only locking the partial tail page.
If there's a partial head page as well, then split that off as a separate
WRITE request.
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/
Fixes: ea9b9907b8 ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org> # v2.6.26
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.25
crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: x86/aes-ni-xts - use direct calls to and 4-way stride
bpf: Prohibit alu ops for pointer types not defining ptr_limit
bpf: Fix off-by-one for area size in creating mask to left
bpf: Simplify alu_limit masking for pointer arithmetic
bpf: Add sanity check for upper ptr_limit
bpf, selftests: Fix up some test_verifier cases for unprivileged
RDMA/srp: Fix support for unpopulated and unbalanced NUMA nodes
fuse: fix live lock in fuse_iget()
Revert "nfsd4: remove check_conflicting_opens warning"
Revert "nfsd4: a client's own opens needn't prevent delegations"
ALSA: usb-audio: Don't avoid stopping the stream at disconnection
net: dsa: b53: Support setting learning on port
Linux 5.10.25
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0a19cd5f8dda58a2fa8fdfbe7cbabd2c32cb57bd
Enabling FUSE passthrough for mmap-ed operations not only affects
performance, but has also been shown as mandatory for the correct
functioning of FUSE passthrough.
yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
data inconsistencies if the same file is also accessed with mmap. What
happens is that read/write operations are directly applied to the lower
file system (and its cache), while mmap-ed operations are affecting the
FUSE cache.
Extend the FUSE passthrough implementation to also handle memory-mapped
FUSE file, to both fix the cache inconsistencies and extend the
passthrough performance benefits to mmap-ed operations.
[1] https://lore.kernel.org/lkml/20210119110654.11817-1-wu-yan@tcl.com/
Bug: 168023149
Link: https://lore.kernel.org/lkml/20210125153057.3623715-9-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Ifad4698b0380f6e004c487940ac6907b9a9f2964
Signed-off-by: Alessio Balsini <balsini@google.com>
When using FUSE passthrough, read/write operations are directly
forwarded to the lower file system file through VFS, but there is no
guarantee that the process that is triggering the request has the right
permissions to access the lower file system. This would cause the
read/write access to fail.
In passthrough file systems, where the FUSE daemon is responsible for
the enforcement of the lower file system access policies, often happens
that the process dealing with the FUSE file system doesn't have access
to the lower file system.
Being the FUSE daemon in charge of implementing the FUSE file
operations, that in the case of read/write operations usually simply
results in the copy of memory buffers from/to the lower file system
respectively, these operations are executed with the FUSE daemon
privileges.
This patch adds a reference to the FUSE daemon credentials, referenced
at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
to temporarily raise the user credentials when accessing lower file
system files in passthrough.
The process accessing the FUSE file with passthrough enabled temporarily
receives the privileges of the FUSE daemon while performing read/write
operations. Similar behavior is implemented in overlayfs.
These privileges will be reverted as soon as the IO operation completes.
This feature does not provide any higher security privileges to those
processes accessing the FUSE file system with passthrough enabled. This
is because it is still the FUSE daemon responsible for enabling or not
the passthrough feature at file open time, and should enable the feature
only after appropriate access policy checks.
Bug: 168023149
Link: https://lore.kernel.org/lkml/20210125153057.3623715-8-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Idb4f03a2ce7c536691e5eaf8fadadfcf002e1677
Signed-off-by: Alessio Balsini <balsini@google.com>
All the read and write operations performed on fuse_files which have the
passthrough feature enabled are forwarded to the associated lower file
system file via VFS.
Sending the request directly to the lower file system avoids the
userspace round-trip that, because of possible context switches and
additional operations might reduce the overall performance, especially
in those cases where caching doesn't help, for example in reads at
random offsets.
Verifying if a fuse_file has a lower file system file associated with
can be done by checking the validity of its passthrough_filp pointer.
This pointer is not NULL only if passthrough has been successfully
enabled via the appropriate ioctl().
When a read/write operation is requested for a FUSE file with
passthrough enabled, a new equivalent VFS request is generated, which
instead targets the lower file system file.
The VFS layer performs additional checks that allow for safer operations
but may cause the operation to fail if the process accessing the FUSE
file system does not have access to the lower file system.
This change only implements synchronous requests in passthrough,
returning an error in the case of asynchronous operations, yet covering
the majority of the use cases.
Bug: 168023149
Link: https://lore.kernel.org/lkml/20210125153057.3623715-6-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Ifbe6a247fe7338f87d078fde923f0252eeaeb668
Signed-off-by: Alessio Balsini <balsini@google.com>
Expose the FUSE_PASSTHROUGH interface to user space and declare all the
basic data structures and functions as the skeleton on top of which the
FUSE passthrough functionality will be built.
As part of this, introduce the new FUSE passthrough ioctl, which allows
the FUSE daemon to specify a direct connection between a FUSE file and a
lower file system file. Such ioctl requires user space to pass the file
descriptor of one of its opened files through the fuse_passthrough_out
data structure introduced in this patch. This structure includes extra
fields for possible future extensions.
Also, add the passthrough functions for the set-up and tear-down of the
data structures and locks that will be used both when fuse_conns and
fuse_files are created/deleted.
Bug: 168023149
Link: https://lore.kernel.org/lkml/20210125153057.3623715-4-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: I732532581348adadda5b5048a9346c2b0868d539
Signed-off-by: Alessio Balsini <balsini@google.com>
Changes in 5.10.6
Revert "drm/amd/display: Fix memory leaks in S3 resume"
Revert "mtd: spinand: Fix OOB read"
rtc: pcf2127: move watchdog initialisation to a separate function
rtc: pcf2127: only use watchdog when explicitly available
dt-bindings: rtc: add reset-source property
kdev_t: always inline major/minor helper functions
Bluetooth: Fix attempting to set RPA timeout when unsupported
ALSA: hda/realtek - Modify Dell platform name
ALSA: hda/hdmi: Fix incorrect mutex unlock in silent_stream_disable()
drm/i915/tgl: Fix Combo PHY DPLL fractional divider for 38.4MHz ref clock
scsi: ufs: Allow an error return value from ->device_reset()
scsi: ufs: Re-enable WriteBooster after device reset
RDMA/core: remove use of dma_virt_ops
RDMA/siw,rxe: Make emulated devices virtual in the device tree
fuse: fix bad inode
perf: Break deadlock involving exec_update_mutex
rwsem: Implement down_read_killable_nested
rwsem: Implement down_read_interruptible
exec: Transform exec_update_mutex into a rw_semaphore
mwifiex: Fix possible buffer overflows in mwifiex_cmd_802_11_ad_hoc_start
Linux 5.10.6
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id4c57a151a1e8f2162163d2337b6055f04edbe9b
[ Upstream commit 5d069dbe8aaf2a197142558b6fb2978189ba3454 ]
Jan Kara's analysis of the syzbot report (edited):
The reproducer opens a directory on FUSE filesystem, it then attaches
dnotify mark to the open directory. After that a fuse_do_getattr() call
finds that attributes returned by the server are inconsistent, and calls
make_bad_inode() which, among other things does:
inode->i_mode = S_IFREG;
This then confuses dnotify which doesn't tear down its structures
properly and eventually crashes.
Avoid calling make_bad_inode() on a live inode: switch to a private flag on
the fuse inode. Also add the test to ops which the bad_inode_ops would
have caught.
This bug goes back to the initial merge of fuse in 2.6.14...
Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Allows FUSE to report to inotify that it is acting as a layered filesystem.
The userspace component returns a string representing the location of the
underlying file. If the string cannot be resolved into a path, the top
level path is returned instead.
Bug: 23904372
Bug: 171780975
Test: FileObserverTest and FileObserverTestLegacyPath on cuttlefish
Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
When using FUSE passthrough, read/write operations are directly forwarded
to the lower file system file through VFS, but there is no guarantee that
the process that is triggering the request has the right permissions to
access the lower file system. This would cause the read/write access to
fail.
In passthrough file systems, where the FUSE daemon is responsible for the
enforcement of the lower file system access policies, often happens that
the process dealing with the FUSE file system doesn't have access to the
lower file system.
Being the FUSE daemon in charge of implementing the FUSE file operations,
that in the case of read/write operations usually simply results in the
copy of memory buffers from/to the lower file system respectively, these
operations are executed with the FUSE daemon privileges.
This patch adds a reference to the FUSE daemon credentials, referenced at
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used to
temporarily raise the user credentials when accessing lower file system
files in passthrough.
The process accessing the FUSE file with passthrough enabled temporarily
receives the privileges of the FUSE daemon while performing read/write
operations. Similar behavior is implemented in overlayfs.
These privileges will be reverted as soon as the IO operation completes.
This feature does not provide any higher security privileges to those
processes accessing the FUSE file system with passthrough enabled. This is
because it is still the FUSE daemon responsible for enabling or not the
passthrough feature at file open time, and should enable the feature only
after appropriate access policy checks.
Bug: 168023149
Link: https://lore.kernel.org/lkml/20201026125016.1905945-6-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: I1123f8113578eb8713f2b777a1b5ec76882bd762
All the read and write operations performed on fuse_files which have the
passthrough feature enabled are forwarded to the associated lower file
system file via VFS.
Sending the request directly to the lower file system avoids the userspace
round-trip that, because of possible context switches and additional
operations might reduce the overall performance, especially in those cases
where caching doesn't help, for example in reads at random offsets.
Verifying if a fuse_file has a lower file system file associated with can
be done by checking the validity of its passthrough_filp pointer. This
pointer is not NULL only if passthrough has been successfully enabled via
the appropriate ioctl().
When a read/write operation is requested for a FUSE file with passthrough
enabled, a new equivalent VFS request is generated, which instead targets
the lower file system file.
The VFS layer performs additional checks that allow for safer operations
but may cause the operation to fail if the process accessing the FUSE file
system does not have access to the lower file system.
This change only implements synchronous requests in passthrough, returning
an error in the case of asynchronous operations, yet covering the majority
of the use cases.
Bug: 168023149
Link: https://lore.kernel.org/lkml/20201026125016.1905945-4-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: If76bb8725e1ac567f9dbe3edb79ebb4d43d77dfb
Expose the FUSE_PASSTHROUGH interface to userspace and declare all the
basic data structures and functions as the skeleton on top of which the
FUSE passthrough functionality will be built.
As part of this, introduce the new FUSE passthrough ioctl(), which
allows
the FUSE daemon to specify a direct connection between a FUSE file and a
lower file system file. Such ioctl() requires userspace to pass the file
descriptor of one of its opened files through the fuse_passthrough_out
data
structure introduced in this patch. This structure includes extra fields
for possible future extensions.
Also, add the passthrough functions for the set-up and tear-down of the
data structures and locks that will be used both when fuse_conns and
fuse_files are created/deleted.
Bug: 168023149
Link: https://lore.kernel.org/lkml/20201026125016.1905945-2-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: I6dd150b93607e10ed53f7e7975b35b6090080fa2
Steps on the way to 5.10-rc1
Resolves conflicts in:
fs/fuse/fuse_i.h
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifa200ce8fae0e3b38c86351006824c62328c00f7
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT
in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the
.d_automount implementation creates a new submount at that location, so
that the submount gets a distinct st_dev value.
Note that all submounts get a distinct superblock and a distinct st_dev
value, so for virtio-fs, even if the same filesystem is mounted more than
once on the host, none of its mount points will have the same st_dev. We
need distinct superblocks because the superblock points to the root node,
but the different host mounts may show different trees (e.g. due to
submounts in some of them, but not in others).
Right now, this behavior is only enabled when fuse_conn.auto_submounts is
set, which is the case only for virtio-fs.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Submounts have their own superblock, which needs to be initialized.
However, they do not have a fuse_fs_context associated with them, and
the root node's attributes should be taken from the mountpoint's node.
Extend fuse_fill_super_common() to work for submounts by making the @ctx
parameter optional, and by adding a @submount_finode parameter.
(There is a plain "unsigned" in an existing code block that is being
indented by this commit. Extend it to "unsigned int" so checkpatch does
not complain.)
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We want to allow submounts for the same fuse_conn, but with different
superblocks so that each of the submounts has its own device ID. To do
so, we need to split all mount-specific information off of fuse_conn
into a new fuse_mount structure, so that multiple mounts can share a
single fuse_conn.
We need to take care only to perform connection-level actions once (i.e.
when the fuse_conn and thus the first fuse_mount are established, or
when the last fuse_mount and thus the fuse_conn are destroyed). For
example, fuse_sb_destroy() must invoke fuse_send_destroy() until the
last superblock is released.
To do so, we keep track of which fuse_mount is the root mount and
perform all fuse_conn-level actions only when this fuse_mount is
involved.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With the last commit, all functions that handle some existing fuse_req
no longer need to be given the associated fuse_conn, because they can
get it from the fuse_req object.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Every fuse_req belongs to a fuse_conn. Right now, we always know which
fuse_conn that is based on the respective device, but we want to allow
multiple (sub)mounts per single connection, and then the corresponding
filesystem is not going to be so trivial to obtain.
Storing a pointer to the associated fuse_conn in every fuse_req will
allow us to trivially find any request's superblock (and thus
filesystem) even then.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.
Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.
If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.
For reclaiming a range, as of now we need to hold following locks in
specified order.
down_write(&fi->i_mmap_sem);
down_write(&fi->dax->sem);
We look for a free range in following order.
A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
We make use of interval tree to keep track of per inode dax mappings.
Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field. Parse this field and
ensure our DAX mappings meet the alignment constraints.
We don't actually align anything differently since our mappings are
already 2MB aligned. Just check the value when the connection is
established. If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.
The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a mount option to allow using dax with virtio_fs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.
Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In virtiofs (unlike in regular fuse) processing of async replies is
serialized. This can result in a deadlock in rare corner cases when
there's a circular dependency between the completion of two or more async
replies.
Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR ==
SCRATCH_MNT (which is a misconfiguration):
- Process A is waiting for page lock in worker thread context and blocked
(virtio_fs_requests_done_work()).
- Process B is holding page lock and waiting for pending writes to
finish (fuse_wait_on_page_writeback()).
- Write requests are waiting in virtqueue and can't complete because
worker thread is blocked on page lock (process A).
Fix this by creating a unique work_struct for each async reply that can
block (O_DIRECT read).
Fixes: a62a8ef9d9 ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Normal, synchronous requests will have their args allocated on the stack.
After the FR_FINISHED bit is set by receiving the reply from the userspace
fuse server, the originating task may return and reuse the stack frame,
resulting in an Oops if the args structure is dereferenced.
Fix by setting a flag in the request itself upon initializing, indicating
whether it has an asynchronous ->end() callback.
Reported-by: Kyle Sanderson <kyle.leet@gmail.com>
Reported-by: Michael Stapelberg <michael+lkml@stapelberg.ch>
Fixes: 2b319d1f6f ("fuse: don't dereference req->args on finished request")
Cc: <stable@vger.kernel.org> # v5.4
Tested-by: Michael Stapelberg <michael+lkml@stapelberg.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If a filesystem returns negative inode sizes, future reads on the file were
causing the cpu to spin on truncate_pagecache.
Create a helper to validate the attributes. This now does two things:
- check the file mode
- check if the file size fits in i_size without overflowing
Reported-by: Arijit Banerjee <arijit@rubrik.com>
Fixes: d8a5ba4545 ("[PATCH] FUSE - core")
Cc: <stable@vger.kernel.org> # v2.6.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Virtio-fs does not accept any mount options, so it's confusing and wrong to
show any in /proc/mounts.
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a basic file system module for virtio-fs. This does not yet contain
shared data support between host and guest or metadata coherency speedups.
However it is already significantly faster than virtio-9p.
Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.
- Use fuse protocol (instead of 9p) for communication between guest and
host. Guest kernel will be fuse client and a fuse server will run on
host to serve the requests.
- For data access inside guest, mmap portion of file in QEMU address space
and guest accesses this memory using dax. That way guest page cache is
bypassed and there is only one copy of data (on host). This will also
enable mmap(MAP_SHARED) between guests.
- For metadata coherency, there is a shared memory region which contains
version number associated with metadata and any guest changing metadata
updates version number and other guests refresh metadata on next access.
This is yet to be implemented.
How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location of
the virtual machine and hypervisor to avoid communication (vmexits).
DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.
By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).
These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.
Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control
the level of coherence between the guest and host filesystems.
- cache=none
metadata, data and pathname lookup are not cached in guest. They are
always fetched from host and any changes are immediately pushed to host.
- cache=always
metadata, data and pathname lookup are cached in guest and never expire.
- cache=auto
metadata and pathname lookup cache expires after a configured amount of
time (default is 1 second). Data is cached while the file is open
(close to open consistency).
- writeback/no_writeback
These options control the writeback strategy. If writeback is disabled,
then normal writes will immediately be synchronized with the host fs.
If writeback is enabled, then writes may be cached in the guest until
the file is closed or an fsync(2) performed. This option has no effect
on mmap-ed writes or writes going through the DAX mechanism.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
virtio-fs does not support aborting requests which are being
processed. That is requests which have been sent to fuse daemon on host.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access. Only do this if explicitly enabled, otherwise it could result in
performance regression.
More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As of now fuse_dev_alloc() both allocates a fuse device and installs it in
fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation fails.
virtio-fs needs to initialize multiple fuse devices (one per virtio queue).
It initializes one fuse device as part of call to fuse_fill_super_common()
and rest of the devices are allocated and installed after that.
But, we can't afford to fail after calling fuse_fill_super_common() as we
don't have a way to undo all the actions done by fuse_fill_super_common().
So to avoid failures after the call to fuse_fill_super_common(),
pre-allocate all fuse devices early and install them into fuse connection
later.
This patch provides two separate helpers for fuse device allocation and
fuse device installation in fuse_conn.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The /dev/fuse device uses fiq->waitq and fasync to signal that requests are
available. These mechanisms do not apply to virtio-fs. This patch
introduces callbacks so alternative behavior can be used.
Note that queue_interrupt() changes along these lines:
spin_lock(&fiq->waitq.lock);
wake_up_locked(&fiq->waitq);
+ kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_fill_super() includes code to process the fd= option and link the
struct fuse_dev to the fd's struct file. In virtio-fs there is no file
descriptor because /dev/fuse is not used.
This patch extracts fuse_fill_super_common() so that both classic fuse and
virtio-fs can share the code to initialize a mount.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
File systems like virtio-fs need to do not have to play directly with
forget list data structures. There is a helper function use that instead.
Rename dequeue_forget() to fuse_dequeue_forget() and export it so that
stacked filesystems can use it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
virtio-fs will need unique IDs for FORGET requests from outside
fs/fuse/dev.c. Make the symbol visible.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>