Changes in 5.10.26
ASoC: ak4458: Add MODULE_DEVICE_TABLE
ASoC: ak5558: Add MODULE_DEVICE_TABLE
spi: cadence: set cqspi to the driver_data field of struct device
ALSA: dice: fix null pointer dereference when node is disconnected
ALSA: hda/realtek: apply pin quirk for XiaomiNotebook Pro
ALSA: hda: generic: Fix the micmute led init state
ALSA: hda/realtek: Apply headset-mic quirks for Xiaomi Redmibook Air
ALSA: hda/realtek: fix mute/micmute LEDs for HP 840 G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP 440 G8
ALSA: hda/realtek: fix mute/micmute LEDs for HP 850 G8
Revert "PM: runtime: Update device status before letting suppliers suspend"
s390/vtime: fix increased steal time accounting
s390/pci: refactor zpci_create_device()
s390/pci: remove superfluous zdev->zbus check
s390/pci: fix leak of PCI device structure
zonefs: Fix O_APPEND async write handling
zonefs: prevent use of seq files as swap file
zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()
btrfs: fix race when cloning extent buffer during rewind of an old root
btrfs: fix slab cache flags for free space tree bitmap
vhost-vdpa: fix use-after-free of v->config_ctx
vhost-vdpa: set v->config_ctx to NULL if eventfd_ctx_fdget() fails
drm/amd/display: Correct algorithm for reversed gamma
ASoC: fsl_ssi: Fix TDM slot setup for I2S mode
ASoC: Intel: bytcr_rt5640: Fix HP Pavilion x2 10-p0XX OVCD current threshold
ASoC: SOF: Intel: unregister DMIC device on probe error
ASoC: SOF: intel: fix wrong poll bits in dsp power down
ASoC: qcom: sdm845: Fix array out of bounds access
ASoC: qcom: sdm845: Fix array out of range on rx slim channels
ASoC: codecs: wcd934x: add a sanity check in set channel map
ASoC: qcom: lpass-cpu: Fix lpass dai ids parse
ASoC: simple-card-utils: Do not handle device clock
afs: Fix accessing YFS xattrs on a non-YFS server
afs: Stop listxattr() from listing "afs.*" attributes
ALSA: usb-audio: Fix unintentional sign extension issue
nvme: fix Write Zeroes limitations
nvme-tcp: fix misuse of __smp_processor_id with preemption enabled
nvme-tcp: fix possible hang when failing to set io queues
nvme-tcp: fix a NULL deref when receiving a 0-length r2t PDU
nvmet: don't check iosqes,iocqes for discovery controllers
nfsd: Don't keep looking up unhashed files in the nfsd file cache
nfsd: don't abort copies early
NFSD: Repair misuse of sv_lock in 5.10.16-rt30.
NFSD: fix dest to src mount in inter-server COPY
svcrdma: disable timeouts on rdma backchannel
vfio: IOMMU_API should be selected
vhost_vdpa: fix the missing irq_bypass_unregister_producer() invocation
sunrpc: fix refcount leak for rpc auth modules
i915/perf: Start hrtimer only if sampling the OA buffer
pstore: Fix warning in pstore_kill_sb()
io_uring: ensure that SQPOLL thread is started for exit
net/qrtr: fix __netdev_alloc_skb call
kbuild: Fix <linux/version.h> for empty SUBLEVEL or PATCHLEVEL again
cifs: fix allocation size on newly created files
riscv: Correct SPARSEMEM configuration
scsi: lpfc: Fix some error codes in debugfs
scsi: myrs: Fix a double free in myrs_cleanup()
scsi: ufs: ufs-mediatek: Correct operator & -> &&
RISC-V: correct enum sbi_ext_rfence_fid
counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED
gpiolib: Assign fwnode to parent's if no primary one provided
nvme-rdma: fix possible hang when failing to set io queues
ibmvnic: add some debugs
ibmvnic: serialize access to work queue on remove
tty: serial: stm32-usart: Remove set but unused 'cookie' variables
serial: stm32: fix DMA initialization error handling
bpf: Declare __bpf_free_used_maps() unconditionally
RDMA/rtrs: Remove unnecessary argument dir of rtrs_iu_free
RDMA/rtrs-srv: Jump to dereg_mr label if allocate iu fails
RDMA/rtrs: Introduce rtrs_post_send
RDMA/rtrs: Fix KASAN: stack-out-of-bounds bug
module: merge repetitive strings in module_sig_check()
module: avoid *goto*s in module_sig_check()
module: harden ELF info handling
scsi: pm80xx: Make mpi_build_cmd locking consistent
scsi: pm80xx: Make running_req atomic
scsi: pm80xx: Fix pm8001_mpi_get_nvmd_resp() race condition
scsi: pm8001: Neaten debug logging macros and uses
scsi: libsas: Remove notifier indirection
scsi: libsas: Introduce a _gfp() variant of event notifiers
scsi: mvsas: Pass gfp_t flags to libsas event notifiers
scsi: isci: Pass gfp_t flags in isci_port_link_down()
scsi: isci: Pass gfp_t flags in isci_port_link_up()
scsi: isci: Pass gfp_t flags in isci_port_bc_change_received()
RDMA/mlx5: Allow creating all QPs even when non RDMA profile is used
powerpc/sstep: Fix load-store and update emulation
powerpc/sstep: Fix darn emulation
i40e: Fix endianness conversions
net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081
MIPS: compressed: fix build with enabled UBSAN
drm/amd/display: turn DPMS off on connector unplug
iwlwifi: Add a new card for MA family
io_uring: fix inconsistent lock state
media: cedrus: h264: Support profile controls
ibmvnic: remove excessive irqsave
s390/qeth: schedule TX NAPI on QAOB completion
drm/amd/pm: fulfill the Polaris implementation for get_clock_by_type_with_latency()
io_uring: don't attempt IO reissue from the ring exit path
io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return
net: bonding: fix error return code of bond_neigh_init()
regulator: pca9450: Add SD_VSEL GPIO for LDO5
regulator: pca9450: Enable system reset on WDOG_B assertion
regulator: pca9450: Clear PRESET_EN bit to fix BUCK1/2/3 voltage setting
gfs2: Add common helper for holding and releasing the freeze glock
gfs2: move freeze glock outside the make_fs_rw and _ro functions
gfs2: bypass signal_our_withdraw if no journal
powerpc: Force inlining of cpu_has_feature() to avoid build failure
usb-storage: Add quirk to defeat Kindle's automatic unload
usbip: Fix incorrect double assignment to udc->ud.tcp_rx
usb: gadget: configfs: Fix KASAN use-after-free
usb: typec: Remove vdo[3] part of tps6598x_rx_identity_reg struct
usb: typec: tcpm: Invoke power_supply_changed for tcpm-source-psy-
usb: dwc3: gadget: Allow runtime suspend if UDC unbinded
usb: dwc3: gadget: Prevent EP queuing while stopping transfers
thunderbolt: Initialize HopID IDAs in tb_switch_alloc()
thunderbolt: Increase runtime PM reference count on DP tunnel discovery
iio:adc:stm32-adc: Add HAS_IOMEM dependency
iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
iio: adis16400: Fix an error code in adis16400_initial_setup()
iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler
iio: adc: ab8500-gpadc: Fix off by 10 to 3
iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask
iio: adc: adi-axi-adc: add proper Kconfig dependencies
iio: hid-sensor-humidity: Fix alignment issue of timestamp channel
iio: hid-sensor-prox: Fix scale not correct issue
iio: hid-sensor-temperature: Fix issues of timestamp channel
counter: stm32-timer-cnt: fix ceiling write max value
counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register
PCI: rpadlpar: Fix potential drc_name corruption in store functions
perf/x86/intel: Fix a crash caused by zero PEBS status
perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT
x86/ioapic: Ignore IRQ2 again
kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
x86: Move TS_COMPAT back to asm/thread_info.h
x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
efivars: respect EFI_UNSUPPORTED return from firmware
ext4: fix error handling in ext4_end_enable_verity()
ext4: find old entry again if failed to rename whiteout
ext4: stop inode update before return
ext4: do not try to set xattr into ea_inode if value is empty
ext4: fix potential error in ext4_do_update_inode
ext4: fix rename whiteout with fast commit
MAINTAINERS: move some real subsystems off of the staging mailing list
MAINTAINERS: move the staging subsystem to lists.linux.dev
static_call: Fix static_call_update() sanity check
efi: use 32-bit alignment for efi_guid_t literals
firmware/efi: Fix a use after bug in efi_mem_reserve_persistent
genirq: Disable interrupts for force threaded handlers
x86/apic/of: Fix CPU devicetree-node lookups
cifs: Fix preauth hash corruption
Linux 5.10.26
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6f6bdd1dc46dc744c848e778f9edd0be558b46ac
Changes in 5.10.12
gpio: mvebu: fix pwm .get_state period calculation
Revert "mm/slub: fix a memory leak in sysfs_slab_add()"
futex: Ensure the correct return value from futex_lock_pi()
futex: Replace pointless printk in fixup_owner()
futex: Provide and use pi_state_update_owner()
rtmutex: Remove unused argument from rt_mutex_proxy_unlock()
futex: Use pi_state_update_owner() in put_pi_state()
futex: Simplify fixup_pi_state_owner()
futex: Handle faults correctly for PI futexes
HID: wacom: Correct NULL dereference on AES pen proximity
HID: multitouch: Apply MT_QUIRK_CONFIDENCE quirk for multi-input devices
media: Revert "media: videobuf2: Fix length check for single plane dmabuf queueing"
media: v4l2-subdev.h: BIT() is not available in userspace
RDMA/vmw_pvrdma: Fix network_hdr_type reported in WC
iwlwifi: dbg: Don't touch the tlv data
kernel/io_uring: cancel io_uring before task works
io_uring: inline io_uring_attempt_task_drop()
io_uring: add warn_once for io_uring_flush()
io_uring: stop SQPOLL submit on creator's death
io_uring: fix null-deref in io_disable_sqo_submit
io_uring: do sqo disable on install_fd error
io_uring: fix false positive sqo warning on flush
io_uring: fix uring_flush in exit_files() warning
io_uring: fix skipping disabling sqo on exec
io_uring: dont kill fasync under completion_lock
io_uring: fix sleeping under spin in __io_clean_op
objtool: Don't fail on missing symbol table
mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint
mm: fix a race on nr_swap_pages
tools: Factor HOSTCC, HOSTLD, HOSTAR definitions
printk: fix buffer overflow potential for print_text()
printk: fix string termination for record_print_text()
Linux 5.10.12
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6d96ec78494ebbc0daf4fdecfc13e522c6bd6b42
commit 34b1a1ce1458f50ef27c54e28eb9b1947012907a upstream
fixup_pi_state_owner() tries to ensure that the state of the rtmutex,
pi_state and the user space value related to the PI futex are consistent
before returning to user space. In case that the user space value update
faults and the fault cannot be resolved by faulting the page in via
fault_in_user_writeable() the function returns with -EFAULT and leaves
the rtmutex and pi_state owner state inconsistent.
A subsequent futex_unlock_pi() operates on the inconsistent pi_state and
releases the rtmutex despite not owning it which can corrupt the RB tree of
the rtmutex and cause a subsequent kernel stack use after free.
It was suggested to loop forever in fixup_pi_state_owner() if the fault
cannot be resolved, but that results in runaway tasks which is especially
undesired when the problem happens due to a programming error and not due
to malice.
As the user space value cannot be fixed up, the proper solution is to make
the rtmutex and the pi_state consistent so both have the same owner. This
leaves the user space value out of sync. Any subsequent operation on the
futex will fail because the 10th rule of PI futexes (pi_state owner and
user space value are consistent) has been violated.
As a consequence this removes the inept attempts of 'fixing' the situation
in case that the current task owns the rtmutex when returning with an
unresolvable fault by unlocking the rtmutex which left pi_state::owner and
rtmutex::owner out of sync in a different and only slightly less dangerous
way.
Fixes: 1b7558e457 ("futexes: fix fault handling in futex_lock_pi")
Reported-by: gzobqq@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c5cade200ab9a2a3be9e7f32a752c8d86b502ec7 upstream
Updating pi_state::owner is done at several places with the same
code. Provide a function for it and use that at the obvious places.
This is also a preparation for a bug fix to avoid yet another copy of the
same code or alternatively introducing a completely unpenetratable mess of
gotos.
Originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 04b79c55201f02ffd675e1231d731365e335c307 upstream
If that unexpected case of inconsistent arguments ever happens then the
futex state is left completely inconsistent and the printk is not really
helpful. Replace it with a warning and make the state consistent.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12bb3f7f1b03d5913b3f9d4236a488aa7774dfe9 upstream
In case that futex_lock_pi() was aborted by a signal or a timeout and the
task returned without acquiring the rtmutex, but is the designated owner of
the futex due to a concurrent futex_unlock_pi() fixup_owner() is invoked to
establish consistent state. In that case it invokes fixup_pi_state_owner()
which in turn tries to acquire the rtmutex again. If that succeeds then it
does not propagate this success to fixup_owner() and futex_lock_pi()
returns -EINTR or -ETIMEOUT despite having the futex locked.
Return success from fixup_pi_state_owner() in all cases where the current
task owns the rtmutex and therefore the futex and propagate it correctly
through fixup_owner(). Fixup the other callsite which does not expect a
positive return value.
Fixes: c1e2f0eaf0 ("futex: Avoid violating the 10th rule of futex")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Steps on the way to 5.10-rc4
Resolves conflict in:
arch/arm64/kvm/sys_regs.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id188ccbec038cf7e30c204e9f5b7866f72b6640d
Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner().
This is one possible chain of events leading to this:
Task Prio Operation
T1 120 lock(F)
T2 120 lock(F) -> blocks (top waiter)
T3 50 (RT) lock(F) -> boosts T1 and blocks (new top waiter)
XX timeout/ -> wakes T2
signal
T1 50 unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter bit is set)
T2 120 cleanup -> try_to_take_mutex() fails because T3 is the top waiter
and the lower priority T2 cannot steal the lock.
-> fixup_pi_state_owner() sees newowner == NULL -> BUG_ON()
The comment states that this is invalid and rt_mutex_real_owner() must
return a non NULL owner when the trylock failed, but in case of a queued
and woken up waiter rt_mutex_real_owner() == NULL is a valid transient
state. The higher priority waiter has simply not yet managed to take over
the rtmutex.
The BUG_ON() is therefore wrong and this is just another retry condition in
fixup_pi_state_owner().
Drop the locks, so that T3 can make progress, and then try the fixup again.
Gratian provided a great analysis, traces and a reproducer. The analysis is
to the point, but it confused the hell out of that tglx dude who had to
page in all the futex horrors again. Condensed version is above.
[ tglx: Wrote comment and changelog ]
Fixes: c1e2f0eaf0 ("futex: Avoid violating the 10th rule of futex")
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com
Link: https://lore.kernel.org/r/87sg9pkvf7.fsf@nanos.tec.linutronix.de
Pull locking fixes from Thomas Gleixner:
"A couple of locking fixes:
- Fix incorrect failure injection handling in the fuxtex code
- Prevent a preemption warning in lockdep when tracking
local_irq_enable() and interrupts are already enabled
- Remove more raw_cpu_read() usage from lockdep which causes state
corruption on !X86 architectures.
- Make the nr_unused_locks accounting in lockdep correct again"
* tag 'locking-urgent-2020-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Fix nr_unused_locks accounting
locking/lockdep: Remove more raw_cpu_read() usage
futex: Fix incorrect should_fail_futex() handling
lockdep: Fix preemption WARN for spurious IRQ-enable
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret'
variable is set to -EFAULT and then immediately overwritten. So the failure
injection is non-functional.
Fix it by actually leaving the function and returning -EFAULT.
The Fixes tag is kinda blury because the initial commit which introduced
failure injection was already sloppy, but the below mentioned commit broke
it completely.
[ tglx: Massaged changelog ]
Fixes: 6b4f4bc9cb ("locking/futex: Allow low-level atomic operations to return -EAGAIN")
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com
Steps on the way to 5.10-rc1
Resolves conflicts in:
fs/userfaultfd.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fe3c818f1f6565cfd4fa551de72d2b72ef60af
Add the hook for the waiter list of futex to allow
vendor perform wait queue enhancement
Bug: 163431711
Signed-off-by: JianMin Liu <jian-min.liu@mediatek.com>
Change-Id: I68218b89c35b23aa5529099bb0bbbd031bdeafef
Since fshared is only conveying true/false values, declare it as bool.
In get_futex_key() the usage of fshared can be restricted to the first part
of the function. If fshared is false the function is terminated early and
the subsequent code can use a constant 'true' instead of the variable.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200702202843.520764-5-andrealmeid@collabora.com
As stated in the coding style documentation, "if there is no cleanup
needed then just return directly", instead of jumping to a label and
then returning.
Remove such goto's and replace with a return statement. When there's a
ternary operator on the return value, replace it with the result of the
operation when it is logically possible to determine it by the control
flow.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200702202843.520764-3-andrealmeid@collabora.com
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Continued user-access cleanups in the futex code.
- percpu-rwsem rewrite that uses its own waitqueue and atomic_t
instead of an embedded rwsem. This addresses a couple of
weaknesses, but the primary motivation was complications on the -rt
kernel.
- Introduce raw lock nesting detection on lockdep
(CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
lock differences. This too originates from -rt.
- Reuse lockdep zapped chain_hlocks entries, to conserve RAM
footprint on distro-ish kernels running into the "BUG:
MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
chain-entries pool.
- Misc cleanups, smaller fixes and enhancements - see the changelog
for details"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
Documentation/locking/locktypes: Minor copy editor fixes
Documentation/locking/locktypes: Further clarifications and wordsmithing
m68knommu: Remove mm.h include from uaccess_no.h
x86: get rid of user_atomic_cmpxchg_inatomic()
generic arch_futex_atomic_op_inuser() doesn't need access_ok()
x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
objtool: whitelist __sanitizer_cov_trace_switch()
[parisc, s390, sparc64] no need for access_ok() in futex handling
sh: no need of access_ok() in arch_futex_atomic_op_inuser()
futex: arch_futex_atomic_op_inuser() calling conventions change
completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
lockdep: Add posixtimer context tracing bits
lockdep: Annotate irq_work
lockdep: Add hrtimer context tracing bits
lockdep: Introduce wait-type checks
completion: Use simple wait queues
sched/swait: Prepare usage in completions
...
Move access_ok() in and pagefault_enable()/pagefault_disable() out.
Mechanical conversion only - some instances don't really need
a separate access_ok() at all (e.g. the ones only using
get_user()/put_user(), or architectures where access_ok()
is always true); we'll deal with that in followups.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The recent futex inode life time fix changed the ordering of the futex key
union struct members, but forgot to adjust the hash function accordingly,
As a result the hashing omits the leading 64bit and even hashes beyond the
futex key causing a bad hash distribution which led to a ~100% performance
regression.
Hand in the futex key pointer instead of a random struct member and make
the size calculation based of the struct offset.
Fixes: 8019ad13ef ("futex: Fix inode life-time issue")
Reported-by: Rong Chen <rong.a.chen@intel.com>
Decoded-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Rong Chen <rong.a.chen@intel.com>
Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de
Now that {get,drop}_futex_key_refs() have become a glorified NOP,
remove them entirely.
The only thing get_futex_key_refs() is still doing is an smp_mb(), and
now that we don't need to (ab)use existing atomic ops to obtain them,
we can place it explicitly where we need it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
We always set 'key->private.mm' to 'current->mm', getting an extra
reference on 'current->mm' is quite pointless, because as long as the
task is blocked it isn't going to go away.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Oleg provided the following test case:
int main(void)
{
struct sched_param sp = {};
sp.sched_priority = 2;
assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
int lock = vfork();
if (!lock) {
sp.sched_priority = 1;
assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
_exit(0);
}
syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
return 0;
}
This creates an unkillable RT process spinning in futex_lock_pi() on a UP
machine or if the process is affine to a single CPU. The reason is:
parent child
set FIFO prio 2
vfork() -> set FIFO prio 1
implies wait_for_child() sched_setscheduler(...)
exit()
do_exit()
....
mm_release()
tsk->futex_state = FUTEX_STATE_EXITING;
exit_futex(); (NOOP in this case)
complete() --> wakes parent
sys_futex()
loop infinite because
tsk->futex_state == FUTEX_STATE_EXITING
The same problem can happen just by regular preemption as well:
task holds futex
...
do_exit()
tsk->futex_state = FUTEX_STATE_EXITING;
--> preemption (unrelated wakeup of some other higher prio task, e.g. timer)
switch_to(other_task)
return to user
sys_futex()
loop infinite as above
Just for the fun of it the futex exit cleanup could trigger the wakeup
itself before the task sets its futex state to DEAD.
To cure this, the handling of the exiting owner is changed so:
- A refcount is held on the task
- The task pointer is stored in a caller visible location
- The caller drops all locks (hash bucket, mmap_sem) and blocks
on task::futex_exit_mutex. When the mutex is acquired then
the exiting task has completed the cleanup and the state
is consistent and can be reevaluated.
This is not a pretty solution, but there is no choice other than returning
an error code to user space, which would break the state consistency
guarantee and open another can of problems including regressions.
For stable backports the preparatory commits ac31c7ff86 .. ba31c1a485
are required as well, but for anything older than 5.3.y the backports are
going to be provided when this hits mainline as the other dependencies for
those kernels are definitely not stable material.
Fixes: 778e9a9c3e ("pi-futex: fix exit races and locking problems")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stable Team <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
attach_to_pi_owner() returns -EAGAIN for various cases:
- Owner task is exiting
- Futex value has changed
The caller drops the held locks (hash bucket, mmap_sem) and retries the
operation. In case of the owner task exiting this can result in a live
lock.
As a preparatory step for seperating those cases, provide a distinct return
value (EBUSY) for the owner exiting case.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de
exec() attempts to handle potentially held futexes gracefully by running
the futex exit handling code like exit() does.
The current implementation has no protection against concurrent incoming
waiters. The reason is that the futex state cannot be set to
FUTEX_STATE_DEAD after the cleanup because the task struct is still active
and just about to execute the new binary.
While its arguably buggy when a task holds a futex over exec(), for
consistency sake the state handling can at least cover the actual futex
exit cleanup section. This provides state consistency protection accross
the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the
cleanup has been finished, this cannot prevent subsequent attempts to
attach to the task in case that the cleanup was not successfull in mopping
up all leftovers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de
Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.
As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.
This allows to simplify that logic in a next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.
Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de