commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream.
Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called. That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.
Considering the three non-blocking poll systems:
- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.
- aio poll is unaffected, since it doesn't support exclusive waits.
However, that's fragile, as someone could add this feature later.
- epoll doesn't appear to be broken by this, since its wakeup function
returns 0 when it sees POLLFREE. But this is fragile.
Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters. Add such a
function. Also make it verify that the queue really becomes empty after
all waiters have been woken up.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 185125206
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4f69da5bbbad53975024d027fa1bbe22522c6efe
Changes in 5.10.51
drm/mxsfb: Don't select DRM_KMS_FB_HELPER
drm/zte: Don't select DRM_KMS_FB_HELPER
drm/ast: Fixed CVE for DP501
drm/amd/display: fix HDCP reset sequence on reinitialize
drm/amd/amdgpu/sriov disable all ip hw status by default
drm/vc4: fix argument ordering in vc4_crtc_get_margins()
drm/bridge: nwl-dsi: Force a full modeset when crtc_state->active is changed to be true
net: pch_gbe: Use proper accessors to BE data in pch_ptp_match()
drm/amd/display: fix use_max_lb flag for 420 pixel formats
clk: renesas: rcar-usb2-clock-sel: Fix error handling in .probe()
hugetlb: clear huge pte during flush function on mips platform
atm: iphase: fix possible use-after-free in ia_module_exit()
mISDN: fix possible use-after-free in HFC_cleanup()
atm: nicstar: Fix possible use-after-free in nicstar_cleanup()
net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT
drm/mediatek: Fix PM reference leak in mtk_crtc_ddp_hw_init()
net: mdio: ipq8064: add regmap config to disable REGCACHE
drm/bridge: lt9611: Add missing MODULE_DEVICE_TABLE
reiserfs: add check for invalid 1st journal block
drm/virtio: Fix double free on probe failure
net: mdio: provide shim implementation of devm_of_mdiobus_register
net/sched: cls_api: increase max_reclassify_loop
pinctrl: equilibrium: Add missing MODULE_DEVICE_TABLE
drm/scheduler: Fix hang when sched_entity released
drm/sched: Avoid data corruptions
udf: Fix NULL pointer dereference in udf_symlink function
drm/vc4: Fix clock source for VEC PixelValve on BCM2711
drm/vc4: hdmi: Fix PM reference leak in vc4_hdmi_encoder_pre_crtc_co()
e100: handle eeprom as little endian
igb: handle vlan types with checker enabled
igb: fix assignment on big endian machines
drm/bridge: cdns: Fix PM reference leak in cdns_dsi_transfer()
clk: renesas: r8a77995: Add ZA2 clock
net/mlx5e: IPsec/rep_tc: Fix rep_tc_update_skb drops IPsec packet
net/mlx5: Fix lag port remapping logic
drm: rockchip: add missing registers for RK3188
drm: rockchip: add missing registers for RK3066
net: stmmac: the XPCS obscures a potential "PHY not found" error
RDMA/rtrs: Change MAX_SESS_QUEUE_DEPTH
clk: tegra: Fix refcounting of gate clocks
clk: tegra: Ensure that PLLU configuration is applied properly
drm: bridge: cdns-mhdp8546: Fix PM reference leak in
virtio-net: Add validation for used length
ipv6: use prandom_u32() for ID generation
MIPS: cpu-probe: Fix FPU detection on Ingenic JZ4760(B)
MIPS: ingenic: Select CPU_SUPPORTS_CPUFREQ && MIPS_EXTERNAL_TIMER
drm/amd/display: Avoid HDCP over-read and corruption
drm/amdgpu: remove unsafe optimization to drop preamble ib
net: tcp better handling of reordering then loss cases
RDMA/cxgb4: Fix missing error code in create_qp()
dm space maps: don't reset space map allocation cursor when committing
dm writecache: don't split bios when overwriting contiguous cache content
dm: Fix dm_accept_partial_bio() relative to zone management commands
net: bridge: mrp: Update ring transitions.
pinctrl: mcp23s08: fix race condition in irq handler
ice: set the value of global config lock timeout longer
ice: fix clang warning regarding deadcode.DeadStores
virtio_net: Remove BUG() to avoid machine dead
net: mscc: ocelot: check return value after calling platform_get_resource()
net: bcmgenet: check return value after calling platform_get_resource()
net: mvpp2: check return value after calling platform_get_resource()
net: micrel: check return value after calling platform_get_resource()
net: moxa: Use devm_platform_get_and_ioremap_resource()
drm/amd/display: Fix DCN 3.01 DSCCLK validation
drm/amd/display: Update scaling settings on modeset
drm/amd/display: Release MST resources on switch from MST to SST
drm/amd/display: Set DISPCLK_MAX_ERRDET_CYCLES to 7
drm/amd/display: Fix off-by-one error in DML
net: phy: realtek: add delay to fix RXC generation issue
selftests: Clean forgotten resources as part of cleanup()
net: sgi: ioc3-eth: check return value after calling platform_get_resource()
drm/amdkfd: use allowed domain for vmbo validation
fjes: check return value after calling platform_get_resource()
selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC
r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM
drm/amd/display: Verify Gamma & Degamma LUT sizes in amdgpu_dm_atomic_check
xfrm: Fix error reporting in xfrm_state_construct.
dm writecache: commit just one block, not a full page
wlcore/wl12xx: Fix wl12xx get_mac error if device is in ELP
wl1251: Fix possible buffer overflow in wl1251_cmd_scan
cw1200: add missing MODULE_DEVICE_TABLE
drm/amdkfd: fix circular locking on get_wave_state
drm/amdkfd: Fix circular lock in nocpsch path
bpf: Fix up register-based shifts in interpreter to silence KUBSAN
ice: fix incorrect payload indicator on PTYPE
ice: mark PTYPE 2 as reserved
mt76: mt7615: fix fixed-rate tx status reporting
net: fix mistake path for netdev_features_strings
net: ipa: Add missing of_node_put() in ipa_firmware_load()
net: sched: fix error return code in tcf_del_walker()
io_uring: fix false WARN_ONCE
drm/amdgpu: fix bad address translation for sienna_cichlid
drm/amdkfd: Walk through list with dqm lock hold
mt76: mt7915: fix IEEE80211_HE_PHY_CAP7_MAX_NC for station mode
rtl8xxxu: Fix device info for RTL8192EU devices
MIPS: add PMD table accounting into MIPS'pmd_alloc_one
net: fec: add ndo_select_queue to fix TX bandwidth fluctuations
atm: nicstar: use 'dma_free_coherent' instead of 'kfree'
atm: nicstar: register the interrupt handler in the right place
vsock: notify server to shutdown when client has pending signal
RDMA/rxe: Don't overwrite errno from ib_umem_get()
iwlwifi: mvm: don't change band on bound PHY contexts
iwlwifi: mvm: fix error print when session protection ends
iwlwifi: pcie: free IML DMA memory allocation
iwlwifi: pcie: fix context info freeing
sfc: avoid double pci_remove of VFs
sfc: error code if SRIOV cannot be disabled
wireless: wext-spy: Fix out-of-bounds warning
cfg80211: fix default HE tx bitrate mask in 2G band
mac80211: consider per-CPU statistics if present
mac80211_hwsim: add concurrent channels scanning support over virtio
IB/isert: Align target max I/O size to initiator size
media, bpf: Do not copy more entries than user space requested
net: ip: avoid OOM kills with large UDP sends over loopback
RDMA/cma: Fix rdma_resolve_route() memory leak
Bluetooth: btusb: Fixed too many in-token issue for Mediatek Chip.
Bluetooth: Fix the HCI to MGMT status conversion table
Bluetooth: Fix alt settings for incoming SCO with transparent coding format
Bluetooth: Shutdown controller after workqueues are flushed or cancelled
Bluetooth: btusb: Add a new QCA_ROME device (0cf3:e500)
Bluetooth: L2CAP: Fix invalid access if ECRED Reconfigure fails
Bluetooth: L2CAP: Fix invalid access on ECRED Connection response
Bluetooth: btusb: Add support USB ALT 3 for WBS
Bluetooth: mgmt: Fix the command returns garbage parameter value
Bluetooth: btusb: fix bt fiwmare downloading failure issue for qca btsoc.
sched/fair: Ensure _sum and _avg values stay consistent
bpf: Fix false positive kmemleak report in bpf_ringbuf_area_alloc()
flow_offload: action should not be NULL when it is referenced
sctp: validate from_addr_param return
sctp: add size validation when walking chunks
MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops
MIPS: set mips32r5 for virt extensions
selftests/resctrl: Fix incorrect parsing of option "-t"
MIPS: MT extensions are not available on MIPS32r1
ath11k: unlock on error path in ath11k_mac_op_add_interface()
arm64: dts: rockchip: add rk3328 dwc3 usb controller node
arm64: dts: rockchip: Enable USB3 for rk3328 Rock64
loop: fix I/O error on fsync() in detached loop devices
mm,hwpoison: return -EBUSY when migration fails
io_uring: simplify io_remove_personalities()
io_uring: Convert personality_idr to XArray
io_uring: convert io_buffer_idr to XArray
scsi: iscsi: Fix race condition between login and sync thread
scsi: iscsi: Fix iSCSI cls conn state
powerpc/mm: Fix lockup on kernel exec fault
powerpc/barrier: Avoid collision with clang's __lwsync macro
powerpc/powernv/vas: Release reference to tgid during window close
drm/amdgpu: Update NV SIMD-per-CU to 2
drm/amdgpu: enable sdma0 tmz for Raven/Renoir(V2)
drm/radeon: Add the missed drm_gem_object_put() in radeon_user_framebuffer_create()
drm/radeon: Call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64
drm/vc4: txp: Properly set the possible_crtcs mask
drm/vc4: crtc: Skip the TXP
drm/vc4: hdmi: Prevent clock unbalance
drm/dp: Handle zeroed port counts in drm_dp_read_downstream_info()
drm/rockchip: dsi: remove extra component_del() call
drm/amd/display: fix incorrrect valid irq check
pinctrl/amd: Add device HID for new AMD GPIO controller
drm/amd/display: Reject non-zero src_y and src_x for video planes
drm/tegra: Don't set allow_fb_modifiers explicitly
drm/msm/mdp4: Fix modifier support enabling
drm/arm/malidp: Always list modifiers
drm/nouveau: Don't set allow_fb_modifiers explicitly
drm/i915/display: Do not zero past infoframes.vsc
mmc: sdhci-acpi: Disable write protect detection on Toshiba Encore 2 WT8-B
mmc: sdhci: Fix warning message when accessing RPMB in HS400 mode
mmc: core: clear flags before allowing to retune
mmc: core: Allow UHS-I voltage switch for SDSC cards if supported
ata: ahci_sunxi: Disable DIPM
arm64: tlb: fix the TTL value of tlb_get_level
cpu/hotplug: Cure the cpusets trainwreck
clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
fpga: stratix10-soc: Add missing fpga_mgr_free() call
ASoC: tegra: Set driver_name=tegra for all machine drivers
i40e: fix PTP on 5Gb links
qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
ipmi/watchdog: Stop watchdog timer when the current action is 'none'
thermal/drivers/int340x/processor_thermal: Fix tcc setting
ubifs: Fix races between xattr_{set|get} and listxattr operations
power: supply: ab8500: Fix an old bug
mfd: syscon: Free the allocated name field of struct regmap_config
nvmem: core: add a missing of_node_put
lkdtm/bugs: XFAIL UNALIGNED_LOAD_STORE_WRITE
selftests/lkdtm: Fix expected text for CR4 pinning
extcon: intel-mrfld: Sync hardware and software state on init
seq_buf: Fix overflow in seq_buf_putmem_hex()
rq-qos: fix missed wake-ups in rq_qos_throttle try two
tracing: Simplify & fix saved_tgids logic
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
ipack/carriers/tpci200: Fix a double free in tpci200_pci_probe
coresight: Propagate symlink failure
coresight: tmc-etf: Fix global-out-of-bounds in tmc_update_etf_buffer()
dm zoned: check zone capacity
dm writecache: flush origin device when writing and cache is full
dm btree remove: assign new_root only when removal succeeds
PCI: Leave Apple Thunderbolt controllers on for s2idle or standby
PCI: aardvark: Fix checking for PIO Non-posted Request
PCI: aardvark: Implement workaround for the readback value of VEND_ID
media: subdev: disallow ioctl for saa6588/davinci
media: dtv5100: fix control-request directions
media: zr364xx: fix memory leak in zr364xx_start_readpipe
media: gspca/sq905: fix control-request direction
media: gspca/sunplus: fix zero-length control requests
media: uvcvideo: Fix pixel format change for Elgato Cam Link 4K
io_uring: fix clear IORING_SETUP_R_DISABLED in wrong function
dm writecache: write at least 4k when committing
pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
drm/ast: Remove reference to struct drm_device.pdev
jfs: fix GPF in diFree
smackfs: restrict bytes count in smk_set_cipso()
ext4: fix memory leak in ext4_fill_super
f2fs: fix to avoid racing on fsync_entry_slab by multi filesystem instances
Linux 5.10.51
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icb10fed733a0050848ecc23db13ae3d134895acd
commit 11c7aa0ddea8611007768d3e6b58d45dc60a19e1 upstream.
Commit 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:
CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker)
rq_qos_wait() rq_qos_wait()
acquire_inflight_cb() -> fails
acquire_inflight_cb() -> fails
completes IOs, inflight
decreased
prepare_to_wait_exclusive()
prepare_to_wait_exclusive()
has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
has_sleeper = !wq_has_single_sleeper() -> true
io_schedule() io_schedule()
Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.
Fixes: 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
android_vh_do_wake_up_sync:
To modify the mode value of __wake_up_sync_key
android_vh_set_wake_flags:
To modify the wake flag from a module
Bug: 181743516
Signed-off-by: Namkyu Kim <namkyu78.kim@samsung.com>
Change-Id: I972e2469c3f139373d21f1e8c85974763388a693
trace_sched_blocked_trace in CFS is really useful for debugging via
trace because it tell where the process was stuck on callstack.
For example,
<...>-6143 ( 6136) [005] d..2 50.278987: sched_blocked_reason: pid=6136 iowait=0 caller=SyS_mprotect+0x88/0x208
<...>-6136 ( 6136) [005] d..2 50.278990: sched_blocked_reason: pid=6142 iowait=0 caller=do_page_fault+0x1f4/0x3b0
<...>-6142 ( 6136) [006] d..2 50.278996: sched_blocked_reason: pid=6144 iowait=0 caller=SyS_prctl+0x52c/0xb58
<...>-6144 ( 6136) [006] d..2 50.279007: sched_blocked_reason: pid=6136 iowait=0 caller=vm_mmap_pgoff+0x74/0x104
However, sometime it gives pointless information like this.
RenderThread-2322 ( 1805) [006] d.s3 50.319046: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220
logd.writer-594 ( 587) [002] d.s3 50.334011: sched_blocked_reason: pid=6126 iowait=1 caller=wait_on_page_bit+0x194/0x208
kworker/u16:13-333 ( 333) [007] d.s4 50.343161: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220
Such wait_on_page_bit, __lock_page_killable are pointless because it doesn't
carry on higher information to identify the callstack.
The reason is page_lock and waitqueue are special synchronization method unlike
other normal locks(mutex, spinlock).
Let's mark them as "__sched" so get_wchan which used in trace_sched_blocked_trace
could detect it and skip them. It will produce more meaningful callstack
function like this.
<...>-2867 ( 1068) [002] d.h4 124.209701: sched_blocked_reason: pid=329 iowait=0 caller=worker_thread+0x378/0x470
<...>-2867 ( 1068) [002] d.s3 124.209763: sched_blocked_reason: pid=8454 iowait=1 caller=__filemap_fdatawait_range+0xa0/0x104
<...>-2867 ( 1068) [002] d.s4 124.209803: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
ScreenDecoratio-2364 ( 1867) [002] d.s3 124.209973: sched_blocked_reason: pid=8454 iowait=1 caller=f2fs_wait_on_page_writeback+0x84/0xcc
ScreenDecoratio-2364 ( 1867) [002] d.s4 124.209986: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470
<...>-329 ( 329) [000] d..3 124.210435: sched_blocked_reason: pid=538 iowait=0 caller=worker_thread+0x378/0x470
kworker/u16:13-538 ( 538) [007] d..3 124.210450: sched_blocked_reason: pid=6 iowait=0 caller=worker_thread+0x378/0x470
Test: build pass and boot to home.
Bug: 144961676
Bug: 144713689
Bug: 172212772
Signed-off-by: Minchan Kim <minchan@google.com>
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
Change-Id: I9c738802a16941ca767dcc37ae4463070b3fabf4
(cherry picked from commit 1e4de875d9e0cfaccf5131bcc709ae8646cdc168)
Signed-off-by: Will McVicker <willmcvicker@google.com>
Add a wakeup call for a case whereby the caller already has the waitqueue
spinlock held. This can be used by pipes to alter the ring buffer indices
and issue a wakeup under the same spinlock.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Remove the nr_exclusive argument from __wake_up_sync_key() and derived
functions as everything seems to set it to 1. Note also that if it wasn't
set to 1, it would clear WF_SYNC anyway.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Do the following cleanups and simplifications:
- sched/sched.h already includes <asm/paravirt.h>, so no need to
include it in sched/core.c again.
- order the <linux/sched/*.h> headers alphabetically
- add all <linux/sched/*.h> headers to kernel/sched/sched.h
- remove all unnecessary includes from the .c files that
are already included in kernel/sched/sched.h.
Finally, make all scheduler .c files use a single common header:
#include "sched.h"
... which now contains a union of the relied upon headers.
This makes the various .c files easier to read and easier to handle.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A good number of small style inconsistencies have accumulated
in the scheduler core, so do a pass over them to harmonize
all these details:
- fix speling in comments,
- use curly braces for multi-line statements,
- remove unnecessary parentheses from integer literals,
- capitalize consistently,
- remove stray newlines,
- add comments where necessary,
- remove invalid/unnecessary comments,
- align structure definitions and other data types vertically,
- add missing newlines for increased readability,
- fix vertical tabulation where it's misaligned,
- harmonize preprocessor conditional block labeling
and vertical alignment,
- remove line-breaks where they uglify the code,
- add newline after local variable definitions,
No change in functionality:
md5:
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following cleanup commit:
50816c4899 ("sched/wait: Standardize internal naming of wait-queue entries")
... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.
Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.
We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long. This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.
[ v2: Remove bookmark_wake_function ]
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.
We saw page wait list that are up to 3700+ entries long in tests of
large 4 and 8 socket systems. It took 0.8 sec to traverse such list
during wake up. Any other CPU that contends for the list spin lock will
spin for a long time. It is a result of the numa balancing migration of
hot pages that are shared by many threads.
Multiple CPUs waking are queued up behind the lock, and the last one
queued has to wait until all CPUs did all the wakeups.
The page wait list is traversed with interrupt disabled, which caused
various problems. This was the original cause that triggered the NMI
watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
extending the NMI watch dog timer there helped.
This patch bookmarks the waker's scan position in wake list and break
the wake up walk, to allow access to the list before the waker resume
its walk down the rest of the wait list. It lowers the interrupt and
rescheduling latency.
This patch also provides a performance boost when combined with the next
patch to break up page wakeup list walk. We saw 22% improvement in the
will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
[ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
simply access to flags. ]
Reported-by: Kan Liang <kan.liang@intel.com>
Tested-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.
Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.
So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:
include/linux/wait_bit.h for types and APIs
kernel/sched/wait_bit.c for the implementation
Update all header dependencies.
This reduces the size of wait.h rather significantly, by about 30%.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So wait-bit-queue head variables are often named:
struct wait_bit_queue *q
... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!
They are misnomers in two ways:
- the 'wait_bit_queue' leaves open the question of whether
it's an entry or a head
- the 'q' parameter and local variable naming falsely implies
that it's a 'queue' - while it's an entry.
This resulted in sometimes confusing cases such as:
finish_wait(wq, &q->wait);
where the 'q' is not a wait-queue head, but a wait-bit-queue entry.
So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:
struct wait_bit_queue => struct wait_bit_queue_entry
q => wbq_entry
Which makes it all a much clearer:
struct wait_bit_queue_entry *wbq_entry
... and turns the former confusing piece of code into:
finish_wait(wq_head, &wbq_entry->wq_entry;
which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.
I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:
wait_queue_head_t *q
wait_queue_head_t *wq
wait_queue_head_t *head
In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.
( Not to mention the confusion of any readers coming over from
workqueue-land. )
Standardize all this around a single, unambiguous parameter and
variable name:
struct wait_queue_head *wq_head
which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:
struct wait_queue_entry *wq_entry
Also rename:
struct __wait_queue_head => struct wait_queue_head
... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.
Don't touch any external users and preserve the main wait_queue_head_t
typedef.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:
wait_queue_entry_t *wait
wait_queue_entry_t *__wait (even in plain C code!)
wait_queue_entry_t *q (!)
wait_queue_entry_t *new (making anyone who knows C++ cringe)
wait_queue_entry_t *old
I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.
( Some of the documentation talks about a 'wait descriptor', which is
the wait-queue entry itself - further adding to the confusion. )
The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.
To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:
struct wait_queue_entry *wq_entry
The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.
wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.
( We don't touch external users, and we preserve the typedef as well
for actual wait-queue users, to reduce unnecessary churn. )
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The scheduler header file split and cleanups ended up exposing a few
nasty header file dependencies, and in particular it showed how we in
<linux/wait.h> ended up depending on "signal_pending()", which now comes
from <linux/sched/signal.h>.
That's a very subtle and annoying dependency, which already caused a
semantic merge conflict (see commit e58bc92783 "Pull overlayfs updates
from Miklos Szeredi", which added that fixup in the merge commit).
It turns out that we can avoid this dependency _and_ improve code
generation by moving the guts of the fairly nasty helper #define
__wait_event_interruptible_locked() to out-of-line code. The code that
includes the signal_pending() check is all in the slow-path where we
actually go to sleep waiting for the event anyway, so using a helper
function is the right thing to do.
Using a helper function is also what we already did for the non-locked
versions, see the "__wait_event*()" macros and the "prepare_to_wait*()"
set of helper functions.
We might want to try to unify all these macro games, we have a _lot_ of
subtly different wait-event loops. But this is the minimal patch to fix
the annoying header dependency.
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:
wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().
The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).
It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.
As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.
Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.
If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.
The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.
Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.
It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults. No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- lockless wakeup support for futexes and IPC message queues
(Davidlohr Bueso, Peter Zijlstra)
- Replace spinlocks with atomics in thread_group_cputimer(), to
improve scalability (Jason Low)
- NUMA balancing improvements (Rik van Riel)
- SCHED_DEADLINE improvements (Wanpeng Li)
- clean up and reorganize preemption helpers (Frederic Weisbecker)
- decouple page fault disabling machinery from the preemption
counter, to improve debuggability and robustness (David
Hildenbrand)
- SCHED_DEADLINE documentation updates (Luca Abeni)
- topology CPU masks cleanups (Bartosz Golaszewski)
- /proc/sched_debug improvements (Srikar Dronamraju)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
sched/deadline: Remove needless parameter in dl_runtime_exceeded()
sched: Remove superfluous resetting of the p->dl_throttled flag
sched/deadline: Drop duplicate init_sched_dl_class() declaration
sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
sched/deadline: Make init_sched_dl_class() __init
sched/deadline: Optimize pull_dl_task()
sched/preempt: Add static_key() to preempt_notifiers
sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
sched/debug: Properly format runnable tasks in /proc/sched_debug
sched/numa: Only consider less busy nodes as numa balancing destinations
Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
sched/fair: Prevent throttling in early pick_next_task_fair()
preempt: Reorganize the notrace definitions a bit
preempt: Use preempt_schedule_context() as the official tracing preemption point
sched: Make preempt_schedule_context() function-tracing safe
x86: Remove cpu_sibling_mask() and cpu_core_mask()
x86: Replace cpu_**_mask() with topology_**_cpumask()
...
There is a race between kthread_stop() and the new wait_woken() that
can result in a lack of progress.
CPU 0 | CPU 1
|
rfcomm_run() | kthread_stop()
... |
if (!test_bit(KTHREAD_SHOULD_STOP)) |
| set_bit(KTHREAD_SHOULD_STOP)
| wake_up_process()
wait_woken() | wait_for_completion()
set_current_state(INTERRUPTIBLE) |
if (!WQ_FLAG_WOKEN) |
schedule_timeout() |
|
After which both tasks will wait.. forever.
Fix this by having wait_woken() check for kthread_should_stop() but
only for kthreads (obviously).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit c1221321b7
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It is currently not possible for various wait_on_bit functions
to implement a timeout.
While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.
The 'action' function is currently passed a pointer to the word
containing the bit being waited on. No current action functions
use this pointer. So changing it to something else will be a
little noisy but will have no immediate effect.
This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.
It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.
An action function can now implement a timeout with something
like
static int timed_out_waiter(struct wait_bit_key *key)
{
unsigned long waited;
if (key->private == 0) {
key->private = jiffies;
if (key->private == 0)
key->private -= 1;
}
waited = jiffies - key->private;
if (waited > 10 * HZ)
return -EAGAIN;
schedule_timeout(waited - 10 * HZ);
return 0;
}
If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".
My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.
While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need. I need the timeout to be sensitive to
the state of the connection with the server, which could change.
So I need to use an 'action' interface.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>